34
Encoding language Encoding language corpora: corpora: current trends and future current trends and future directions directions Tomaž Erjavec Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies Jožef Stefan Institute, Jožef Stefan Institute, Ljubljana, Slovenia Ljubljana, Slovenia [email protected], [email protected], http://nl.ijs.si/et/ http://nl.ijs.si/et/ National Institute for Japanese Language National Institute for Japanese Language 2006-09-28 2006-09-28

Encoding language corpora: current trends and future directions

  • Upload
    cili

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Encoding language corpora: current trends and future directions. Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute , Ljubljana , Slovenia [email protected] , http://nl.ijs.si/et/ National Institute for Japanese Language 2006-09-28. Overview. - PowerPoint PPT Presentation

Citation preview

Page 1: Encoding language corpora:  current trends and future directions

Encoding language Encoding language corpora: corpora: current trends and future current trends and future directionsdirections

Tomaž ErjavecTomaž ErjavecDepartment of Knowledge TechnologiesDepartment of Knowledge Technologies

Jožef Stefan Institute, Jožef Stefan Institute,

Ljubljana, SloveniaLjubljana, Slovenia

[email protected], [email protected], http://nl.ijs.si/et/http://nl.ijs.si/et/

National Institute for Japanese LanguageNational Institute for Japanese Language

2006-09-282006-09-28

Page 2: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

OverviewOverview

1.1. History and current practices in History and current practices in corpus encoding: corpus encoding: TEI P4, CESTEI P4, CES

2.2. Open issues: Open issues: multiple annotations, metadata multiple annotations, metadata and analytical toolsand analytical tools

3.3. Future directions: Future directions: TEI P5, ISO TC 37TEI P5, ISO TC 37

Page 3: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

I. Some historyI. Some history

80’s: corpora (and other language 80’s: corpora (and other language resources) encoded in idiosyncratic resources) encoded in idiosyncratic formats, usu. bound to specific toolsformats, usu. bound to specific tools

corpora expensive to produce butcorpora expensive to produce but difficult exchange and reusedifficult exchange and reuse quickly became obsolete quickly became obsolete to address these problems, the Text to address these problems, the Text

Encoding Initiative is established in 1987Encoding Initiative is established in 1987 initiative comes from humanities initiative comes from humanities

computing: sponsorship by ACH, ALLC, computing: sponsorship by ACH, ALLC, ACLACL

Page 4: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Text Encoding Text Encoding Initiative Initiative TEI is the only systematized attempt to TEI is the only systematized attempt to

develop a develop a fully general text encoding fully general text encoding modelmodel and set of encoding conventions and set of encoding conventions based upon itbased upon it

intended for processing and analysis of intended for processing and analysis of any type of text, in any languageany type of text, in any language

main result: the main result: the TEI Guidelines for TEI Guidelines for Electronic Text Encoding and Electronic Text Encoding and InterchangeInterchange

SGML was chosen as the underlying SGML was chosen as the underlying standard for the TEI Guidelines. standard for the TEI Guidelines.

drafts: TEI P1 (1990), TEI P2 (1993)drafts: TEI P1 (1990), TEI P2 (1993)

Page 5: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

TEI P3 and P4TEI P3 and P4

the third version of the Guidelines, TEI P3 the third version of the Guidelines, TEI P3 (1994) published in two substantial green (1994) published in two substantial green volumes (1200pp) and soon also on the Web. volumes (1200pp) and soon also on the Web.

A major revision, the TEI P4A major revision, the TEI P4 published in published in 20022002

TEI P4 addresses the following issues: TEI P4 addresses the following issues: – error correctionerror correction– provides equal support for XML and SGML provides equal support for XML and SGML – retains backward compatibility with TEI P3retains backward compatibility with TEI P3

Today, TEI P4 is the most widely used version Today, TEI P4 is the most widely used version of TEI: over 130 projects listed on the TEI of TEI: over 130 projects listed on the TEI web pagesweb pages

Page 6: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

The TEI schemeThe TEI scheme

TEI P4 consists of the written guidelines + a TEI P4 consists of the written guidelines + a set of DTD fragmentsset of DTD fragments

to obtain a project specific DTD (TEI to obtain a project specific DTD (TEI parameterisation) the DTDs fragments are parameterisation) the DTDs fragments are combined: combined:

1.1. core tagset (always present)core tagset (always present)includes the TEI headerincludes the TEI header

2.2. base tagsets (specific text types)base tagsets (specific text types)e.g. prose, dictionaries, dramae.g. prose, dictionaries, drama

3.3. additional tagsets (particular analyses)additional tagsets (particular analyses)e.g. dates&times, certainty, simple linguistic analysise.g. dates&times, certainty, simple linguistic analysis

4.4. user extensions, which extend or modify the TEIuser extensions, which extend or modify the TEI a widely used parameterisation of TEI: TEI a widely used parameterisation of TEI: TEI

LiteLite

Page 7: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

What is good about TEIWhat is good about TEI

is a “standard”is a “standard” offers a rich vocabulary of tags with offers a rich vocabulary of tags with

extensive documentationextensive documentation can be extended and modifiedcan be extended and modified many best practice scenariosmany best practice scenarios software and user community software and user community

support (tei-c web pages & tei-l support (tei-c web pages & tei-l mailing list)mailing list)

tutorials teaching TEItutorials teaching TEI

Page 8: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

What is bad about TEIWhat is bad about TEI

steep learning curve (difficult to start using it) steep learning curve (difficult to start using it) TEI is general, so tags are often too generic for TEI is general, so tags are often too generic for

the needs of particular projects; also, too the needs of particular projects; also, too deeply nested (tag bloat)deeply nested (tag bloat)

it is often not clear to how encode a particular it is often not clear to how encode a particular phenomenon (more than one possibility exists)phenomenon (more than one possibility exists)

while TEI is modular, it will still allow lots of while TEI is modular, it will still allow lots of tags that a project (encoder) has no need fortags that a project (encoder) has no need for

never really became accepted in the comp. never really became accepted in the comp. ling. communityling. community

some areas missing or not up-to date: some areas missing or not up-to date: computational lexicons, terminological computational lexicons, terminological databases, complex linguistic annotationsdatabases, complex linguistic annotations

Page 9: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

TEI for corpus TEI for corpus encodingencoding base module: TEI.prosebase module: TEI.prose additional modules:additional modules:

– TEI.corpusTEI.corpusadditional tags in the headeradditional tags in the header

– TEI.analysis TEI.analysis tags for simple analytic mechanismstags for simple analytic mechanisms

– TEI.linking TEI.linking tags for linking, segmentation, and tags for linking, segmentation, and alignmentalignment

– TEI.fs TEI.fs tags for feature structure analysistags for feature structure analysis

Page 10: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Example annotated Example annotated texttext<seg id="orwl.en.24" corresp="orwl.sl.24"> <seg id="orwl.en.24" corresp="orwl.sl.24"> <s id="Oen.1.1.4.5"> <s id="Oen.1.1.4.5"> <c type="open" ctag='"'>"</c><c type="open" ctag='"'>"</c> <w ana="Af" lemma="big">Big</w> <w ana="Af" lemma="big">Big</w> <w ana="Ncms" lemma="brother">Brother</w> <w ana="Ncms" lemma="brother">Brother</w> <w ana="Vaip3s" lemma="be">is</w> <w ana="Vaip3s" lemma="be">is</w> <w ana="Vmpp" lemma="watch">watching</w> <w ana="Vmpp" lemma="watch">watching</w> <w ana="Pp2" lemma="you">you</w> <w ana="Pp2" lemma="you">you</w> <c ctag='"'>"</c> <c ctag='"'>"</c> <w ana="Dd" lemma="the">the</w> <w ana="Dd" lemma="the">the</w> <w ana="Ncns" lemma="caption">caption</w> <w ana="Ncns" lemma="caption">caption</w> <w ana="Vmis" lemma="say">said</w> <w ana="Vmis" lemma="say">said</w> <c ctag=".">.</c> <c ctag=".">.</c> </s> </s> </seg> </seg>

Page 11: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Example Example morphosyntactic morphosyntactic encoding encoding In textIn text::<w ana="Ncfda" lemma="<w ana="Ncfda" lemma="ženskaženska">">ženskamaženskama</w> </w>

In the MSD specification:In the MSD specification:<fsLib> <fsLib> <fs type="Noun" id="Ncfda" select="sl" feats="N1.c N2.f N3.d N4.a"/> <fs type="Noun" id="Ncfda" select="sl" feats="N1.c N2.f N3.d N4.a"/> <fs type="Noun" id="Ncfdd" select="sl" feats="N1.c N2.f N3.d N4.d"/> <fs type="Noun" id="Ncfdd" select="sl" feats="N1.c N2.f N3.d N4.d"/> <fs type="Noun" id="Ncfdg" select="sl" feats="N1.c N2.f N3.d N4.g"/> <fs type="Noun" id="Ncfdg" select="sl" feats="N1.c N2.f N3.d N4.g"/> ... ... </fsLib> </fsLib>

<fLib> <fLib> <f id="N1.c" select="en ro sl cs bg et hu hr" name="Type"> <sym value="common"/> <f id="N1.c" select="en ro sl cs bg et hu hr" name="Type"> <sym value="common"/>

</f> </f> <f id="N1.p" select="en ro sl cs bg et hu hr" name="Type"> <sym value="proper"/> <f id="N1.p" select="en ro sl cs bg et hu hr" name="Type"> <sym value="proper"/>

</f> </f> <f id="N2.m" select="en ro sl cs bg hr" name="Gender"> <sym value="masculine"/> <f id="N2.m" select="en ro sl cs bg hr" name="Gender"> <sym value="masculine"/>

</f> </f> <f id="N2.f" select="en ro sl cs bg hr" name="Gender"> <sym value="feminine"/> </f> <f id="N2.f" select="en ro sl cs bg hr" name="Gender"> <sym value="feminine"/> </f> <f id="N2.n" select="en ro sl cs bg hr" name="Gender"> <sym value="neuter"/> </f> <f id="N2.n" select="en ro sl cs bg hr" name="Gender"> <sym value="neuter"/> </f> ... ... </fLib> </fLib>

Page 12: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

CES: the Corpus CES: the Corpus Encoding StandardEncoding Standard CES was developed in the scope of EU EAGLES, CES was developed in the scope of EU EAGLES,

the Expert Advisory Group on Language the Expert Advisory Group on Language Engineering Standards (1996)Engineering Standards (1996)

CES is a SGML DTD and is a particular CES is a SGML DTD and is a particular parameterization (and modification) of TEI P3parameterization (and modification) of TEI P3

XCES (2002) is the XML version of CESXCES (2002) is the XML version of CES (X)CES has been used in a number of corpus (X)CES has been used in a number of corpus

projects, mainly because it is simpler to use projects, mainly because it is simpler to use and understand than the full TEIand understand than the full TEI

however, there is not prescribed way how to however, there is not prescribed way how to modify or extend itmodify or extend it

also, less strictly maintained than the TEIalso, less strictly maintained than the TEI

Page 13: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

II. Open issuesII. Open issues

multiple annotationsmultiple annotations metadatametadata corpus analytical toolscorpus analytical tools

Page 14: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Multiple annotationsMultiple annotations

More and more linguistic annotation is being More and more linguistic annotation is being added to the data, e.g.added to the data, e.g.

sentences, words, punctuation, part-of-sentences, words, punctuation, part-of-speech, (morphosyntactic) tags, multi-word speech, (morphosyntactic) tags, multi-word units (terms), named entities, syntactic units (terms), named entities, syntactic structure, co-reference annotation (anaphora), structure, co-reference annotation (anaphora), word-sense informationword-sense information

also rhetorical structure: quoted speech, also rhetorical structure: quoted speech, paragraphs, lists, … paragraphs, lists, …

even more annotation can be added to even more annotation can be added to multimodal data, e.g. speech signalsmultimodal data, e.g. speech signals

furthermore, the same level of analysis can be furthermore, the same level of analysis can be marked-up by more than one tool / annotatormarked-up by more than one tool / annotator

Page 15: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

How to combine these How to combine these annotations?annotations? simply have distinct tags & attributes simply have distinct tags & attributes

for each of the phenomena coveredfor each of the phenomena covered easy to understand and hand-editeasy to understand and hand-edit easy to validateeasy to validate easy to processeasy to process but XML requires a tree-structure; but XML requires a tree-structure;

what if the tags do not nest properly?what if the tags do not nest properly?

Page 16: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Crossing hierarchiesCrossing hierarchies

simple example - page breaks v.s. simple example - page breaks v.s. paragraph boundaries:paragraph boundaries:<page> … <p> …. <page> … <p> …. </page></page> … … </p></p>

a well known problem for XML a well known problem for XML encoding, but with multiple encoding, but with multiple annotations it is now becoming annotations it is now becoming more severemore severe

Page 17: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Solutions to crossing Solutions to crossing hierarchieshierarchiesDiscussed in TEI chapter 14 “Linking, Discussed in TEI chapter 14 “Linking,

Segmentation, and Alignment”:Segmentation, and Alignment”: split elements:split elements:

<page broken=“yes” id=“p1” next=“p2”>…</page><page broken=“yes” id=“p1” next=“p2”>…</page><p> <p> <page broken=“yes” id=“p2” prev=“p1”>…</page> <page broken=“yes” id=“p2” prev=“p1”>…</page></p></p>

““milestones” i.e. empty elements:milestones” i.e. empty elements:<page/> … <p> …. <page/> … </p><page/> … <p> …. <page/> … </p>

but somewhat difficult to process and not very but somewhat difficult to process and not very generalgeneral

Page 18: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Stand-off markupStand-off markup

General solution to crossing hierarchies General solution to crossing hierarchies is to keep markup in separate is to keep markup in separate documents that only point into the text documents that only point into the text (or other markup)(or other markup)

Several specific recommendations and Several specific recommendations and projects:projects:

TEI P5 and TEI Workgroup on Stand-Off TEI P5 and TEI Workgroup on Stand-Off Markup, XLink and XpointerMarkup, XLink and Xpointer

Annotation Graphs with AGTKAnnotation Graphs with AGTK TIGER annotation schemeTIGER annotation scheme

Page 19: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Stand-off markup Stand-off markup example: TIGERexample: TIGER<s id="s5"> <s id="s5"> <graph root="s5_504"> <graph root="s5_504"> <terminals> <terminals> <t id="s5_1" word="Die" pos="ART" morph="Def.Fem.Nom.Sg"/><t id="s5_1" word="Die" pos="ART" morph="Def.Fem.Nom.Sg"/> <t id="s5_2" word="Tagung" pos="NN" morph="Fem.Nom.Sg.*"/><t id="s5_2" word="Tagung" pos="NN" morph="Fem.Nom.Sg.*"/> <t id="s5_3" word="hat" pos="VVFIN" morph="3.Sg.Pres.Ind"/><t id="s5_3" word="hat" pos="VVFIN" morph="3.Sg.Pres.Ind"/> <t id="s5_4" word="mehr" pos="PIAT" morph="--"/><t id="s5_4" word="mehr" pos="PIAT" morph="--"/> <t id="s5_5" word="Teilnehmer" pos="NN" morph="Masc.Akk.Pl.*"/><t id="s5_5" word="Teilnehmer" pos="NN" morph="Masc.Akk.Pl.*"/> <t id="s5_6" word="als" pos="KOKOM" morph="--"/><t id="s5_6" word="als" pos="KOKOM" morph="--"/> <t id="s5_7" word="je" pos="ADV" morph="--"/><t id="s5_7" word="je" pos="ADV" morph="--"/> <t id="s5_8" word="zuvor" pos="ADV" morph="--"/><t id="s5_8" word="zuvor" pos="ADV" morph="--"/> </terminals> </terminals> <nonterminals> <nonterminals> <nt id="s5_500" cat="NP"> <nt id="s5_500" cat="NP"> <edge label="NK" idref="s5_1"/> <edge label="NK" idref="s5_1"/> <edge label="NK" idref="s5_2"/> <edge label="NK" idref="s5_2"/> </nt> </nt> <nt id="s5_501" cat="AVP"><nt id="s5_501" cat="AVP"> <edge label="CM" idref="s5_6"/><edge label="CM" idref="s5_6"/> <edge label="MO" idref="s5_7"/><edge label="MO" idref="s5_7"/> <edge label="HD" idref="s5_8"/><edge label="HD" idref="s5_8"/> </nt> </nt> … …..

Page 20: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Problems with stand-Problems with stand-off markupoff markup need tools to link the data:need tools to link the data:

more difficult processing and more difficult processing and editingediting

no automatic validity checking: no automatic validity checking: consistency, cyclesconsistency, cycles

difficult to change (correct) difficult to change (correct) primarily data or downstream primarily data or downstream annotationsannotations

Page 21: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

MetadataMetadata

description of the corpus or corpus elementsdescription of the corpus or corpus elements traditional bibliographic standards (MARC)traditional bibliographic standards (MARC) but computer corpora need to be but computer corpora need to be

documented also along other dimensions: documented also along other dimensions: availability, size, markup used, relation of availability, size, markup used, relation of digital file to source text, etc.digital file to source text, etc.

EAD developed for archives, but many EAD developed for archives, but many similarities to corpus descriptionsimilarities to corpus description

a meta-data recommendation closely a meta-data recommendation closely coupled with the data itself is the TEI headercoupled with the data itself is the TEI header

Page 22: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

TEI headerTEI header

<teiHeader> is an obligatory part of every TEI document <teiHeader> is an obligatory part of every TEI document and consists of: and consists of:

<fileDesc>, <fileDesc>, file descriptionfile description full bibliographical description of the computer file itself; full bibliographical description of the computer file itself; includes information about the source or sources of the includes information about the source or sources of the electronic textelectronic text

<encodingDesc>, <encodingDesc>, encoding descriptionencoding description describes relationship between electronic text and its describes relationship between electronic text and its source: normalization, ambiguity resolution, levels of source: normalization, ambiguity resolution, levels of encoding or analysis, etc.encoding or analysis, etc.

<profileDesc>, <profileDesc>, text profiletext profileclassificatory & contextual information, e.g. subject classificatory & contextual information, e.g. subject matter. Important for corpora, to perform retrievals from matter. Important for corpora, to perform retrievals from a body of text in terms of text type or origin (taxonomies)a body of text in terms of text type or origin (taxonomies)

<revisionDesc>, <revisionDesc>, revision historyrevision historyhistory of changes made during the development of the history of changes made during the development of the electronic textelectronic text

Page 23: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

TEI header II.TEI header II.

an an example of a TEI headerexample of a TEI header very detailed information is very detailed information is

possible, but again, many ways to possible, but again, many ways to express the same information express the same information (e.g. free text or structured in (e.g. free text or structured in elements)elements)

stricter, but poorer alternatives stricter, but poorer alternatives exists: Dublin Coreexists: Dublin Core

Page 24: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Dublin CoreDublin Core

Dublin Core Metadata Initiative (DCMI) was Dublin Core Metadata Initiative (DCMI) was founded in 1995 with the aim to create a core founded in 1995 with the aim to create a core set of meta-data descriptions for Web-based set of meta-data descriptions for Web-based resources that would be useful for categorizing resources that would be useful for categorizing the Web for easier search and retrieval. the Web for easier search and retrieval.

Dublin Core Metadata Element Set (DCES) Dublin Core Metadata Element Set (DCES) defines 15 elements, i.e.: defines 15 elements, i.e.: Title, Creator, Subject, Description, Publisher, Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, RightsSource, Language, Relation, Coverage, Rights

can be extendedcan be extended DC is used e.g. by the Open Language DC is used e.g. by the Open Language

Archives Community (OLAC)Archives Community (OLAC)

Page 25: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Corpus analytical toolsCorpus analytical tools

Currently, many corpus exploration tools exists, Currently, many corpus exploration tools exists, and they typically offer:and they typically offer:

search with regular expressions over stringssearch with regular expressions over strings sometimes search over (lemma/PoS) sometimes search over (lemma/PoS)

annotationsannotations concordance and word frequency list display of concordance and word frequency list display of

resultsresults sometimes search and display of parallel sometimes search and display of parallel

corporacorpora sometimes basic statistic tests (keywordness, sometimes basic statistic tests (keywordness,

collocation strength)collocation strength) examples: WordSmith, MonoConc, IMS CQP, examples: WordSmith, MonoConc, IMS CQP,

Manatee/Bonito, SARA/Xaira, TigersearchManatee/Bonito, SARA/Xaira, Tigersearch

Page 26: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

What is missingWhat is missing

possibility to combine different types possibility to combine different types of annotation in queries and displays, of annotation in queries and displays, esp. for multimodal corporaesp. for multimodal corpora

integration of more powerful statistical integration of more powerful statistical methods, esp. for collocations and methods, esp. for collocations and parallel corporaparallel corpora

tools targeted to different types of tools targeted to different types of users (e.g. Sketch Engine)users (e.g. Sketch Engine)

merging of digital library viewers with merging of digital library viewers with corpus concordancing softwarecorpus concordancing software

Page 27: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Corpora v.s. digital Corpora v.s. digital librarieslibraries classical reference corpora were composed of classical reference corpora were composed of

samples, and interesting only for their linguistic samples, and interesting only for their linguistic contentcontent

today, more and more corpora contain integral texts, today, more and more corpora contain integral texts, which are of interest in themselves (e.g. historical which are of interest in themselves (e.g. historical texts)texts)

conversely, digital libraries are growing in size and conversely, digital libraries are growing in size and accessibility and becoming interesting also for accessibility and becoming interesting also for linguistic researchlinguistic research

what is needed is a system that can perform two what is needed is a system that can perform two tasks: enable selection of (fragments of) heavily tasks: enable selection of (fragments of) heavily structured (multimedia, text-critical) texts for reading structured (multimedia, text-critical) texts for reading and allow for concordance views of selectionsand allow for concordance views of selections

currently the only available (OS) system that attempts currently the only available (OS) system that attempts this is Philologic from University of Chicagothis is Philologic from University of Chicago

Page 28: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

III. Future directionsIII. Future directions

Two directions in standardisation of Two directions in standardisation of corpus and language resource corpus and language resource annotation:annotation:

next version of TEI, version P5next version of TEI, version P5 work by ISO TC 37 SC4work by ISO TC 37 SC4

Page 29: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

TEI P5TEI P5

the next version of TEI, currently at beta the next version of TEI, currently at beta stage: available, but not stablestage: available, but not stable

significantly revised and brought in line with significantly revised and brought in line with current practicescurrent practices

not backward compatible with P3/P4 not backward compatible with P3/P4 (although scripts exists for conversion)(although scripts exists for conversion)

formal specification based on the ISO Relax formal specification based on the ISO Relax NG schema language (although DTD and NG schema language (although DTD and W3C schemas also available)W3C schemas also available)

parameterisation also produces dedicated parameterisation also produces dedicated documentationdocumentation

Page 30: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

ISO TC 37ISO TC 37

ISO TC 37: ISO Technical Committee ISO TC 37: ISO Technical Committee on Terminology, est. on Terminology, est. 19521952

maybe best known for ISO 639 and maybe best known for ISO 639 and MARTIF MARTIF

in 2002 changed name to Technical in 2002 changed name to Technical Committee on Terminology and Committee on Terminology and Other Other Language ResourcesLanguage Resources

also established also established ISO TC 37/SC 4ISO TC 37/SC 4Sub-Committee on Language Resource Sub-Committee on Language Resource ManagementManagement

Page 31: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

ISO TC 37 SC4 WGsISO TC 37 SC4 WGs

WG 1 : Basic descriptors and mechanisms for language resourcesWG 1 : Basic descriptors and mechanisms for language resources – terminology used in language resources,terminology used in language resources,– basic mechanisms and data structures for linguistic representationbasic mechanisms and data structures for linguistic representation– meta-data representation scheme to document linguistic information meta-data representation scheme to document linguistic information

structures and processesstructures and processes WG 2 : Representation schemesWG 2 : Representation schemes

– definition of annotation/representation schemes for morpho-syntax and definition of annotation/representation schemes for morpho-syntax and syntaxsyntax

– representation scheme for the semantic content of multimodal information,representation scheme for the semantic content of multimodal information,– metadata for discourse level representation schememetadata for discourse level representation scheme        

WG 3 : Multilingual text representationWG 3 : Multilingual text representation– translation memory and alignment of parallel corpora,translation memory and alignment of parallel corpora,– segmentation and counting algorithmssegmentation and counting algorithms,,– meta-markup for Globalization, Internationalization and Localization (GIL)meta-markup for Globalization, Internationalization and Localization (GIL)        

WG 4 : Lexical databaseWG 4 : Lexical databasess– standardization of lexical representation formats for the various types of NLP standardization of lexical representation formats for the various types of NLP

applications (Machine Readable Lexica)applications (Machine Readable Lexica)         WG 5 : Workflow of language resource managementWG 5 : Workflow of language resource management

– Standardization of guidelines for language validation and net-based Standardization of guidelines for language validation and net-based distributed cooperative workdistributed cooperative work

Page 32: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

WG4 standardsWG4 standards

Language Resource Management Language Resource Management — Feature Structures— Feature Structures

Language resource management Language resource management —Lexical markup framework (LMF)—Lexical markup framework (LMF)

Language Resource Management Language Resource Management — Morpho-syntactic Annotation — Morpho-syntactic Annotation Framework (MAF)Framework (MAF)

all under development!all under development!

Page 33: Encoding language corpora:  current trends and future directions

National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

ConclusionsConclusions

I presented some history, current I presented some history, current state and possible future directions state and possible future directions in the field of encoding in the field of encoding standardisation of, mainly, corporastandardisation of, mainly, corpora

the main recommendation (for me!) the main recommendation (for me!) still seems to be TEI: combines still seems to be TEI: combines tradition with innovationtradition with innovation

Page 34: Encoding language corpora:  current trends and future directions

Thank you!Thank you!