Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School

Advanced Language Technologies

Information and Communication TechnologiesModule "Knowledge Technologies"

Jožef Stefan International Postgraduate SchoolWinter 2010 / Spring 2011

Tomaž Erjavec

Lecture II.Computer Corpora

Overview of the Overview of the lecturelecture

1.1. BackgroundBackground

2.2. Corpus compilation and markupCorpus compilation and markup

3.3. Morphosyntactic taggingMorphosyntactic tagging

BackgroundBackground

• What is a corpus?

• Using corpora

• Characteristics of a corpus

• Typology of corpora

• History

• Slovene language corpora

A corpus isA corpus is::

a large collection of textsa large collection of texts in digital formatin digital format language “as it is”language “as it is” a sample of the language it is a sample of the language it is

meant to representmeant to represent used for describing language used for describing language

(des(desccriptivriptivee/empi/empirical linguisticsrical linguistics))

A more precise A more precise definitiondefinition

CorpusCorpus (plural (plural corporacorpora) is Latin for ) is Latin for bodybody Guidelines of the Expert Advisory Group on Guidelines of the Expert Advisory Group on

Language Engineering Standards, Language Engineering Standards, EAGLES::– Corpus : A collection of pieces of language that are : A collection of pieces of language that are

selected and ordered according to explicit linguistic selected and ordered according to explicit linguistic criteria in order to be used as a sample of the criteria in order to be used as a sample of the language.language.

– Computer corpus : a corpus which is encoded in a : a corpus which is encoded in a standardised and homogeneous way for open-ended standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance. are documented as to their origins and provenance.

For computer scientists: a datasetFor computer scientists: a dataset

Using corporaUsing corpora

Applied linguistics: Applied linguistics: – LexicographyLexicography: making dictionaries (first users of : making dictionaries (first users of

corpora)corpora)– Translation studiesTranslation studies: translation equivalents with : translation equivalents with

contextscontextstranslation memories, machine aided translations translation memories, machine aided translations

– Language learningLanguage learning: real-life examples, curriculum : real-life examples, curriculum development development

Corpus linguistics:Corpus linguistics:– linguistics based not on introspection, but on linguistics based not on introspection, but on

observation of real dataobservation of real data Language technologyLanguage technology: :

– testing set for developed methods; testing set for developed methods; – training settraining set for inductive learning for inductive learning

((statistical Natural Language Processing) )

Characteristics of a Characteristics of a (good) corpus(good) corpus QuantityQuantity::

the bigger, the better the bigger, the better Quality Quality ::

the texts are authentic; the mark-up is the texts are authentic; the mark-up is validated validated

SimplicitySimplicity::the computer representation is the computer representation is understandable, with the markup easily understandable, with the markup easily separated from the text separated from the text

DocumentedDocumented::the corpus contains bibliographic and other the corpus contains bibliographic and other meta-datameta-data

Typology of corpora I.Typology of corpora I.

Medium:Medium:– written languagewritten language– spokenspoken languagelanguage (spoken, but in writing / (spoken, but in writing /

transcription)transcription)– speechspeech corporacorpora (actual speech signal) (actual speech signal)

Content:Content:– referencereference corpora (representative), e.g. corpora (representative), e.g. BNC – sub-language corporasub-language corpora (specialised), e.g. (specialised), e.g. COLT

Structure:Structure:– corpora with corpora with integralintegral texts texts – corpora or of text corpora or of text samplessamples (historical and legal (historical and legal

reasons)reasons)e.g. e.g. Brown

Typology of corpora IITypology of corpora II

Time:Time:– staticstatic corpora corpora– monitormonitor corpora (language change) corpora (language change)

Languages:Languages:– monolingual monolingual corporacorpora– multilingual multilingual parallelparallel corpora (e.g. corpora (e.g. Hansard, ,

Europarl, , JRC Acquis))– multilingual multilingual comparablecomparable corpora corpora

Annotation:Annotation:– plain textplain text corpora corpora– annotatedannotated corpora corpora

ReferenReference corporace corpora

Characteristics:Characteristics:– a sample of the “complete” languagea sample of the “complete” language– large, expensive, detailed and explicit design criterialarge, expensive, detailed and explicit design criteria– ttypically of contemporary languageypically of contemporary language– dodocumented and annotatedcumented and annotated– legaly clean, availablelegaly clean, available (but usu. only via a (but usu. only via a

concordancer)concordancer) Criteria for including textsCriteria for including texts::

– representativenessrepresentativeness::corpus includes corpus includes ““allall” ” text typestext types

– balancebalance::the sizes of text type samples are in proportion to the sizes of text type samples are in proportion to their “importance” for the speakers of the languagetheir “importance” for the speakers of the language

metodmetodhodology v.s. practical constraintshodology v.s. practical constraints

HistoryHistory of of corporacorpora First milestones: First milestones:

Brown (1 million words) 1964; (1 million words) 1964; LOB (also 1M) 1974 (also 1M) 1974 The spread of reference corpora: Cobuild Bank of The spread of reference corpora: Cobuild Bank of

English (monitor, 100..200..M) 1980; English (monitor, 100..200..M) 1980; BNC (100M) (100M) 1995; Czech 1995; Czech CNC (100M) 1998; Croatian (100M) 1998; Croatian HNK (100M) 1999... (100M) 1999...

Slovene reference corpora: Slovene reference corpora: FIDA (100M), (100M), Nova Beseda (100M...) 1998; (100M...) 1998; FIDA+ (600M) 2006 (600M) 2006; ; gigaFIDA (2011?)gigaFIDA (2011?). .

EU corpus oriented projects in the '90: NERC, EU corpus oriented projects in the '90: NERC, MULTEXT-East,... ,...

Language resources brokers: Language resources brokers: LDC 1992, 1992, ELRA 1995 1995

Web as Corpus (2000Web as Corpus (2000....)): ukWaC, itWaC, … slWaC: ukWaC, itWaC, … slWaC more, larger, for more languages, with diverse more, larger, for more languages, with diverse

annotationsannotations: EUROPARL, PDT, … : EUROPARL, PDT, …

Slovene language Slovene language corporacorporaMonolingual reference corpora:Monolingual reference corpora: ZRC SAZU: ZRC SAZU: Beseda, 1998; , 1998; Nova beseda, 2000- , 2000- DZS, Amebis, FF, IJS: DZS, Amebis, FF, IJS: FIDA, 1998, , 1998, FidaPlus, 2006, 2006 IJS, FF: IJS, FF: JOS corpora corpora

Parallel corpora:Parallel corpora: IJS: IJS: MULTEXT-East 1998- 1998-, , SVEZ-IJS, 2004, , 2004, JRC-ACQUIS, 2006 , 2006 SVEZ: EuroKorpusSVEZ: EuroKorpus FF: FF: TRANS, 2002, 2002

Speech corpora:Speech corpora: Laboratory for Digital Signal Processing, University of Maribo

r::SpeechDat, ONOMASTICA... SpeechDat, ONOMASTICA...

Laboratory of Articifical Perception, Systems and Cybernetics, University of Ljubljana::SQEL, GOPOLIS,... SQEL, GOPOLIS,...

II. Compilation and II. Compilation and markup of corporamarkup of corpora

• Steps in the preparation of a corpus

• What annotation can be added to the text

• Computer coding of corpora

• Markup Methods

Before making your Before making your own corpusown corpus

check if an appropriate corpus is check if an appropriate corpus is already availablealready available

googlegoogle [email protected] LDC, , ELRA

Steps in the Steps in the preparation of a preparation of a corpuscorpus1.1. Choosing the component texts Choosing the component texts

and acquiring digital originalsand acquiring digital originals

2.2. Up-translation to standard Up-translation to standard formatformat

3.3. Linguistic annotationLinguistic annotation

4.4. DocumentationDocumentation

5.5. Use and DisseminationUse and Dissemination

Getting the textGetting the text

1.1. Choosing the component texts:Choosing the component texts:linguistic and non-linguistic criteria; linguistic and non-linguistic criteria; availability; simplicity; size availability; simplicity; size

2.2. CopyrightCopyrightsensitivity of source (financial and sensitivity of source (financial and privacy considerations); agreement privacy considerations); agreement with providers; usage, publication with providers; usage, publication

3.3. Acquiring digital originalsAcquiring digital originalsOCR; digital originals; Web OCR; digital originals; Web

BootCatBootCat

ProcessingProcessing

1.1. Conversion to common format Conversion to common format consistency; character set consistency; character set encodings; structure encodings; structure

Web as Corpus: Wacky toolsWeb as Corpus: Wacky tools

2.2. DocumentationDocumentatione.g. TEI header; Open Archives e.g. TEI header; Open Archives etc. etc.

3.3. Linguistic annotationLinguistic annotationlanguage dependent methods; language dependent methods; errors errors

Use and disseminationUse and dissemination

Using the corpus:Using the corpus:– concordancer (linguists)concordancer (linguists)

e.ge.g FidaPLUS, , SKE, , iKorpus, JOS, IMP, JOS, IMP– statistics extraction statistics extraction – development of new methods for analysisdevelopment of new methods for analysis

Dissemination:Dissemination:– legalities (source copyright, corpus use legalities (source copyright, corpus use

agreement)agreement)– mode: concordancer or datasetmode: concordancer or dataset

Computer coding of Computer coding of corporacorpora Encoding must ensure Encoding must ensure

– durabilitydurability– interchange between computer platformsinterchange between computer platforms– interchange between applications interchange between applications

Basic standard: Basic standard: XML – companion standards: companion standards: W3C W3C Schema, ISO Relax NG, Schema, ISO Relax NG,

XSLT, XSLT, XPath, XQuery, ... XPath, XQuery, ... XML XML vocabulary of annotationsvocabulary of annotations of arbitrary of arbitrary

texts:texts:Text Encoding InitiativeText Encoding Initiative, , TEI

ISO TC 37 „Terminology and other language ISO TC 37 „Terminology and other language resources“: many standards for text resources“: many standards for text encodingencoding

Corpus annotationCorpus annotation

Annotation = interpretation Annotation = interpretation Documentation about the corpus (Documentation about the corpus (example) ) Document structure (Document structure (example) ) Basic linguistic markup: sentences, words (Basic linguistic markup: sentences, words (

example), punctuation, abbreviations ), punctuation, abbreviations (example) (example)

Lemmas and morphosyntactic descriptions Lemmas and morphosyntactic descriptions (example) (example)

Syntax (example) Syntax (example) Alignment (example) Alignment (example) Terms, semantics, anaphora, pragmatics, Terms, semantics, anaphora, pragmatics,

intonation,... intonation,...

Example: TEI headerExample: TEI header

<teiHeader id="ecmr.H" type="text" lang="sl-en" creator=ET <teiHeader id="ecmr.H" type="text" lang="sl-en" creator=ET status="update" date.created="1999-04-13" date.updated="1999-06-status="update" date.created="1999-04-13" date.updated="1999-06-22" > 22" >

<fileDesc><fileDesc> <titleStmt><titleStmt> <title lang="sl">Ekonomsko ogledalo; 13 &scaron;tevilk 98/99</title> <title lang="sl">Ekonomsko ogledalo; 13 &scaron;tevilk 98/99</title> <title lang="en">Slovenian Economic Mirror; 13 issues, 98/99</title> <title lang="en">Slovenian Economic Mirror; 13 issues, 98/99</title> <respstmt> <respstmt> <name>Andrej Skubic, FF</name> <name>Andrej Skubic, FF</name> <resp lang="sl">Zagotovitev digitalnega originala, <resp lang="sl">Zagotovitev digitalnega originala,

poravnava</resp> poravnava</resp> <resp lang="en">Provision of digital original, alignment</resp> <resp lang="en">Provision of digital original, alignment</resp> <name>Toma<name>Tomažž Erjavec, IJS</name> Erjavec, IJS</name> <resp lang="sl">Tokenizacija, pretvorba v TEI</resp><resp lang="sl">Tokenizacija, pretvorba v TEI</resp> <resp lang="en">Tokenisation, conversion to TEI</resp> <resp lang="en">Tokenisation, conversion to TEI</resp> </respStmt></respStmt> </titleStmt> ... </titleStmt> ...

Example: text Example: text structurestructure<quote id="Osl.1.8.18" rend="center;it"><quote id="Osl.1.8.18" rend="center;it"> <lg id="Osl.1.8.18.1"> <lg id="Osl.1.8.18.1"> <l id="Osl.1.8.18.1.1">Tam pod kostanjevim <l id="Osl.1.8.18.1.1">Tam pod kostanjevim

drevesom</l> drevesom</l> <l id="Osl.1.8.18.1.2">izdala si me,</l> <l id="Osl.1.8.18.1.2">izdala si me,</l> <l id="Osl.1.8.18.1.3">izdal sem te,</l> <l id="Osl.1.8.18.1.3">izdal sem te,</l> <l id="Osl.1.8.18.1.4">ne da bi trenila z očesom.</l> <l id="Osl.1.8.18.1.4">ne da bi trenila z očesom.</l> </lg> </lg> </quote> </quote> <p id="Osl.1.8.19"> <p id="Osl.1.8.19"> <s id="Osl.1.8.19.1">Trije možje se niso niti ganili.</s><s id="Osl.1.8.19.1">Trije možje se niso niti ganili.</s> <s id="Osl.1.8.19.2">Toda ko je <s id="Osl.1.8.19.2">Toda ko je

<name>Winston</name> znova pogledal v Rutherfordov <name>Winston</name> znova pogledal v Rutherfordov propadli obraz, je opazil, da so njegove oči polne propadli obraz, je opazil, da so njegove oči polne solz.</s> ... solz.</s> ...

Example: Example: morphosyntactic morphosyntactic taggingtagging<s id="Osl.1.2.2.1"> <s id="Osl.1.2.2.1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w><w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen“ ana="Afpmsnn">jasen</w><c>,</c><w lemma="jasen“ ana="Afpmsnn">jasen</w><c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w><w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w><w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" <w lemma="trinajst"

ana="Mcnpnl">trinajst</w><c>.</c> ana="Mcnpnl">trinajst</w><c>.</c> </s> </s>

Example: alignmentExample: alignment

<linkGrp id="Oslen.1" type="body" targtype="s" <linkGrp id="Oslen.1" type="body" targtype="s" domains="Oen Osl"> domains="Oen Osl">

<link xtargets="Osl.1.2.2.1 ; Oen.1.1.1.1"> <link xtargets="Osl.1.2.2.1 ; Oen.1.1.1.1"> <link xtargets="Osl.1.2.2.2 ; Oen.1.1.1.2"> <link xtargets="Osl.1.2.2.2 ; Oen.1.1.1.2"> <link xtargets="Osl.1.2.3.1 ; Oen.1.1.2.1"> <link xtargets="Osl.1.2.3.1 ; Oen.1.1.2.1"> <link xtargets="Osl.1.2.3.2 ; Oen.1.1.2.2"> <link xtargets="Osl.1.2.3.2 ; Oen.1.1.2.2"> ... ... <link xtargets="Osl.1.2.6.5 ; Oen.1.1.5.5"><link xtargets="Osl.1.2.6.5 ; Oen.1.1.5.5"> <link xtargets="Osl.1.2.6.6 ; Oen.1.1.5.6 <link xtargets="Osl.1.2.6.6 ; Oen.1.1.5.6

Oen.1.1.5.7"> Oen.1.1.5.7"> <link xtargets="Osl.1.2.6.7 ; Oen.1.1.5.8"> <link xtargets="Osl.1.2.6.7 ; Oen.1.1.5.8"> ... ...

Methods for linguistic Methods for linguistic markupmarkup hand annotationhand annotation: documentation, first steps: documentation, first steps

generic (XML, spreadsheet) editors or specialised editors generic (XML, spreadsheet) editors or specialised editors semi-automaticsemi-automatic: morphosyntactic and other linguistic : morphosyntactic and other linguistic

annotationannotationcyclic approach: machine, hand, validate, correct, cyclic approach: machine, hand, validate, correct, machine, ... machine, ...

machine, with hand-written rulesmachine, with hand-written rules: tokenisation: tokenisationregular expression regular expression

machine, with inductively built models from annotated machine, with inductively built models from annotated datadata: : "supervised learning"; HMMs, decision trees, inductive "supervised learning"; HMMs, decision trees, inductive logic programming,... logic programming,...

machine, with inductively built models from un-machine, with inductively built models from un-annotated dataannotated data: : "unsupervised leaning"; clustering techniques "unsupervised leaning"; clustering techniques

overview of the field overview of the field

III. Morphosyntactic III. Morphosyntactic taggingtagging Better known as part-of-speech (PoS) taggingBetter known as part-of-speech (PoS) tagging Tagging is the task of labeling each word in a

sequence of words with its appropriate part-of-speech

Words are often ambiguous with respect to their POS:– saw → singular noun „I brought a saw“– saw → past tense of verb „I saw a tree“

Purposes and applications (examples):– pre-processing step for further analyses:

lemmatisation syntactic structure, etc.

– text indexing, e.g. nouns are more useful than verbs– pronunciation in speech processing

Steps in taggingSteps in tagging

for each word token in text the tagger for each word token in text the tagger needs to know all its possible tags needs to know all its possible tags (ambiguity class) (ambiguity class) →→ a morphological lexicon a morphological lexicon

given the context in which the word given the context in which the word appears in, the tagger must decide in appears in, the tagger must decide in the correct tag:the correct tag:– he saw/V a man carrying a saw/Nhe saw/V a man carrying a saw/N

so, tagging performs limited syntactic so, tagging performs limited syntactic disambiguationdisambiguation

Example: Penn Example: Penn TreebankTreebankUnder/IN the/DT proposal/NN ,/,

Delmed/NNP would/MD issue/VB about/IN 123.5/CD million/CD additional/JJ Delmed/NNP common/JJ shares/NNS to/TO Fresenius/NNP at/IN an/DT average/JJ price/NN of/IN about/IN 65/CD cents/NNS a/DT share/NN ,/, though/IN under/IN no/DT circumstances/NNS more/JJR than/IN 75/CD cents/NNS a/DT share/NN ./.

PoS taggersPoS taggers

Most taggers induce the language Most taggers induce the language model from a hand-annotated model from a hand-annotated corpuscorpus

Typically, two resources are Typically, two resources are induced:induced:– lexicon, giving the lexicon, giving the ambiguity class ambiguity class of a of a

word and their frequencies in the word and their frequencies in the training corpustraining corpus

– tag n-gramstag n-grams

Tagging with Markov Models Sequence of tags in a text is regarded a Markov

chain Limited horizon: A word’s tag only depends on the

previous tag: p(xi+1 = tj | x1, ..., xi) = p (xi+1 = tj | xi) Time invariant: This dependency does not change

over time: p(xi+1 = tj | xi) = p(x2 = tj | x1) Task: Find the most probable tag sequence for a

sequence of words Maximum likelihood estimate of tag tk following tj:

p(tk | tj) = f(tj,tk) / f(tj) Optimal tags for a sentence:

t´1,n = arg max p(t1,n | w1,n) = Π p(wi | ti) p(ti | ti-1)

Most popular Markov Most popular Markov model taggermodel tagger TnTTnT (Trigrams ‘n Tags) (Trigrams ‘n Tags) induces lexicon and tag trigrams from induces lexicon and tag trigrams from

the training corpusthe training corpus has heuristics to tag unknown wordshas heuristics to tag unknown words has no problem with large tagsetshas no problem with large tagsets fast in training and taggingfast in training and tagging freely available for non-commercial usefreely available for non-commercial use but only as a Linux executablebut only as a Linux executable OS alternative: hunposOS alternative: hunpos

Yet another TaggerYet another Tagger

For a while, trying out new For a while, trying out new approaches to tagging was in approaches to tagging was in fashionfashion

Maximum Entropy taggersMaximum Entropy taggers Support Vector Machine taggersSupport Vector Machine taggers Memory based taggersMemory based taggers ……

TagsetsTagsets

A tagset is a set of part-of-speech tagsA tagset is a set of part-of-speech tags Classical 8 classes (Thrax, 100 BC):

noun, verb, article, participle, pronoun, preposition, adverb, conjunction

But all tagset use more tags than that! Criteria:Criteria:

– specifiability: degree to which humans use the tagset uniformly on the same text

– accuracy: evaluation of output on tagged text

– suitability for intended application

Tagsets for EnglishTagsets for English

For English, there exist several tagsets: For English, there exist several tagsets: Brown, CLAWS, Penn, …Brown, CLAWS, Penn, …

English tagsets include PoS + some other English tagsets include PoS + some other morphological (inflectional) properties: 30-80 morphological (inflectional) properties: 30-80 tagstags

Penn Treebank Tagset for English: 37 tags, e.g.Penn Treebank Tagset for English: 37 tags, e.g.– JJ adjective, positive– JJR adjective, comparative– JJS adjective, superlative– NN non-plural common noun– NNS plural common noun– NNP non-plural proper name– NNPS plural proper name– IN preposition– ……

Morphosyntactic Morphosyntactic tagsetstagsets For inflectionaly rich languages (such as For inflectionaly rich languages (such as

Slavic languages), tagsets contain much Slavic languages), tagsets contain much more information than just PoSmore information than just PoS

Slovene, Czech, etc. > 1Slovene, Czech, etc. > 1,,000 different 000 different morphosyntactic tagsmorphosyntactic tags– gendgendeer, number, case, animacy, definiteness, …r, number, case, animacy, definiteness, …

Efforts to standardise tagsets across Efforts to standardise tagsets across languages:languages:– EaglesEagles– MULTEXTMULTEXT– MULTEXT-EastMULTEXT-East

MULTEXT-EastMULTEXT-East

EU project in ’90s: development of EU project in ’90s: development of language resources for Central and East-language resources for Central and East-European languagesEuropean languages

Several later releasesSeveral later releases, V4 in 2010 (17 , V4 in 2010 (17 languages)languages)

DDevelopment of morphosyntactic evelopment of morphosyntactic specifications, lexica and annotated corpusspecifications, lexica and annotated corpus

Parallel annotated corpus:Parallel annotated corpus:Orwell’s 1984Orwell’s 1984

Web site: http://nl.ijs.si/ME/ Web site: http://nl.ijs.si/ME/

MULTEXT-East MULTEXT-East morphosyntactic morphosyntactic specificationsspecifications

Specify Specify – what morphosyntactic features particular what morphosyntactic features particular

languages distinguish, languages distinguish, – what their names and values are, what their names and values are, – how they can be mapped to how they can be mapped to tags

(morphosyntactic descriptions, MSDs) e.g. that e.g. that Ncms is:

– a valid for Slovene– is equivalent to PoS:Noun, Type:common,

Gender:masculine, Number:singular http://nl.ijs.si/ME/Vhttp://nl.ijs.si/ME/V44/msd/html//msd/html/

JOS projectJOS project

JOS language resources are meant to facilitate JOS language resources are meant to facilitate developments of human language technologies and developments of human language technologies and corpus linguistics for the Slovene language corpus linguistics for the Slovene language

MMorphosyntactic specificationsorphosyntactic specifications TTwo annotated corporawo annotated corpora ( (morphosyntactic descriptions and morphosyntactic descriptions and

lemmaslemmas))– jos100kjos100k (hand validated)(hand validated)– jos1M (partially hand validated)jos1M (partially hand validated)

SSampled ampled from FidaPLUS corpusfrom FidaPLUS corpus jos100kjos100k:: syntactic and semantic levels of linguistic syntactic and semantic levels of linguistic

descriptiondescription TTwo web services wo web services

– concordancer concordancer – text annotation tooltext annotation tool

Encoded in Encoded in TEI P5 TEI P5 FFreely available reely available (CC): (CC): http://nl.ijs.si/jos/http://nl.ijs.si/jos/

jos100k encodingjos100k encoding

<s xml:id="F0020003.557.2"><s xml:id="F0020003.557.2"> <w xml:id="F0020003.557.2.1" lemma="ta" msd="Zk-sei">To</w><S/><w xml:id="F0020003.557.2.1" lemma="ta" msd="Zk-sei">To</w><S/> <w xml:id="F0020003.557.2.2" lemma="biti" <w xml:id="F0020003.557.2.2" lemma="biti" msd="Gp-ste-n">je</w><S/>msd="Gp-ste-n">je</w><S/> <term type="sloWNet" sortKey="kraj<term type="sloWNet" sortKey="kraj” ” key="ENG20-08114200-n">key="ENG20-08114200-n"> <w xml:id="F0020003.557.2.3" lemma="turisti<w xml:id="F0020003.557.2.3" lemma="turističčen„en„ mmsd="Ppnmein">turistisd="Ppnmein">turističčen</w><S/>en</w><S/> <w xml:id="F0020003.557.2.4" lemma="kraj" msd="Somei">kraj</w><w xml:id="F0020003.557.2.4" lemma="kraj" msd="Somei">kraj</w> </term></term> <c xml:id="F0020003.557.2.5">.</c><S/><c xml:id="F0020003.557.2.5">.</c><S/></s></s><linkGrp type="syntax" targFunc="head argument" <linkGrp type="syntax" targFunc="head argument" corresp="#F0020003.557.2">corresp="#F0020003.557.2"> <link type="ena" targets="#F0020003.557.2.2 #F0020003.557.2.1"/><link type="ena" targets="#F0020003.557.2.2 #F0020003.557.2.1"/> <link type="modra" targets="#F0020003.557.2 #F0020003.557.2.2"/><link type="modra" targets="#F0020003.557.2 #F0020003.557.2.2"/> <link type="dol" targets="#F0020003.557.2.4 #F0020003.557.2.3"/><link type="dol" targets="#F0020003.557.2.4 #F0020003.557.2.3"/> <link type="dol" targets="#F0020003.557.2.2 #F0020003.557.2.4"/><link type="dol" targets="#F0020003.557.2.2 #F0020003.557.2.4"/> <link type="modra" targets="#F0020003.557.2 #F0020003.557.2.5"/><link type="modra" targets="#F0020003.557.2 #F0020003.557.2.5"/></linkGrp</linkGrp>>

Processing Historical Processing Historical LanguageLanguage interesting for diachronic linguistics and interesting for diachronic linguistics and

better access to digital librariesbetter access to digital libraries problems: problems:

– difficult to obtain good transcriptionsdifficult to obtain good transcriptions– great variation in spellinggreat variation in spelling– no resouces for tool training no resouces for tool training

Historical slv:Historical slv: Late standardisation (XIX ≠ XX)Late standardisation (XIX ≠ XX) BeforeBefore 1850: 1850: ſ ſh s sh z zh → s š z ž c čſ ſh s sh z zh → s š z ž c č No corporaNo corpora/lexica /lexica of historical Sloveneof historical Slovene

AHLibAHLib (2004–08) (2004–08)Deutsch-slowenische/kroatische Übersetzung 1848–1918Deutsch-slowenische/kroatische Übersetzung 1848–1918

– SScans + correction + (lemmatisation) of ger→slv bookscans + correction + (lemmatisation) of ger→slv books– AAS & Karl-Franzens University, Graz (prof. Erich Prunč)AAS & Karl-Franzens University, Graz (prof. Erich Prunč)– JSI: correction JSI: correction & & lemmatisation environmentlemmatisation environment

EU EU IP IP IMPACTIMPACT ((ext. ext. 2010–2011)2010–2011)

– BBetter OCR for historical textsetter OCR for historical texts– NUK: GTD transcriptions NUK: GTD transcriptions – JSI: (semi)manual lexicon constructionJSI: (semi)manual lexicon construction

Google awardGoogle award (2011)(2011)Developing language models for historical SloveneDeveloping language models for historical Slovene

– ZRC SAZU: transcriptions of old textsZRC SAZU: transcriptions of old texts– JSI: annotating a corpus of XIXJSI: annotating a corpus of XIXthth century Slovene century Slovene

BackgroundBackground

Tomaž Erjavec: Tomaž Erjavec: Annotating Historical Annotating Historical Slovene Slovene

4242

Producing the IMP Producing the IMP corpuscorpus Representative Representative & & balanced, sampledbalanced, sampled Corpus element: unbroken & contiguous text Corpus element: unbroken & contiguous text

from 1 pagefrom 1 page Sampled by decade Sampled by decade & & texttext Target size: 1,000 pages (~200,000 words)Target size: 1,000 pages (~200,000 words) Encoded in TEI P5Encoded in TEI P5 Automatically annotatedAutomatically annotated TToolool for manual annotation for manual annotation: IMPACT INL Cobalt: IMPACT INL Cobalt Annotator training & management: MayAnnotator training & management: May Manual correction: June–NovemberManual correction: June–November


4343

Annotation toolAnnotation tool

Approach:Approach:ModernisModernisee, then process as contemporary language, then process as contemporary languageLanguage independent (trainable) modulesLanguage independent (trainable) modules

Steps:Steps:

1.1.ToTokenisation kenisation (mlToken)(mlToken)

2.2.TrTranscription anscription (Vaam)(Vaam)

3.3.TaTagginggging (TnT(TnT))4.4.LeLemmatisationmmatisation (CLOG)(CLOG)

= ToTrTaLe= ToTrTaLePipeline in PerlPipeline in PerlTEI P5 I/OTEI P5 I/O


4444

ConclusionsConclusions

What is a corpusWhat is a corpus How to make itHow to make it How to annotate itHow to annotate it Case studies: MULTEXT-East, JOS, Case studies: MULTEXT-East, JOS,

IMPIMP

Documents

Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School