Upload
sabrina-tucker
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Advanced Language Technologies
Information and Communication TechnologiesModule "Knowledge Technologies"
Jožef Stefan International Postgraduate SchoolWinter 2010 / Spring 2011
Tomaž Erjavec
Lecture II.Computer Corpora
Overview of the Overview of the lecturelecture
1.1. BackgroundBackground
2.2. Corpus compilation and markupCorpus compilation and markup
3.3. Morphosyntactic taggingMorphosyntactic tagging
BackgroundBackground
• What is a corpus?
• Using corpora
• Characteristics of a corpus
• Typology of corpora
• History
• Slovene language corpora
A corpus isA corpus is::
a large collection of textsa large collection of texts in digital formatin digital format language “as it is”language “as it is” a sample of the language it is a sample of the language it is
meant to representmeant to represent used for describing language used for describing language
(des(desccriptivriptivee/empi/empirical linguisticsrical linguistics))
A more precise A more precise definitiondefinition
CorpusCorpus (plural (plural corporacorpora) is Latin for ) is Latin for bodybody Guidelines of the Expert Advisory Group on Guidelines of the Expert Advisory Group on
Language Engineering Standards, Language Engineering Standards, EAGLES::– Corpus : A collection of pieces of language that are : A collection of pieces of language that are
selected and ordered according to explicit linguistic selected and ordered according to explicit linguistic criteria in order to be used as a sample of the criteria in order to be used as a sample of the language.language.
– Computer corpus : a corpus which is encoded in a : a corpus which is encoded in a standardised and homogeneous way for open-ended standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance. are documented as to their origins and provenance.
For computer scientists: a datasetFor computer scientists: a dataset
Using corporaUsing corpora
Applied linguistics: Applied linguistics: – LexicographyLexicography: making dictionaries (first users of : making dictionaries (first users of
corpora)corpora)– Translation studiesTranslation studies: translation equivalents with : translation equivalents with
contextscontextstranslation memories, machine aided translations translation memories, machine aided translations
– Language learningLanguage learning: real-life examples, curriculum : real-life examples, curriculum development development
Corpus linguistics:Corpus linguistics:– linguistics based not on introspection, but on linguistics based not on introspection, but on
observation of real dataobservation of real data Language technologyLanguage technology: :
– testing set for developed methods; testing set for developed methods; – training settraining set for inductive learning for inductive learning
((statistical Natural Language Processing) )
Characteristics of a Characteristics of a (good) corpus(good) corpus QuantityQuantity::
the bigger, the better the bigger, the better Quality Quality ::
the texts are authentic; the mark-up is the texts are authentic; the mark-up is validated validated
SimplicitySimplicity::the computer representation is the computer representation is understandable, with the markup easily understandable, with the markup easily separated from the text separated from the text
DocumentedDocumented::the corpus contains bibliographic and other the corpus contains bibliographic and other meta-datameta-data
Typology of corpora I.Typology of corpora I.
Medium:Medium:– written languagewritten language– spokenspoken languagelanguage (spoken, but in writing / (spoken, but in writing /
transcription)transcription)– speechspeech corporacorpora (actual speech signal) (actual speech signal)
Content:Content:– referencereference corpora (representative), e.g. corpora (representative), e.g. BNC – sub-language corporasub-language corpora (specialised), e.g. (specialised), e.g. COLT
Structure:Structure:– corpora with corpora with integralintegral texts texts – corpora or of text corpora or of text samplessamples (historical and legal (historical and legal
reasons)reasons)e.g. e.g. Brown
Typology of corpora IITypology of corpora II
Time:Time:– staticstatic corpora corpora– monitormonitor corpora (language change) corpora (language change)
Languages:Languages:– monolingual monolingual corporacorpora– multilingual multilingual parallelparallel corpora (e.g. corpora (e.g. Hansard, ,
Europarl, , JRC Acquis))– multilingual multilingual comparablecomparable corpora corpora
Annotation:Annotation:– plain textplain text corpora corpora– annotatedannotated corpora corpora
ReferenReference corporace corpora
Characteristics:Characteristics:– a sample of the “complete” languagea sample of the “complete” language– large, expensive, detailed and explicit design criterialarge, expensive, detailed and explicit design criteria– ttypically of contemporary languageypically of contemporary language– dodocumented and annotatedcumented and annotated– legaly clean, availablelegaly clean, available (but usu. only via a (but usu. only via a
concordancer)concordancer) Criteria for including textsCriteria for including texts::
– representativenessrepresentativeness::corpus includes corpus includes ““allall” ” text typestext types
– balancebalance::the sizes of text type samples are in proportion to the sizes of text type samples are in proportion to their “importance” for the speakers of the languagetheir “importance” for the speakers of the language
metodmetodhodology v.s. practical constraintshodology v.s. practical constraints
HistoryHistory of of corporacorpora First milestones: First milestones:
Brown (1 million words) 1964; (1 million words) 1964; LOB (also 1M) 1974 (also 1M) 1974 The spread of reference corpora: Cobuild Bank of The spread of reference corpora: Cobuild Bank of
English (monitor, 100..200..M) 1980; English (monitor, 100..200..M) 1980; BNC (100M) (100M) 1995; Czech 1995; Czech CNC (100M) 1998; Croatian (100M) 1998; Croatian HNK (100M) 1999... (100M) 1999...
Slovene reference corpora: Slovene reference corpora: FIDA (100M), (100M), Nova Beseda (100M...) 1998; (100M...) 1998; FIDA+ (600M) 2006 (600M) 2006; ; gigaFIDA (2011?)gigaFIDA (2011?). .
EU corpus oriented projects in the '90: NERC, EU corpus oriented projects in the '90: NERC, MULTEXT-East,... ,...
Language resources brokers: Language resources brokers: LDC 1992, 1992, ELRA 1995 1995
Web as Corpus (2000Web as Corpus (2000....)): ukWaC, itWaC, … slWaC: ukWaC, itWaC, … slWaC more, larger, for more languages, with diverse more, larger, for more languages, with diverse
annotationsannotations: EUROPARL, PDT, … : EUROPARL, PDT, …
Slovene language Slovene language corporacorporaMonolingual reference corpora:Monolingual reference corpora: ZRC SAZU: ZRC SAZU: Beseda, 1998; , 1998; Nova beseda, 2000- , 2000- DZS, Amebis, FF, IJS: DZS, Amebis, FF, IJS: FIDA, 1998, , 1998, FidaPlus, 2006, 2006 IJS, FF: IJS, FF: JOS corpora corpora
Parallel corpora:Parallel corpora: IJS: IJS: MULTEXT-East 1998- 1998-, , SVEZ-IJS, 2004, , 2004, JRC-ACQUIS, 2006 , 2006 SVEZ: EuroKorpusSVEZ: EuroKorpus FF: FF: TRANS, 2002, 2002
Speech corpora:Speech corpora: Laboratory for Digital Signal Processing, University of Maribo
r::SpeechDat, ONOMASTICA... SpeechDat, ONOMASTICA...
Laboratory of Articifical Perception, Systems and Cybernetics, University of Ljubljana::SQEL, GOPOLIS,... SQEL, GOPOLIS,...
II. Compilation and II. Compilation and markup of corporamarkup of corpora
• Steps in the preparation of a corpus
• What annotation can be added to the text
• Computer coding of corpora
• Markup Methods
Before making your Before making your own corpusown corpus
check if an appropriate corpus is check if an appropriate corpus is already availablealready available
googlegoogle [email protected] LDC, , ELRA
Steps in the Steps in the preparation of a preparation of a corpuscorpus1.1. Choosing the component texts Choosing the component texts
and acquiring digital originalsand acquiring digital originals
2.2. Up-translation to standard Up-translation to standard formatformat
3.3. Linguistic annotationLinguistic annotation
4.4. DocumentationDocumentation
5.5. Use and DisseminationUse and Dissemination
Getting the textGetting the text
1.1. Choosing the component texts:Choosing the component texts:linguistic and non-linguistic criteria; linguistic and non-linguistic criteria; availability; simplicity; size availability; simplicity; size
2.2. CopyrightCopyrightsensitivity of source (financial and sensitivity of source (financial and privacy considerations); agreement privacy considerations); agreement with providers; usage, publication with providers; usage, publication
3.3. Acquiring digital originalsAcquiring digital originalsOCR; digital originals; Web OCR; digital originals; Web
BootCatBootCat
ProcessingProcessing
1.1. Conversion to common format Conversion to common format consistency; character set consistency; character set encodings; structure encodings; structure
Web as Corpus: Wacky toolsWeb as Corpus: Wacky tools
2.2. DocumentationDocumentatione.g. TEI header; Open Archives e.g. TEI header; Open Archives etc. etc.
3.3. Linguistic annotationLinguistic annotationlanguage dependent methods; language dependent methods; errors errors
Use and disseminationUse and dissemination
Using the corpus:Using the corpus:– concordancer (linguists)concordancer (linguists)
e.ge.g FidaPLUS, , SKE, , iKorpus, JOS, IMP, JOS, IMP– statistics extraction statistics extraction – development of new methods for analysisdevelopment of new methods for analysis
Dissemination:Dissemination:– legalities (source copyright, corpus use legalities (source copyright, corpus use
agreement)agreement)– mode: concordancer or datasetmode: concordancer or dataset
Computer coding of Computer coding of corporacorpora Encoding must ensure Encoding must ensure
– durabilitydurability– interchange between computer platformsinterchange between computer platforms– interchange between applications interchange between applications
Basic standard: Basic standard: XML – companion standards: companion standards: W3C W3C Schema, ISO Relax NG, Schema, ISO Relax NG,
XSLT, XSLT, XPath, XQuery, ... XPath, XQuery, ... XML XML vocabulary of annotationsvocabulary of annotations of arbitrary of arbitrary
texts:texts:Text Encoding InitiativeText Encoding Initiative, , TEI
ISO TC 37 „Terminology and other language ISO TC 37 „Terminology and other language resources“: many standards for text resources“: many standards for text encodingencoding
Corpus annotationCorpus annotation
Annotation = interpretation Annotation = interpretation Documentation about the corpus (Documentation about the corpus (example) ) Document structure (Document structure (example) ) Basic linguistic markup: sentences, words (Basic linguistic markup: sentences, words (
example), punctuation, abbreviations ), punctuation, abbreviations (example) (example)
Lemmas and morphosyntactic descriptions Lemmas and morphosyntactic descriptions (example) (example)
Syntax (example) Syntax (example) Alignment (example) Alignment (example) Terms, semantics, anaphora, pragmatics, Terms, semantics, anaphora, pragmatics,
intonation,... intonation,...
Example: TEI headerExample: TEI header
<teiHeader id="ecmr.H" type="text" lang="sl-en" creator=ET <teiHeader id="ecmr.H" type="text" lang="sl-en" creator=ET status="update" date.created="1999-04-13" date.updated="1999-06-status="update" date.created="1999-04-13" date.updated="1999-06-22" > 22" >
<fileDesc><fileDesc> <titleStmt><titleStmt> <title lang="sl">Ekonomsko ogledalo; 13 številk 98/99</title> <title lang="sl">Ekonomsko ogledalo; 13 številk 98/99</title> <title lang="en">Slovenian Economic Mirror; 13 issues, 98/99</title> <title lang="en">Slovenian Economic Mirror; 13 issues, 98/99</title> <respstmt> <respstmt> <name>Andrej Skubic, FF</name> <name>Andrej Skubic, FF</name> <resp lang="sl">Zagotovitev digitalnega originala, <resp lang="sl">Zagotovitev digitalnega originala,
poravnava</resp> poravnava</resp> <resp lang="en">Provision of digital original, alignment</resp> <resp lang="en">Provision of digital original, alignment</resp> <name>Toma<name>Tomažž Erjavec, IJS</name> Erjavec, IJS</name> <resp lang="sl">Tokenizacija, pretvorba v TEI</resp><resp lang="sl">Tokenizacija, pretvorba v TEI</resp> <resp lang="en">Tokenisation, conversion to TEI</resp> <resp lang="en">Tokenisation, conversion to TEI</resp> </respStmt></respStmt> </titleStmt> ... </titleStmt> ...
Example: text Example: text structurestructure<quote id="Osl.1.8.18" rend="center;it"><quote id="Osl.1.8.18" rend="center;it"> <lg id="Osl.1.8.18.1"> <lg id="Osl.1.8.18.1"> <l id="Osl.1.8.18.1.1">Tam pod kostanjevim <l id="Osl.1.8.18.1.1">Tam pod kostanjevim
drevesom</l> drevesom</l> <l id="Osl.1.8.18.1.2">izdala si me,</l> <l id="Osl.1.8.18.1.2">izdala si me,</l> <l id="Osl.1.8.18.1.3">izdal sem te,</l> <l id="Osl.1.8.18.1.3">izdal sem te,</l> <l id="Osl.1.8.18.1.4">ne da bi trenila z očesom.</l> <l id="Osl.1.8.18.1.4">ne da bi trenila z očesom.</l> </lg> </lg> </quote> </quote> <p id="Osl.1.8.19"> <p id="Osl.1.8.19"> <s id="Osl.1.8.19.1">Trije možje se niso niti ganili.</s><s id="Osl.1.8.19.1">Trije možje se niso niti ganili.</s> <s id="Osl.1.8.19.2">Toda ko je <s id="Osl.1.8.19.2">Toda ko je
<name>Winston</name> znova pogledal v Rutherfordov <name>Winston</name> znova pogledal v Rutherfordov propadli obraz, je opazil, da so njegove oči polne propadli obraz, je opazil, da so njegove oči polne solz.</s> ... solz.</s> ...
Example: Example: morphosyntactic morphosyntactic taggingtagging<s id="Osl.1.2.2.1"> <s id="Osl.1.2.2.1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w><w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen“ ana="Afpmsnn">jasen</w><c>,</c><w lemma="jasen“ ana="Afpmsnn">jasen</w><c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w><w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w><w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" <w lemma="trinajst"
ana="Mcnpnl">trinajst</w><c>.</c> ana="Mcnpnl">trinajst</w><c>.</c> </s> </s>
Example: alignmentExample: alignment
<linkGrp id="Oslen.1" type="body" targtype="s" <linkGrp id="Oslen.1" type="body" targtype="s" domains="Oen Osl"> domains="Oen Osl">
<link xtargets="Osl.1.2.2.1 ; Oen.1.1.1.1"> <link xtargets="Osl.1.2.2.1 ; Oen.1.1.1.1"> <link xtargets="Osl.1.2.2.2 ; Oen.1.1.1.2"> <link xtargets="Osl.1.2.2.2 ; Oen.1.1.1.2"> <link xtargets="Osl.1.2.3.1 ; Oen.1.1.2.1"> <link xtargets="Osl.1.2.3.1 ; Oen.1.1.2.1"> <link xtargets="Osl.1.2.3.2 ; Oen.1.1.2.2"> <link xtargets="Osl.1.2.3.2 ; Oen.1.1.2.2"> ... ... <link xtargets="Osl.1.2.6.5 ; Oen.1.1.5.5"><link xtargets="Osl.1.2.6.5 ; Oen.1.1.5.5"> <link xtargets="Osl.1.2.6.6 ; Oen.1.1.5.6 <link xtargets="Osl.1.2.6.6 ; Oen.1.1.5.6
Oen.1.1.5.7"> Oen.1.1.5.7"> <link xtargets="Osl.1.2.6.7 ; Oen.1.1.5.8"> <link xtargets="Osl.1.2.6.7 ; Oen.1.1.5.8"> ... ...
Methods for linguistic Methods for linguistic markupmarkup hand annotationhand annotation: documentation, first steps: documentation, first steps
generic (XML, spreadsheet) editors or specialised editors generic (XML, spreadsheet) editors or specialised editors semi-automaticsemi-automatic: morphosyntactic and other linguistic : morphosyntactic and other linguistic
annotationannotationcyclic approach: machine, hand, validate, correct, cyclic approach: machine, hand, validate, correct, machine, ... machine, ...
machine, with hand-written rulesmachine, with hand-written rules: tokenisation: tokenisationregular expression regular expression
machine, with inductively built models from annotated machine, with inductively built models from annotated datadata: : "supervised learning"; HMMs, decision trees, inductive "supervised learning"; HMMs, decision trees, inductive logic programming,... logic programming,...
machine, with inductively built models from un-machine, with inductively built models from un-annotated dataannotated data: : "unsupervised leaning"; clustering techniques "unsupervised leaning"; clustering techniques
overview of the field overview of the field
III. Morphosyntactic III. Morphosyntactic taggingtagging Better known as part-of-speech (PoS) taggingBetter known as part-of-speech (PoS) tagging Tagging is the task of labeling each word in a
sequence of words with its appropriate part-of-speech
Words are often ambiguous with respect to their POS:– saw → singular noun „I brought a saw“– saw → past tense of verb „I saw a tree“
Purposes and applications (examples):– pre-processing step for further analyses:
lemmatisation syntactic structure, etc.
– text indexing, e.g. nouns are more useful than verbs– pronunciation in speech processing
Steps in taggingSteps in tagging
for each word token in text the tagger for each word token in text the tagger needs to know all its possible tags needs to know all its possible tags (ambiguity class) (ambiguity class) →→ a morphological lexicon a morphological lexicon
given the context in which the word given the context in which the word appears in, the tagger must decide in appears in, the tagger must decide in the correct tag:the correct tag:– he saw/V a man carrying a saw/Nhe saw/V a man carrying a saw/N
so, tagging performs limited syntactic so, tagging performs limited syntactic disambiguationdisambiguation
Example: Penn Example: Penn TreebankTreebankUnder/IN the/DT proposal/NN ,/,
Delmed/NNP would/MD issue/VB about/IN 123.5/CD million/CD additional/JJ Delmed/NNP common/JJ shares/NNS to/TO Fresenius/NNP at/IN an/DT average/JJ price/NN of/IN about/IN 65/CD cents/NNS a/DT share/NN ,/, though/IN under/IN no/DT circumstances/NNS more/JJR than/IN 75/CD cents/NNS a/DT share/NN ./.
PoS taggersPoS taggers
Most taggers induce the language Most taggers induce the language model from a hand-annotated model from a hand-annotated corpuscorpus
Typically, two resources are Typically, two resources are induced:induced:– lexicon, giving the lexicon, giving the ambiguity class ambiguity class of a of a
word and their frequencies in the word and their frequencies in the training corpustraining corpus
– tag n-gramstag n-grams
Tagging with Markov Models Sequence of tags in a text is regarded a Markov
chain Limited horizon: A word’s tag only depends on the
previous tag: p(xi+1 = tj | x1, ..., xi) = p (xi+1 = tj | xi) Time invariant: This dependency does not change
over time: p(xi+1 = tj | xi) = p(x2 = tj | x1) Task: Find the most probable tag sequence for a
sequence of words Maximum likelihood estimate of tag tk following tj:
p(tk | tj) = f(tj,tk) / f(tj) Optimal tags for a sentence:
t´1,n = arg max p(t1,n | w1,n) = Π p(wi | ti) p(ti | ti-1)
Most popular Markov Most popular Markov model taggermodel tagger TnTTnT (Trigrams ‘n Tags) (Trigrams ‘n Tags) induces lexicon and tag trigrams from induces lexicon and tag trigrams from
the training corpusthe training corpus has heuristics to tag unknown wordshas heuristics to tag unknown words has no problem with large tagsetshas no problem with large tagsets fast in training and taggingfast in training and tagging freely available for non-commercial usefreely available for non-commercial use but only as a Linux executablebut only as a Linux executable OS alternative: hunposOS alternative: hunpos
Yet another TaggerYet another Tagger
For a while, trying out new For a while, trying out new approaches to tagging was in approaches to tagging was in fashionfashion
Maximum Entropy taggersMaximum Entropy taggers Support Vector Machine taggersSupport Vector Machine taggers Memory based taggersMemory based taggers ……
TagsetsTagsets
A tagset is a set of part-of-speech tagsA tagset is a set of part-of-speech tags Classical 8 classes (Thrax, 100 BC):
noun, verb, article, participle, pronoun, preposition, adverb, conjunction
But all tagset use more tags than that! Criteria:Criteria:
– specifiability: degree to which humans use the tagset uniformly on the same text
– accuracy: evaluation of output on tagged text
– suitability for intended application
Tagsets for EnglishTagsets for English
For English, there exist several tagsets: For English, there exist several tagsets: Brown, CLAWS, Penn, …Brown, CLAWS, Penn, …
English tagsets include PoS + some other English tagsets include PoS + some other morphological (inflectional) properties: 30-80 morphological (inflectional) properties: 30-80 tagstags
Penn Treebank Tagset for English: 37 tags, e.g.Penn Treebank Tagset for English: 37 tags, e.g.– JJ adjective, positive– JJR adjective, comparative– JJS adjective, superlative– NN non-plural common noun– NNS plural common noun– NNP non-plural proper name– NNPS plural proper name– IN preposition– ……
Morphosyntactic Morphosyntactic tagsetstagsets For inflectionaly rich languages (such as For inflectionaly rich languages (such as
Slavic languages), tagsets contain much Slavic languages), tagsets contain much more information than just PoSmore information than just PoS
Slovene, Czech, etc. > 1Slovene, Czech, etc. > 1,,000 different 000 different morphosyntactic tagsmorphosyntactic tags– gendgendeer, number, case, animacy, definiteness, …r, number, case, animacy, definiteness, …
Efforts to standardise tagsets across Efforts to standardise tagsets across languages:languages:– EaglesEagles– MULTEXTMULTEXT– MULTEXT-EastMULTEXT-East
MULTEXT-EastMULTEXT-East
EU project in ’90s: development of EU project in ’90s: development of language resources for Central and East-language resources for Central and East-European languagesEuropean languages
Several later releasesSeveral later releases, V4 in 2010 (17 , V4 in 2010 (17 languages)languages)
DDevelopment of morphosyntactic evelopment of morphosyntactic specifications, lexica and annotated corpusspecifications, lexica and annotated corpus
Parallel annotated corpus:Parallel annotated corpus:Orwell’s 1984Orwell’s 1984
Web site: http://nl.ijs.si/ME/ Web site: http://nl.ijs.si/ME/
MULTEXT-East MULTEXT-East morphosyntactic morphosyntactic specificationsspecifications
Specify Specify – what morphosyntactic features particular what morphosyntactic features particular
languages distinguish, languages distinguish, – what their names and values are, what their names and values are, – how they can be mapped to how they can be mapped to tags
(morphosyntactic descriptions, MSDs) e.g. that e.g. that Ncms is:
– a valid for Slovene– is equivalent to PoS:Noun, Type:common,
Gender:masculine, Number:singular http://nl.ijs.si/ME/Vhttp://nl.ijs.si/ME/V44/msd/html//msd/html/
JOS projectJOS project
JOS language resources are meant to facilitate JOS language resources are meant to facilitate developments of human language technologies and developments of human language technologies and corpus linguistics for the Slovene language corpus linguistics for the Slovene language
MMorphosyntactic specificationsorphosyntactic specifications TTwo annotated corporawo annotated corpora ( (morphosyntactic descriptions and morphosyntactic descriptions and
lemmaslemmas))– jos100kjos100k (hand validated)(hand validated)– jos1M (partially hand validated)jos1M (partially hand validated)
SSampled ampled from FidaPLUS corpusfrom FidaPLUS corpus jos100kjos100k:: syntactic and semantic levels of linguistic syntactic and semantic levels of linguistic
descriptiondescription TTwo web services wo web services
– concordancer concordancer – text annotation tooltext annotation tool
Encoded in Encoded in TEI P5 TEI P5 FFreely available reely available (CC): (CC): http://nl.ijs.si/jos/http://nl.ijs.si/jos/
jos100k encodingjos100k encoding
<s xml:id="F0020003.557.2"><s xml:id="F0020003.557.2"> <w xml:id="F0020003.557.2.1" lemma="ta" msd="Zk-sei">To</w><S/><w xml:id="F0020003.557.2.1" lemma="ta" msd="Zk-sei">To</w><S/> <w xml:id="F0020003.557.2.2" lemma="biti" <w xml:id="F0020003.557.2.2" lemma="biti" msd="Gp-ste-n">je</w><S/>msd="Gp-ste-n">je</w><S/> <term type="sloWNet" sortKey="kraj<term type="sloWNet" sortKey="kraj” ” key="ENG20-08114200-n">key="ENG20-08114200-n"> <w xml:id="F0020003.557.2.3" lemma="turisti<w xml:id="F0020003.557.2.3" lemma="turističčen„en„ mmsd="Ppnmein">turistisd="Ppnmein">turističčen</w><S/>en</w><S/> <w xml:id="F0020003.557.2.4" lemma="kraj" msd="Somei">kraj</w><w xml:id="F0020003.557.2.4" lemma="kraj" msd="Somei">kraj</w> </term></term> <c xml:id="F0020003.557.2.5">.</c><S/><c xml:id="F0020003.557.2.5">.</c><S/></s></s><linkGrp type="syntax" targFunc="head argument" <linkGrp type="syntax" targFunc="head argument" corresp="#F0020003.557.2">corresp="#F0020003.557.2"> <link type="ena" targets="#F0020003.557.2.2 #F0020003.557.2.1"/><link type="ena" targets="#F0020003.557.2.2 #F0020003.557.2.1"/> <link type="modra" targets="#F0020003.557.2 #F0020003.557.2.2"/><link type="modra" targets="#F0020003.557.2 #F0020003.557.2.2"/> <link type="dol" targets="#F0020003.557.2.4 #F0020003.557.2.3"/><link type="dol" targets="#F0020003.557.2.4 #F0020003.557.2.3"/> <link type="dol" targets="#F0020003.557.2.2 #F0020003.557.2.4"/><link type="dol" targets="#F0020003.557.2.2 #F0020003.557.2.4"/> <link type="modra" targets="#F0020003.557.2 #F0020003.557.2.5"/><link type="modra" targets="#F0020003.557.2 #F0020003.557.2.5"/></linkGrp</linkGrp>>
Processing Historical Processing Historical LanguageLanguage interesting for diachronic linguistics and interesting for diachronic linguistics and
better access to digital librariesbetter access to digital libraries problems: problems:
– difficult to obtain good transcriptionsdifficult to obtain good transcriptions– great variation in spellinggreat variation in spelling– no resouces for tool training no resouces for tool training
Historical slv:Historical slv: Late standardisation (XIX ≠ XX)Late standardisation (XIX ≠ XX) BeforeBefore 1850: 1850: ſ ſh s sh z zh → s š z ž c čſ ſh s sh z zh → s š z ž c č No corporaNo corpora/lexica /lexica of historical Sloveneof historical Slovene
AHLibAHLib (2004–08) (2004–08)Deutsch-slowenische/kroatische Übersetzung 1848–1918Deutsch-slowenische/kroatische Übersetzung 1848–1918
– SScans + correction + (lemmatisation) of ger→slv bookscans + correction + (lemmatisation) of ger→slv books– AAS & Karl-Franzens University, Graz (prof. Erich Prunč)AAS & Karl-Franzens University, Graz (prof. Erich Prunč)– JSI: correction JSI: correction & & lemmatisation environmentlemmatisation environment
EU EU IP IP IMPACTIMPACT ((ext. ext. 2010–2011)2010–2011)
– BBetter OCR for historical textsetter OCR for historical texts– NUK: GTD transcriptions NUK: GTD transcriptions – JSI: (semi)manual lexicon constructionJSI: (semi)manual lexicon construction
Google awardGoogle award (2011)(2011)Developing language models for historical SloveneDeveloping language models for historical Slovene
– ZRC SAZU: transcriptions of old textsZRC SAZU: transcriptions of old texts– JSI: annotating a corpus of XIXJSI: annotating a corpus of XIXthth century Slovene century Slovene
BackgroundBackground
Tomaž Erjavec: Tomaž Erjavec: Annotating Historical Annotating Historical Slovene Slovene
4242
Producing the IMP Producing the IMP corpuscorpus Representative Representative & & balanced, sampledbalanced, sampled Corpus element: unbroken & contiguous text Corpus element: unbroken & contiguous text
from 1 pagefrom 1 page Sampled by decade Sampled by decade & & texttext Target size: 1,000 pages (~200,000 words)Target size: 1,000 pages (~200,000 words) Encoded in TEI P5Encoded in TEI P5 Automatically annotatedAutomatically annotated TToolool for manual annotation for manual annotation: IMPACT INL Cobalt: IMPACT INL Cobalt Annotator training & management: MayAnnotator training & management: May Manual correction: June–NovemberManual correction: June–November
Tomaž Erjavec: Tomaž Erjavec: Annotating Historical Annotating Historical Slovene Slovene
4343
Annotation toolAnnotation tool
Approach:Approach:ModernisModernisee, then process as contemporary language, then process as contemporary languageLanguage independent (trainable) modulesLanguage independent (trainable) modules
Steps:Steps:
1.1.ToTokenisation kenisation (mlToken)(mlToken)
2.2.TrTranscription anscription (Vaam)(Vaam)
3.3.TaTagginggging (TnT(TnT))4.4.LeLemmatisationmmatisation (CLOG)(CLOG)
= ToTrTaLe= ToTrTaLePipeline in PerlPipeline in PerlTEI P5 I/OTEI P5 I/O
Tomaž Erjavec: Tomaž Erjavec: Annotating Historical Annotating Historical Slovene Slovene
4444