Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Research on historical data atSpråkbanken
Dana Dannélls
SpråkbankenDepartment of Swedish Language
University of Gothenburg
Språkbanken Kick-off meeting2015-01-27
Digital historical texts
I Digitization of historical texts aims at preservingcultural heritage and making it more accessible.
I Ongoing digitization initiatives seek to createdigital text resources which can be searchedand processed by machines.
I Together with the availability of historical textresources in digital form, there is a growinginterest in applying NLP methods and tools tohistorical texts.
Digitized historical Swedish texts
I Old Swedish (fornsvenska)I starts with latin script manuscripts of law text
(around 1225)I ends with (pre-)publication of the Gustav Vasa
bible (1526)I Modern Swedish (nysvenska)
I Early Modern Swedish (äldre nysvenska)I starts from the Gustav Vasa bible 1526I ends with Olof v. Dalin’s Then Swänska Argus
1732
I Late Modern Swedish (yngre nysvenska)I starts from 1732I ends with August Strindberg’s Röda rummet 1879
Material includes codices legal manuscripts,religious prose, poetry, non-fiction, letters, etc.
Characterization of historical texts
I lack of standardized orthography
I old words and word forms
I spelling variation
I grammar change: inflectional complexity, wordorder, subordination
I OCR errors
Problems for LT: limited resources, lack of annotatedmaterial, lack of grammar and morphologicaldescriptions, no native speakers.
Old Swedish
Han beddis almoso af petro ok johanne Tha han saathem til byriä at inga j mönstrit Petrus sagde til hans Jakhafuir ey gul ällär silfuer vtan thz som iak hafuir gifuir jakthik j ihesu christi nazareni nampn stat vp ok gak Okginstan grep sanctus petrus hans höghre hand ok vplyptehan ok ämbrat festos hans sinor ok fötir ok han sprang vpstodh ok gik in j mönstrit medhär thöm gangande okspringande ok lowande gudh ok alt folkit saa hangangande ok lofuande gudh ok kiändo han at han varthen sami som saat for mönstersins port thiggiandealmoso ok vndradho mykit a thz som honom var hänt okthen tidh the hioldo petrum oc iohannem lop alt folkit tiltherä vndrande
Early Modern Swedish
Ingen lärer kunna neka, at ju sådane Skriffter hafwastor nytta med sig, som, på ett angenämt och lustigtsätt, föreställa Lärdomar och Wettenskaper; Derförehafwa och de gamla, under roliga Dikter, liufligaSamtahl eller nöysamma Historier, underwisat Folketom Dygden, och likasom skiämtewijs förehållit demalfwarsamma Sede-Läror. I nyare tider, och än i dag,se wi äfwen, hos kloka Nationer, sådane Skrifftermed mycken nytta utgifwas och älskas. Men fast änhwarken de gamla sådane Läro-sätt skulle älskateller nyare frägdade Folckeslag dem älska, så wetewi dock at Hof-smaken . . .
Digital image (Then Swänska Argus)
Exploring sentence boundaries
Thaa kesarinnan fik höra ath then wnga herrän war kommen thaa
tilreddhe hon sigh mz iomfrum och klädom som hon aldra bäst kunne
och kom gaangande til konungen och tilhans son ther the saatho
baaden til samman ‖ konungen bad hänne sätia sig när sonnen ‖kesarinnan sade til konungen herra är tättha edher son som saa länghe
hafuer borto waridh när the wisa mästara ‖ konungen sade ya män jak
kan ekke wettha hurw thz gaar til thy han wil inthe tala, ‖ thaa sadhe
hon herra antuarden honnom mik jak skal wil wäl göra honnom talande
och togh honom widh haandena och wille hafuan mz sigh, ‖ thaa
warde han sig och wille ekke mz, ‖ fadren bad honom gaa mz hänne ‖thaa negh han sinom fader ödhmyuklighan och war honom lydoger ‖kesarinnan ledde honom in j en kammara och badh alla wtgaa och
satte honnom oppaa en sänga stok när sigh och sadhe, hiärtans käre
diocleciane, (Sju vise mästare C, 1492)
Machine learning with HaCOSSA
Sentence annotation:
I Hamburg Corpus of Old Swedish with SyntacticAnnotations (Höder 2011): annotation forclauses
I Construct sentence-like annotation: 8ksentences in 137k words
I Average sentence length: 16.5 tokens
Sentence as a sequence of tags:
I S|L0(L1M∗)?R
Sentence boundaries evaluation
All feat 10-fold: prec 82.9, rec 66.4
All feat leave1out: prec 76.0, rec 58.2
Lexical information helps, but its generalization isdifficult
Simple spelling normalization helpsespecially precision for leave1out (>+5%)
Spelling variation
o→ u: 0.2 arvuþi ærvoþi;æ→ e: 0.27 ær er ;au→ ö: 0.31 barnlös barnalös barnalaus;pt→ ft: 0.42 apter after æftær ;gg#→ g#: 0.43 væg vægg vegg;þer→ n: 0.44 maþer man;th→ þ: 0.44 oþolskipti othalskipte;mp→m: 0.45 hamn hampn;eli→ li: 0.45 lastelika lastlika;ghi→ i: 0.62 aplöia opplöghia.
(Ahlberg & Bouma 2012)
Spelling variation evaluation
Link-up 96% of Fornsvenska Textbanken
Estimated correctness in top3: 78%
Does not handle morphologyöknen→ ökn.N
Problems with compound splitting
villhonnugh→ vilder.A + hunagh.N
POS-taggingProjecting syntactic information
⇑
1526
1873
Dalin, 1732-1734 Swart, 1560
POS-taggingExtending Hunpos with morphological information
I Idea: use the model for contemporary Swedish,and plug-in an extra morphology for historicallanguage.
I Swedberg and Dalin (Borin & Forsberg 2008)
+ komm NN → VB+ doch UO → AB foreign words+ ähra PM → NN proper nouns
- Stolpe NN → PM proper/common nouns- moste VB → NN errors in the morphology- wisa VB → NN morphology not yet coveringöfverväxa vb inf aktivöfvervälta vb inf aktivöfvervintra vb inf aktivöfvervinna vb inf aktivöfvervika vb inf aktivprisar vb pres sg ind aktivprisas vb pres sg ind s-formprise vb pres pl ind p1 aktivprises vb pres pl ind p1 s-formprisen vb pres pl ind p2 aktivprisens vb pres pl ind p2 s-formprisa vb pres pl ind p3 aktiv
Morphology
Semi-supervised learning?
Can we make abstractions from known inflectiontables?
Paradigm induction
Can we use these to predict inflection of unseenwords?
Lexicon construction
(Ahlberg, Forsberg & Hulden 2014)
MorphologyParadigm induction
MorphologyLexicon induction
Morphology evaluation
Modern languages:
Table accuray of 76.50–98.00%
Form accuray of 91.81–99.58%
Old Swedish: (Adesam et al. 2014)
Table accuracy of ∼54.00%
Fonsvenska readerhttp://demo.spraakdata.gu.se/fsvreader/
Morfologilabbethttp://spraakbanken.gu.se/karp/morfologilabbet/
OCR error correction
I OCR Swedish text with high quality, independenton the quality of the print.
I OCROpus: open source OCR engine,neural-network based
I Material:I Blackletter texts printed between approx 1600
and 1800I Olof v. Dalin’s Swänska Argus (1732–1734,
Stockholm)
Post-processing
Noisy channel model:
argmaxorig
p(orig|ocr) = argmaxorig
[p(ocr |orig) · p(orig) ]
I error model (EM) = p(ocr |orig)
I language model (LM)= p(orig)
Kolak, O. (2005). OCR post-processing for low densitylanguages. In Proceedings of human languagetechnology conference and conference on empiricalmethods in natural language processing (HLT/EMNLP).
Error model
Estimates the probability that a certaintransformation can occur to a string
ocr: ε apitel cller dcl är ey så wäl utarbetad,orig: Capitel eller del är ey så wäl utarbetad,
ocr: dock hålst welat blifwa wid det sättet,orig: dock hälst welat blifwa wid det sättet,
ocr: wårt Bärck.orig: wårt Wärck.
Language model
I trigram modelconstructed from a training coprus
I unigram modelwords and their frequencies (72,359 entries)
I word-based model
wordlists constructed from Swedberg and Dalinfullform dictionary (567,108 entries)
Post-processing evaluation
Argus (1732–1734)
CER 11–15%WER 40–50%
Mixed texts (1600–1800)
CER 17–25%WER 55–60%
OCR cloudhttp://demo.spraakdata.gu.se/ocr/
Word Sense Changes
1. Automatically find word senses• Unsupervised Word Sense Discrimination
2. Track the senses over time to find change
Experiments on The Times Archive, London (1785-1985)
First we had to correctOCR errors!
Increased number ofclusters, on average 24% more clusters, 61% moreclusters before 1815.
ti tnow
timet1 t2 t4t3
Evaluation• Evolved sense
(broadened/narrowed)• Personal computer, mobile phone,
• Novel related sense• Music tape, computer mouse
• Novel unrelated sense• (new word, sense), e.g., Internet• (exist. word, new sense), e.g., rock
music
• Existing sense• Stone sense of rock• Deer, horse, …
Exp 1 counts recall in any unit
Exp 2 counts recall in correct form
95% recall in our units, 82% in correctform!
On average, 6.3— 9.4 years to find changefrom first cluster evidence.
It takes 29-32 years to find change from defintion.
Correct OCR errors
For the Kubhist data:
• Use a sliding window method to create a graph
• Cluster the graph using word sense discrimination
Many spelling variations end up in the same cluster
Examples:
{Hvitliafre, hvithafpe,rag,hvete, svarthafre, hvithafro, hvithafre, korn, slipsten, ny, kora}
{Fianinon, planlnon,fotograflapparatcr, planinon, pnino, orafofoner, orgel, flaninon, fotografiapparate,rurafofoner,
grafofoner, fotograflapparater}