Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Research on historical data atSpråkbanken

Dana Dannélls

SpråkbankenDepartment of Swedish Language

University of Gothenburg

Språkbanken Kick-off meeting2015-01-27

Digital historical texts

I Digitization of historical texts aims at preservingcultural heritage and making it more accessible.

I Ongoing digitization initiatives seek to createdigital text resources which can be searchedand processed by machines.

I Together with the availability of historical textresources in digital form, there is a growinginterest in applying NLP methods and tools tohistorical texts.

Digitized historical Swedish texts

I Old Swedish (fornsvenska)I starts with latin script manuscripts of law text

(around 1225)I ends with (pre-)publication of the Gustav Vasa

bible (1526)I Modern Swedish (nysvenska)

I Early Modern Swedish (äldre nysvenska)I starts from the Gustav Vasa bible 1526I ends with Olof v. Dalin’s Then Swänska Argus

1732

I Late Modern Swedish (yngre nysvenska)I starts from 1732I ends with August Strindberg’s Röda rummet 1879

Material includes codices legal manuscripts,religious prose, poetry, non-fiction, letters, etc.

Characterization of historical texts

I lack of standardized orthography

I old words and word forms

I spelling variation

I grammar change: inflectional complexity, wordorder, subordination

I OCR errors

Problems for LT: limited resources, lack of annotatedmaterial, lack of grammar and morphologicaldescriptions, no native speakers.

Old Swedish

Han beddis almoso af petro ok johanne Tha han saathem til byriä at inga j mönstrit Petrus sagde til hans Jakhafuir ey gul ällär silfuer vtan thz som iak hafuir gifuir jakthik j ihesu christi nazareni nampn stat vp ok gak Okginstan grep sanctus petrus hans höghre hand ok vplyptehan ok ämbrat festos hans sinor ok fötir ok han sprang vpstodh ok gik in j mönstrit medhär thöm gangande okspringande ok lowande gudh ok alt folkit saa hangangande ok lofuande gudh ok kiändo han at han varthen sami som saat for mönstersins port thiggiandealmoso ok vndradho mykit a thz som honom var hänt okthen tidh the hioldo petrum oc iohannem lop alt folkit tiltherä vndrande

Early Modern Swedish

Ingen lärer kunna neka, at ju sådane Skriffter hafwastor nytta med sig, som, på ett angenämt och lustigtsätt, föreställa Lärdomar och Wettenskaper; Derförehafwa och de gamla, under roliga Dikter, liufligaSamtahl eller nöysamma Historier, underwisat Folketom Dygden, och likasom skiämtewijs förehållit demalfwarsamma Sede-Läror. I nyare tider, och än i dag,se wi äfwen, hos kloka Nationer, sådane Skrifftermed mycken nytta utgifwas och älskas. Men fast änhwarken de gamla sådane Läro-sätt skulle älskateller nyare frägdade Folckeslag dem älska, så wetewi dock at Hof-smaken . . .

Digital image (Then Swänska Argus)

Exploring sentence boundaries

Thaa kesarinnan fik höra ath then wnga herrän war kommen thaa

tilreddhe hon sigh mz iomfrum och klädom som hon aldra bäst kunne

och kom gaangande til konungen och tilhans son ther the saatho

baaden til samman ‖ konungen bad hänne sätia sig när sonnen ‖kesarinnan sade til konungen herra är tättha edher son som saa länghe

hafuer borto waridh när the wisa mästara ‖ konungen sade ya män jak

kan ekke wettha hurw thz gaar til thy han wil inthe tala, ‖ thaa sadhe

hon herra antuarden honnom mik jak skal wil wäl göra honnom talande

och togh honom widh haandena och wille hafuan mz sigh, ‖ thaa

warde han sig och wille ekke mz, ‖ fadren bad honom gaa mz hänne ‖thaa negh han sinom fader ödhmyuklighan och war honom lydoger ‖kesarinnan ledde honom in j en kammara och badh alla wtgaa och

satte honnom oppaa en sänga stok när sigh och sadhe, hiärtans käre

diocleciane, (Sju vise mästare C, 1492)

Machine learning with HaCOSSA

Sentence annotation:

I Hamburg Corpus of Old Swedish with SyntacticAnnotations (Höder 2011): annotation forclauses

I Construct sentence-like annotation: 8ksentences in 137k words

I Average sentence length: 16.5 tokens

Sentence as a sequence of tags:

I S|L0(L1M∗)?R

Sentence boundaries evaluation

All feat 10-fold: prec 82.9, rec 66.4

All feat leave1out: prec 76.0, rec 58.2

Lexical information helps, but its generalization isdifficult

Simple spelling normalization helpsespecially precision for leave1out (>+5%)

Spelling variation

o→ u: 0.2 arvuþi ærvoþi;æ→ e: 0.27 ær er ;au→ ö: 0.31 barnlös barnalös barnalaus;pt→ ft: 0.42 apter after æftær ;gg#→ g#: 0.43 væg vægg vegg;þer→ n: 0.44 maþer man;th→ þ: 0.44 oþolskipti othalskipte;mp→m: 0.45 hamn hampn;eli→ li: 0.45 lastelika lastlika;ghi→ i: 0.62 aplöia opplöghia.

(Ahlberg & Bouma 2012)

Spelling variation evaluation

Link-up 96% of Fornsvenska Textbanken

Estimated correctness in top3: 78%

Does not handle morphologyöknen→ ökn.N

Problems with compound splitting

villhonnugh→ vilder.A + hunagh.N

POS-taggingProjecting syntactic information

⇑

1526

1873

Dalin, 1732-1734 Swart, 1560

POS-taggingExtending Hunpos with morphological information

I Idea: use the model for contemporary Swedish,and plug-in an extra morphology for historicallanguage.

I Swedberg and Dalin (Borin & Forsberg 2008)

+ komm NN → VB+ doch UO → AB foreign words+ ähra PM → NN proper nouns

- Stolpe NN → PM proper/common nouns- moste VB → NN errors in the morphology- wisa VB → NN morphology not yet coveringöfverväxa vb inf aktivöfvervälta vb inf aktivöfvervintra vb inf aktivöfvervinna vb inf aktivöfvervika vb inf aktivprisar vb pres sg ind aktivprisas vb pres sg ind s-formprise vb pres pl ind p1 aktivprises vb pres pl ind p1 s-formprisen vb pres pl ind p2 aktivprisens vb pres pl ind p2 s-formprisa vb pres pl ind p3 aktiv

Morphology

Semi-supervised learning?

Can we make abstractions from known inflectiontables?

Paradigm induction

Can we use these to predict inflection of unseenwords?

Lexicon construction

(Ahlberg, Forsberg & Hulden 2014)

MorphologyParadigm induction

MorphologyLexicon induction

Morphology evaluation

Modern languages:

Table accuray of 76.50–98.00%

Form accuray of 91.81–99.58%

Old Swedish: (Adesam et al. 2014)

Table accuracy of ∼54.00%

Fonsvenska readerhttp://demo.spraakdata.gu.se/fsvreader/

http://demo.spraakdata.gu.se/fsvreader/

Morfologilabbethttp://spraakbanken.gu.se/karp/morfologilabbet/

http://spraakbanken.gu.se/karp/morfologilabbet/

OCR error correction

I OCR Swedish text with high quality, independenton the quality of the print.

I OCROpus: open source OCR engine,neural-network based

I Material:I Blackletter texts printed between approx 1600

and 1800I Olof v. Dalin’s Swänska Argus (1732–1734,

Stockholm)

Post-processing

Noisy channel model:

argmaxorig

p(orig|ocr) = argmaxorig

[p(ocr |orig) · p(orig) ]

I error model (EM) = p(ocr |orig)

I language model (LM)= p(orig)

Kolak, O. (2005). OCR post-processing for low densitylanguages. In Proceedings of human languagetechnology conference and conference on empiricalmethods in natural language processing (HLT/EMNLP).

Error model

Estimates the probability that a certaintransformation can occur to a string

ocr: ε apitel cller dcl är ey så wäl utarbetad,orig: Capitel eller del är ey så wäl utarbetad,

ocr: dock hålst welat blifwa wid det sättet,orig: dock hälst welat blifwa wid det sättet,

ocr: wårt Bärck.orig: wårt Wärck.

Language model

I trigram modelconstructed from a training coprus

I unigram modelwords and their frequencies (72,359 entries)

I word-based model

wordlists constructed from Swedberg and Dalinfullform dictionary (567,108 entries)

Post-processing evaluation

Argus (1732–1734)

CER 11–15%WER 40–50%

Mixed texts (1600–1800)

CER 17–25%WER 55–60%

OCR cloudhttp://demo.spraakdata.gu.se/ocr/

http://demo.spraakdata.gu.se/ocr/

Word Sense Changes

1. Automatically find word senses• Unsupervised Word Sense Discrimination

2. Track the senses over time to find change

Experiments on The Times Archive, London (1785-1985)

First we had to correctOCR errors!

Increased number ofclusters, on average 24% more clusters, 61% moreclusters before 1815.

ti tnow

timet1 t2 t4t3

Evaluation• Evolved sense

(broadened/narrowed)• Personal computer, mobile phone,

email

• Novel related sense• Music tape, computer mouse

• Novel unrelated sense• (new word, sense), e.g., Internet• (exist. word, new sense), e.g., rock

music

• Existing sense• Stone sense of rock• Deer, horse, …

Exp 1 counts recall in any unit

Exp 2 counts recall in correct form

95% recall in our units, 82% in correctform!

On average, 6.3— 9.4 years to find changefrom first cluster evidence.

It takes 29-32 years to find change from defintion.

Correct OCR errors

For the Kubhist data:

• Use a sliding window method to create a graph

• Cluster the graph using word sense discrimination

Many spelling variations end up in the same cluster

Examples:

{Hvitliafre, hvithafpe,rag,hvete, svarthafre, hvithafro, hvithafre, korn, slipsten, ny, kora}

{Fianinon, planlnon,fotograflapparatcr, planinon, pnino, orafofoner, orgel, flaninon, fotografiapparate,rurafofoner,

grafofoner, fotograflapparater}

Documents

Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)