30
Research on historical data at Språkbanken Dana Dannélls Språkbanken Department of Swedish Language University of Gothenburg Språkbanken Kick-off meeting 2015-01-27

Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Research on historical data atSpråkbanken

Dana Dannélls

SpråkbankenDepartment of Swedish Language

University of Gothenburg

Språkbanken Kick-off meeting2015-01-27

Page 2: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Digital historical texts

I Digitization of historical texts aims at preservingcultural heritage and making it more accessible.

I Ongoing digitization initiatives seek to createdigital text resources which can be searchedand processed by machines.

I Together with the availability of historical textresources in digital form, there is a growinginterest in applying NLP methods and tools tohistorical texts.

Page 3: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Digitized historical Swedish texts

I Old Swedish (fornsvenska)I starts with latin script manuscripts of law text

(around 1225)I ends with (pre-)publication of the Gustav Vasa

bible (1526)I Modern Swedish (nysvenska)

I Early Modern Swedish (äldre nysvenska)I starts from the Gustav Vasa bible 1526I ends with Olof v. Dalin’s Then Swänska Argus

1732

I Late Modern Swedish (yngre nysvenska)I starts from 1732I ends with August Strindberg’s Röda rummet 1879

Material includes codices legal manuscripts,religious prose, poetry, non-fiction, letters, etc.

Page 4: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Characterization of historical texts

I lack of standardized orthography

I old words and word forms

I spelling variation

I grammar change: inflectional complexity, wordorder, subordination

I OCR errors

Problems for LT: limited resources, lack of annotatedmaterial, lack of grammar and morphologicaldescriptions, no native speakers.

Page 5: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Old Swedish

Han beddis almoso af petro ok johanne Tha han saathem til byriä at inga j mönstrit Petrus sagde til hans Jakhafuir ey gul ällär silfuer vtan thz som iak hafuir gifuir jakthik j ihesu christi nazareni nampn stat vp ok gak Okginstan grep sanctus petrus hans höghre hand ok vplyptehan ok ämbrat festos hans sinor ok fötir ok han sprang vpstodh ok gik in j mönstrit medhär thöm gangande okspringande ok lowande gudh ok alt folkit saa hangangande ok lofuande gudh ok kiändo han at han varthen sami som saat for mönstersins port thiggiandealmoso ok vndradho mykit a thz som honom var hänt okthen tidh the hioldo petrum oc iohannem lop alt folkit tiltherä vndrande

Page 6: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Early Modern Swedish

Ingen lärer kunna neka, at ju sådane Skriffter hafwastor nytta med sig, som, på ett angenämt och lustigtsätt, föreställa Lärdomar och Wettenskaper; Derförehafwa och de gamla, under roliga Dikter, liufligaSamtahl eller nöysamma Historier, underwisat Folketom Dygden, och likasom skiämtewijs förehållit demalfwarsamma Sede-Läror. I nyare tider, och än i dag,se wi äfwen, hos kloka Nationer, sådane Skrifftermed mycken nytta utgifwas och älskas. Men fast änhwarken de gamla sådane Läro-sätt skulle älskateller nyare frägdade Folckeslag dem älska, så wetewi dock at Hof-smaken . . .

Page 7: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Digital image (Then Swänska Argus)

Page 8: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Exploring sentence boundaries

Thaa kesarinnan fik höra ath then wnga herrän war kommen thaa

tilreddhe hon sigh mz iomfrum och klädom som hon aldra bäst kunne

och kom gaangande til konungen och tilhans son ther the saatho

baaden til samman ‖ konungen bad hänne sätia sig när sonnen ‖kesarinnan sade til konungen herra är tättha edher son som saa länghe

hafuer borto waridh när the wisa mästara ‖ konungen sade ya män jak

kan ekke wettha hurw thz gaar til thy han wil inthe tala, ‖ thaa sadhe

hon herra antuarden honnom mik jak skal wil wäl göra honnom talande

och togh honom widh haandena och wille hafuan mz sigh, ‖ thaa

warde han sig och wille ekke mz, ‖ fadren bad honom gaa mz hänne ‖thaa negh han sinom fader ödhmyuklighan och war honom lydoger ‖kesarinnan ledde honom in j en kammara och badh alla wtgaa och

satte honnom oppaa en sänga stok när sigh och sadhe, hiärtans käre

diocleciane, (Sju vise mästare C, 1492)

Page 9: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Machine learning with HaCOSSA

Sentence annotation:

I Hamburg Corpus of Old Swedish with SyntacticAnnotations (Höder 2011): annotation forclauses

I Construct sentence-like annotation: 8ksentences in 137k words

I Average sentence length: 16.5 tokens

Sentence as a sequence of tags:

I S|L0(L1M∗)?R

Page 10: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Sentence boundaries evaluation

All feat 10-fold: prec 82.9, rec 66.4

All feat leave1out: prec 76.0, rec 58.2

Lexical information helps, but its generalization isdifficult

Simple spelling normalization helpsespecially precision for leave1out (>+5%)

Page 11: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Spelling variation

o→ u: 0.2 arvuþi ærvoþi;æ→ e: 0.27 ær er ;au→ ö: 0.31 barnlös barnalös barnalaus;pt→ ft: 0.42 apter after æftær ;gg#→ g#: 0.43 væg vægg vegg;þer→ n: 0.44 maþer man;th→ þ: 0.44 oþolskipti othalskipte;mp→m: 0.45 hamn hampn;eli→ li: 0.45 lastelika lastlika;ghi→ i: 0.62 aplöia opplöghia.

(Ahlberg & Bouma 2012)

Page 12: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Spelling variation evaluation

Link-up 96% of Fornsvenska Textbanken

Estimated correctness in top3: 78%

Does not handle morphologyöknen→ ökn.N

Problems with compound splitting

villhonnugh→ vilder.A + hunagh.N

Page 13: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

POS-taggingProjecting syntactic information

1526

1873

Page 14: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Dalin, 1732-1734 Swart, 1560

Page 15: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

POS-taggingExtending Hunpos with morphological information

I Idea: use the model for contemporary Swedish,and plug-in an extra morphology for historicallanguage.

I Swedberg and Dalin (Borin & Forsberg 2008)

+ komm NN → VB+ doch UO → AB foreign words+ ähra PM → NN proper nouns

- Stolpe NN → PM proper/common nouns- moste VB → NN errors in the morphology- wisa VB → NN morphology not yet coveringöfverväxa vb inf aktivöfvervälta vb inf aktivöfvervintra vb inf aktivöfvervinna vb inf aktivöfvervika vb inf aktivprisar vb pres sg ind aktivprisas vb pres sg ind s-formprise vb pres pl ind p1 aktivprises vb pres pl ind p1 s-formprisen vb pres pl ind p2 aktivprisens vb pres pl ind p2 s-formprisa vb pres pl ind p3 aktiv

Page 16: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Morphology

Semi-supervised learning?

Can we make abstractions from known inflectiontables?

Paradigm induction

Can we use these to predict inflection of unseenwords?

Lexicon construction

(Ahlberg, Forsberg & Hulden 2014)

Page 17: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

MorphologyParadigm induction

Page 18: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

MorphologyLexicon induction

Page 19: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Morphology evaluation

Modern languages:

Table accuray of 76.50–98.00%

Form accuray of 91.81–99.58%

Old Swedish: (Adesam et al. 2014)

Table accuracy of ∼54.00%

Page 20: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Fonsvenska readerhttp://demo.spraakdata.gu.se/fsvreader/

Page 21: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Morfologilabbethttp://spraakbanken.gu.se/karp/morfologilabbet/

Page 22: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

OCR error correction

I OCR Swedish text with high quality, independenton the quality of the print.

I OCROpus: open source OCR engine,neural-network based

I Material:I Blackletter texts printed between approx 1600

and 1800I Olof v. Dalin’s Swänska Argus (1732–1734,

Stockholm)

Page 23: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Post-processing

Noisy channel model:

argmaxorig

p(orig|ocr) = argmaxorig

[p(ocr |orig) · p(orig) ]

I error model (EM) = p(ocr |orig)

I language model (LM)= p(orig)

Kolak, O. (2005). OCR post-processing for low densitylanguages. In Proceedings of human languagetechnology conference and conference on empiricalmethods in natural language processing (HLT/EMNLP).

Page 24: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Error model

Estimates the probability that a certaintransformation can occur to a string

ocr: ε apitel cller dcl är ey så wäl utarbetad,orig: Capitel eller del är ey så wäl utarbetad,

ocr: dock hålst welat blifwa wid det sättet,orig: dock hälst welat blifwa wid det sättet,

ocr: wårt Bärck.orig: wårt Wärck.

Page 25: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Language model

I trigram modelconstructed from a training coprus

I unigram modelwords and their frequencies (72,359 entries)

I word-based model

wordlists constructed from Swedberg and Dalinfullform dictionary (567,108 entries)

Page 26: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Post-processing evaluation

Argus (1732–1734)

CER 11–15%WER 40–50%

Mixed texts (1600–1800)

CER 17–25%WER 55–60%

Page 27: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

OCR cloudhttp://demo.spraakdata.gu.se/ocr/

Page 28: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Word Sense Changes

1. Automatically find word senses• Unsupervised Word Sense Discrimination

2. Track the senses over time to find change

Experiments on The Times Archive, London (1785-1985)

First we had to correctOCR errors!

Increased number ofclusters, on average 24% more clusters, 61% moreclusters before 1815.

ti tnow

timet1 t2 t4t3

Page 29: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Evaluation• Evolved sense

(broadened/narrowed)• Personal computer, mobile phone,

email

• Novel related sense• Music tape, computer mouse

• Novel unrelated sense• (new word, sense), e.g., Internet• (exist. word, new sense), e.g., rock

music

• Existing sense• Stone sense of rock• Deer, horse, …

Exp 1 counts recall in any unit

Exp 2 counts recall in correct form

95% recall in our units, 82% in correctform!

On average, 6.3— 9.4 years to find changefrom first cluster evidence.

It takes 29-32 years to find change from defintion.

Page 30: Research on historical data at Språkbanken€¦ · Digitized historical Swedish texts I Old Swedish (fornsvenska) I starts with latin script manuscripts of law text (around 1225)

Correct OCR errors

For the Kubhist data:

• Use a sliding window method to create a graph

• Cluster the graph using word sense discrimination

Many spelling variations end up in the same cluster

Examples:

{Hvitliafre, hvithafpe,rag,hvete, svarthafre, hvithafro, hvithafre, korn, slipsten, ny, kora}

{Fianinon, planlnon,fotograflapparatcr, planinon, pnino, orafofoner, orgel, flaninon, fotografiapparate,rurafofoner,

grafofoner, fotograflapparater}