OCR Post Correction for Endangered Language Texts

OCR Post Correction for Endangered Language Texts

Shruti Rijhwani,1 Antonios Anastasopoulos,2, † Graham Neubig1

1Language Technologies Institute, Carnegie Mellon University2Department of Computer Science, George Mason University

{srijhwan,gneubig}@cs.cmu.edu, [email protected]

Abstract

There is little to no data available to build nat-ural language processing models for most en-dangered languages. However, textual datain these languages often exists in formats thatare not machine-readable, such as paper booksand scanned images. In this work, we addressthe task of extracting text from these resources.We create a benchmark dataset of transcrip-tions for scanned books in three critically en-dangered languages and present a systematicanalysis of how general-purpose OCR toolsare not robust to the data-scarce setting of en-dangered languages. We develop an OCR post-correction method tailored to ease training inthis data-scarce setting, reducing the recogni-tion error rate by 34% on average across thethree languages.1

1 Introduction

Natural language processing (NLP) systems existfor a small fraction of the world’s over 6,000 liv-ing languages, the primary reason being the lackof resources required to train and evaluate models.Technological advances are concentrated on lan-guages that have readily available data, and mostother languages are left behind (Joshi et al., 2020).This is particularly notable in the case of endan-gered languages, i.e., languages that are in dangerof becoming extinct due to dwindling numbers ofnative speakers and the younger generations shift-ing to using other languages. For most endangeredlanguages, finding any data at all is challenging.

In many cases, natural language text in theselanguages does exist. However, it is locked awayin formats that are not machine-readable — pa-per books, scanned images, and unstructured webpages. These include books from local publishing

†: Work done at Carnegie Mellon University.1Code and data are available at https://shrutirij.

github.io/ocr-el/.

(a) Ainu (left) – Japanese (right)

(b) Griko (top) – Italian (bottom)

(c) Yakkha (top) – Nepali (middle) – English (bottom)

(d) Handwritten Shangaji – typed English glosses

Text 31: cashew nuts Amina Sharaama explains how to make a sauce of green cashew nuts. Recorded on the 26th of April 2007. notebooks: p. 1230 31.1 nxúuzi wa náńtiíkwa mu-xuzi o-a nantikwa 3-sauce 3-Conn 1a.cashew A sauce of green cashew nuts. 31.2 mí kittóonxipilíkáari nhaáno wírá nxúzí wa náńtiíkwa mi ki-ttoo-mu-xipilikari-a mu-hano o-ir-a mu-xuzi o-a nantikwa 1sg.Subst 1sg-Prog-1-explain-Fi 3-white.woman 15-say-Inf 3-sauce 3-Conn 1a.Cashew I am explaining the white woman about a sauce of green cashew nuts. 31.3 nańtiíkwa khaaju* khaáju t'íiniyaá nańtiíkwa nantikwa khaju khaju khaju ti e-ni-iy-a_yo nantikwa 1a.cashew 9;cashew 9.cashew Cop 9-Pres-be-Fi_Rel 1a.cashew Green cashew nuts are cashew nuts, cashew nuts is what are green cashew nuts. 31.4 khajú yáaw'etíile eri étteétthe esikhoomaléeni khaju e-awe entile e-ri e-ttetthe e-si-khoomal-eni 9.cashew 9-Poss.1 9.Demiii 9-be 9-unripe 9-Neg-be.fully.grown-PSit these very cashew nuts are unripe, they are not fully grown yet 31.5 masí esikhoomaléeni n'eesiyéeni étteétthe masi e-si-khoomal-eni na e-si-iy-eni e-ttetthe but 9-Neg-be.fully.grown-PSit and 9-Neg-be-PSit 9-unripe but they are not fully grown and also not completely unripe 31.6 eri ya wíyá nakátthí nakáatthi e-ri e-a o-iy-a nakatthi nakatthi 9-be 9-Conn 15-be-Inf 1a.middle Red they are in between 31.7 okhóomaál'okhuúno osíkóomaal'okhuúno o-khoomal-a okhuno o-si-khoomal-a okhuno 15-be.fully.grown-Inf here 15-Neg-be.fully.grown-Inf here fully grown on the one side and not fully grown on the other side

Comment [m1]: also: vakátthí vakáatthi

Figure 1: Examples of scanned documents in endan-gered languages accompanied by translations from thesame scanned book (a, b, c) or linguistic archive (d).

houses within the communities that speak endan-gered languages, such as educational or cultural ma-terials. Additionally, linguists documenting theselanguages also create data such as word lists andinterlinear glosses, often in the form of handwrit-ten notes. Examples from such scanned documentsare shown in Figure 1. Digitizing the textual datafrom these sources will not only enable NLP forendangered languages but also aid linguistic docu-mentation, preservation, and accessibility efforts.

arX

iv:2

011.

0540

2v1

[cs

.CL

] 1

0 N

ov 2

020

https://shrutirij.github.io/ocr-el/

https://shrutirij.github.io/ocr-el/

In this work, we create a benchmark dataset andpropose a suite of methods to extract data fromthese resources, focusing on scanned images ofpaper books containing endangered language text.Typically, this sort of digitization requires an opti-cal character recognition (OCR) system. However,the large amounts of textual data and transcribedimages needed to train state-of-the-art OCR modelsfrom scratch are unavailable in the endangered lan-guage setting. Instead, we focus on post-correctingthe output of an off-the-shelf OCR tool that canhandle a variety of scripts. We show that targetedmethods for post-correction can significantly im-prove performance on endangered languages.

Although OCR post-correction is relatively well-studied, most existing methods rely on consider-able resources in the target language, including asubstantial amount of textual data to train a lan-guage model (Schnober et al., 2016; Dong andSmith, 2018; Rigaud et al., 2019) or to create syn-thetic data (Krishna et al., 2018). While readilyavailable for high-resource languages, these re-sources are severely limited in endangered lan-guages, preventing the direct application of existingpost-correction methods in our setting.

As an alternative, we present a method thatcompounds on previous models for OCR post-correction, making three improvements tailoredto the data-scarce setting. First, we use a multi-source model to incorporate information from thehigh-resource translations that commonly appear inendangered language books. These translations areusually in the lingua franca of the region (e.g., Fig-ure 1 (a,b,c)) or the documentary linguist’s primarylanguage (e.g., Figure 1 (d) from Devos (2019)).Next, we introduce structural biases to ease learn-ing from small amounts of data. Finally, we addpretraining methods to utilize the little unanno-tated data that exists in endangered languages.

We summarize our main contributions as follows:

• A benchmark dataset for OCR post-correctionon three critically endangered languages: Ainu,Griko, and Yakkha.

• A systematic analysis of a general-purpose OCRsystem, demonstrating that it is not robust to thedata-scarce setting of endangered languages.

• An OCR post-correction method that adapts thestandard neural encoder-decoder framework tothe highly under-resourced endangered languagesetting, reducing both the character error rate and

the word error rate by 34% over a state-of-the-artgeneral-purpose OCR system.

2 Problem Setting

In this section, we first define the task of OCRpost-correction and introduce how we incorporatetranslations into the correction model. Next, wediscuss the sources from which we obtain scanneddocuments containing endangered language texts.

2.1 FormulationOptical Character Recognition OCR tools aretrained to find the best transcription correspondingto the text in an image. The system typically con-sists of a recognition model that produces candidatetext sequences conditioned on the input image anda language model that determines the probabilityof these sequences in the target language. We usea general-purpose OCR system (detailed in Sec-tion 4) to produce a first pass transcription of theendangered language text in the image. Let this bea sequence of characters x = [x1, . . . , xN].OCR post-correction The goal of post-correction is to reduce recognition errors in thefirst pass transcription — often caused by lowquality scanning, physical deterioration of thepaper book, or diverse layouts and typefaces (Dongand Smith, 2018). The focus of our work is onusing post-correction to counterbalance the lackof OCR training data in the target endangeredlanguages. The correction model takes x asinput and produces the final transcription of theendangered language document, a sequence ofcharacters y = [y1, . . . , yK].

y = arg maxy′

pcorr(y′∣x)

Incorporating translations We use informationfrom high-resource translations of the endangeredlanguage text. These translations are commonlyfound within the same paper book or linguis-tic archive (e.g., Figure 1). We use an exist-ing OCR system to obtain a transcription of thescanned translation, a sequence of characters t =[t1, . . . , tM]. This is used to condition the model:

y = arg maxy′

pcorr(y′∣x, t)

2.2 Endangered Language DocumentsWe explore online archives to determine how manyscanned documents in endangered languages exist

as potential sources for data extraction (as of thiswriting, October 2020).

The Internet Archive,2 a general-purpose archiveof web content, has scanned books labeled with thelanguage of their content. We find 11,674 books la-beled with languages classified as “endangered” byUNESCO. Additionally, we find that endangeredlanguage linguistic archives contain thousands ofdocuments in PDF format — the Archive of theIndigenous Languages of Latin America (AILLA)3

contains ≈10,000 such documents and the Endan-gered Languages Archive (ELAR)4 has ≈7,000.

How common are translations? As described inthe introduction, endangered language documentsoften contain a translation into another (usuallyhigh-resource) language. While it is difficult to es-timate the number of documents with translations,multilingual documents represent the majority inthe archives we examined; AILLA contains 4,383PDFs with bilingual text and 1,246 PDFs with trilin-gual text, while ELAR contains ≈5,000 multilin-gual documents. The structure of translations inthese documents is varied, from dictionaries andinterlinear glosses to scanned multilingual books.

3 Benchmark Dataset

From the sources described above, we select docu-ments from three critically endangered languages5

for annotation — Ainu, Griko, and Yakkha. Theselanguages were chosen in an effort to create a ge-ographically, typologically, and orthographicallydiverse benchmark. We focus this initial studyon scanned images of printed books as opposedto handwritten notes, which are a relatively morechallenging domain for OCR.

We manually transcribed the text correspond-ing to the endangered language content. The textcorresponding to the translations is not manuallytranscribed. We also aligned the endangered lan-guage text to the OCR output on the translations,per the formulation in Section 2.1. We describe theannotated documents below and example imagesfrom our dataset are in Figure 1 (a), (b), (c).

Ainu is a severely endangered indigenous lan-guage from northern Japan, typically considered

2https://archive.org/3https://ailla.utexas.org4https://elar.soas.ac.uk/5UNESCO defines critically endangered languages as

those where the youngest speakers are grandparents and older,and they speak the language partially and infrequently.

a language isolate. In our dataset, we use a bookof Ainu epic poetry (yukara), with the “KutuneShirka” yukara (Kindaichi, 1931) in Ainu tran-scribed in Latin script.6 Each page in the bookhas a two-column structure — the left column hasthe Ainu text, and the right has its Japanese trans-lation already aligned at the line-level, removingthe need for manual alignment (see Figure 1 (a)).The book has 338 pages in total. Given the effortinvolved in annotation, we transcribe the Ainu textfrom 33 pages, totaling 816 lines.

Griko is an endangered Greek dialect spoken insouthern Italy. The language uses a combinationof the Latin alphabet and the Greek alphabet as itswriting system. The document we use is a book ofGriko folk tales compiled by Stomeo (1980). Thebook is structured such that in each fold of twopages, the left page has Griko text, and the rightpage has the corresponding translation in Italian.Of the 175 pages in the book, we annotate theGriko text from 33 pages and manually align it atthe sentence-level to the Italian translation. Thisresults in 807 annotated Griko sentences.

Yakkha is an endangered Sino-Tibetan languagespoken in Nepal. It uses the Devanagari writingsystem. We use scanned images of three chil-dren’s books, each of which has a story writtenin Yakkha along with its translation in Nepali andEnglish (Schackow, 2012). We manually transcribethe Yakkha text from all three books. We also alignthe Yakkha text to both the Nepali and the EnglishOCR at the sentence level with the help of an exist-ing Yakkha dictionary (Schackow, 2015). In total,we have 159 annotated Yakkha sentences.

4 OCR Systems: Promises and Pitfalls

As briefly alluded to in the introduction, training anOCR model for each endangered language is chal-lenging, given the limited available data. Instead,we use the general-purpose OCR system from theGoogle Vision AI toolkit7 to get the first pass OCRtranscription on our data.

The Google Vision OCR system (Fujii et al.,2017; Ingle et al., 2019) is highly performant andsupports 60 major languages in 29 scripts. It cantranscribe a wide range of higher resource lan-guages with high accuracy, ideal for our proposedmethod of incorporating high-resource translations

6Some transcriptions of Ainu also use the Katakana script.See Howell (1951) for a discussion on Ainu folklore.

7https://cloud.google.com/vision

https://archive.org/

https://ailla.utexas.org

https://elar.soas.ac.uk/

https://cloud.google.com/vision

Language CER WER

Ainu 1.34 6.27Griko 3.27 15.63Yakkha 8.90 31.64

Table 1: Character error rate and word error rate withthe Google Vision OCR system on our dataset.

into the post-correction model. Moreover, it is par-ticularly well-suited to our task because it providesscript-specific OCR models in addition to language-specific ones. Per-script models are more robustto unknown languages because they are trainedon data from multiple languages and can act as ageneral character recognizer without relying on asingle language’s model. Since most endangeredlanguages adopt standard scripts (often from theregion’s dominant language) as their writing sys-tems, the per-script recognition models can providea stable starting point for post-correction.

The metrics we use for evaluating performanceare character error rate (CER) and word error rate(WER), representing the ratio of erroneous char-acters or words in the OCR prediction to the totalnumber in the annotated transcription. More de-tails are in Section 6. The CER and WER using theGoogle Vision OCR on our dataset are in Table 1.

4.1 OCR Performance

Across the three languages, the error rates indicatethat we have a first pass transcription that is of rea-sonable quality, giving our post-correction methoda reliable starting point. We note the particularlylow CER for the Ainu data, reflecting previouswork that has evaluated the Google Vision systemto have strong performance on typed Latin scriptdocuments (Fujii et al., 2017). However, there re-mains considerable room for improvement in bothCER and WER for all three languages.

Next, we look at the edit distance between thepredicted and the gold transcriptions, in terms ofinsertion, deletion, and replacement operations. Re-placement accounts for over 84% of the errors inthe Griko and Ainu datasets, and 55% overall. Thispattern is expected in the OCR task, as the recogni-tion model uses the image to make predictions andis more likely to confuse a character’s shape for an-other than to hallucinate or erase pixels. However,we observe that the errors in the Yakkha dataset donot follow this pattern. Instead, 87% of the errorsfor Yakkha occur because of deleted characters.

exi i kaddinàra!

eχi i kaḍḍinàra!

_खारिनङगोङखाॽिनङगो

OCR−−−→ exi i kaddinara

exi i kaddinàra!


_खा!िनङ गो"# खा$िनङ गो

è ffaciloè ffaćilo हाङ चाङ %&'(

हाङचाङचाङ

OCR−−−→exi i kaddinàra!


_खा!िनङ गो"# खा$िनङ गोFigure 2: Examples of errors in Griko (top) and Yakkha

(bottom) when using the Google Vision OCR.

4.2 Types of ErrorsTo better understand the challenges posed by theendangered language setting, we manually inspectall the errors made by the OCR system. Whilesome errors are commonly seen in the OCR task,such as misidentified punctuation or incorrect wordboundaries, 85% of the total errors occur due tospecific characteristics of endangered languagesthat general-purpose OCR systems do not accountfor. Broadly, they can be categorized into two types,examples of which are shown in Figure 2:

• Mixed scripts The existing scripts that mostendangered languages adopt as writing systemsare often not ideal for comprehensively represent-ing the language. For example, the Devanagariscript does not have a grapheme for the glottalstop — as a solution, printed texts in the Yakkhalanguage use the IPA symbol ‘P’ (Schackow,2015). Similarly, both Greek and Latin charac-ters are used to write Griko. The Google VisionOCR is trained to detect script at the line-leveland is not equipped to handle multiple scriptswithin a single word. As seen in Figure 2, thesystem does not recognize the Greek character χin Griko and the IPA symbol P in Yakkha. Mixedscripts cause 11% of the OCR errors.

• Uncommon characters and diacritics En-dangered languages often use graphemes and di-acritics that are part of the standard script but arenot commonly seen in high-resource languages.Since these are likely rare in the OCR system’straining data, they are frequently misidentified,accounting for 74% of the errors. In Figure 2,we see that the OCR system substitutes the un-common diacritic d. in Griko. The system alsodeletes the Yakkha character ङ, which is a ‘halfform’ alphabet that is infrequent in several otherDevanagari script languages (such as Hindi).

5 OCR Post-Correction Model

In this section, we describe our proposed OCRpost-correction model. The base architecture ofthe model is a multi-source sequence-to-sequence

Ainu

OCR

x1 . . .xN

encoder

hx1 . . .h

xN

Japanese

OCR

t1 . . . tM

encoder

ht1 . . .h

tM

attention attention

c1 . . . cK

decoder

s1 . . . sK

softmax

P (y1 . . .yK)

Figure 3: The proposed multi-source architecture withthe encoder for an endangered language segment (left)and an encoder for the translated segment (right). Theinput to the encoders is the first pass OCR over thescanned images of each segment. For example, theOCR on the scanned images of some Ainu text (left)and its Japanese translation (right).

framework (Zoph and Knight, 2016; Libovicky andHelcl, 2017) that uses an LSTM encoder-decodermodel with attention (Bahdanau et al., 2015). Wepropose improvements to training and modeling forthe multi-source architecture, specifically tailoredto ease learning in data-scarce settings.

5.1 Multi-source ArchitectureOur post-correction formulation takes as input thefirst pass OCR of the endangered language segmentx and the OCR of the translated segment t, topredict an error-free endangered language text y.The model architecture is shown in Figure 3.

The model consists of two encoders — one thatencodes x and one that encodes t. Each encoder isa character-level bidirectional LSTM (Hochreiterand Schmidhuber, 1997) and transforms the inputsequence of characters to a sequence of hiddenstate vectors: hx for the endangered language textand h

t for the translation.The model uses an attention mechanism during

the decoding process to use information from theencoder hidden states. We compute the attentionweights over each of the two encoders indepen-dently. At the decoding time step k:

exk,i = v

xtanh (Wx

1sk−1 +Wx2h

xi ) (1)

αxk = softmax (exk)cxk = [Σiα

xk,ih

xi ]

where sk−1 is the decoder state of the previous timestep and v

x, Wx1 and W

x2 are trainable parameters.

The encoder hidden states hx are weighted by theattention distribution αx

k to produce the contextvector cxk . We follow a similar procedure for thesecond encoder to produce c

tk. We concatenate

the context vectors to combine attention from bothsources (Zoph and Knight, 2016):

ck = [cxk; ctk]

ck is used by the decoder LSTM to compute thenext hidden state sk and subsequently, the proba-bility distribution for predicting the next characteryk of the target sequence y:

sk = lstm (sk−1, ck,yk−1) (2)

P (yk) = softmax (Wsk + b) (3)

Training and Inference The model is trained foreach language with the cross-entropy loss (Lce)on the small amount of transcribed data we have.Beam search is used at inference time.

5.2 Model and Training ImprovementsWith the minimal annotated data we have, it ischallenging for the neural network to learn a gooddistribution over the target characters. We proposea set of adaptations to the base architecture thatimproves the post-correction performance withoutadditional annotation. The adaptations are basedon characteristics of the OCR task itself and theperformance of the upstream OCR tool (Section 4).

Diagonal attention loss As seen in Section 4,substitution errors are more frequent in the OCRtask than insertions or deletions; consequently,we expect the source and target to have similarlengths. Moreover, post-correction is a monotonicsequence-to-sequence task, and reordering rarelyoccurs (Schnober et al., 2016). Hence, we expectattention weights to be higher at characters close tothe diagonal for the endangered language encoder.

We modify the model such that all the elementsin the attention vector that are not within j steps(we use j = 3) of the current time step k are addedto the training loss, thereby encouraging elementsaway from the diagonal to have lower values. Thediagonal loss summed over all time steps for atraining instance, where N is the length of x, is:

Ldiag = ∑k

⎛⎜⎝

k−j

∑i=1

αxk,i +

N

∑i=k+j

αxk,i

⎞⎟⎠

Copy mechanism Table 1 indicates that the firstpass OCR predicts a majority of the charactersaccurately. In this scenario, enabling the model todirectly copy characters from the first pass OCRrather than generate a character at each time stepmight lead to better performance (Gu et al., 2016).

We incorporate the copy mechanism proposedin See et al. (2017). The mechanism computes a“generation probability” at each time step k, whichis used to choose between generating a character(Equation 3) or copying a character from the sourcetext by sampling from the attention distribution αx

k .

Coverage Given the monotonicity of the post-correction task, the model should not attend to thesame character repeatedly. However, repetition is acommon problem for neural encoder-decoder mod-els (Mi et al., 2016; Tu et al., 2016). To account forthis problem, we adapt the coverage mechanismfrom See et al. (2017), which keeps track of theattention distribution over past time steps in a cov-erage vector. For time step k, the coverage vectorwould be gk = ∑k−1

k′=0αxk′ .

gk is used as an extra input to the attention mech-anism, ensuring that future attention decisions takethe weights from previous time steps into account.Equation 1, with learnable parameter wg, becomes:

exk,i = v

xtanh (Wx

1sk−1 +Wx2h

xi +wggk,i)

gk is also used to penalize attending to the samelocations repeatedly with a coverage loss. Thecoverage loss summed over all time steps k is:

Lcov = ∑k

n

∑i=1

min (αxk,i, gk,i)

Therefore, with our model adaptations, the loss fora single training instance:

L = Lce + Ldiag + Lcov (4)

5.3 Utilizing Untranscribed DataAs discussed in Section 3, given the effort in-volved, we transcribe only a subset of the pages ineach scanned book. Nonetheless, we leverage theremaining unannotated pages for pretraining ourmodel. We use the upstream OCR tool to get a firstpass transcription on all the unannotated pages.

We then create “pseudo-target” transcriptions forthe endangered language text as described below:

• Denoising rules Using a small fraction ofthe available annotated pages, we compute

the edit distance operations between the firstpass OCR and the gold transcription. Theoperations of each type (insertion, deletion,and replacement) are counted for each char-acter and divided by the number of times thatcharacter appears in the first pass OCR. Thisforms a probability of how often the operationshould be applied to that specific character.

We use these probabilities to form rules, suchas p(replace d with d.) = 0.4 for Griko andp(replace ? with P)= 0.7 for Yakkha. Theserules are applied to remove errors from, or“denoise”, the first pass OCR transcription.

• Sentence alignment We use Yet AnotherSentence Aligner (Lamraoui and Langlais,2013) for unsupervised alignment of the en-dangered language and translation on theunannotated pages.

Given the aligned first pass OCR for the endan-gered language text and the translation along withthe pseudo-target text, x, t and y respectively, thepretraining steps, in order, are:

• Pretraining the encoders We pretrain boththe forward and backward LSTMs of eachencoder with a character-level language modelobjective: the endangered language encoderon x and the translation encoder on t.

• Pretraining the decoder The decoder ispretrained on the pseudo-target y with acharacter-level language model objective.

• Pretraining the seq-to-seq model Themodel is pretrained withx and t as the sourcesand the pseudo-target y as the target transcrip-tion, using the post-correction loss function Las defined in Equation 4.

6 Experiments

This section discusses our experimental setup andthe post-correction performance on the three en-dangered languages on our dataset.

6.1 Experimental Setup

Data Splits We perform 10-fold cross-validationfor all experimental settings because of the smallsize of the datasets. For each language, we dividethe transcribed data into 11 segments — we use onesegment for creating the denoising rules describedin the previous section and the remaining ten as the

Character Error Rate Word Error RateAinu Griko Yakkha Ainu Griko Yakkha

Model Multi Single Multi Single Multi Single Multi Single Multi Single Multi Single

FP-OCR – 1.34 – 3.27 – 8.90 – 6.27 – 15.63 – 31.64

BASE 1.56 1.41 6.78 5.95 70.39 71.71 8.56 7.88 15.13 13.67 98.15 99.10

COPY 2.04 1.99 2.54 2.28 14.77 12.30 9.48 8.57 9.33 8.90 30.36 27.81

OURS 0.92 0.80 1.66 1.70 7.75 8.44 5.75 5.19 7.46 7.51 20.95 21.33

Table 2: Our method improves performance over all baselines (10-fold cross-validation averaged over five ran-domly seeded runs). We present multi- and single-source variants and highlight the best model for each language.

folds for cross-validation. In each cross-validationfold, eight segments are used for training, one forvalidation and one for testing.

We divide the dataset at the page-level for theAinu and Griko documents, resulting in 11 seg-ments of three pages each. For the Yakkha docu-ments, we divide at the paragraph-level, due to thesmall size of the dataset. We have 33 paragraphsacross the three books in our dataset, resulting in 11segments that contain three paragraphs each. Themulti-source results for Yakkha reported in Table 2use the English translations. Results with Nepaliare similar and are included in Appendix A.

Metrics We use two metrics for evaluating oursystems: character error rate (CER) and word errorrate (WER). Both metrics are based on edit distanceand are standard for evaluating OCR and OCR post-correction (Berg-Kirkpatrick et al., 2013; Schulzand Kuhn, 2017). CER is the edit distance betweenthe predicted and the gold transcriptions of the doc-ument, divided by the total number of charactersin the gold transcription. WER is similar but iscalculated at the word level.

Methods In our experiments, we compare theperformance of our proposed method with the firstpass OCR and with two systems from recent workin OCR post-correction. All the post-correctionmethods have two variants – the single-sourcemodel with only the endangered language encoderand the multi-source model that additionally usesthe high-resource translation encoder.

• FP-OCR: The first pass transcription obtainedfrom the Google Vision OCR system.

• BASE: This system is the base sequence-to-sequence architecture described in Section 5.1.Both the single-source and multi-source vari-ants of this system are used for English OCRpost-correction in Dong and Smith (2018).

• COPY: This system is the base architecturewith a copy mechanism as described in Sec-tion 5.2. The single-source variant of thismodel is used for OCR post-correction on Ro-manized Sanskrit in Krishna et al. (2018).8

• OURS: The model with all the adaptationsproposed in Section 5.2 and Section 5.3.

Implementation The post-correction models areimplemented using the DyNet neural networktoolkit (Neubig et al., 2017), and all reported re-sults are the average of five training runs with dif-ferent random seeds. We assume knowledge ofthe entire alphabet of the endangered language forall the methods, which is straightforward to ob-tain for most languages. The decoder’s vocabularycontains all these characters, irrespective of theirpresence in the training data, with correspondingrandomly-initialized character embeddings.

6.2 Main ResultsTable 2 shows the performance of the baselines andour proposed method for each language. Overall,our method results in an improved CER and WERover existing methods across all three languages.

The BASE system does not improve the recog-nition rate over the first pass transcription, apartfrom a small decrease in the Griko WER. The per-formance on Yakkha, particularly, is significantlyworse than FP-OCR: likely because the data sizeof Yakkha is much smaller than that of Griko andAinu, and the model is unable to learn a reasonabledistribution. However, on adding a copy mecha-nism to the base model in the COPY system, theperformance is notably better for both Griko andYakkha. This indicates that adaptations to the basemodel that cater to specific characteristics of the

8Although Krishna et al. (2018) use BPE tokenization,preliminary experiments showed that character-level modelsresult in much better performance on our dataset, likely dueto the limited data available for training the BPE model.

Errors fixed by post-correction(a) Griko (b) Yakkha

[Image]

exi i kaddinàra!


_खारिनङगोङखाॽिनङगो

exi i kaddinàra!





ÈÈ↓ ÈÈ↓[First pass OCR] exi i kaddinaraexi i kaddinàra!



ÈÈ↓ ÈÈ↓[Post-corrected] eχi i kad. d. inara

exi i kaddinàra!



Errors introduced by post-correction(c) Griko (d) Yakkha

exi i kaddinàra!





exi i kaddinàra!




हाङचाङचाङÈÈ↓ ÈÈ↓e ffacilo

exi i kaddinàra!




हाङचाङचाङÈÈ↓ ÈÈ↓

e ffacilo

exi i kaddinàra!





Figure 4: Our model fixes many mixed script and uncommon diacritics errors such as (a) and (b). In rare cases, it“over-corrects” the first pass OCR transcription, introducing errors such as (c) and (d).

0 2 4 6 8

all-diag-copy

-coverage-pretr. dec-pretr. enc-pretr. s2s

5.195.49

6.565.6

6.866.7

5.65Ainu

0 2 4 6 8 10

7.468.068.6610.19

7.879.479.43

Griko

0 10 20 30 40

all-diag-copy

-coverage-pretr. dec-pretr. enc-pretr. s2s

24.2922.73

37.8326.71

20.9528.4127.68

Word Error Rate

Yakkha

Figure 5: WER with model component ablations onthe best model setting in Table 2. “all” includes all theadaptations we propose. Each ablation removes a sin-gle component from the “all” model, e.g. “-pretr. s2s”removes the seq-to-seq model pretraining.

post-correction task can alleviate some of the chal-lenges of learning from small amounts of data.

The single-source and the multi-source variantsof our proposed method improve over the baselines,demonstrating that our proposed model adaptationscan improve recognition even without translations.We see that using the high-resource translationsresults in better post-correction performance forGriko and Yakkha, but the single-source modelachieves better accuracy for Ainu. We attributethis to two factors: the very low error rate of thefirst pass transcription for Ainu and the relativelyhigh error rate (based on manual inspection) of theOCR on the Japanese translation. Despite beinga high-resource language, OCR is difficult due tothe complexity of Japanese characters and low scanquality. The noise resulting from the Japanese OCRerrors likely hurts the multi-source model.

6.3 Ablation Studies

Next, we study the effect of our proposed adapta-tions and evaluate their benefit to the performanceof each language. Figure 5 shows the word errorrate with models that remove one adaptation fromthe model with all the adaptations (“all”).

For Ainu and Griko, removing any single compo-nent increases the WER, with the complete (“all”)method performing the best. There is little variancein the Ainu ablations, likely due to the high-qualityfirst pass transcription.

Our proposed adaptations add the most benefitfor Yakkha, which has the fewest training data andrelatively less accurate first pass OCR. The copymechanism is crucial for good performance, but re-moving the decoder pretraining (“pretr. dec”) leadsto the best scores among all the ablations. The de-noising rules used to create the pseudo-target datafor Yakkha are likely not accurate since they arederived from only three paragraphs of annotateddata. Consequently, using it to pretrain the decoderleads to a poor language model.

6.4 Error Analysis

We systematically inspect all the recognition errorsin the output of our post-correction model to deter-mine the sources of improvement with respect tothe first pass OCR. We also examine the types oferrors introduced by the post-correction process.

We observe a 91% reduction in the number oferrors due to mixed scripts and a 58% reductionin the errors due to uncommon characters and dia-critics (as defined in Section 4). Examples of theseare shown in Figure 4 (a) and (b): mixed scripterrors such as the χ character in Griko and theglottal stop P in Yakkha are successfully correctedby the model. The model is also able to correctuncommon character errors like d. in Griko and ङin Yakkha.

Examples of errors introduced by the model areshown in Figure 4 (c) and (d). Example (c) is inGriko, where the model incorrectly adds a diacriticto a character. We attribute this to the fact that thefirst pass OCR does not recognize diacritics well;hence, the model learns to add diacritics frequentlywhile generating the output. Example (d) is inYakkha. The model inserts several incorrect char-acters, and can likely be attributed to the lack of agood language model due to the relatively smalleramount of training data we have in Yakkha.

7 Related Work

Post-correction for OCR is well-studied for high-resource languages. Early approaches include lexi-cal methods and weighted finite-state methods (seeSchulz and Kuhn (2017) for an overview). Re-cent work has primarily focused on using neuralsequence-to-sequence models. Hamalainen andHengchen (2019) use a BiLSTM encoder-decoderwith attention for historical English post-correction.Similar to our base model, Dong and Smith (2018)use a multi-source model to combine the first passOCR from duplicate documents in English.

There has been little work on lower-resourcedlanguages. Kolak and Resnik (2005) present aprobabilistic edit distance based post-correctionmodel applied to Cebuano and Igbo, and Krishnaet al. (2018) show improvements on RomanizedSanksrit OCR by adding a copy mechanism to aneural sequence-to-sequence model.

Multi-source encoder-decoder models have beenused for various tasks including machine transla-tion (Zoph and Knight, 2016; Libovicky and Helcl,2017) and morphological inflection (Kann et al.,2017; Anastasopoulos and Neubig, 2019). Perhapsmost relevant to our work is the multi-source modelpresented by Anastasopoulos and Chiang (2018),which uses high-resource translations to improvespeech transcription of lower-resourced languages.

Finally, Bustamante et al. (2020) construct cor-pora for four endangered languages from text-based PDFs using rule-based heuristics. Data cre-ation from such unstructured text files is an impor-tant research direction, complementing our methodof extracting data from scanned images.

8 Conclusion

This work presents a first step towards extractingtextual data in endangered languages from scannedimages of paper books. We create a benchmark

dataset with transcribed images in three endan-gered languages: Ainu, Griko, and Yakkha. Wepropose an OCR post-correction method that facili-tates learning from small amounts of data, whichresults in a 34% average relative error reduction inboth the character and word recognition rates.

As future work, we plan to investigate the effectof using other available data for the three languages(for example, word lists collected by documentarylinguists or the additional Griko folk tales collectedby Anastasopoulos et al. (2018)).

Additionally, it would be valuable to examinewhether our method can improve the OCR on high-resource languages, which typically have muchbetter recognition rates in the first pass transcriptionthan the endangered languages in our dataset.

Further, we note our use of the Google Vi-sion OCR system to obtain the first pass OCRfor our experiments, primarily because it providesscript-specific models as opposed to other general-purpose OCR systems that rely on language-specific models (as discussed in Section 4). Futurework that focuses on overcoming the challenges ofapplying language-specific models to endangeredlanguage texts would be needed to confirm ourmethod’s applicability to post-correcting the firstpass transcriptions from different OCR systems.

Lastly, given the annotation effort involved, thispaper explores only a small fraction of the en-dangered language data available in linguistic andgeneral-purpose archives. Future work will focuson large-scale digitization of scanned documents,aiming to expand our OCR benchmark on as manyendangered languages as possible, in the hope ofboth easing linguistic documentation and preserva-tion efforts and collecting enough data for NLP sys-tem development in under-represented languages.

Acknowledgements

We thank David Chiang, Walter Scheirer, andWilliam Theisen for initial discussions on theproject, the University of Notre Dame Libraryfor the scanned “Kutune Shirka” Ainu-Japanesebook, and Josep Quer for the scanned Griko folk-tales book. We also thank Taylor Berg-Kirkpatrick,Shuyan Zhou, Zi-Yi Dou, Yansen Wang, Zhen Fan,and Deepak Gopinath for feedback on the paper.

This material is based upon work supported inpart by the National Science Foundation underGrant No. 1761548. Shruti Rijhwani is supportedby a Bloomberg Data Science Ph.D. Fellowship.

ReferencesAntonios Anastasopoulos and David Chiang. 2018.

Leveraging translations for speech transcription inlow-resource settings. In Proc. INTERSPEECH.

Antonios Anastasopoulos, Marika Lekakou, JosepQuer, Eleni Zimianiti, Justin DeBenedetto, andDavid Chiang. 2018. Part-of-speech tagging on anendangered language: a parallel Griko-Italian re-source. In Proceedings of the 27th InternationalConference on Computational Linguistics, pages2529–2539, Santa Fe, New Mexico, USA. Associ-ation for Computational Linguistics.

Antonios Anastasopoulos and Graham Neubig. 2019.Pushing the limits of low-resource morphological in-flection. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages984–996, Hong Kong, China. Association for Com-putational Linguistics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd Inter-national Conference on Learning Representations,ICLR 2015.

Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein.2013. Unsupervised transcription of historical docu-ments. In Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 207–217, Sofia, Bul-garia. Association for Computational Linguistics.

Gina Bustamante, Arturo Oncevay, and RobertoZariquiey. 2020. No data to crawl? monolingualcorpus creation from PDF files of truly low-resourcelanguages in Peru. In Proceedings of The 12th Lan-guage Resources and Evaluation Conference, pages2914–2923, Marseille, France. European LanguageResources Association.

Maud Devos. 2019. Shangaji. a maka or swahili lan-guage of mozambique. grammar, texts and wordlist.https://elar.soas.ac.uk/Collection/MPI1029699. Accessed: 2020-02-02.

Rui Dong and David Smith. 2018. Multi-input atten-tion for unsupervised OCR correction. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 2363–2372, Melbourne, Australia. As-sociation for Computational Linguistics.

Yasuhisa Fujii, Karel Driesen, Jonathan Baccash, AshHurst, and Ashok C Popat. 2017. Sequence-to-labelscript identification for multilingual ocr. In 201714th IAPR International Conference on DocumentAnalysis and Recognition (ICDAR), volume 1, pages161–168. IEEE.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating copying mechanism in

sequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1631–1640, Berlin, Germany. Association forComputational Linguistics.

Mika Hamalainen and Simon Hengchen. 2019. Fromthe paft to the fiiture: a fully automatic NMT andword embeddings method for OCR post-correction.In Proceedings of the International Conference onRecent Advances in Natural Language Processing(RANLP 2019), pages 431–436, Varna, Bulgaria. IN-COMA Ltd.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Richard W Howell. 1951. The classification and de-scription of ainu folklore. The Journal of AmericanFolklore, 64(254):361–369.

R Reeve Ingle, Yasuhisa Fujii, Thomas Deselaers,Jonathan Baccash, and Ashok C Popat. 2019. Ascalable handwritten text recognition system. arXivpreprint arXiv:1904.09150.

Pratik Joshi, Sebastin Santy, Amar Budhiraja, KalikaBali, and Monojit Choudhury. 2020. The state andfate of linguistic diversity and inclusion in the NLPworld. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 6282–6293, Online. Association for Computa-tional Linguistics.

Katharina Kann, Ryan Cotterell, and Hinrich Schutze.2017. Neural multi-source morphological reinflec-tion. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics: Volume 1, Long Papers, pages514–524, Valencia, Spain. Association for Compu-tational Linguistics.

Kyosuke Kindaichi. 1931. Ainu Jojishi Yukara noKenkyu [Research on Ainu Epic Yukar]. Tokyo:Tokyo Bunko.

Okan Kolak and Philip Resnik. 2005. OCR post-processing for low density languages. In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing, pages 867–874, Vancouver,British Columbia, Canada. Association for Compu-tational Linguistics.

Amrith Krishna, Bodhisattwa P. Majumder, RajeshBhat, and Pawan Goyal. 2018. Upcycle your OCR:Reusing OCRs for post-OCR text correction in Ro-manised Sanskrit. In Proceedings of the 22nd Con-ference on Computational Natural Language Learn-ing, pages 345–355, Brussels, Belgium. Associationfor Computational Linguistics.

Fethi Lamraoui and Philippe Langlais. 2013. Yet an-other fast, robust and open source sentence aligner.time to reconsider sentence alignment? In XIV Ma-chine Translation Summit, Nice, France.

https://www.aclweb.org/anthology/C18-1214



https://doi.org/10.18653/v1/D19-1091

https://doi.org/10.18653/v1/D19-1091

https://www.aclweb.org/anthology/P13-1021

https://www.aclweb.org/anthology/P13-1021

https://www.aclweb.org/anthology/2020.lrec-1.356



https://elar.soas.ac.uk/Collection/MPI1029699


https://doi.org/10.18653/v1/P18-1220

https://doi.org/10.18653/v1/P18-1220

https://doi.org/10.18653/v1/P16-1154

https://doi.org/10.18653/v1/P16-1154

https://doi.org/10.26615/978-954-452-056-4_051

https://doi.org/10.26615/978-954-452-056-4_051

https://doi.org/10.26615/978-954-452-056-4_051

https://doi.org/10.18653/v1/2020.acl-main.560



https://www.aclweb.org/anthology/E17-1049

https://www.aclweb.org/anthology/E17-1049

https://www.aclweb.org/anthology/H05-1109

https://www.aclweb.org/anthology/H05-1109

https://doi.org/10.18653/v1/K18-1034

https://doi.org/10.18653/v1/K18-1034

https://doi.org/10.18653/v1/K18-1034

Jindrich Libovicky and Jindrich Helcl. 2017. Attentionstrategies for multi-source sequence-to-sequencelearning. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), pages 196–202, Vancou-ver, Canada. Association for Computational Linguis-tics.

Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and AbeIttycheriah. 2016. Coverage embedding models forneural machine translation. In Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 955–960, Austin,Texas. Association for Computational Linguistics.

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, Daniel Cloth-iaux, Trevor Cohn, Kevin Duh, Manaal Faruqui,Cynthia Gan, Dan Garrette, Yangfeng Ji, LingpengKong, Adhiguna Kuncoro, Gaurav Kumar, Chai-tanya Malaviya, Paul Michel, Yusuke Oda, MatthewRichardson, Naomi Saphra, Swabha Swayamdipta,and Pengcheng Yin. 2017. Dynet: The dy-namic neural network toolkit. arXiv preprintarXiv:1701.03980.

C. Rigaud, A. Doucet, M. Coustaty, and J. Moreux.2019. ICDAR 2019 competition on post-OCRtext correction. In 2019 International Conferenceon Document Analysis and Recognition (ICDAR),pages 1588–1593.

Diana Schackow. 2012. Documentation and gram-matical description of yakkha, nepal. https://elar.soas.ac.uk/Collection/MPI186180.Accessed: 2020-02-02.

Diana Schackow. 2015. A grammar of Yakkha. Lan-guage Science Press.

Carsten Schnober, Steffen Eger, Erik-Lan Do Dinh, andIryna Gurevych. 2016. Still not there? comparingtraditional sequence-to-sequence models to encoder-decoder neural networks on monotone string trans-lation tasks. In Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers, pages 1703–1714,Osaka, Japan. The COLING 2016 Organizing Com-mittee.

Sarah Schulz and Jonas Kuhn. 2017. Multi-modulardomain-tailored OCR post-correction. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 2716–2726,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computa-tional Linguistics.

Paolo Stomeo. 1980. Racconti greci inediti di Ster-natıa. La nuova Ellade, s.I.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 76–85, Berlin, Germany. Association for ComputationalLinguistics.

Barret Zoph and Kevin Knight. 2016. Multi-sourceneural translation. In Proceedings of the 2016 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 30–34, San Diego, Cali-fornia. Association for Computational Linguistics.

https://doi.org/10.18653/v1/P17-2031

https://doi.org/10.18653/v1/P17-2031

https://doi.org/10.18653/v1/P17-2031

https://doi.org/10.18653/v1/D16-1096

https://doi.org/10.18653/v1/D16-1096



https://doi.org/10.26530/oapen_603340





https://doi.org/10.18653/v1/D17-1288

https://doi.org/10.18653/v1/D17-1288

https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.18653/v1/P16-1008

https://doi.org/10.18653/v1/P16-1008

https://doi.org/10.18653/v1/N16-1004

https://doi.org/10.18653/v1/N16-1004

A Appendix

A.1 Implementation DetailsThe hyperparameters used are:

• Character embedding size = 128

• Number of LSTM layers = 1

• Hidden state size of the LSTM = 256

• Attention size = 256

• Beam size = 4

• For the diagonal loss, j = 3

• Minibatch size for training = 1

• Maximum number of epochs = 150

• Patience for early stopping = 10 epochs

• Pretraining epochs for encoder/decoder = 10

• Pretraining epochs for seq-to-seq model = 5

We use the same values of the hyperparameters foreach language and all the systems. We select thebest model with early stopping on the charactererror rate of the validation set.

A.2 Additional Experimental ResultsPerformance on Yakkha with Nepali Table 3shows the performance for the Yakkha datasetwhen using Nepali as the high-resource translationinput to the multisource model. The performanceis similar to those of the experiments using theEnglish translations, as presented in Table 2.

Standard deviation on the main results Ta-ble 4 and Table 5 show the character error rate andword error rate respectively including the standarddeviation over five randomly seeded runs, corre-sponding to the systems described in Table 2.

Model CER WER

FP-OCR 8.90 31.64BASE 70.89 100.00COPY 11.60 26.74OURS 7.95 20.83

Table 3: Character error rate (CER) and word errorrate (WER) for the Yakkha dataset with the multi-source model that uses the OCR on Nepali as the high-resource translation. The table shows the mean overfive random runs.

(a) AinuModel Multi Single

FP-OCR – 1.34BASE 1.56 ± 0.23 1.41 ± 0.16COPY 2.04 ± 0.62 1.99 ± 0.41OURS 0.92 ± 0.05 0.80 ± 0.07

(b) GrikoModel Multi Single


(c) YakkhaModel Multi Single


Table 4: Mean and standard deviation of the charactererror rate with 10-fold cross-validation over five ran-dom seeds. The results presented are the same as Ta-ble 2 with the added information of standard deviation.The best models for each language are highlighted.

(a) AinuModel Multi Single


(b) GrikoModel Multi Single


(c) YakkhaModel Multi Single


Table 5: Mean and standard deviation of the word er-ror rate with 10-fold cross-validation over five randomseeds. The results presented are the same as Table 2with the added information of standard deviation. Thebest models for each language are highlighted.