Tesseract OCR customization - digitisation.eu · Alchemy API - Training Material, version 1.0, 09/12/2013 Page 2/3 In first place images for training are initially processed with

Tesseract OCR customization – Cutouts and page-generator - Training Material | WP3

Author: Adam Dudczak (PSNC)

Introduction

Creation of searchable versions of historical documents is in general a hard problem. Modern documents differ a lot from early printed documents in terms of used letters, fonts and conventions. Contemporary commercial OCR applications were trained to recognise modern documents, that is why their applicability to historical documents is limited. One of the possible solutions to this problem is OCR customization.

Tesseract (https://code.google.com/p/tesseract-ocr/) is a well-known open-source OCR application, apart from other things it features layout analysis and training capabilities. Because Tesseract is a command-line tool it is very handy to have it as part of larger digitisation workflow. This document describes how to create custom recognition profile for a specific kind of documents using web application called Cutouts (http://wlt.synat.pcss.pl/cutouts) and command line tools called page-generator (https://github.com/psnc-dl/page-generator).

Requirements

Cutouts is available at http://wlt.synat.pcss.pl/cutouts/ and the page generator is on github: https://github.com/psnc-dl/page-generator

Usage

Figure 1. First step in training material preparation, adjusting the boundaries of the glyph.

Succeed is supported by the European Union under FP7ICT and coordinated by Universidad de Alicante.

https://github.com/psnc-dl/page-generator

http://wlt.synat.pcss.pl/cutouts/

https://github.com/psnc-dl/page-generator

http://wlt.synat.pcss.pl/cutouts

https://code.google.com/p/tesseract-ocr/

Alchemy API - Training Material, version 1.0, 09/12/2013 Page 2/3

In first place images for training are initially processed with Tesseract OCR and uploaded to Cutouts application. Then users can handle training material preparation, by adjusting the boundaries of glyphs recognised by Tesseract in the initial step (see Figure 1). As a result each processed glyph is represented as four small files:

• original, not-binarized image with a glyph,

• binarized version of the image with a glyph,

• final version, which includes binarization and manual correction performed by user, e.g. removal of overlapping glyphs,

• XML file with metadata.

XML file contains several important information related to a glyph itself and to the original scanned image. This includes: coordinates of a glyph, size of the original image, name of the original file, Unicode character associated with a glyph. Apart from that Cutouts allows to specify additional information about the quality of print for a given glyph. User can mark certain „noisy” glyphs as unreadable, this can be used to filter out these „noisy” characters during preparation of a training material.

Figure 2. Example of original page and training image created using page-generator.

The next step is done using page-generator which converts Cutouts output into the Tesseract training images (see Figure 2). After that the simple Bash script is launched in order to perform training which results in a new recognition profile. This profile can be then uploaded to OCR service or used in other tools e.g. Virtual Transcription Laboratory (http://wlt.synat.pcss.pl) which offers web-based user interface for OCR and post-correction.

Two the most tedious parts of this process: preparation of training material and OCR post-


Alchemy API - Training Material, version 1.0, 09/12/2013 Page 3/3

correction are implemented using web-based tools, thanks to this the whole process can be accelerated by distributing small units of work among group of volunteers.

ASSURING QUALITY WITH CUTOUTS

Nature of crowdsourcing does not guarantee that only skilled and well-trained experts will be responsible for creation of training materials. Because of that Cutouts features also an audit interface. For each scanned image, application administrator can check statistically relevant sample of materials prepared by the volunteers. Then on top of this sample, he/she can decide about the quality of prepared material. Figure 3 presents Cutouts audit interface; incorrect elements can be rejected. In such case, the editor would have to process them once again.

Figure 3: Audit interface which allows to check the quality of materials prepared by volunteers.


Documents

Tesseract OCR customization - digitisation.eu · Alchemy API - Training Material, version 1.0, 09/12/2013 Page 2/3 In first place images for training are initially processed with