Historical Document Digitization through LayoutAnalysis and Deep Content Classification
Andrea Corbelli, Lorenzo Baraldi, Costantino Grana and Rita CucchiaraDipartimento di Ingegneria Enzo Ferrari
Universita degli Studi di Modena e Reggio Emilia41125 Modena, Italy
AbstractDocument layout segmentation and recognition is animportant task in the creation of digitized documents collections,especially when dealing with historical documents. This paperpresents an hybrid approach to layout segmentation as well asa strategy to classify document regions, which is applied tothe process of digitization of an historical encyclopedia. Ourlayout analysis method merges a classic top-down approach anda bottom-up classification process based on local geometricalfeatures, while regions are classified by means of featuresextracted from a Convolutional Neural Network merged in aRandom Forest classifier. Experiments are conducted on thefirst volume of the Enciclopedia Treccani, a large datasetcontaining 999 manually annotated pages from the historicalItalian encyclopedia.
Document digitization is a very important aspect of doc-ument preservation. In particular, regarding historical docu-ments, the digitization process is useful to improve not only thedurability of the documents, but also their accessibility. Oftentimes historical documents have been transcribed, but a plaintext version lacks of all the metadata available in the originalversion, regarding images, layout arrangements and linking tothe original page. An extensive digitization process involves,along with the creation of a digital copy, also the extractionof the information contained in a document. Simple opticalcharacter recognition (OCR) is only one of the necessarysteps that lead to the creation of a fully digital version ofa document, which include the recognition of all the differentelements that might appear in a document, such as images,tables and formulas. All these should be to be segmented andclassified to obtain a truly digital version of the document.
Despite its long history, layout analysis is still a difficulttask due the great number of possible layout configurationsand content arrangements. In this paper we present a docu-ment analysis pipeline which includes a layout segmentationalgorithm and a content classification method to classify theactual content of the segmented regions. Moreover, in casea manual transcription is available, we propose a techniqueto map OCR-ed text to transcription, to enrich the digitizedversion with additional metadata.
We test our pipeline on the Italian historical encyclopedia,the Enciclopedia Treccani, published between 1929 and1936. The encyclopedia consists of 35 volumes, for a total ofaround 60000 articles. In our experiments we focus on the first
volume which contains all the different layout arrangementsfound in the encyclopedia and which has been manuallyannotated by us. A possible outcome of this work is indeedthe digitization of the entire encyclopedia.
The rest of this paper is structured as follows: Section2 gives a brief discussion of the state of the art in layoutanalysis and content classification, Section 3 explains themain components of our pipeline, and Section 4 reports theperformance evaluation and a comparison against the state ofthe art.
II. RELATED WORK
Layout analysis has been an active area of research sincethe seventies. There are two main approaches to this task,namely bottom up and top down. Top-down methods, such asXY cuts ,  or methods that exploit white streams or projection profiles are usually fast but tend to fail whendealing with complex layouts. Bottom-up methods are insteadmore flexible and process the image page from the pixel leveland subsequently aggregate into higher level regions but withan higher computational complexity.
These approaches are usually based on mathematical mor-phology, Connected Components (CCs), Voronoi diagrams or run-length smearing . Many other methods exist whichdo not fit exactly into either of these categories: the so calledmixed or hybrid approaches try to combine the high speed ofthe top-down approaches with the robustness of the bottom-upones. Chen et al.  propose a method based on whitespacerectangles extraction and grouping: initially the foregroundCCs are extracted and linked into chains according to their hor-izontal adjacency relationship; whitespace rectangles are thenextracted from the gap between horizontally adjacent CCs;CCs and whitespaces are progressively grouped and filteredto form text lines and afterward text blocks. Lazzara et al. provide a chain of steps to first recognize text regions andsuccessively non-text elements. Foreground CCs are extracted,then delimiters (such as lines, whitespaces and tab-stop) aredetected with object alignment and morphological algorithms.Since text components are usually well aligned, have a uniformsize and are close to each other, the authors propose to regroupCCs by looking for their neighbors. Filters can also be appliedon a group of CCs to validate the link between two CCs.Another algorithm based on whitespace analysis has been
(a) XY-Cut (b) Detected illustrations (c) Final result
Fig. 1. The page layout segmentation pipeline. First the Recursive XY-Cut algorithm is applied to detect candidate regions inside the page; then, illustrationsare detected using local autocorrelation features. A second application of the XY-Cut algorithm gives the final segmentation.
proposed by Baird et al. : the algorithm uses the white spacein the page as a layout delimiter and tries to find the biggestbackground empty rectangles to extract connected regions.
On a different note, document content classification is alsoa problem that has been faced by many researchers. Regardingthe applied strategies, existing algorithms can be subdividedin two main categories: rule-based and statistical-based. Somepapers present algorithms to classify whole pages into differentcategories, such as title page, table of contents, etc. ,while a different approach is to classify homogeneous regionsin a document into different classes, such as text, image,table, etc. Is interesting to note that many papers face thisproblem trying to distinguish only one class, for exampleZanibbi et al.  focuses on mathematical expressions, Huet al.  on tables, Chen et al.  and Pham  onlogo detection and Futrelle et al.  on diagrams. Theseapproaches solve only part of the classification problem.
Regarding multi-class algorithms, many of them exploitrules built specifically for certain document classes, for exam-ple Krishnamoorthy et al.  and Lee et al.  developedalgorithms to identify entities in scientific journals and papers,Mitchell et al.  identifies entities in newspaper pagesand Jain and Yu  identifies various types of entities indifferent types of documents. These approaches often rely onhand-built heuristics, strictly tied to specific document types.Other approaches use statistical features and image featuresto classify document regions in different categories. The algo-rithm proposed by Sivaramakrishnan et al.  distinguishesbetween nine different classes extracting features for eachregion using run length mean and variance, number of blackpixels and aspect ratio of the region. Fan and Wang  usedensity features and connectivity histograms to classify regionsinto text, images and graphics. Li et al.  proposed analgorithm that models images using two-dimensional HMMsand Wang et al.  proposed an algorithm based on therepresentation of a region by means of a 25-dimensional vector
and on a classification tree.
III. PROPOSED APPROACH
A. Layout Segmentation
Layout segmentation aims at segmenting an input page intocoherent regions, like text, tables and images, and is thereforea necessary step for classifying the contents of a document.In the following, we propose a top-down layout analysisapproach, which builds upon the classic XY-Cut algorithm and is able to deal with more complex layouts.
The XY-Cut algorithm is applied as the first step of ourlayout segmentation pipeline. This classic top-down algorithmrecursively splits a region of interest into two, based on thespatial arrangement of white pixels. The algorithm exploitsthe pixels vertical and horizontal projections in order tofind low density regions in the projections histograms. Theselow density points correspond to white spaces separating twodifferent layout elements and therefore the region is splitaccordingly.
The XY-Cut algorithm is suitable to find rectangular regionssurrounded by white space, but in real layouts it is notuncommon to find images within text areas (see the image inFig. 1 for an example). The involved text area, instead of beinga plain rectangle, is identifiable as a rectilinear polygon whichcant be recognized by the XY-Cut algorithm. To address thisproblem an additional analysis step is performed in order todetect images and illustrations in the page.
The image detection problem is approached exploiting localautocorrelation features and the method we propose is inspiredby the algorithm proposed in . The assumption is thattext areas are clearly distinguishable thanks to the text linespattern which is absent in images, thus the autocorrelationmatrix of a region is used as an effective descriptor for thistask. The image is subdivided into square blocks of size nn,and for each block the autocorrelation matrix C(k, l), withk, l [n/2, n/2], is computed. Then, the autocorrelation
TABLE ISTRUCTURE OF THE CONTENT CLASSIFIER CNN. THE NETWORK TAKES A INPUT A GRAY-SCALE n n PATCH, AND CONSISTS OF A SEQUENCE OF
CONVOLUTIONAL (CONV), MAXPOOLING (MP) AND