15
HAL Id: inria-00516678 https://hal.inria.fr/inria-00516678 Submitted on 10 Sep 2010 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy, Daniel Lopresti To cite this version: Bart Lamiroy, Daniel Lopresti. A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents. Fourth Workshop on Analytics for Noisy Unstructured Text Data - AND’10, IAPR, Oct 2010, Toronto, Canada. 10.1145/1871840.1871844. inria-00516678

A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

HAL Id: inria-00516678https://hal.inria.fr/inria-00516678

Submitted on 10 Sep 2010

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

A Platform for Storing, Visualizing, and InterpretingCollections of Noisy Documents

Bart Lamiroy, Daniel Lopresti

To cite this version:Bart Lamiroy, Daniel Lopresti. A Platform for Storing, Visualizing, and Interpreting Collections ofNoisy Documents. Fourth Workshop on Analytics for Noisy Unstructured Text Data - AND’10, IAPR,Oct 2010, Toronto, Canada. �10.1145/1871840.1871844�. �inria-00516678�

Page 2: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

A Platform for Storing, Visualizing, and Interpreting

Collections of Noisy Documents∗

Bart Lamiroy†

Nancy Université � LORIA

Campus Scienti�que, BP 239

54506 Vandoeuvre Cedex, France

[email protected]

Daniel Lopresti

Computer Science and Engineering

Lehigh University

Bethlehem, PA 18015, USA

[email protected]

Abstract

The goal of document image analysis is to pro-duce interpretations that match those of a �u-ent and knowledgeable human when viewingthe same input. Because computer vision tech-niques are not perfect, the text that resultswhen processing scanned pages is frequentlynoisy. Building on previous work, we proposea new paradigm for handling the inevitableincomplete, partial, erroneous, or slightly or-thogonal interpretations that commonly arisein document datasets. Starting from the ob-servation that interpretations are dependent onapplication context or user viewpoint, we de-scribe a platform now under development thatis capable of managing multiple interpretationsfor a document and o�ers an unprecedentedlevel of interaction so that users can freely buildupon, extend, or correct existing interpreta-

∗ c© ACM, 2010. This is the author's ver-sion of the work. It is posted here by permis-sion of ACM for your personal use. Not for re-distribution. The de�nitive version was publishedin the ACM International Conference Proceeding Se-ries, Proceedings of The Fourth Workshop on An-alytics for Noisy Unstructured Text Data, 2010,http://doi.acm.org/10.1145/nnnnnn.nnnnnn

†Bart Lamiroy is a Visiting Scientist in the Depart-ment of Computer Science and Engineering at LehighUniversity, on an INRIA délégation with the Unité deRecherche Nancy � Grand Est.

tions. In this way, the system supports the cre-ation of a continuously expanding and improv-ing document analysis repository which can beused to support research in the �eld.

1 Introduction

The goal of document image analysis is toachieve performance using automated toolsthat is comparable to what a careful humanexpert would achieve, or at least to do betterthan existing algorithms on the same task .

Our use of terms like �performance,� �com-parable,� and �better� indicate that there isan underlying notion of quality and thereforemeasurement . It suggests a controlled pro-cess that continually improves toward perfec-tion. However, we also make mention of �care-ful� humans, �tasks,� and �existing algorithms.�While humans may believe themselves to beexpert and careful when performing a task,there are situations where they unavoidablydisagree [7, 17, 22, 2], meaning that, at best,quality and improvement are subjective no-tions. It also strongly suggests that, dependingon the task, measurements will di�er, advocat-ing again for multiple ways of measuring overallperformance.

On the other hand, shared reference bench-marks are essential in scienti�c domains where

1

Page 3: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

reproducible experiments are vital to the peerreview process. For instance, there havebeen numerous attempts to produce commondatasets for problems which arise in documentanalysis [14, 25, 24]. It is important to note,however, that shared datasets are only a partof what is needed for performance evaluation,and since research in document analysis is of-ten task-driven, speci�c interpretations of adataset may exist. So whether the problem isinvoice routing, building the semantic desktop,digital libraries, global intelligence, or docu-ment authentication, to name a few, the re-sult tends to be application-speci�c, result-ing in software solutions that integrate a com-plete pipeline of cascading methods and algo-rithms [14, 23]. This most certainly does nota�ect the intrinsic quality of the underlying re-search, but it does tend to generate isolatedclusters of very focused problem de�nitions andexperimental requirements. Crossing bound-aries and agreeing on what kinds of tools, for-mats or measurements are the most useful isdi�cult and may, in fact, be impossible sincethe pursuit of goals may be prove orthogonalbetween domains.In this paper we put forth a rather radical

point of view: quality measurement of both hu-man and automated document interpretations,ground-truths, and therefore performance mea-surements are so context dependent that itdoesn't always make sense to consider themin an absolute reference frame where true andfalse would be universally agreed upon for aparticular document and its interpretation. In-stead, we are presenting a paradigm in whichmultiple interpretations and measurements co-exist, and where measuring, comparing and in-terpreting require the presence of a well de�nedcontext. Taking this into account is a very dif-ferent way of considering document analysis re-search and opens up a wide range of possiblenew research topics, provided the frameworkand tools for doing so are available. Section 3

describes such a platform, as it is currently un-der development as part of the DAE project atLehigh University [11], and details its means ofrepresenting, comparing, and correcting dataand interpretations in Section 4. Section 5 con-cludes with a discussion of open questions andongoing work. Before that, the next section de-velops the di�erent document models, abstrac-tions and interpretations that need to be con-sidered in order to make the rest of our workpossible.

2 Contents, Abstractions and

Interpretations

2.1 Vocabulary

In the introduction we mention �interpre-tations�, �performance�, �ground-truth� andother terms referring to what could be con-sidered to be �true� or �false� in a context ofdocument interpretation. Precisely de�ning allthese terms is not very helpful and it wouldmake this document unnecessarily verbose andlong. However, in order to understand ourwork, and to capture the semantics of the usedvocabulary, it is necessary to view these terms,and document interpretation in general, in thelight of [5]. Documents are physical supports1

that were created by an author to convey ames-sage to a reader.

Documents exist and are unambiguous.They are mere physical entities and havean undisputed content value (pixel valueson scanned documents, tags and �elds inHTML documents, sampling and impulsevalues for audio recordings, ink moleculeson a velum ...). These content values may

1One can argue about the term physical support. Inour framework it might be a recorded audio message,as well as a twelfth century handwritten codex or acomplex HTML or PDF document.

2

Page 4: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

or may not be the result of the author'sintent.

The author's message is embedded in thedocument and is a complex mix of syntac-tic representations, cultural context pre-suppositions, etc. The transcription of themessage to the document is a noisy andimperfect process, and it can be generallyassumed [5] that it is not reversible with-out meta knowledge.

The reader's interpretation is an attemptto retrieve whole or part of the author'smessage, based on the physical content ofthe document and a set of contextual as-sumptions2.

This way of perceiving document interpreta-tion sheds a new light on how to consider noisydocuments since it not only covers the noisethat a�ects the physical transcription processof the author's syntax on the document sup-port, but it also covers the lack of contextualknowledge that a�ects the interpretation by thereader.

2.2 Comparing Interpretations

The de�nitions in the previous section insistthat both the author and the reader operate intheir own contextual frame. There is no guar-antee that either of them, on the one side, orthat two di�erent readers, on the other side,share the same context. With this postulate,trying to determine which one of two interpre-tations is �better� becomes di�cult.

Notwithstanding, it seems essential that inthe context of experimental noisy document

2There is no need for the user interpretation to ac-tually be a tentative to retrieve part of the author'smessage. Interpretation can also consist in trying torecover part of the contextual assumptions of the au-thor, or to try and retrieve information concerning thephysical transcription and/or capture process.

analysis, methods and algorithms are com-pared in order to evaluate scienti�c contribu-tions. This is the reason for collections of evalu-ation documents to be annotated down to a �nelevel with the so-called �ground-truth� (e.g. thelocation and identity of every character repre-sented in the document, in some cases, or evenricher annotations, like the type size and type-face for each character in other cases). It wouldbe a mistake to consider this ground-truth tobe an absolute fact. Given the paradigm ofthe previous section, it is just an instance ofone readers' interpretation. It is certainly notunique and may not cover the author's wholeintent. It is merely a re�ection of a speci�creader's interpretation context.Existing tools allow the user to indicate how

he/she believes a document should be inter-preted, but do little to help users understanddi�erences in interpretations. Such di�erencesmight be called �errors� when there is a strongconsensus about what constitutes the right an-swer. In many cases, however, there are legit-imate di�erences of opinion [8, 15] by variousreaders of the document, and these may di�erfrom the intention of the author (which is usu-ally hard or impossible to determine, althoughsometimes we can get access to it [5]).So, although standard document collections

exist, their annotations or ground truth may bespeci�c, recorded in pre-determined represen-tations, incomplete or partially �awed for moregeneric contexts, while, on the other hand,there is a need to collect and manage annota-tions in ways that make it possible to constructmore robust and general document analysis so-lutions (and therefore encompass broader con-texts).In the next sections of this paper, we seek

to explore methodologies for storing, visualiz-ing, and interpreting document collections thatacknowledge ambiguity and integrate the factthat multiple interpretations are unavoidable.Our approach exhibits the following principles:

3

Page 5: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

Figure 1: Alternate interpretations and relative error analyses (from [16]).

• allow that an interpretation for an entityon a page may be incompatible with agiven context and not with another, as-suming there may be more than one ac-ceptable interpretation for a particular en-tity;

• support the interleaving of machine (au-tomatic) and human (manual) interpreta-tion steps that are intended to improvethe quality of the document representationover time;

• facilitate the development of more accu-rate recognition algorithms by retainingand exploiting all of the user's interactionswith the collection;

• help the collection as a whole to evolveto higher and higher levels of quality overtime.

The next section brie�y summarizes the fea-tures we believe should be present in a compre-hensive system, drawing from the discussion inour earlier paper [16]. The concept of interpre-tation, which we de�ned previously as readerspeci�c, plays a central role. An interpretationre�ects the opinion of a reader of the documentand, since opinions can vary, there may be nounique �correct� interpretation.3

3It should be understood that interpretations are

2.3 Document Models

In order to support comparison of interpreta-tions, there needs to be a set of basic entitiesin which to express document contents. Thesebasic entities are generally agreed upon andconsist of a hierarchy of regions that are la-beled as pages, zones, text lines, words, andcharacters (see, e.g., Trueviz [13] and Do-

clib [9]). In other cases, entities may also begraphical components expressed with a visualvocabulary (see, e.g., the Qgar toolkit [19],or [10] for more elaborate graphical documentdescriptions). Other relationships between en-tities should also be supported, including con-tainment and reading order. Moreover, whilesharing of document models is important forcomparison of interpretations, they should notrestrict users in their ways of interpreting doc-uments. Users should be allowed to create theirown label types and relationships without re-striction to address the needs of speci�c appli-cations.Obviously, the successive pages that com-

prise a single document should be linked in se-quential order. Some categories of documentsnaturally contain cross-page references (includ-ing this paper). In addition, it may be that dif-ferent versions of a document are present in a

not limited to simple transcriptions of text appearing onthe page, but even in this case there could be di�erencesof opinion.

4

Page 6: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

corpus, or multiple copies of the same version.Two photocopies of a page may lead to nearlyidentical images in the dataset, or they maydi�er in substantive ways (e.g., one may con-tain handwritten annotations added after theoriginal printing).

2.4 Alternative Interpretations

Interpretations for a page are created either byhumans or by algorithms which have been de-signed to mimic human behavior on a certainclass of inputs. In our model, it is natural toassume that a given page will have receivedmultiple interpretations over time, originatingfrom varying sources, and relating to varyingcontexts. Interpretations can be created fromscratch (working directly from the page im-age), or they can build on previous interpreta-tions (attempting to correct perceived errors).As suggested earlier in our discussion of doc-ument models, we can generally assume thatinformation will be recorded at the character-level, word-level, sentence-level, paragraph-level, page-level, and document-level. The abil-ity to compare interpretations is critical whenit comes to quantifying how well an algorithmhas done (i.e., how closely it matches humanperformance). Figure 1 illustrates a range ofpossibilities.

We have suggested there may be no suchthing as a �correct� interpretation, but when apage has multiple interpretations, which is thepreferred one for comparison purposes? Herethe notion of on-line reputation as practiced inWeb 2.0 recommender systems may hold thekey [18, 26, 20]. Researchers and algorithmsalready have informal reputations within thecommunity. Extending this to the documentinterpretation data can provide a mechanismfor deciding which annotations to trust.

The presence of di�ering interpretations pro-vides a basis for de�ning what �noise� meansin the context of document image analysis. We

might take input noise to be any artifact thatprevents a �uent reader � whether human ormachine � from arriving at the interpretationof a document the author intended.4 Noise inthe input manifests itself as noise in the out-put. It is often unrealistic, however, to expectthat we can recover the author's original intent.Hence, we prefer a more practical de�nition ofnoise as being the relative di�erence betweenthe interpretations of two or more readers. Ifhuman readers arrive at similar interpretationswhich a particular computer algorithm cannotmatch, then we can conclude that there is noisein the output of the algorithm, but we shouldbe careful about assuming there is noise in theinput since it could be that we are simply deal-ing with a bad algorithm. If no computer al-gorithm can produce an interpretation similarto that of the humans, then we can say thatthe input document itself is noisy and that theproblem is hard. An examination of this em-pirical approach to de�ning noise is one of theinvestigations made possible by the server ar-chitecture we describe next.

3 Supporting Platform

To begin studying how to support these goals,we have developed an operational platformthat is publicly accessible [3] and capable ofstoring data, meta-data and interpretationsas well as interaction software. The system,known as DAE (for �Document Analysis andExploitation�), runs on a 12-core machine with32 GB RAM and 48 TB disk space, and, assuch, is a credible proof-of-concept prototype.It also stores complete provenance [11] (i.e.the full data and event pipeline that has con-tributed to the creation of an element in thedata base) of all data generated through theplatform in order to capture all available con-

4Here ��uent� refers both to language skills as wellas to domain knowledge.

5

Page 7: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

text information that might impact on measur-ing di�erences between interpretations.

The data model is based on the followingprinciples:

• all data is typed; users can de�ne newtypes;

• data can be attached to speci�c parts of adocument image (but need not be),

• both data and algorithms are modeled; al-gorithms transform data from one typeinto data of another type;

• full provenance of all data transformationsis recorded;

The DAE server has been implemented us-ing both Oracle 11.2 and MySQL back-enddatabase management systems. It is accessedby a web front-end that provides a Web 2.0-likeinterface and encapsulates SQL queries to theback-end. It also relies on an independent ap-plication server that is used for executing reg-istered algorithms on the data.

A simpli�ed representation of the datamodel is represented in Figure 2. It con-sists of three key elements: algorithms,data_items and algorithm_runs. The un-derlying reasoning is that data is transformedby algorithms. data_items are instances ofdata and are related to algorithms by explicitalgorithm_runs. New data_items thus pro-duced are stored in the database with the exactinformation of how they were obtained.

Pre-de�ned types of data_items are files,page_images, page_elements , datasets andpage_element_properties. The �rst threeare straightforward generic data types that �tinto any document analysis schema:

file is a data_item corresponding to a �lecontaining data pertaining to some speci�cproblem, in any format. This allows users

to plug into our framework in an uncon-strained and direct manner, without hav-ing to convert individual data and �le for-mats.

page_image is an image �le representing aphysical page at a given resolution andwith a given image quality. It is perfectlypossible to have multiple page_images

representing the same physical page, asshown in Figure 1.

page_element is an area of a page_image, de-�ned in as unconstrained a way as possi-ble, either by a bounding box or a pixelmap, or other representations by means ofa speci�c page_element_property.

These elementary data_items canbe further extended by user-de�nedpage_element_property and grouped intodatasets. Furthermore, every data_item canbe annotated, commented, and rated througha Web 2.0 interface, as shown in Figure 3.

The focus is not just on the data itself. Datasemantics come from the fact that they havebeen applied to, or are the results from, spe-ci�c algorithms, and are therefore tasks and in-terpretations as mentioned in the introduction.Since all data is structured in the database, itbecomes straightforward to query for �all im-age regions to which at least two OCR algo-rithms have been applied and for which the re-sulting interpretations di�er�, or ��nd all OCRalgorithm results for this image patch�. Thisgreatly enhances the processes we have beendescribing in our previous work [16].

The advantages of our platform go even be-yond and can be summarized as follows:

Formats and Representations are trans-parently handled by the system, sincethe user can de�ne any format, namingor association convention within oursystem. Data can be associated with

6

Page 8: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

Figure 2: Simpli�ed data model for the DAE web server (adapted from [11]; full version availableon-line [4])

image regions, image regions can be ofany shape and format, there is no re-striction on uniqueness or redundancy, somultiple interpretations are not an issue.Furthermore, data can be convenientlygrouped together in sets. These sets canin their turn be named and annotatedas well. Data items do not need tobelong exclusively to a single set, so newsets can be created by recomposition orcombination of existing data sets.

Storage and Access architecture of the sys-tem makes it extremely easy to scale tohigher demands, as both the storage andthe computing infrastructures are con-ceived as physically separate entities. Thecurrent version already distributes someof its computing onto a high performancecomputing cluster for speci�c algorithms.

Querying and Retrieval is the great gainthat the DAE platform o�ers. Becauseof its underlying data model and architec-ture, all data is accessible through SQLqueries. The standard datasets that canbe downloads from the platform are nolonger monolithic .zip �les, but poten-

tially complex queries that generate newdatasets on demand. Because of the de-gree of �exibility in annotating and sup-plementing existing data with meta-data,the potential uses are far beyond simplestorage and retrieval of �xed data corpora.

Interaction with the data is integrated in thedata model on the one hand (it representsalgorithms, their inputs and outputs), butgoes further by hosting algorithms thatcan be executed on the stored data, thusproducing new meta-data and interpreta-tions. Queries like �nding all OCR resultsproduced by a speci�ed algorithm can ei-ther be used as an interpretation of a doc-ument, but can equally serve as a bench-marking element for comparison with com-peting OCR algorithms.

4 Uses and Interactions

The described DAE platform supports thekinds of interactions described in Section 2.Because of the �exibility of our data model,they all consist of similar interactions with thedatabase.

7

Page 9: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

Figure 3: http://dae.cse.lehigh.edu/DAE/?q=browse/dataitem/54246#47165 showing theon-line interface for annotating, commenting, and rating data.

4.1 Expressing Reading Order

Expressing reading order, as described earlier,is actually a restrictive instance of more gen-eral navigation support capacities. Navigationsupport can be de�ned as establishing direc-tional links between regions of pages. Sincepage regions already exist in our data model bymeans of page_elements, we can de�ne a newpage_element_property. As shown in Fig-ure 2, page_element_properties can be typedand are attached to any kind of page_element.We can therefore de�ne a link property relatingtwo page_elements.

Furthermore, the question of multiple ver-sions of the same document are also handledin our system. Although not represented inthe simpli�ed data model in Figure 2, the com-plete data model [11] represents the notion ofdocument as a set of �pages.� Each page can

be projected onto any number of page_images.page_images can thus be considered as single,stand-alone, document analysis objects in onecontext, or related to other images, known torepresent the same physical object, but cap-tured under other conditions.A typical SQL query for retrieving other ver-

sions of a same page_images would consist inretrieving the master document of an image ofinterest, and then �nd all page_images gener-ated from this document. For instance, if theimage has an identi�er X then we would get:

select PAGE_ID from GENERATES_PAGE_IMAGE

where PAGE_IMAGE_ID = X;

to retrieve the master document ID, and callingit M,

select PAGE_IMAGE_ID from GENERATES_PAGE_IMAGE

where PAGE_ID = M;

8

Page 10: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

would yield all copies of the document.

4.2 Adding Interpretations

The wide range of interpretations mentioned inSection 2.4 can be handled in three ways.

1. The �rst one is the one we men-tioned previously, by adding properties todata_items.

For instance, during our OCR exper-iments, we have de�ned the followingdatatype_properties:

recognized text represents the recog-nized textual information from apage element through some OCR al-gorithm,

number of characters represents thenumber of characters that a pageelement has,

textline height represents the height ofa text line (in pixels).

However, proceeding in this way would beequivalent to assigning a unique groundtruth to an input. This is not what weare aiming at.

2. A more coherent way of adding interpreta-tions is to include algorithms and prove-nance. This requires the same de�nition ofdatatype_properties as in the previouscase, but these properties are not directlyattached to data_items. Instead, they be-come the result of algorithm_runs thattake data_items as input. The fundamen-tal di�erence is that it is now possible tomanage an large spectrum of similar an-notations and interpretations of the samedata. It even becomes possible to reuseand adapt interpretations to new contextswithout loss of previous knowledge, as willbe explained in Section 4.4.

Furthermore, the semantics are now ex-pressed by the explicit relation with analgorithm which can, in its turn, allowfor a very detailed analysis of the results.

3. In a less formal, but still useful way, wehave also mentioned that any data_item

can be freely annotated and rated byusers (as shown in Figure 3). Thismakes possible a user-friendly and user-centric interaction, beyond more formaland algorithm-oriented interactions. Itcan allow open discussion of alternateviewpoints before updating and modifyingdata in the database itself.

4.3 Comparing Interpretations

Given the tools and models described pre-viously, comparing interpretations becomes achallenging goal. If interpretations share iden-tical (syntactic) representations, this might bestraightforward. It introduces also very inter-esting new perspectives when their syntacticrepresentations di�er and comparison requiresoperating on a more semantic level. This in-cludes comparison of results from di�erent al-gorithms addressing similar problems, combin-ing and pipelining algorithms [12] to achievesimilar results, or even di�erent versions of thesame algorithm. It is also possible to de�neoracle algorithms (i.e., algorithms that do notactually correspond to executable code, butare, for instance, knowledgeable human ex-perts. They are considered being �awless andalways giving the exact correct result for a spe-ci�c context). In that case, the result would bemanually annotated datasets.

One of the main innovations in our approachis that the platform allows for modeling, stor-age, and execution of any kind of algorithmoperating on data. Up to now, we implic-itly assumed that these algorithms were doc-ument analysis-related, extracting interpreta-

9

Page 11: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

tions from document images. But the platformcan do more than this, since it can host and ex-ecute the evaluation algorithms that comparethe �nal results of a given analysis pipeline.The scenario depicted in Figure 1 gives rise,

on our platform, to the following queries:5

• Retrieving Interpretation1 and Interpreta-tion2 would be done through respectively

select DATA_ITEM_ID from

ALGORITHM_RUN_OUTPUT, ALGORITHM_RUN_OF

where

ALGORITHM_ID='OCR 1' and

ALGORITHM_RUN_OUTPUT.ALGORITHM_RUN_ID =

ALGORITHM_RUN_OF.ALGORITHM_RUN_ID;

and an equivalent query replacing OCR1by Bob's Transcription.

• Retrieving all interpretations compatiblewith a given evaluation metric (i.e. in-terpretation data that is in a requiredformat type, for instance) requires thatwe check the data_type of the interpre-tations. The queries would therefore re-trieve all data_items issuing from anypage_image that is a representation ofthe given physical page and that have adata_type that is the same of the Data

required as input for the evaluation algo-rithm.

4.4 Reusing Interpretations

Finally, reusing interpretations in new contextsbecomes possible. It might however be neces-sary to adapt (or �correct�) them. The need tocreate a new interpretation for existing data,usually arises from a change in context. In thatcase, one simply declares a new algorithm and

5The SQL queries given in this paper are slightlysimpli�ed for ease of reading and hide some join opera-tions with tables that are not essential for understand-ing the underlying principles.

uploads the interpretation data resulting fromit. However, there can also be a need for acorrection within an existing context, due to amisinterpretation of it, or a programming er-ror, for instance. In that case, one contributesa new version of an existing algorithm and theplatform has all the tools and information toautomatically generate all new correspondingmeta-data.While new interpretation contexts often give

rise to new algorithms which can be hosted andexecuted by the platform, manual annotationsand interpretations are equally possible. Aswe have seen previously, manually annotateddocuments still require the registration of an�oracle� algorithm, such that provenance anddependencies are maintained.

10

Page 12: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

5 Discussion

5.1 Paradigm Shift

Our ultimate goals for the research describedin this paper distinguish it from past work ondocument ground-truthing and evaluation. Wedo not focus on establishing ground-truth ordesigning evaluation processes, but instead onproviding a new paradigm showing how to han-dle access to multiple, contradictory, and in-complete interpretations. As such, it is notcomparable to existing software libraries [19, 9],data formats or ground-truth creation algo-rithms [13, 21].

The paradigm shift that underpins our workis motivated by the fact that, while the abovetools and solutions provide excellent means ofsharing algorithm implementations and allowfor building on previous work, they do not ad-dress the issue of comparing research resultscoming from di�erent sources. They are merelythere for people to build upon (which is, by theway, already quite a step in the right direction).Our work is to be seen in a broader context,integrating the latter. It builds on two majorassumptions:

The Need for Peer Assessment of algo-rithms, methods and datasets is crucialfor the research community. Because alot of the research is conducted in focusedor application-speci�c contexts, it is hardto e�ectively achieve real incremental re-search with fully and openly assessed andmeasured performances and cross-domainimpact evaluations.

Crowd-Power is currently the most versatileand dynamic approach to peer assessment.Web 2.0 communities have shown theirtremendous capacity to dynamically adaptto new information �ows, and to have aselective evaluation of uncontrolled data.This model only works if two major con-

ditions are met:

1. Unrestricted and open access to datafor contribution, retrieval or exten-sion (e.g. Wikipedia [1]).

2. The personal bene�t perceived bythe contributing individual increaseswith his level of contribution and out-weighs the overhead of contributingto the system (e.g. Facebook [6]).

Our platform is exactly that: a comprehen-sive platform supporting la larger paradigmthat will allow peer evaluation of algorithmsand datasets through crowd cooperation.Being able to capture and exploit all user in-

teractions with a collection and its documentsrequires an approach to representing and relat-ing the alternative interpretations described inthis paper. Feeding this information back toimprove the performance of document analysisalgorithms then follows as a natural extension.One of the critical points of this kind of

paradigm is that it succeeds only in a collabo-rative and collective environment. In order towork, it needs to be widely adopted and used,but in order to be adopted, it must prove itsusefulness (cf. the personal bene�t mentionedpreviously). To help promote this, we are plan-ning to populate our repository with widelyused datasets, and create easy interfaces for up-load and retrieval that do not require adoptinga speci�c data format.The infrastructure we have developed was

designed for continuously evolving data and in-terpretations and will drive our research fora signi�cant period of time, since it providesa new way of viewing experimentation, evalu-ation, and certi�cation of scienti�c results indocument analysis.

5.2 Further Work

Widespread Adoption Needs Promotion

It is obvious that, although we believe in the

11

Page 13: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

Web 2.0 viral spreading potential, our ap-proach is not going to be adopted solely be-cause it is a nice idea. The platform reportedin this paper is fully operational and functional,but has not yet been openly promoted. In or-der be adopted by the community it needs asigni�cant initial investment in promotion. Atthe �rst stages of adoption, it will need tuningto all the user needs we might not have initiallyenvisioned.

This will be done through various ways:

1. making the full source code and documen-tation of the platform available under anopen GNU-like license;

2. hosting a wide range of publicly availabledatasets, in collaboration with the IAPRTC-10 and TC-11 committees6;

3. providing a �exible framework for hostingand running document analysis contests;

4. organizing demonstrations and tutorials atinternational events.

�Eating our own Dog Food� The dissem-ination initiatives mentioned in the previoussection are lacking one essential element: algo-rithms. We are aware that the hosting and inte-gration of any kind of algorithm is a very chal-lenging task. Although there are mature tech-nological solutions like virtual machine hostingas to overcome compatibility issues, or web-service infrastructures to take advantage of, ifnot cloud computing, at least distributed com-puting facilities, the engineering e�ort is verylikely to be substantial (hence the interest of

6The International Association for Pattern Recogni-tion is an international association of non-pro�t, scien-ti�c or professional organizations concerned with pat-tern recognition, computer vision, and image processingin a broad sense. Its technical committees TC-10 andTC-11 are respectively concerned with graphics recog-nition and reading systems.

having the platform widely adopted, and thussupported, by the community).

In order to overcome the initial barrier, weare extensively using our own platform and cur-rently providing access to and hosting or ownalgorithms and meta data, extending it contin-uously as our research advances and needs forcomparison with other tools arise.

Missing Features Currently missing fea-tures of the platform are related to its userinterface. The whole data model described inthis paper, and detailed in [11, 4] provides forall the functionality that we have been men-tioning. However, a great part of it requiressubstantial technical knowledge of the under-lying architecture and implementation. Thisis an inconvenience for the average documentanalysis user. We are currently working on verylow e�ort web based interaction tools that willease the transition for a widespread adoption.

6 Acknowledgments

The DAE project and resulting platform is acollaborative e�ort hosted by the ComputerScience and Engineering Department at LehighUniversity and funded through a Congressionalappropriation administered through DARPAIPTO via Raytheon BBN Technologies. Theproject currently involves the following mem-bers (in alphabetical order) Chang An, Sai LuMon Aung, Henry Baird, Michael Caffrey,Siyuan Chen, Brian Davison, Je� Heflin,Hank Korth, Michael Kot, Bart Lamiroy,Daniel Lopresti, Dezhao Song, PingpingXiu, Dawei Yin.

References

[1] A. Bruns. Blogs, Wikipedia, Second Life,and beyond: From production to pro-dusage. Peter Lang, 2008.

12

Page 14: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

[2] A. Clavelli, D. Karatzas, and J. Lladós. Aframework for the assessment of text ex-traction algorithms on complex colour im-ages. In Proceedings of the 8th IAPR In-ternational Workshop on Document Anal-ysis Systems, pages 19�26, Boston, MA,USA, 2010. ACM.

[3] Document Analysis and Ex-ploitation (DAE) web server.http://dae.cse.lehigh.edu.

[4] DAE server entity-relationship model speci�cation.http://dae.cse.lehigh.edu/Design/ER.pdf.

[5] U. Eco. The limits of interpretation. Indi-ana University Press, Bloomington :, 1990.

[6] N. Ellison, C. Stein�eld, and C. Lampe.The bene�ts of Facebook" friends:" so-cial capital and college students' useof online social network sites. Jour-nal of Computer-Mediated Communica-tion, 12(4):1143�1168, 2007.

[7] J. Hu, R. Kashi, D. Lopresti, G. Nagy, andG. Wilfong. Why table ground-truthing ishard. In ICDAR01, pages 129�133, 2001.

[8] J. Hu, R. Kashi, D. Lopresti, G. Nagy, andG. Wilfong. Why table ground-truthing ishard. In Proceedings of the Sixth Interna-tional Conference on Document Analysisand Recognition, pages 129�133, Seattle,WA, September 2001.

[9] S. Jaeger, G. Zhu, D. Doermann, K. Chen,and S. Sampat. DOCLIB: a software li-brary for document processing. In Pro-ceedings of Document Recognition and Re-trieval XIII (IS&T/SPIE Electronic Imag-ing), volume 6067, pages 09.1�09.9, SanJose, CA, January 2006.

[10] S. K.C., B. Lamiroy, and J.-P. Ropers.Inductive logic programming for symbol

recognition. In 10th International Confer-ence on Document Analysis and Recogni-tion, pages 1330�1334, Los Alamitos, CA,USA, 26-29 2009. IEEE Computer Society.

[11] H. F. Korth, D. Song, and J. He�in. Meta-data for structured document datasets. InProceedings of the 8th IAPR InternationalWorkshop on Document Analysis Systems,pages 547�550, Boston, MA, USA, 2010.ACM.

[12] B. Lamiroy and L. Najman. Scan-to-XML: Using Software Component Al-gebra for Intelligent Document Genera-tion. In D. Blostein and Y.-B. Kwon,editors, 4th International Workshop onGraphics Recognition - Algorithms andApplications, volume 2390 of LectureNotes in Computer Science, pages 211�221, Kingston, Ontario, Canada, 2002.Springer-Verlag.

[13] C. H. Lee and T. Kanungo.The architecture of TrueViz: agroundTRUth/metadata editing andVIsualiZing toolkit. Pattern Recognitino,36:811�825, March 2003.

[14] J. Liang, R. Rogers, R. M. Haralick,and I. T. Phillips. UW-ISL documentimage analysis toolbox: An experimen-tal environment. In Proceedings of theFourth International Conference on Doc-ument Analysis and Recognition, pages984�988, Ulm, Germany, August 1997.

[15] D. Lopresti and G. Nagy. Issues inground-truthing graphic documents. InProceedings of the Fourth IAPR Interna-tional Workshop on Graphics Recognition,pages 59�72, Kingston, Ontario, Canada,September 2001.

[16] D. Lopresti and G. Nagy. Tools for moni-toring, visualizing, and re�ning collections

13

Page 15: A Platform for Storing, Visualizing, and Interpreting ... · A Platform for Storing, Visualizing, and Interpreting Collections of Noisy Documents Bart Lamiroy y Nancy Université

of noisy documents. In AND '09: Proceed-ings of The Third Workshop on Analyticsfor Noisy Unstructured Text Data, pages9�16, New York, NY, USA, 2009. ACM.

[17] D. Lopresti, G. Nagy, and E. B. Smith.Document analysis issues in reading op-tical scan ballots. In Proceedings of the8th IAPR International Workshop on Doc-ument Analysis Systems, pages 105�112,Boston, MA, USA, 2010. ACM.

[18] W. Raub and J. Weesie. Reputation ande�ciency in social interactions: An exam-ple of network e�ects. American Journalof Sociology, 96(3):626�654, 1990.

[19] J. Rendek, G. Masini, P. Dosch, andK. Tombre. The search for genericity ingraphics recognition applications: Designissues of the Qgar software system. InS. Marinai and A. Dengel, editors, 6thIAPR International Workshop on Docu-ment Analysis Systems, volume 3163 ofLecture Notes in Computer Science, pages366�377, Florence, Italy, 2004. SpringerVerlag.

[20] J. Sabater and C. Sierra. Review on com-putational trust and reputation models.Arti�cial Intelligence Review, 24(1):33�60,2005.

[21] E. Saund, J. Lin, and P. Sarkar. Pixla-beler: User interface for pixel-level label-ing of elements in document images. In10th International Conference on Docu-ment Analysis and Recognition, pages 646�650, 26-29 2009.

[22] E. H. B. Smith. An analysis of binariza-tion ground truthing. In Proceedings ofthe 8th IAPR International Workshop onDocument Analysis Systems, pages 27�34,Boston, MA, USA, 2010. ACM.

[23] Stefan Jaeger, Guangyu Zhu, David Do-ermann, Kevin Chen, and Summit Sam-pat. DOCLIB: a Software Library forDocument Processing. In InternationalConference on Document Recognition andRetrieval XIII, pages 1�9. San Jose, CA,2006.

[24] Tobacco800 data set.http://www.umiacs.umd.edu/�zhugy/Tobacco800.html.

[25] UNLV data set.http://www.isri.unlv.edu/ISRI/OCRtk.

[26] B. Yu and M. Singh. A social mecha-nism of reputation management in elec-tronic communities. Cooperative Infor-mation Agents IV-The Future of Informa-tion Agents in Cyberspace, pages 355�393,2000.

14