19

DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

DIATHESIS: OCR based semantic annotation

of newspapers.

Martin Doerr, Georgios Markakis,

Maria Theodoridou, Minas Tsikritzis

June 2007

Center for Cultural Informatics | Information Systems Laboratory | Instituteof Computer Science |Foundation for Research and Technology - Hellas

(FORTH) | Vassilika Vouton,P.O.Box1385,GR71110 Heraklion,Crete,Greece

Abstract

Digitization of historical newspapers constitutes nowadays an essentialmeans for the preservation, dissemination of this material and the creationof large scale, distributed digital archives. Currently there are severalapproaches for rendering this type of digitized material searchable. The�rst approach relies on the manual completion of metadata �elds regardingthe physical aspects of the digitized material. The second relies on themanual creation of semantic relations describing and linking the digitizedcontent. Finally the last approach makes use of OCR (optical characterrecognition) technology for full text indexing purposes.

Each one of the above mentioned approaches has its advantages anddisadvantages. In the DIATHESIS digital library system we tried to im-plement a novel hybrid approach based on the three previous ones. Thesystem provides the user a fully web based interface that enables her toannotate speci�c segments of a document, extract the OCR text that cor-responds to that segment and describe the segment with detailed metadatainformation based on the CIDOC conceptual reference model.

1 Introduction.

Historical newspapers are one of the most signi�cant source of information forresearchers due to the wealth of information they provide regarding every aspectof everyday political, social and intellectual life. Access to this type of archivalmaterial is usually obstructed by the following factors:

• In order to protect the archival material from potential damage somearchives prohibit the access to the largest part of their collection.

• Direct contact with the original archival material constitutes a potentialhealth hazard (due to dust and fungi).

1

Page 2: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

• The lack of indexes to newspapers combined with the vastness of informa-tion contained in them makes research a very time consuming task.

• This type of material is usually dispersed geographically in many archivesor private collections.

Many archives adopted digitization of newspapers as a straightforward methodto deal with the above problems. Digitized material is easier to preserve andmuch easier to distribute via the Web. However, conversion of archival materialinto a digital image format (i.e. JPEG, TIFF, PDF or DJVU) does not solvethe problem of rapid access to this material. Digitization itself is inadequate ifit does not provide the means of rapidly accessing the digitized material in atimely and accurate manner (also known as the searchability issue).

A common practice of rendering digitized newspaper material searchable isto produce a basic set of manually completed metadata tags that are meant todescribe the whole document/issue and enable the user to access the digitizedarchive via basic keyword search applied to these �elds. A more advancedvariation of this practice is the use of Ontologies in order to create semanticrelations that describe and semantically link the digitized content with otherdigital objects or semantically annotate parts of it.

Another practice frequently met in digital archives digitization projects refersto the conversion of the original material into a digital format, accompanied bythe extraction of the free text contained in the original image via the use ofOptical Character Recognition techniques. In this case the extracted full textis used for indexing and presentation purposes. For future reference we couldcall the �rst case as physical features based classi�cation, the second case asconceptual classi�cation and the third one as OCR based full text indexingapproach of newspaper archival material.

1.1 The physical features based classi�cation approach.

Some real world examples of this approach are the cases of National Library ofAustria [1], the German National Library [7] and several other European insti-tutions [4, 9, 6] . In these cases archival material was converted into multimediamaterial (JPEG or DJVU images) and classi�ed with a basic set of metadata(number of issue, date of publication, newspaper name, number of pages etc).The result of these digitization attempts resembled more like a browsing mech-anism rather than a conventional search engine. The �nal user could browsethrough an existing catalog of online newspapers in a similar manner that shewould access the original material via a conventional catalog.

The advantages of this approach lie in its simplicity: the main task in thedigitization process is the conversion of the original material into a digital formatwhile maintaining a relatively simple table in a relational database, that holdsthe mappings between the digitized material and its metadata. However, thisapproach su�ers from some serious drawbacks. The �nal user is unable toconduct full-text searches on an article or issue level basis. Given the plethora ofinformation contained in a single newspaper issue this approach maintains the��nd the needle in the haystack� problems encountered in non digital newspaperarchives. In addition all systems of this kind do not provide the researcher withan exact map representing the conceptual structure of the archive.

2

Page 3: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

1.2 The conceptual classi�cation approach.

The conceptual classi�cation approach overcomes many of the above weaknessesby enabling the user to perform a knowledge engineering task upon the alreadydigitized material via the use of ontologies. The latter di�er radically in scopeand nature compared to the metadata used in the previous approach:

• Ontologies are used to express a speci�c conceptual view over the digitizedmaterial. The often quoted de�nition of ontology is "the speci�cation ofones conceptualization of a knowledge domain". In contrast to the �atmetadata that describe physical aspects of the archival material (newspa-per name, number of issue), ontologies use di�erent conceptual schemesin order to declare relations between instances of an a priori set of givenclasses (top level ontologies).

• The use of top level ontologies guarantees to a certain extent the semanticinteroperability among di�erent archives. The main idea is that via theadoption of common conceptual framework provided by a top level on-tology, several institutions could easily create a common semantic basedsearch mechanism. Currently there are several top level domain ontologiesthat can be used for this purpose (CIDOC CRM[17], Dublin Core[5]).

• In some cases the produced metadata may allow the researcher or intelli-gent agents to perform logical inference queries exploiting logical relationsdeclared between conceptual entities. Tim Berners Lee stressed underseveral occasions this need for machine understandable metadata as thefoundation of his semantic web vision [26, 27].

• In addition, the user may use concepts that classify the document thatare not initially contained within the document itself (i.e. classifying thedocument as a pre WWII era document or a document concerning educa-tional policies). This is more an act of interpretation of the document byan expert rather than a description of its contents, an act which remainsuntil nowadays an exclusively human virtue.

There have been already some attempts to use this semantic approach forarchival purposes. The Neptuno Project [15] for instance (although it did notinvolve the preservation and classi�cation of archival material but the creationof new) was an attempt to infuse the article creation process with semantic prop-erties. Neptuno provided an authoring environment where the article editor wasable to create content along with its semantic description.

On the cultural documentation/preservation systems domain there have al-ready been some attempts to adopt this ontology oriented approach. The AR-CHON system[23, 32] was such an initial attempt1 of the Institute of ComputerScience at FORTH to perform annotation tasks upon digitized material. In thisapproach the user was encouraged to mark-up parts of the representation onhis/her medium of choice (e.g. areas on a document image) and associate localinformation (e.g. �le name, spatial or temporal co-ordinates etc.) with text,

1A Multimedia System for Archival, Annotation and Retrieval of Historical Documents,Jan. 1997 - June 1998. An ICS-FORTH project originally used for the digitization of 100,000documents of the Vikelaia Municipal Library of Heraklion.

3

Page 4: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

Figure 1: The Archon System Architecture.

images, sound-clips, video or sub parts of them without altering the originalmaterial.

The Encyclopedia of Chicago [34] managed to create an online encyclope-dia by semantically linking information comprising of historical articles, im-ages,videos and audio clips. This material originated from the original contentsof the Encyclopedia of Chicago that was digitized and semantically interrelatedvia the use of semantic web technologies and Dublin Core Metadata. The sameapproach was followed in the Pergamos system in the University of Athens Dig-ital Library digitization project [20] where original manuscripts were digitizedand interlinked via the use of SW technologies.

Despite its indisputable advantages regarding the precision capabilities ofontology based systems, this approach su�ers from some serious drawbacks mostof them concerning the e�ciency of the knowledge building process:

• Given the density of information in a newspaper, production of metadatais a notoriously time consuming task (knowledge engineering bottleneck).

• Due to the above reason it is almost impossible to manually de�ne allthe semantic relations or entities contained even in a single article. Forinstance one can imagine an article containing a long list of participants ina demonstration (more than 50 names in total). The manual conversionof all these names into entities in a RDF graph is an almost prohibitivetask. Information Extraction techniques could speed up the whole pro-cess by enabling the semi-automatic extraction of named entities fromfull text, but unfortunately in historical newspapers the required data areusually not available in a textual electronic format. On the other handif one is obliged to omit valuable information in order to speed up theclassi�cation process (ie omit some names of not particularly great his-torical signi�cance), this information will not be accessible via a search

4

Page 5: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

mechanism.

• Article level search is usually not supported or it is hard to implement.Manually de�ning areas of the document is a possible option, howeverthis is a time consuming activity that narrows down further the existingknowledge acquisition bottleneck.

1.3 The OCR based Full Text Indexing Approach.

The knowledge acquisition de�ciencies of the later approach is handled e�cientlyby automatic digitization approaches that make use of OCR analysis of digitizednewspapers. Several institutions [2, 3, 11, 12, 10, 13, 8, 14] adopted this approachin order to digitize e�ciently a large volume of their newspaper archives, eitherby use of commercial software[2] (like Olive Software's Active Paper Reader ordtSearch ) or by implementing their own tailor made solutions[12]. There aretwo main trends in this approach:

1. The semi-automatic full text indexing approach is conducted roughly asfollows: In a �rst phase the newspaper page is converted into a digitalformat, and then via a supervised OCR process is broken down into itsconstituent parts (articles). In the process the page and its constituentarticles are transformed into PDF �les that contain both the full textand its correlated image segment. In the next phase a group of humansmanually creates an XML �le that link all the separate article PDF �leswith the speci�c page, and contains speci�c metadata regarding that page.Finally the produced PDF and XML �les are bundled into a ZIP �le andthey are imported massively into a full text indexing mechanism[24].

2. The fully automated full text indexing approach: the archival material isagain analyzed via an OCR process and is massively imported into a fulltext indexing mechanism in a similar manner to the previous approach.The main di�erence in this case is that the linking of the page segmentsthat articles consist of as well as the partial extraction of metadata con-tained in certain �elds (ie article title) is conducted in an automatic un-supervised manner via the application of a heuristic algorithm. Howeverthis algorithm is to a large extent context dependent: it depends to thespeci�c layout structure of a given newspaper as well as the language thearticles are written in2.

Full Text Indexing techniques are currently considered to be the state of theart in the area of newspaper digitization and this is mainly for the followingreasons:

• Creation of searchable full - text index via OCR is a much faster processcompared to the manual creation of metadata. In the case study of Britishlibrary's digitization project it is estimated that over 20.000 pages of his-toric documents and newspaper pages containing some 500.000 articles

2A �ne example of this approach was the British Library's newspaper digitization project[2,28]. In this case a Bitmap Zoning algorithm was used to automatically identify article regionswithin the newspaper issue. However di�erent versions of the same algorithm were used forthe analysis of newspapers printed before 1900 due to signi�cant structural di�erences of thesenewspapers compared to the ones that were published during the 20th century.

5

Page 6: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

have been processed and rendered searchable in only two months [28]. Inthe case of Utah's Digitization project again a total sum of 30.000 pageshave been processed in less than a year.

• Separation of �searchability� and �readability� [28]: by the term �searcha-bility� we refer to the means by which the digitized material is renderedaccessible via the use of conventional keyword search methods. A com-mon practice followed in the context of these techniques is that the fulltext produced via the OCR analysis of the document is used for full textindexing purposes. It is generally acknowledged that OCR analysis of his-torical documents (although it has greatly improved during the last fewyears) never produces a 100% exact replica of the original digitized imagein a textual format [21]. In order to overcome the inadequacies of OCRrecognition, most of these systems implemented fuzzy search methods thatproduce satisfactory results even in the presence of slightly corrupted text.However most of the texts produced via a OCR process cannot be used forpresentation purposes (�readability� issue) due to their poor quality. Inorder to overcome this problem, image segmentation techniques were usedin order to extract the appropriate image segment from the issue imageand bundle it with the appropriate text (usually in a PDF format). Thisapproach solves both the �searchability� and the �readability� issues sincethe textual part of the PDF is used for indexing purposes and the imagepart for presentation purposes.

• It is possible to conduct searches at a page/issue/article level basis: dueto the �exibility and the segmentation capabilities of this OCR basedapproach it is possible for these systems to fetch the exact article thatthe researcher was looking for. This fact alone increases dramatically theprecision/recall factor compared to the two previous approaches.

• The search is conducted via keywords in a manner that is familiar to theaverage user of contemporary Web Search engines (Google, Yahoo etc).

• This approach also addresses the problem of e�cient content dissemina-tion over the Web: since the page is segmented into its constituent parts,the latter can be used to represent a speci�c search result item to the �naluser . By following this practice, the overall browsing experience is greatlyimproved and network congestion problems are avoided.

Despite its advantages this approach still su�ers from some serious drawbacks:

• Even though nowadays the �bag of words� approach constitutes the pre-dominant search paradigm for the Web it still su�ers from some well knownprecision/recall issues. In conventional search engines the user has to copewith thousands of results that are essentially irrelevant to the user's needs.Some search engines have invented some ranking/�ltering mechanisms inorder to deal with the above problem3. Unfortunately such mechanismsare not currently implemented in this OCR based full text indexing ap-proach. This fact combined with the inherent imperfection of the OCR

3See for instance Google's PageRank mechanism.

6

Page 7: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

produced text (even though this imperfection is handled to a certain de-gree by fuzzy search methods) necessarily means that the retrieval capa-bilities of full text based systems of this kind are necessarily inferior to theretrieval capabilities of their original counterparts (web search engines).

• Newspaper archives are not as chaotic as the Web: The Web itself isby nature an unstructured, continuously expanding universe of HTMLdocuments that refer to conceptually diverse material. Under these cir-cumstances, the use of purely information retrieval solutions appearedinevitably as the most pragmatic approach of bringing order to chaos.Historical archives on the other hand are not as chaotic in nature: in mostcases the digitized material is produced by a usually overwhelming ,butyet �nite, number of newspaper issues of a collection. It is a commonpractice for humanities scholars to classify historical references containedin newspaper articles under di�erent categories and seek information byreferring to these categories [19] (i.e. by historical periods, types of activi-ties, types of actors etc). This potential categorization requires essentiallythe scholar's intervention during the indexing process of the digitized ma-terial and constitutes in some sense the conceptual structure of the archive.The existence of such an underlying structure is almost absent in OCRbased information retrieval systems.

• The search of information in OCR based information retrieval systems is�conceptually blind �: the user usually attempts to perform a blind keywordsearch within the contents of the digitized archive, without an explicitnotion of the semantics conveyed by archive content. A semantic structureof this kind could be used in order to provide the user with a concisedescription of the archive's contents. It could also provide a conceptualguide that could be used in conjunction to full text search queries in orderto improve the overall precision factor of the system.

• This approach uses bulky import �les rendering thus the import processa computationally expensive procedure. The zip �les used for the importprocess consist of the original image �le and its segments in PDF format,and a XML �le that links these �les together.This practically means thatthe same page is indexed and stored in the content repository twice: onceas a whole document and once as a constituent of its parts.

2 A fourth approach: hybrid annotation upon

OCR results.

The main rationale behind the creation of the DIATHESIS system [30] was toextend the conceptual classi�cation approach in order to perform a conceptualmodeling task upon the already digitized material by adding semantic valueto newspaper segments. This system attempts to implement a realistic con-ceptual classi�cation approach by combining the best elements from the threeapproaches mentioned above:

1. It permits searches on a newspaper issue basis (newspaper issue name,number, publication date) in a similar manner to the physical featuresbased classi�cation approach.

7

Page 8: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

2. It permits searches on an article level basis via the use of full text queriesin a similar manner to the OCR based Full Text Indexing Approach.

3. It permits searches on an article level basis via the semantic relationshipsassigned to each segment.

4. It permits searches that combine all of the above elements.

Since this approach is basically an extension of the conceptual classi�cationapproach our main concern was to provide e�cient means for speeding up theknowledge creation process. In addition it It does so by implementing the fol-lowing strategy:

• Document annotation is based on a single faceted model based on theCIDOC CRM RDF schema. The newspaper page is modeled as an in-stance of a Document class (E31.Document) that refers to instances ofthe Activity class (E7.Activity). By adopting such a �xed conceptualiza-tion, we managed to speed up to a large extent the annotation processand deal with the semantic interoperability issues encountered in purelyfull text indexing implementations.

• The system adopts a rather shallow semantic annotation approach: theresulting RDF triples don't aim at the formation of a coherent semanticnetwork. Instead they try to combine full text search and metadata basedsearch in such a manner that improves the average precision factor of thesystem. In simple words DIATHESIS tries to improve information �lteringby assigning semantic properties to full text search. At the same time thesystem produces a robust semantic backbone that can be used for theconstruction of a coherent semantic network in future implementations.

• OCR results have a dual use in our system: they are used for searchabilityand presentation purposes in a similar manner to the one described in thefull text indexing approach. In addition they become building materialsfor a user friendly annotation environment that allows the user to easilyde�ne areas of interest (an article or a sum of articles) given the OCRproduced segments.

At the core of this system lies the concept of document annotation. It is gener-ally acknowledged that annotation is a frequently encountered scholarly practiceduring the document understanding process [29]. Because of its popularity asa practice in the real world, annotation interfaces constitute a design patternqueenly met in cultural documentation systems. According to a distinctionmade by Doerr and al [33] we can distinguish annotation types in such systemsunder the following three main categories:

1. Annotations by medium:

(a) Lexical or hyperlink annotations.

(b) Visual (icon or highlighting) annotations.

(c) Acoustic (audio signals) annotations.

2. Annotations by locality of reference: Annotations may refer to

8

Page 9: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

(a) Entire texts.

(b) Parts of texts.

(c) Both.

3. Annotation by process:

(a) Textual annotation: involves adding some form of free text commen-tary to a resource, aimed primarily for human readers.

(b) Link annotation: provides information in the form of the contents ofa link destination other than an explicit piece of text or other data.

(c) Semantic annotation: assigns markup elements according to a spec-i�ed model which takes values from controlled vocabularies and itaims both at human readers and software agents.

According to the �rst two parts of this classi�cation the type of annotation per-formed via the DIATHESIS interface is a partial visual annotation of newspapermaterial. However according to the third part of this classi�cation the type ofannotation encountered in this system has multiple dimensions:

• It is textual: upon annotation of a speci�c segment, the user assigns thefull text that corresponds to this selection as an attribute to the createdobject.

• It represents a link: the annotation indicates a link to the speci�c coordi-nates that de�ne a segment of the annotated document.

• It is semantic: annotation indicates a semantic relationship on two levels.The �rst level concerns the relationship of the annotated segment to itsparent document (newspaper). The second level concerns the relationshipof the segment to other semantic entities (Actor, Stu�, Place etc).

Due to its multidimensional nature we could classify this this type of annota-tion as a hybrid one. Hybrid annotations are of particularly great importancefor they enable us to assign semantic properties to whole regions of text. Byassigning such properties to text we are able to create a semantic context thatdramatically increases the precision of full text queries. The importance of theexistence such a semantic context in the querying process has been acknowl-edged also in the COLLATE project [35].

2.1 The CIDOC CRM Core as a conceptual basis of doc-

ument annotation.

As mentioned above, the underlying semantics of DIATHESIS are based on theCIDOC Conceptual Reference Model [31]. The CIDOC Conceptual ReferenceModel (CRM) is a high-level ontology to enable information integration forcultural heritage data and their correlation with library and archive information.It is the culmination of over 10 years of work by an interdisciplinary team, andhas been accepted by the International Standards Organization as standard ISO21127. The CIDOC CRM is intended to promote a shared understanding ofcultural heritage information by providing a common and extensible semantic

9

Page 10: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

Figure 2: CIDOC based semantic relationships between the parent documentand its constituent parts.

framework to which any cultural heritage information can be mapped. TheCRM provides de�nitions and a formal structure for describing the implicit andexplicit concepts and relationships utilized in cultural heritage documentation.It is intended as a common language for domain experts and implementers toformulate requirements for information systems, and to serve as a guide for goodpractice of conceptual modeling. CRM can thus provide the semantic glue thatis needed to mediate between di�erent sources of cultural heritage information,such museums, libraries and archives.

In the DIATHESIS system the newspaper page is modeled as an instance ofa Document class (E31.Document) that refers to instances of the Activity class(E7.Activity). This essentially means that each annotation upon the newspaperissue indicates a �refers_to� (reference) relationship between a document and anactivity or a sum of activities mentioned within the document's text. Practicallythis means that each article in the text is modeled as an activity or a sum ofactivities. As a consequence in this implementation, a Document object linksto more than one Activity objects.

The second type of semantic relationship encountered in this implementa-tion concerns the relationships stated between the declared activity and otherCIDOC CRM classes. More speci�cally the following classes of CIDOC havebeen used:

1. E2.Temporal_Entity: this indicates the speci�c time period when the ac-tivity mentioned in the document took place. Usually this is the time ofthe edition of the newspaper issue, however a single article might referto a di�erent time period (like in the case of historical references). A�P4F.has_time-span� attribute is used to establish the semantic relation-ship with the parent Activity class and contains a pair of long numbers

10

Page 11: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

that indicate a speci�c time-span.

2. E39.Actor: this class refers to all the actors that are considered partici-pants in the mentioned activity. There is a �P14F.carried_out_by� rela-tionship between the Actor and Activity objects.

3. E70.Stu�: this class refers to all the objects that were used in the men-tioned activity. There is a �P16F.used_speci�c_object� relationship be-tween the Stu� and Activity classes.

4. E53.Place: this class refers to the geographic location where the activitytook place. It has a �P7F.took_place_at� relationship with the Activityclass.

5. E55.Type: this class indicates the type of activity that took place. It hasa �P2F.has_type� relationship with the Activity class.

6. A String containing the full text included in the original segment. This hasa a �P3F.has_note� relationship with the Activity class which is basicallya pseudo - semantic relationship indicating that the extracted text is anote upon the de�ned segment. Although this does not indicate an actualsemantic relationship, the inclusion of the full text in this semantic model,will enable us later to submit queries that exploit both the full text andthe metadata contained in each Activity object.

Despite the fact that this implementation is based on a purely ontocentricmodel, it does not aim at the creation of a closed semantic network in the fashionthat the ARCHON system did. In a purely semantic network implementationthe leaves of this semantic tree structure would be links to distinct RDF re-sources that could be linked to each other via property nodes deepening thusthe produced semantic net to an in�nite length.

However, in this implementation the leaves are not resources but plain lit-erals. The literals are either assigned directly by the user (directly typewrittenor copied from full text) or they are contained in reserved vocabularies (the-sauri). The speci�c reserved vocabularies are centrally managed and stored ina thesaurus repository (SIS-TMS [18]) arranged in a hierarchical structure thatre�ects the broader term - narrow term relationships that exist between them.These speci�c hierarchies represent the domain speci�c (context dependent) vo-cabulary that is used and maintained by a group of annotators and it is used asthe conceptual basis for the interpretation of a selected event.

3 Technical Implementation

DIATHESIS uses the Fedora digital library system as its backend. The latter isan ontocentric content management system specially designed for the creationof digital library collections [25] that permits the creation of semantic net struc-tures linking digitized cultural material. At its core is a digital object modelthat supports multiple views of each digital object and the relationships amongthese objects. Digital objects can encapsulate locally-managed content or makereference to remote content.

Three kinds of fedora object prototypes were used in this implementation:

11

Page 12: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

Figure 3: CIDOC based semantic properties of a user de�ned newspaper seg-ment.

12

Page 13: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

1. Entry Objects: These objects contain the information regarding the news-paper issue. They consist of one RELS-EXT datastream that contains thesemantic information described in the E31.Document class and an auxil-iary datastream called RELATIONS that contains the relations betweenthe current issue, its constituent parts (pages), and its declared segments(articles).

2. Metadata Objects: These objects contain the information regarding eachnewspaper segment de�ned by the user. They consist of one RELS-EXT datastream that contains the semantic information described in theE7.Activity class. They also contain an internally managed XML datas-tream (MAPPINGS) that holds information regarding the exact coordi-nates of the selected segment and the exact pages where the segment islocated.

3. Image Objects: This type of objects contain the data required for thevisualization of the newspaper page. This object contains three types ofdatastreams: one internally managed IMAGE datastream that containsthe JPEG image thumbnail used for preservation purposes, an IMAGEBIGdatastream that holds a reference to the original scan (TIFF image) thatis stored in a remote location for preservation purposes and an IMGXMLdatastream that contains an internally managed XML �le that representsthe document segments and the text produced by the OCR process. Inaddition this object uses a disseminator, whose main purpose is to bindthe IMAGE and IMGXML datastreams together. Upon invocation, thedisseminator performs an XSL transformation upon the contents of thetwo datastreams in order to produce the interactive SVG element thatvisualizes the segmented document page. The produced SVG element isthe the �semantic canvas� upon which the document annotation task willtake place.

3.1 System Work�ow and the Import Process.

The initial goal is to populate the system with Entry objects, which represent thenewspaper issue as a whole. Entry objects are in some sense the �raw material�upon which the annotator can extract segments manually. The ingest processis initiated via the administration menu and has two prerequisite elements:

1. An XML �le that de�nes the location of the directories containing thedigitized material per issue (and optionally some metadata concerningthis issue) and

2. A speci�c folder structure that holds the digitized material (thumbnailJPEG and original TIFF images).

These two elements are the product of the digitization process by a team ofdigitization specialists. Upon completion, the import process produces a seriesof Entry objects, their corresponding Image objects as well as the semanticrelationships that link these objects together. The produced material is storedin the Fedora repository and it forms the raw material upon which anothergroup of annotators will perform the knowledge extraction task. The document

13

Page 14: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

Figure 4: The DIATHESIS system work�ow.

annotation team uses the annotation interface of DIATHESIS in order to isolatespeci�c segments of text upon the produced SVG canvas producing this waya series of Metadata Objects. The produced objects are used by the searchmechanism to conduct searches on an article level or a document level basis.

3.2 The system architecture

The system consists of two separate web apps that are bundled in a customizedFedora installation package. The data are stored and retrieved within a fedorarepository.

The system consists of three main modules:

1. The image annotation application: This is an AJAX-based applicationthat is divided in three tabs. The �rst tab (Document Annotation Tab)permits the user to select a speci�c Entry Object (newspaper issue), anduse a specially designed GUI to de�ne new Metadata Objects, and corre-late them with speci�c regions of each page.The second tab (Issue Metadata Tab) contains metadata that describe thedocument issue as a whole. Finally, the third tab (metadata tab) diplaysthe required �elds for the classi�cation of a speci�c segment. There aretwo kinds of �elds used in the metadata tab: free text �elds (were data isdirectly typewritten or copied directly from the OCR text) and reservedvocabulary �elds (where keywords are selected from SIS-TMS generatedSVG tree structures).Upon submitting the document and its annotated regions, the system

14

Page 15: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

Figure 5: The Document Annotation Tab.

Figure 6: The Metadata Tab.

15

Page 16: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

recursively creates and ingests into the Fedora repository a series of Meta-data objects which hold information regarding the OCRed text containedin the speci�c page, the coordinates of the speci�c segment on the givenpage and a set of speci�c CIDOC compliant metadata describing the an-notated segment.

2. The Administration Application: This is the administration portion ofthe application. Through it, the administrator can import new materialinto the database, supervise the import process, modify the structure ofmetadata used by the system, and obtain statistical information regardingthe use of the system (document insertion rate, metadata creation rateetc). The Administrator screen uses an embedded ActiveX control inorder to integrate an instance of the Abby Finereader OCR software withthe import process.

3. The Viewer Application: This is an Ajax based application designed toprovide rapid access to digitized material over the web. The user canconduct searches either on an article level or an issue level basis. Thedownload mechanism is based on a �exible architecture that allows thetimely download of the retrieved material. In a similar manner to mostOCR based implementations, the system allows the user to retrieve theexact segment of the document that contains the search criteria submittedby the user.

4 Future Research

The system uses an e�cient user interface in order to speed up the annota-tion process. Undoubtedly there is a signi�cant improvement in terms of speedof classi�cation over the Manual ontocentric classi�cation approach. However,this approach is still signi�cantly slower than the fully automated OCR basedsystems, to the extent that it requires the user's involvement during the classi-�cation process.

In order to overcome this problem and narrow down the cost/productiongap between the hybrid annotation and and the OCR based Full Text Index-ing Approach we plan to implement information extraction techniques, for thesemiautomatic extraction of named entities from full text. The successful imple-mentation of such an approach could speed up radically the annotation processand relieve the user from the repetitive task of manually locating names ofactors, places or objects within the text of countless articles. Information ex-traction techniques have already successfully been used in digital library systemsfor this purpose (see the Perseus Digital Library project [22]). We currently fo-cus on Adaptive Information Extraction Techniques [16] due to their relativeindependence from �xed, domain/language speci�c gazetteers.

5 Conclusion

In DIATHESIS we managed to implement a novel approach for the digitizationof historical newspapers. This approach takes into account the advantages andthe disadvantages of most digitization approaches of this type of material. It uses

16

Page 17: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

OCR technology for the structural and textual analysis of scanned newspapermaterial in a similar manner to the OCR based approach. However, unlike thisapproach the data produced by the OCR processing of scanned material are notimported directly into a full text indexing mechanism. Instead they are usedfor the creation of a highly �exible annotation interface. The system allows theusers to perform hybrid annotations upon the digitized material assigning thisway semantic properties to speci�c regions of text.

The search interface enables the user to conduct searches on a documentlevel and an article level basis via the execution of full text queries. In addition,it provides a semantic framework that enables the user to combine full text andmetadata based searches. By allowing this type of query execution the systemprovides a �semantic �lter� that greatly improves the precision of the conductedsearches. In addition it uses a robust top level domain ontology (CIDOC-CRM)in order to ensure that the produced knowledge can be inter exchanged betweendi�erent institutions. Finally, it provides a �exible presentation mechanismthat allows the partial download of the digitized material in order to improvethe overall user experience and reduce download time.

Over time the DIATHESIS system has evolved into a stable, lightweight, eas-ily deployable and highly con�gurable newspaper digitization suite. DIATHESISis currently being used for the digitization of the Vikelaia Municipal library'snewspaper collection in Heraklion Greece, where over 30.000 pages have alreadybeen classi�ed and indexed. The same system has also been successfully used forthe digitization of the Filekpaideytiki Etaireia Archive in Athens and there arealso plans for using it for the digitization of the �I AVGHI� newspaper archive .

References

[1] Anno: Austrian newspapers online project(http://deposit.ddb.de/online/exil/exil.htm).

[2] British library online newspaper archive(http://www.uk.olivesoftware.com/).

[3] The brooklyn daily eagle online (http://www.brooklynpubliclibrary.org/eagle/).

[4] Denmark: Digitaliserede danske aviser 1759-1865(http://www.statsbiblioteket.dk).

[5] The dublin core metadata initiative: http://dublincore.org/.

[6] Estonia: Digiteeritud eesti ajalehed (http://dea.nlib.ee/).

[7] "exilpresse digital. deutsche exilzeitschriften 1933-1945" project(http://deposit.ddb.de/online/exil/exil.htm).

[8] Historical newspapers in washington (http://www.secstate.wa.gov/history/newspapersname.aspx).

[9] Iceland :the vestnord project (1696-2002)(http://www.timarit.is/listi.jsp?lang=4&t=).

[10] Krueger library winona newspaper project :(http://www.winona.edu/library/databases/winonanewspaperproject.htm).

17

Page 18: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

[11] Northern new york historical newspapers (http://news.nnyln.net/).

[12] Utah digital newspapers (http://www.lib.utah.edu/digital/unews/).

[13] Wisconsin local history and biography articles(http://www.wisconsinhistory.org/wlhba/).

[14] Balakrishnan. Universal digital library:future research directions. Journalof Zhejiang University SCIENCE, 11:1204�1205, 2005.

[15] Pablo Castells, F. Perdrix, E. Pulido, Mariano Rico, V. Richard Benjamins,Jesús Contreras, and J. Lorés. Neptuno: Semantic web technologies fora digital newspaper archive. In Christoph Bussler, John Davies, DieterFensel, and Rudi Studer, editors, ESWS, volume 3053 of Lecture Notes inComputer Science, pages 445�458. Springer, 2004.

[16] F. Ciravegna, A. Dingli, D. Petrelli, and Y. Wilks. Timely and non-intrusiveactive document annotation via adaptive information extraction. In Work-shop Semantic Authoring Annotation and Knowledge Management (Euro-pean Conf. Arti�cial Intelligence), 2002, 2002.

[17] Martin Doerr & Nicholas Crofts. Electronic esperanto: The role of the ob-ject oriented cidoc reference model. In Proc. of the ICHIM '99, WashingtonDC, September 22-26 1999.

[18] Martin Doerr & Irini Fundulaki. Sis - tms: A thesaurus management systemfor distributed digital collections. In Proc. of the 2nd European Conference,ECDL'98, pages 215�234, Heraklion, Crete, Greece, September 1998 1998.

[19] Ann Blandford Jon Rimmer George Buchanan, Sally Jo Cunningham andClaire Warwick. Information seeking by humanities scholars. Lecture Notesin Computer Science, 3652/2005:218�229, September 2005.

[20] Mara Nikolaidou George Pyrounakis, Kostas Saidis and VassiliosKarakoidas. Introducing pergamos : A fedora-based dl system utilizing dig-ital object prototypes. Lecture Notes in Computer Science, 4172/2006:500�503, September 2006.

[21] Susan Haigh. Optical character recognition (ocr) as a digitization technol-ogy. Technical report, Information Technology Services National Libraryof Canada, November 15 1996.

[22] David A. Smith Anne Mahoney Gregory R. Crane Je�rey A. Rydberg-Cox,Robert F. Chavez. Knowledge management in the perseus digital library.Ariadne Journal, (25), September 2000.

[23] M. Doerr & P. Trahanias K. Chandrinos, J. Immerkaer. A visual taggingtechnique for annotating large-volume multimedia databases - a tool foradding semantic value to improve information rating. 5th DELOS Work-shop "Filtering and Collaborative Filtering", European Research Consor-tium for Informatics and Mathematics (ERCIM), November 10-12 1997.

[24] Karen Edge Kenning Arlitsch, L. Yapp. The utah digital newspapersproject. D-Lib Magazine, 9(3), 2003.

18

Page 19: DIATHESIS: OCR based semantic annotation of newspapers. · Optical Character Recognition techniques. In this case the extracted full text is used for indexing and presentation purposes

[25] Carl Lagoze, Sandy Payette, Edwin Shin, and Chris Wilper. Fedora: Anarchitecture for complex objects and their relationships, Aug 2005.

[26] TIm Berners Lee. The world wide web: Past, present and future. IEEEComputer special issue of October 1996, 1996.

[27] Tim Berners Lee. Realising the full potential of the web. Based on a talkpresented at the W3C meeting, London, 1997.

[28] Emil Steinvil Marilyn Deegan, Edmund King. British library micro�lmednewspapers and oxford grey literature online. Technical report, OxfordUniversity, British Library , Olive Software Inc., 2003.

[29] C Marshall. Annotation: from paper books to the digital library. In Pro-ceedings of the ACM Digital Libraries '97 Conference, Philadelphia, pages23�26, July 1997.

[30] Maria Theodoridou Martin Doerr, Georgios Markakis. Digital library ofhistorical newspapers. ERCIM News:Special Issue on Digital Libraries,(66), July 2006.

[31] Tony Gill Stephen Stead Matthew Sti� Nick Crofts, Martin Doerr. Def-inition of the CIDOC Conceptual Reference Model. ICOM/CIDOC CRMSpecial Interest Group, 4.2.1 edition, October 2006.

[32] Maria Theodoridou Manolis Tzobanakis Panos Constantopoulos, Mar-tin Doerr. Historical documents as monuments and as sources. In ComputerApplications and Quantitative Methods in Archaeology Conference, Herak-lion, Greece, April 2002.

[33] Maria Theodoridou Manolis Tzobanakis Panos Constantopoulos, Mar-tin Doerr. On information organization in annotation systems. IntuitiveHuman Interface, LNAI 3359:189�200, 2004.

[34] Bill Parod. Encyclopedia of chicago fedora implementation. Technicalreport, Northwestern University, May 15 2005.

[35] Ulrich Thiel, Holger Brocks, Andrea Dirsch-Weigand, Andre Everts, IngoFrommholz, and Adelheit Stein. Queries in context: Access to digitizedhistoric documents in a collaboratory for the humanities. From IntegratedPublication and Information Systems to Information and Knowledge Envi-ronments, pages 117�127, 2005.

19