15
International Journal on Document Analysis and Recognition manuscript No. (will be inserted by the editor) Genre identification for office document search and browsing Francine Chen · Andreas Girgensohn · Matthew Cooper · Yijuan Lu · Gerry Filby Received: date / Accepted: date Abstract When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve the performance of genre identification. Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide and table. Our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to iden- tification of coarse office document genres. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections. Keywords genre identification · office documents · image features · text features · classification Francine Chen · Andreas Girgensohn · Matthew Cooper · Gerry Filby FX Palo Alto Laboratory, Inc. 3400 Hillview Ave, Bldg. 4 Palo Alto, CA 94304 USA Tel.: +1-650-842-4840 Fax: +1-650-842-4880 E-mail: chen, girgensohn, cooper, fi[email protected] Yijuan Lu Texas State University San Marcos, TX 78747 USA E-mail: [email protected] 1 Introduction Genre 1 is a popular way of categorizing movies, songs, and literature. Documents have also been categorized into genres such as fiction, opinion, brochure, slide presentation, techni- cal document, blog, and home page. Document genres such as these can improve document search. While entering query terms is the traditional method of searching for documents, genre provides a complementary, non-topical means to char- acterize documents and web pages, and is useful metadata for indexing, organizing, and searching for documents. Documents occur in different modalities, including plain text, web, and imaged documents. In enterprise intranets and on the web, there are many office documents, which we de- fine as digital documents created and shared by office work- ers. These documents are in formats such as PDF, Power- Point and Word. In this paper, we focus on automatic identi- fication of four coarse genres of office documents: technical paper, photo, slide, and table. We define technical papers as research-type papers such as conference papers, journal papers, and reports. Photos are pictures taken with a cam- era and may be embedded in documents, but do not include drawings or screenshots. Slides are defined as any page or document created for presentation in front of audience. Ta- bles include tables from spreadsheets as well as tables in papers and other documents. Although document genre is related to file format, they are not identical. A document creation tool, such as Word or PowerPoint, can be used to create documents in several genres. For example, in addition to its primary use in cre- ating slides, PowerPoint is also used to create figures and drawings for papers, and also to create handouts of screen shots and photos. And some extensions, such as PDF, are associated with many genres, since files created in different 1 Merriam-Webster: a category of artistic, musical, or literary com- position characterized by a particular style, form, or content.

Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

Embed Size (px)

Citation preview

Page 1: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

International Journal on Document Analysis and Recognition manuscript No.(will be inserted by the editor)

Genre identification for office document search and browsing

Francine Chen · Andreas Girgensohn · Matthew Cooper · Yijuan Lu · Gerry Filby

Received: date / Accepted: date

Abstract When searching or browsing documents, thegenre of a document is an important consideration thatcomplements topical characterization. We examine designconsiderations for automatic tagging of office documentpages with genre membership. These include selectingfeatures that characterize genre-related information in officedocuments, examining the utility of text-based features andimage-based features, and proposing a simple ensemblemethod to improve the performance of genre identification.Experiments were conducted on the open-set identificationof four coarse office document genres: technical paper,photo, slide and table. Our experiments show that whencombined with image-based features, text-based featuresdo not significantly influence performance. These resultsprovide support for a topic-independent approach to iden-tification of coarse office document genres. Experimentsalso show that our simple ensemble method significantlyimproves performance relative to using a support vectormachine (SVM) classifier alone. We demonstrate the utilityof our approach by integrating our automatic genre tags in afaceted search and browsing application for office documentcollections.

Keywords genre identification · office documents · imagefeatures · text features · classification

Francine Chen · Andreas Girgensohn · Matthew Cooper · Gerry FilbyFX Palo Alto Laboratory, Inc.3400 Hillview Ave, Bldg. 4Palo Alto, CA 94304USATel.: +1-650-842-4840Fax: +1-650-842-4880E-mail: chen, girgensohn, cooper, [email protected]

Yijuan LuTexas State UniversitySan Marcos, TX 78747USAE-mail: [email protected]

1 Introduction

Genre1 is a popular way of categorizing movies, songs, andliterature. Documents have also been categorized into genressuch as fiction, opinion, brochure, slide presentation, techni-cal document, blog, and home page. Document genres suchas these can improve document search. While entering queryterms is the traditional method of searching for documents,genre provides a complementary, non-topical means to char-acterize documents and web pages, and is useful metadatafor indexing, organizing, and searching for documents.

Documents occur in different modalities, including plaintext, web, and imaged documents. In enterprise intranets andon the web, there are many office documents, which we de-fine as digital documents created and shared by office work-ers. These documents are in formats such as PDF, Power-Point and Word. In this paper, we focus on automatic identi-fication of four coarse genres of office documents: technicalpaper, photo, slide, and table. We define technical papersas research-type papers such as conference papers, journalpapers, and reports. Photos are pictures taken with a cam-era and may be embedded in documents, but do not includedrawings or screenshots. Slides are defined as any page ordocument created for presentation in front of audience. Ta-bles include tables from spreadsheets as well as tables inpapers and other documents.

Although document genre is related to file format, theyare not identical. A document creation tool, such as Wordor PowerPoint, can be used to create documents in severalgenres. For example, in addition to its primary use in cre-ating slides, PowerPoint is also used to create figures anddrawings for papers, and also to create handouts of screenshots and photos. And some extensions, such as PDF, areassociated with many genres, since files created in different

1 Merriam-Webster: a category of artistic, musical, or literary com-position characterized by a particular style, form, or content.

Page 2: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

2 Accepted by IJDAR. The final publication will be available at www.springerlink.com.

Fig. 1 Sample document pages from the genres of (a) paper (b) photo (c) table (d) slide.

formats are often converted to PDF for its portable represen-tation. Thus if a user were looking for slides from talks, andlooked only at PowerPoint files, the user may be presentedwith extra files that are not slides; furthermore, the user willnot see slides that are in other formats, such as PDF.

Layout has been useful in genre identification ofscanned documents (e.g., [1, 10]); it is encoded in the mark-up of web documents and has been found to convey strongcues about web genre (e.g., [27]). In office documents,one cannot assume that all documents contain encodedlayout information. For example, some PDF documentsare created from scanned documents. Furthermore, whenthe information is present, it may be inconsistent, e.g., notall PDF documents correctly encode column information.We thus identify genre from imaged document pages thathave been OCR’d, so that both layout and text informationare available for all pages; this also obviates the needto determine whether layout information, when given, iscorrect.

Examples of the four office genres of (technical) paper,photo, table, and slide are shown in Fig. 1. These examplesillustrate some of the diversity within genres and also theambiguity of the genres. Note in Fig. 1b, the photo categoryis composed of not only photos taken from a camera, butalso document pages that are mostly photo (the top-right andthe middle-right examples). Note also that the bottom-leftslide in Fig. 1d is composed mostly of photos. Thus, it can beappropriate for some pages to be tagged with more than onegenre. We tag a page with a genre when at least half the pagerepresents the target genre. With this definition, the middle-right page in Fig. 1c is tagged as a table and as a paper,and the lower-right page is tagged as a table. Thus, tablesthat also contain text as footnotes and other information aretagged as tables.

While web search strongly relies on link informationfor ranking, office documents contain very little link infor-mation, making genre identification more valuable whensearching office documents. The ability to automaticallyidentify office document genre can allow search results to begrouped by genre, or topical search queries to be augmentedby genre. With our four proposed coarse genres, one couldspecify that technical papers on geysers are of interest, orif one was looking for a brochure, specify that none of thegenres are of interest. These genres complement a numberof finer genre classifications that are characterized primarilyby differences in textual content. Examples of such genresinclude legal, scientific, non-fiction and fiction (e.g., [16])and Editorials, Letters to the Editor, Reportage, and SpotNews (e.g., [31]). Our office genres also complement webgenres, such as home pages, blogs, and wikis.

For genre identification, we assume that the various pagetypes that occur in a particular office document genre shouldbe learned by the classifier. Thus, rather than classifyinginto page types such as title, table of contents, references,etc., we classify each page by genre. This eliminates theneed to create classifiers for each possible page type and theneed to map between page types and genre. For text-basedgenre classification, Kessler et al. [16] distinguish between“surface” cues, which are simple to extract, and “structural”cues, which are related to linguistic structure. Here, we de-fine a set of document-inspired, surface-type image features.It has been suggested that there are advantages to perform-ing genre identification without text features, including lan-guage independence [27] and corpus independence [2]; im-age features may have these advantages if they perform wellwithout text features. Since some documents belong to morethan one genre, our genre identifier is open set, so that adocument is tagged with zero or more genres. We then ex-tend our page genre identification to predict document genre

Page 3: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

Accepted by IJDAR. The final publication will be available at www.springerlink.com. 3

membership. The automatic genre identification results areused in a system developed for faceted search and brows-ing of enterprise collections that offers genre as one of thepossible search facets.

The contributions of this paper center on an investiga-tion of methods for office document genre identification.These include: (1) developing a set of document-inspired“surface” image features and experimentally comparingtheir utility to text-based features, (2) adding a simpleensemble method that efficiently makes use of the data toimprove genre identification performance, (3) developing adocument genre identification approach where the differentpage types associated with a genre are learned by theclassifier and (4) deploying the results in a faceted searchand browsing system.

The rest of the paper is organized as follows. In Section2, we review earlier text-based and image-based genreidentification work and compare them to our approach. Wealso discuss current techniques for genre identification ofweb documents and why those techniques do not directlyapply to office documents. In Section 3, we describe thedata set of representative office documents that we collectedand labeled for this work. In Section 4, we describe ourproposed text-based and image-based feature sets, and theclassification methods used, including a simple ensemblemethod. In Section 5, we present experimental results thatshow the utility of our image-based method over a text-based approach. In Section 6, we discuss the combination ofpage genre scores to predict document genre, and then weillustrate the use of our genre tags in a document searchingand browsing system. Conclusions and future directions arepresented in Section 7.

2 Related Work

Many sets of genre categories have been proposed for textgenre identification and web genre identification. For webpage genre, Roussinov et al. [24] conducted a study of usefulgenres and proposed five major web document genre groups,while Santini et al. [25] compiled and published more than80 genres.2 Henderson [12] conducted a study analyzingthe computer folder structure of six knowledge workers andfound genre to be the most common organizing factor. Shealso states that “people may deal with a vastly differing setof genres, depending on their job”. Thus, the set of usefulgenres depends, in part, on the planned application.

In our work, we focus on automatic genre identificationof the types of documents used in a research lab, which area subset of those found in an enterprise. Our genre set canbe further refined to application-specific genres of interestusing text-based methods.

2 http://www.webgenrewiki.org/index.php5/Genre Classes List

Different modalities of features have been defined foruse in genre identification. In works on genre identifica-tion based on textual features, e.g., [16, 31, 19, 7], the fea-tures are primarily computed from surface cues, such asdocument terms and punctuation, or structural cues, suchas part-of-speech tagging. Kessler et al. [16] found that per-formance using surface cues was similar to that using struc-tural cues, and recommended surface cues because they aremuch easier to compute. Stamatatos et al. [31] noted in anexperiment on genre identification that using less than 50of the most frequent terms in English, which are primarilystop words, plus punctuation, as “style” features was effec-tive (greater than 97% accuracy) for classifying a subset ofthe Wall Street Journal text corpus into four genres. Leeand Myaeng [19] examined frequency-based methods foridentifying terms that occur across genres and also acrosssubject, i.e., topic, classes within a genre. The use of sub-ject classes improved genre classification performance whenused with a similarity-based approach, but when used with aNaıve Bayes classifier, the best performance was when sub-ject classes were not used. Since we take a classification-based approach, we do not use subject classes in our text-based genre classifier. As in [31, 19], we use surface cues,but in addition, we use discriminative selection of text fea-tures.

Web genre detection commonly employs features thatdescribe markup attributes of web pages, in addition to an-alyzing the pages’ textual content. Link plus formatting orlayout tags have been used, as well as web page URLs (e.g.,[2, 23, 5]). Meyer zu Eissen and Stein [5] created a webgenre identification system and found HTML mark-up to bemore useful than text features. Levering, et al. [20] foundthat adding HTML features to text features improved per-formance. They also noted that adding visual features de-rived from the HTML tags sometimes helped and sometimeshurt performance, depending on genre. Recently, Scholl [27]presented a system for web genre detection that used onlyfeatures derived from HTML mark-up and obtained rela-tively good accuracy. Thus one may ask whether featurescapturing layout in office documents are similarly more ef-fective than text-based features for genre identification ofoffice documents.

One approach that addresses the absence of encoded lay-out information in office documents is to perform image-based layout analysis (e.g., [1, 10]). Each page is segmentedinto zones that are labeled with tags such as text body, pic-ture, and table. The tagged zones are then used to classifypages according to genre. As can be observed in a surveypaper by Chen and Blostein [4] on classifying “mostly-text”document page images into a variety of types, the use oflayout is predominant in document page image classifica-tion approaches. However, as noted in [4], layout analysiscan be complex and error-prone.

Page 4: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

4 Accepted by IJDAR. The final publication will be available at www.springerlink.com.

A few prior works have explored genre identificationbased on image analysis without layout information. DasGupta and Sarkar [8] describe a genre classifier basedon identified salient feature points for discriminatingbetween two genres, journal articles and memos, and testedon 80 page images. Shin et al. [28] used window-basedfeatures to identify page type, i.e., title page, cover page,reference page, table of contents, and form. They defineover 20 “structure-based” features that include number oftext column gaps, configuration of lines on a page, andclassification of each window into the region types of text,image, or graphics. Analogous to the surface and structuraltypes of cues for text features, these features may be thoughtof as structural cues, while simple visual features that canbe extracted directly without classification are analogousto surface cues. We also use window-based features, andsome of our features attempt to capture similar documentcharacteristics. However, our features are primarily sur-face features and our goal is page and document genreidentification, rather than page type identification.

Most previous genre identification systems make use ofeither image-based or text-based features. Kim and Ross[18] compared image and text-based features for genreidentification. For image-based features, they used simple,surface-based features in tiles from the first page of adocument to identify document genre. Kim and Ross alsoinvestigated the use of two types of text-based features.One is based on “prolific” terms, and the other is based onsignificant topical terms. We initially planned to use theirimage-based system and their prolific terms-based system asbaseline systems. We did not consider the topical terms fea-ture because we wanted features that are topic-independent,and as mentioned earlier, other researchers achieved goodresults without topical terms. However, the prolific termsfeature was problematic for our genre set, since two of ourgenres did not have any prolific terms as defined in [17]. Wethen decided to use Stamatatos et al. [31] as a baseline textsystem because their system demonstrated good accuracy.Kim and Ross’s image-based system was used as an imagebaseline.

As with the Kim and Ross image-based system, we forgolayout analysis and use tiles. However, in contrast, we de-velop a set of document-inspired features that attempt tocapture layout characteristics and texture of the tiles withoutrequiring region classification. Our work further differs fromKim and Ross in two ways. First, we examine the utility ofcombining feature types. Second, we develop an open-classsystem for identifying a set of genres.

Most prior image-based genre identification approachesfocus on page genre identification. Those that perform doc-ument genre identification commonly use only the first pageof a document (e.g., [18, 1]), which is the most distinctivepage for some genres. In contrast, we identify the genre for

each page separately and then combine the page genre tagsto identify document genre membership. With our approach,search and browsing can be performed based on documentgenre or page genre.

In summary, there has been much work on text, image,and web page genre identification. Also, HTML-based tagshave shown to be very useful in differentiating web genres.There has been much less work on differentiating the genresof office documents. In this paper, we show how the genreof office documents, which commonly do not have link ortagged layout information, can be identified using surfaceimage-based features. We also show that our text-based fea-tures, while exhibiting better performance than a baselinebased on [31], are not as useful for identifying office doc-ument genres. We improve performance using a simple en-semble method, and combine per-page genre labels to clas-sify document genre.

3 Office Document Data Set

A benchmark data set of office documents that has been la-beled with our target genres, slides, technical papers, tablesand photos, does not exist. To create a corpus, we collecteda set of documents and tagged them with our four targetgenre types. The documents were collected primarily fromthe web, with some documents also added from our corpo-rate intranet. We queried Google for 100 results per query,scraped the returned URLs for office document extensions,such as .pdf, .xls, and .doc, and downloaded those files. Thequeries included a number of document genres (e.g., table,brochure) as well as topical terms covering a variety of sub-jects (e.g., airline, baby, car, disease, dog, plants, semicon-ductor, shipping, software, stocks, vacation). We also addedphotos obtained from Flickr and about 100 documents fromour corporate intranet. Note that in addition to the Flickrphotos, photos also occurred in some of the other docu-ments, such as presentations and magazine articles.

In order to uniformly process all documents, eachpage of a document was represented as a JPEG image.We checked for duplicates, manually looking for thosewith identical or almost identical content, and removedthem. The genre-based queries identified documents bothwithin and outside the desired genre. For example, ‘table’contained document data tables, as well as periodic tables,and pages containing other senses of the word, such as“table tennis”, “table eggs”, “table linens”, and the verb in“table a motion”.

Eleven people participated in labeling the corpus pageswith our four target genres. The labelers used an interface wedeveloped for quickly selecting pages belonging to a speci-fied genre. All pages of the documents in a folder were dis-played in a scrollable window and the pages could be se-lected using common interface selection techniques.

Page 5: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

Accepted by IJDAR. The final publication will be available at www.springerlink.com. 5

Table 1 Statistics of the office document data set

Genre # pages # documentspaper 948 87photo 541 447slide 1765 120table 681 144total tags 3935 798corpus 5098 1178

Table 2 Labeler agreement

paper photo slide tableFleiss’ kappa 0.88 0.81 0.97 0.92

Each labeler was instructed to find and mark pages forwhich one given genre (or one genre at a time for the personwho labeled two genres) was apparent in at least half of thepage. With this procedure, there may be pages with multiplegenre labels and pages that are not labeled as any genre.

Labeling pages for a selected genre was sometimes am-biguous. For example, a table of contents usually has twocolumns, but we decided not to include these as ‘tables’ be-cause they are more closely associated with genres such asreports and books, rather than generic tables. Another exam-ple is that the “boundary” defining technical papers is am-biguous, since some magazines and newsletters are on tech-nical topics. To resolve this, we decided to define technicalpapers as research-type papers such as conference papers,journal papers, and reports.

The corpus contained 1267 pages with no tags. As indi-cated by the ambiguous examples above, some of the pageswithout tags were relatively similar in layout to our four tar-get genres. There were also 3727 pages with one tag and 104pages with two tags. The center column of Table 1 shows thenumber of pages in the corpus tagged with each of the fourgenres. There were a total of 3935 tags assigned to 5098pages.

Table 2 shows the agreement among our labelers foreach genre, as measured by Fleiss’ kappa [6]. Fleiss’ kappameasures the extent of labeler agreement above chancewhen there are more than two labelers. It ranges in valuefrom negative (poor agreement) to 1.0 (complete agree-ment). The agreement among our labelers was relativelygood. However, there was some disagreement, indicatingambiguity in the page genres.

For our experiments, the labeled data was partitioned bydocument into three sets with equal numbers of documents(ignoring round-off). The partitioned data was used for per-forming 3-way cross-validation.

4 Genre Identification

In our approach to genre identification, the different pagetypes associated with a genre are learned by the classifier

Fig. 2 Tiling a page

model. That is, we do not create separate models for eachpage type, such as title page, Table of Contents, and ref-erence pages, which are all page types associated with thetechnical papers genre. This is reflected in the instructions tothe labelers to select pages of a given genre, rather than pagetype. With these genre labels, we create one one-versus-restmodel per genre.

In the rest of this section, we describe the features andclassifiers we use for page genre identification. We also pro-pose a simple ensemble method for improving genre identi-fication performance.

4.1 Features for Genre Identification

We developed a set of image-based features to capture docu-ment layout characteristics. We also developed a set of text-based features that combine ideas proposed in earlier works,and incorporate feature selection.

4.1.1 Image-Based Page Genre Features

As an alternative to performing layout analysis of page im-ages, we tile each page image and extract a set of featuresthat capture local document characteristics, such as lines oftext, within a tile. While this requires relatively large tiles,the tiles need to be sufficiently small to distinguish the dif-ferent region types (e.g., heading, figure, body text). Empir-ically, we have found that dividing each page image into agrid of 5 tiles horizontally by 5 tiles vertically, for a total of25 tiles, meets our requirements. A full “page” tile is alsoused to capture features that may span multiple tiles, such astable rule lengths. Fig. 2 illustrates the tiling of a page.

Our feature set was designed to differentiate the typesof regions identified in layout analysis, such as text, images,rules and columns. The tile features are:

Image density. This feature is similar to the image fea-ture used by Kim and Ross [18], except our tile size is muchlarger. This feature attempts to capture areas where print issparse, and should help differentiate titles, regular text, and

Page 6: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

6 Accepted by IJDAR. The final publication will be available at www.springerlink.com.

Fig. 3 Feature selection over tiles (each tile is outlined in red, eachsquare in b-e is a feature): (a) Image tiles (b) Correlograms of tiles areconcatenated into a feature vector. mRMR-selected features are col-ored. (c) Union of selected feature positions, the colored features formthe feature set for all tiles. (d) Features to be extracted from concate-nated correlograms are colored. (e) Final feature set.

photos. To compute it, a page image is converted to a binaryimage and the ratio of the number of black pixels to the totalnumber of pixels in each tile is calculated.

Horizontal projections. This feature attempts to cap-ture the number of text lines and the distribution of text lineheights. This helps differentiate genres such as slides, whichare of a larger font, and photos, which have few, if any textlines. The feature can be computed based purely on imageprocessing techniques (e.g., [33]) to project the pixels hori-zontally once the text foreground is identified. Alternatively,word bounding box locations, e.g., provided by an OCR sys-tem, can be used to compute the projections of the boundingboxes. We used the latter approach because each page wasOCR’d for the text features. From a projection, the peakwidths, roughly corresponding to text rows, are quantizedinto a five-bin histogram. This characterizes the number ofrows and the distribution of font sizes.

Vertical projections. This feature attempts to capturethe number of text columns and the distribution of theirwidths. This feature can help to differentiate papers andtables, which may have columns of varying widths. Aswith the horizontal projections, we project word boundingbox locations and compute a five-bin histogram of columnwidths.

Color correlogram. Color correlograms represent thespatial correlation of colors in an image and have been usedfor image retrieval [14]. We select a subset of the correlo-gram values to capture texture and color variation. To com-pute this feature, the images are proportionally scaled to amaximum of 1550 pixels in the horizontal and vertical di-rections, and then a color autocorrelogram computed. The

colors were quantized to 96 colors and distances of 0, 1, and3 pixels were used, resulting in 288 dimensions per tile. Toreduce the number of correlogram coefficients, feature se-lection was performed using the Maximally Relevant Mini-mally Redundant (mRMR) feature selection method [22] toidentify a subset of 50 features. Because the feature valuesdepend in part on tile location (e.g., titles usually appear nearthe top of a page), the feature selection method performs bet-ter when implicit spatial information is preserved. This wasdone by concatenating the correlogram coefficients for eachtile in a page into a vector and then performing feature selec-tion over all concatenated correlogram vectors in the train-ing set (Fig. 3b). Our approach contrasts with the “bag ofvisual words” approach to image classification [29], wherethe locations of the image tiles are not preserved. To enablemodeling of all region types in any tile, the final feature setis composed of the union across tiles of the selected featurepositions (Fig. 3c). Then for each page, the selected featuresfor each tile are concatenated in a fixed order (Figs. 3d,e).With the tile locations implicitly encoded in the feature vec-tor, a genre classifier can learn the feature values observedin each location.

The concatenated vector is combined with page-basedfeatures to create a feature vector representing a page. Twotypes of page-based features are computed:

Lines. Tables often have many rules which extend hor-izontally and/or vertically. To identify rules on a page, runlengths of black pixels in a binary image are computed hor-izontally, allowing for short pixel jogs horizontally or ver-tically [33]. The number of lines in each tile is noted andthe line lengths are uniformly quantized into a four-bin his-togram. This differentiates lines that span a page width, acolumn, and shorter lines. Vertical line histograms are com-puted similarly.

Image size. The width and height of the image. Tech-nical papers and slides commonly occur in standard pagesizes. Photos and tables, on the other hand, may be croppedand also may occur in document pages and slides.

Although we use feature selection on the entire set oftext terms as described in the following section, we do notuse it on the entire set of image features. For our image-based features, the number of dimensions is relatively small,e.g., five-bin histograms per feature type, and each of thefeatures is informative relative to the others. For example,the number of rules is independent of the number of lines oftext. The exception is the color correlogram, where featureselection was employed as described above.

4.1.2 Text-Based Page Genre Features

Prior to extracting text-based features, the page images areOCR’d with Microsoft Document Imaging3 and then pre-

3 http://office.microsoft.com/en-us/help/HP030763951033.aspx

Page 7: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

Accepted by IJDAR. The final publication will be available at www.springerlink.com. 7

Table 3 Top terms selected by mRMR for one data partition

Paper Photos Slides Tablesthe photos the ltdfigure skiing these comparativehowever to has westboth peru this informthese racetrack first gethis gourmet are timeshown 1388-10 was 1-2such viewer one pesticidebetween napa have japaneseresults systems that stanford

processed. The recognized text, which contains errors andextraneous characters, was tokenized into terms and punctu-ation. Some token types were mapped to a unique term rep-resenting that type; these token types were: integers, float-ing point numbers, underscores, lines, and dates of the form‘dd/dd/dddd’ where ‘d’ is a digit. To reduce the number ofterms due to OCR errors and to increase robustness, onlyterms that occurred in more than ten documents were kept.Stop words and punctuation were kept, since these tokenscan be indicative of genre [31, 23, 16].

Previous text-based genre identification systems by [18]and [31] select frequent terms as features. “Prolific” termsthat occur in many of the documents in each genre wereused as text features in [18] to capture writing style. A fixedset of frequent terms, which corresponds to a subset of stopwords and punctuation, was used by [31]. The features inthese systems are primarily stop words and were motivatedin part by the desire to select non-topical terms. However,these earlier methods do not consider whether the terms aregood at discriminating among classes. For example, a fre-quent term may occur at approximately the same frequencyin each class, and therefore would not provide useful infor-mation for classification.

Non-discriminative selection methods, such as thoseused by [18] and [31], are suboptimal compared to MutualInformation (MI) and χ2 based methods [21], which arepopular feature selection methods. Since these featureselection methods are greedy, redundant features maybe selected. For our text-based classifier, we used a dis-criminative feature selection method that is an extensionof the Mutual Information (MI) based selection method.The Maximally Relevant Minimally Redundant (mRMR)[22] method selects features that discriminate the differentclasses but minimize redundancy among the retainedfeatures.

We performed preliminary experiments using mRMR toselect 50 terms per genre, but the results were poor. We hy-pothesize that this may be due, in part, to the sparseness oftext in some documents. To address the sparseness problem,we next tried selecting 100 and 200 terms, which seemedto be the limit of the number of terms mRMR could select

from our office document data set in a reasonable amount oftime. The results with 200 terms were much better, and sowe select 200 terms per genre.

For performing cross-validation experiments, the datawas partitioned as described in Section 3, and term selectionwas performed separately for each partition. The selectedterms for a partition form the text features for all partitionswhen that partition is used as the training set.

For each of our four target genres, the top ten terms se-lected by mRMR for one data partition are shown in Table 3.Note that the ‘Paper’ genre includes the content words “fig-ure”, “shown”, and “results”, in addition to stop words. Incontrast, the top mRMR features for the Slides genre areprimarily all stop words, and the top mRMR features for theTables genre are all content words. Note also that terms arepresent for the Photo genre. Although a few photos containtext that was OCR’d, the terms are also from pages that aremostly photo, as described in Section 3 and illustrated bytwo examples in Fig. 1b. We also used mRMR to select themaximally relevant (not minimally redundant) terms for thefour genres and observed that the terms were primarily stopwords.

For each data partition, we combined the set of 200terms selected using maximum relevance and 200 termsselected using mRMR for each of the target genres as thebasis for the text feature vectors. Our text features contrastwith Stamatatos, et al. [31], who found that using lessthan 50 of the most frequent terms provided good genreidentification for their corpus. Including the maximallyrelevant terms provides some redundancy, which may behelpful for genres where text is sparse. It also adds primarilystopwords and punctuation, capturing some of the traits ofthe style features in [18, 31]. The text feature vector foreach document is then composed of the term counts foreach of the selected terms.

4.2 Classification

For most search systems, full coverage of all possible genresis not realistic. Instead, a genre identification system shouldbe able to spot a subset of genres from a large possible set.For this we create for each genre of interest a one-versus-restmodel, where a classifier attempts to differentiate betweenthe genre of interest and the “rest” of the genres. As notedearlier, each classifier learns the variations in page types as-sociated with a genre.

Based on the competitive performance of SVMs (Sup-port Vector Machine [3]) for many classification tasks (e.g.,[15, 5]), we used an SVM classifier. As commonly recom-mended for better SVM performance, the feature vectorcomponents are normalized. We use a normalization thatscales and shifts the component values to range between 0

Page 8: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

8 Accepted by IJDAR. The final publication will be available at www.springerlink.com.

and 1, as recommended in [13]. Each vector element, xi isnormalized using:

xi,norm =xi−min j(x j)

max j(x j)−min j(x j)(1)

where {x j} are the observed values for one vector elementin the training data; the same scaling factors are applied toboth the training and test data.

An SVM classifier is a supervised, discriminative clas-sifier that computes a hyperplane that best separates twoclasses: w ·x−b = 0 where x is a set of points from a train-ing set. w and b are chosen to maximize the margin betweenthe two classes. By applying the kernel trick [3] to the dotproduct, a linear SVM classifier can be transformed into anon-linear classifier. This may allow for better handling ofthe non-linear feature space with multiple subtypes within agenre. For our experiments, we used SVMlight [26], whichallows specification of different kernels and cost-factors dur-ing training. In particular, we performed a parameter sweepwith linear, polynomial and radial basis kernels, with the or-der of the polynomial kernels ranging from 2 to 4. We alsoswept the cost-factor from 1 to 15, biased towards detectionof a genre. Since an optimal model is found for each genreand data partition, the best kernel and parameter values var-ied for each trained model.

For our experiments, we performed 3-fold cross-valida-tion on the data. For each train/development/test combina-tion, one data partition was used for training the SVM, anda second data partition was used to tune the genre models,selecting parameters to maximize the balanced F-score, F1:

F1 =2PR

P+R, (2)

the harmonic mean of precision, P, and recall, R. These mea-sures are defined [21] as:

P =t p

t p+ f p(3)

R =t p

t p+ f n(4)

where t p is the number of true positives (both the systemand the ground truth tagged a page with a given genre); f p isthe number of false positives (the system but not the groundtruth tagged a page with a given genre), and f n is the numberof false negatives (the ground truth but not system tagged apage with a given genre). After tuning the genre models,the “optimal” model for each genre was used to classify thepage images in a third data partition.

Using the features and classifiers we have described,we created systems to compare experimentally: image-based features with the Weka [32] Naıve Bayes classifier(imgfeats.NB), image-based features with an SVM classi-fier (imgfeats.SVM), and text-based features with an SVMclassifier (txtfeats.SVM).

Fig. 4 Illustration of two SVM classifier hyperplanes trained on dif-ferent data partitions.

4.3 A Simple Ensemble Classifier

To improve performance, we explored a simple ensemblemethod that makes better use of the data partitions fortraining and tuning our SVM-based system. We initiallyconsidered using a traditional classifier combination methodthat uses the results from different types of classifiers on thesame data. But in preliminary studies comparing differentclassifiers, SVM performance was much better than theother classifiers that we tried—Naıve Bayes, k-Nearest-Neighbors, and Random Forests [32]. Consequently, wedecided instead to use an SVM classifier trained on differentpartitions of the data and combine the results. In contrast tobagging, which avoids overfitting by selecting exemplarswith replacement to form multiple training sets, we divideour training data into two mutually exclusive partitionsto train separate classifiers. We chose this approach tominimize the number of models that need to be trained.

In particular, the data set was divided into three parti-tions, A, B, and C, for cross-validation. The results fromtraining on A, tuning on B, and testing on C to produce bi-nary genre classifications g1 were combined with the resultsfrom training on B, tuning on A, and testing on C, to producebinary genre classifications g2. Thus each test partition is la-beled using two SVMs. Assume that the classification forgenre k of page i by classifier c is gk

c[i]. To create a higherprecision classifier, we used this combination rule for the ithpage:

(gk1[i]∨gk

2[i])∧ (gm1 [i]∧gm

2 [i]) ∀m 6= k.

It requires that at most one of the classifier pairs in a com-peting genre would label the page as the competing genre.

An illustration of the intuition behind this method isshown in Fig. 4. Two classifiers trained on different datapartitions identify different hyperplanes and have a portionof the space in common. We hypothesize that the commonspace is primarily pages of a given genre, and that the spacecovered by only one of the classifiers are pages that maybe a combination of genres or a page that is otherwiseambiguous or unusual. When used with image-basedfeatures, we denote this system imgfeats.Ens.

Page 9: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

Accepted by IJDAR. The final publication will be available at www.springerlink.com. 9

4.4 Combination of Text and Image Features for PageGenre Identification

We considered whether using a combination of text and im-age features can improve performance for our genre clas-sification task. In video analysis, Snoek et al. [30] com-pared “late fusion”, which combines output from separatetext-based and image-based classifiers, with “early fusion”,which combines text and image features before classifica-tion. They observed mixed results across the set of classifi-cation concepts and suggested that the fusion strategy shoulddepend on the concept to be classified. Since we were con-cerned about the sparseness of text in many of the photosand some of the slides, we chose to use “early fusion” andcombine the text and image information at the feature level.The features were used in a simple ensemble classifier. Werefer to this system as txt+imgfeats.Ens.

5 Experiments and Discussion

We conducted experiments to investigate the effectivenessof the features and classification methods we have proposedfor office document page genre identification. We firstcompared the utility of our text and image features againstbaseline features and against each other. We then tested theeffectiveness of our partition-based ensemble classificationmethod and of the combination of text and image features.

The pages in our office document data set may be la-beled with zero, one, or multiple genres. For evaluation, wemeasure performance by precision, recall, and F1-score (seeEqns. 2–4), computing the macro-average across genres.The macro-average gives equal weight to each genre, whilethe micro-average gives equal weight to each page [21].We did not use micro-average measures because we did notcollect documents for our corpus in a way to insure that thedistribution of genres is representative of the distributionof office documents on the web. In our experiments, thesignificance of observed differences in performance wastested using the paired Wilcoxon signed rank test.

5.1 Baseline Image Feature Comparison

Kim and Ross’ image-based system [18] was used as a base-line because their system is based on surface features in tilesand was developed for classifying document genres. We im-plemented their image-based feature extraction: each pagewas divided into a 62×62 grid, and each tile in the grid withat least one pixel of value less than 245 was assigned thevalue of ‘0’. The other tiles were assigned a value of ‘1’. Forclassification, Kim and Ross compared Naıve Bayes, NaıveBayes with kernel density estimation, Random Forest, andSVM classifiers using the Weka software package [9]. They

Table 4 Performance of our image-based page genre features againstKim and Ross baseline image features with different classifiers

Classifier Precision Recall F1-scoreKR.NB 0.4161 0.8085r 0.5272KR.NBK 0.4169 0.8162r 0.5326KR.RF 0.6033nk 0.4433 0.4885imgfeats.NB 0.6431nk 0.9406nkr 0.7240nkr

Superscripts ‘n’, ‘k’, ‘r’ and ‘m’ indicate statistically significant im-provement at the 0.05 level over KR.NB, KR.NBK, KR.RF, andimgfeats.NB models, respectively. The best performance in each col-umn is in bold.

Table 5 Performance comparison of text features using SVM-basedgenre classifiers

Classifier Precision Recall F1-scorefreqfeats.SVM 0.4146 0.7743 0.4940txtfeats.SVM 0.7458 f 0.7718 0.7470 f

Superscripts ‘f’ and ‘t’ indicate statistically significant improvement atthe 0.05 level over freqfeats.SVM and txtfeats.SVM models, respec-tively. The best performance in each column is in bold.

Table 6 Performance comparison after adding an ensemble classifierand after adding text features

Classifier Precision Recall F1-scoreimgfeats.SVM 0.8619t 0.8708t 0.8639t

imgfeats.Ens 0.9464ti 0.8790tb 0.9107ti

txt+imgfeats.Ens 0.9724tie 0.8626t 0.9135ti

Superscripts ‘t’, ‘i’, ‘e’, and ‘b’ indicate statistically significantimprovement at the 0.05 level over txtfeats.SVM, imgfeats.SVM,imgfeats.Ens, and txt+imgfeats.Ens models, respectively. The best per-formance in each column is in bold.

observed that the SVM classifier performed much worse ontheir image feature data than the other classifiers. Hence, forthe baseline image feature experiments, the Kim and Rossimage features were tested with the Weka Naıve Bayes clas-sifier using both the plain and the kernel density estimationversions, and the Weka Random Forest classifier. These sys-tems are referred to as KR.NB, KR.NBK, and KR.RF, re-spectively, and form our image feature baseline models.

The page genre identification performance for thesebaseline models is shown in the top three lines of Table 4.Similar to the image feature results in [18], where twodifferent data sets were used, both Naıve Bayes modelsperformed better than Random Forests overall. However,the observed performance was not significantly differentwhether or not kernel density estimates were used with theNaıve Bayes classifier. These results indicate that our officedocument data set is a challenging collection for page genreclassification.

The last row of Table 4, labeled imgfeats.NB, shows theperformance of our document-inspired image features (de-scribed in Section 4.1.1) with the Weka Naıve Bayes clas-sifier. The overall performance as measured by F1-score ofour features was 0.7240 versus the best Kim and Ross F1-score of 0.5326. In all cases, our document-inspired image

Page 10: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

10 Accepted by IJDAR. The final publication will be available at www.springerlink.com.

features perform significantly better than the Kim and Rossfeatures [18]. In the rest of the experiments, our document-inspired image features are used as the image features.

5.2 Baseline Text Feature Comparison

We use the text-based genre features proposed by Sta-matatos et al. [31] for our text baseline. These features arecorpus independent and based on frequent terms. Morespecifically, we used the 30 most frequent stopwordsand eight frequent punctuation marks specified in [31] asfeatures. These features were used with the same SVMclassifier as in our text-based method with feature selection.We refer to this baseline system as freqfeats.SVM.

Table 5 compares the page genre identificationperformance of the frequent terms-based baseline (fre-qfeats.SVM) and our text-based method with featureselection (txtfeats.SVM, as described in Sections 4.1.2–4.2). Note that the overall performance measure, F1-score,for the frequent terms was 0.4940 and for the feature se-lection method was 0.7470. While there was no significantdifference in recall, our method employing feature selectionperformed significantly better than the frequent terms basedmethod in precision and F1-score. These results indicatethat using only frequent terms does not provide a richenough set of features for identifying our target genres.

Although we do not show significance for results acrossTables 4 and 5, we observed that our text-based features,txtfeats.SVM, exhibited significantly better F1-score thanany of the Kim and Ross image feature classifiers, but nosignificant difference in F1-score with our image features,imgfeats.NB.

5.3 Image Features and Ensemble ClassificationPerformance

We examined the performance of our image featuresusing the classification models described in Section 4.Table 6 shows the performance of our image features withan SVM classifier (imgfeats.SVM), image features withour ensemble classifier (imgfeats.Ens), and the use ofboth image and text features with our ensemble classifier(txt+imgfeats.Ens). Comparing the image-based systemsagainst our text-based method (txtfeats.SVM), Table 6shows that for the three evaluation measures of precision,recall, and F1-score, all methods with image-based featureshad statistically significant better performance than ourtext-based method. From this, we infer that image-basedfeatures can be more useful than text-based features fordetecting some office document genres.

While it is not indicated in the tables, we also observedthat the use of our image features with an SVM classifier,

Fig. 5 F1-score page genre identification performance by genrefor the methods: txtfeats.SVM, imgfeats.SVM, imgfeats.Ens, andtxt+imgfeats.Ens.

imgfeats.SVM, had significantly better performance, asmeasured by F1-score, over use of our image features witha Naıve Bayes classifier, imgfeats.NB (from Table 4). Thus,the ensemble method was applied to the SVM classifierresults. Examining the addition of our simple ensembleclassifier, imgfeats.Ens, we note statistically significantbetter F1-score performance over imgfeats.SVM.

Although our text-based classifier did not perform aswell as our image-based classifiers, we examined whetherthe combination of our text-based and image-based featuresenhances performance. The last two lines of Table 6 com-pare the combined features (txt+imgfeats.Ens) to the imagefeatures with the ensemble classifier method (imgfeats.Ens).Although precision improved and recall decreased when textfeatures are added (txt+imgfeats.Ens), overall, there was nosignificant difference, as measured by F1-score.

We also examine F1-score performance by each of ourmethods for each target genre. Fig. 5 shows that our imagefeatures (imgfeats.SVM) perform better than using textonly features for all test genres. Both ensemble classifiers(imgfeats.Ens and txt+imgfeats.Ens) outperform the bestplain SVM classifier (imgfeats.SVM) for all genres. Asmight be expected for the Table genre, the F1-score usingthe text-based method (txtfeats.SVM) is much lower thanthe methods that employ features that capture layout.

Comparing these results with Fleiss’ kappa labeleragreement shown in Table 2, note that for the four genres,the rank order of the labeler kappa values is the same asthe rank order of F1-score for the two methods using anensemble classifier (imgfeats.Ens and txt+imgfeats.Ens).This indicates that page genre identification performance isbetter when labeler agreement is higher.

5.4 Discussion

Although Stamatatos et al. [31] had good results using textfeatures, applying their text features to our data resulted inmuch poorer performance. We believe this is due in largepart to the different sets of genres used. In [31], the corpus is

Page 11: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

Accepted by IJDAR. The final publication will be available at www.springerlink.com. 11

a subset of Wall Street Journal documents composed of fourgenres: Editorials, Letters to the Editor, Reportage, and SpotNews. Documents in these genres always contain words andare usually composed of complete sentences. Slides and ta-bles tend to contain fewer words, and photos contain littleor no words. With fewer stop words and punctuation char-acters, the features used in [31] may have been too sparse.Thus, supplementing them with selected text terms led toimproved performance for our coarse genres.

It has been suggested that there are advantages to per-forming genre identification without text features. Scholl etal. [27] mention language independence and Kim and Ross[18] mention less language dependence and freeing the pro-cess from “text processing tools with encoding requirementsand problems relating to special characters”. They also notethat “pdftohtml failed to extract information from seventeenpercent of the documents.”

Boese and Howe [2] examined whether different fea-tures for web genre prediction are transferable between webpage corpora. They noted that terms in a web page are gen-erally not transferable. They also noted that the only fea-tures common across the corpora they examined are a stylereadability measure, number of web links, and some HTMLtable tags. These features are independent of the terms in adocument, as are our image-based features.

In agreement with earlier web page genre detection re-sults by [27, 20], our office document genre identificationresults indicate that using only text features is inferior to ourimage-based features. The latter are surface features cap-turing layout without requiring explicit segmentation intozones or classification into structural regions types. Our re-sults also indicate that the addition of simple text featuresdoes not significantly influence performance for our fourcoarse genre types that exhibit differences in layout. Thuswe expect our image-based features with an ensemble clas-sifier to be applicable to other collections of office docu-ments. However, for finer grained genre classification wherelayout is similar, text features such as the stop words used byStamatatos [31] and the surface cues proposed by Kessler etal. [16] could play an important role.

6 Application to Document Search

We applied our genre identification work to documentsearch. Since document search is commonly document-based, rather than page-based, we first discuss labelingdocuments by genre. In particular, we describe how weuse our office document data set for the document genreidentification task and our experiments in predicting docu-ment genre based on the page genre classifications. We thendescribe a document browsing interface that makes use ofthe predicted genres as search facets.

Fig. 6 Quantization of page scores to create a document feature vector.(a) For a given genre, each page in a document is scored by a pair ofclassifiers. (b) The scores are quantized and a histogram created. (c)Counts of the quantized scores for one genre. (d) Counts for each ofthe four genres (blue) and the number of pages (green) form the featurevector.

6.1 Document Genre Data Set

To develop a document genre identifier and evaluate perfor-mance, a set of documents where each document is labeledby genre is needed. Instead of performing a separate manuallabeling of documents, we assign a document to a particulargenre if at least half of its pages have been manually taggedas that genre. The right column of Table 1 shows the numberof documents in the corpus tagged with each of the four gen-res. A total of 798 tags were assigned to 1178 documents.Note that there are documents without any tags.

6.2 Document Genre Identification

To automatically tag documents by genre, the genre scoresfor each page in a document are used. Since page classifica-tion is imperfect, we again employ learning and explore theutility of two feature representations for classifying a doc-ument based on the page scores. Since we biased the genreclassifiers towards positively labeling a page as the specifiedgenre, we assume that a negative page score does not con-tain much differentiating information; thus, we focused onthe positive page scores. We investigated encoding the pagescores in two different ways: (1) quantize the page scores ina document into a small number of bins and (2) combine thepage scores produced by the classifiers for one genre into asingle score.

Page 12: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

12 Accepted by IJDAR. The final publication will be available at www.springerlink.com.

Fig. 7 Fusion of a document’s page scores for one genre. (a) Each pageof a document is scored by a pair of classifiers trained to identify onegenre. (b) The maximum score of the pair is selected. (c) The maximumscore is clipped. (d) A document’s page scores are averaged, producingthe document genre score, Sd(g).

The quantization method is illustrated in Fig. 6. Quanti-zation of the page scores produced by the pair of classifiersis performed by providing two bins for negative scores andfour bins for positive scores, with bin breakpoints at {-∞,-1.0, 0.0, 0.33, 0.67, 1.0, +∞}. To use the page scores asfeatures, a feature vector is created that is composed of thecounts of the quantized page scores for each of the four tar-get genres, plus the number of pages in a document. UsingEqn. 1, the feature vector values are scaled and shifted torange between 0 and 1 and then are used in an SVM.

In our second method, which is based on “score fusion”,we combine a document’s page scores for one genre into asingle score (Fig. 7) and use it as a feature in an SVM. Weagain focus on the positive scores produced by the classi-fiers. Instead of making a hard decision as in Section 4.3,the maximum page score, smax(p,g), produced by each pairof classifiers for page p and genre g is used in computing thesummary score (Fig. 7b). To insure that the document genrescore, Sd(g), for document d and genre g is “normalized” torange between 0 and 1 and simultaneously focus on the pos-itive scores in the predominant score range, each maximumpage score smax(p,g) is clipped to a maximum value of 1.0and a minimum value of 0.0 (Fig. 7c). Sd(g) is computedas the average of the individual maximum page scores for adocument (Fig. 7d). Thus, Sd(g) is computed from the pairsof scores as:

Sd(g) =1P ∑

pmax(min(smax(p,g),1.0),0.0)

where P is the number of pages in document d. Sd(g) foreach of the four genres plus the normalized number of pagesin the document form the feature vector used in an SVM.

Table 7 Comparison of document genre identification performance fordifferent page combination models

Genre Precision Recall F1-scoreVoting 0.7120 0.9524s 0.8042Quantized.SVM 0.7870v 0.9316s 0.8429v

Score.SVM 0.8474vq 0.9027 0.8695vq

Superscripts ‘v’, ‘q’, and ‘s’ indicate statistically significant improve-ment at the 0.05 level over voting, Quantized.SVM, and Score.SVMmodels, respectively. The best performance in each column is in bold.

For each feature set, a separate classifier is optimized foreach genre. The parameters of an SVM classifier are opti-mized using the same cross-validation approach as used forpage classification. The method using quantized features isreferred to as Quantized.SVM, and the method based onscore fusion is referred to as Score.SVM.

6.3 Document Genre Performance

In this section we examine the performance of three meth-ods for predicting document genre from page genre. Specif-ically, we compared the two methods described in the pre-vious section, Quantized.SVM and Score.SVM, against abaseline voting classifier. Voting is a common classificationmethod for combining the results from different classifierson the same data. Here we use voting to combine classifi-cation results on different data: the pages of a document.In particular, the page classification scores from the SVMclassifier are used to vote whether to tag a document with agenre.

We expect that a trained classifier should better modelthe noisy features due to page genre identification errors,resulting in better performance than simple voting. Table 7shows the results for the different document genre classifi-cation methods. The score fusion method, Score.SVM, hadthe best overall performance, with an F1-score of 0.8695.

The supervised methods exhibited statistically signifi-cant better F1-score performance than voting. Quantizationwas not better overall than our score fusion method. It maybe that there is not enough data for good estimates in thelarger number of dimensions, and that in the quantized rep-resentation, the features for documents with a small numberof pages are noisy, hurting performance.

6.4 Genres in Document Search

Genre can play a valuable role when searching for officedocuments in a business organization. Unlike the web,where ranking of search results is strongly influencedby links between documents, business documents rarelycontain links. Consequently, other methods for locatingdocuments are needed. Faceted search has been used to

Page 13: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

Accepted by IJDAR. The final publication will be available at www.springerlink.com. 13

Fig. 8 Directories with the ‘Tech paper’ genre in an office document search and browsing system.

locate documents based on their attributes (e.g., [11]), andautomatically identified genres have been proposed as auseful facet for such searches [12]. To explore this, weapplied our genre identification method to tag documents ina system we developed for faceted search and browsing of acollection of business documents. The system has been usedinternally to access 30,000 office documents and 230,000images created over the past ten years.

The system provides users with search and navigationoptions through metadata and the document collectionfile structure. Filters for different facets, including auto-matically computed genre, are provided by the interface.The results are displayed within the user-created directoryhierarchy (Figs. 8 and 9). Directories without matchingdocuments are not displayed. Directories are visualized bythree thumbnails of sample documents, cropped to squaresto better fill the directory box. If fewer than three matchingdocuments exist in a directory tree, then only that numberof thumbnails is presented.

To provide genre tags for the search and browse system,we used the genre models trained on the office documentdata set described in Section 3 to tag the documents andimages in the corpus. A user can restrict the genres shownin the interface by selecting one of the genres listed inthe genre option as shown in Fig. 8. The pull-down menuincludes ‘Slide’, ‘Technical Paper’, ‘Photo’, ‘Table’, and

‘Other’. The ‘Other’ option contains any document that wasnot tagged with a genre. This option allows a user to usethe genre facet to limit the documents presented to excludethe tagged genres. Examples of documents that fall into thiscategory include memos, maps, and brochures.

Fig. 8 displays the results of selecting the ‘Tech paper’genre. Fig. 9 shows the interface when the genre ‘Slide’ isselected. The background color of each directory is a differ-ent intensity of orange to indicate the strength of the matchaccording to the match score. The darker intensities indi-cate that those directories have a higher proportion of docu-ments tagged with the selected genre. Note in Figs. 8 and 9that most of the thumbnails representing each directory are(cropped) technical papers and slides, respectively. The tooltip in Fig. 9 displays a larger image of the first slide from oneof the presentations. One can see that a different set of direc-tories is listed in Figs. 8 and 9 and that directories appearingin both listings, such as ‘Activities’ or ‘Conferences’, arerepresented by different document thumbnails.

As would be expected given imperfect genre identifi-cation performance (and the subjectivity of manual genrelabeling), there are some genre tagging errors. However, thenumber of errors is small and the directories and documentspresented are greatly reduced, enabling a user to morequickly browse for slides (or documents of another genre)in the directories.

Page 14: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

14 Accepted by IJDAR. The final publication will be available at www.springerlink.com.

Fig. 9 Directories with the ‘Slide’ genre.

The system has been demonstrated to many visitors whocommented that the addition of genre as a search facet wasvery useful for narrowing the number of documents. Theywere excited by the directory thumbnail display that onlyshowed documents of the selected genre. In one instance, weimpressed a visitor by how quickly we could locate a slidefrom his previous visit ten years ago by using a combinationof the slide genre and date facets. The system has also beenused consistently by several people in our lab for finding oldslides or photos to use when creating a new presentation.

7 Conclusions

We investigated design considerations for systems thatperform coarse genre classification of office documents.Our set of document-inspired, visual, surface-type featuresdo not require explicit classification into region types,unlike most earlier image-based systems. Experimentscomparing these features against a simpler baseline set ofimage features showed that our visual features performsignificantly better in identifying page genre as measured byprecision, recall, and balanced F1-score. Feature selectionalso improved performance of a text-based genre classifierover a baseline, frequent-terms based classifier, indicatingthat enhancing the feature set with selected additional termscan improve performance over use of stop words only.

The results show that our image features perform sig-nificantly better than our text features. Additionally, the textfeatures do not significantly improve performance in com-bination with image-based features. Thus, image-based fea-tures that capture the layout of a document can be more in-formative for coarse genre identification than simple textualfeatures. As reported by others, performing genre identifi-cation without textual features enables broader applicabilityto new languages and documents, since textual features mayvary with topic.

A simple ensemble method using data partitioningsignificantly improved the performance of individual one-versus-rest SVM classifiers for page genre identification.

Each of our genre models learned the features of the differ-ent types of pages that occur in a genre, making creation ofseparate page-type identification models unnecessary.

We incorporated genre identification into a system forbrowsing and searching a collection of business documents,visibly reducing the document search space. We comparedtwo methods for identifying document genre from pagegenre scores and used the best method to identify documentgenres in the collection.

In the future, we would like to investigate whether mod-eling the sequence of page genre labels improves documentgenre identification. We also would like to extend our meth-ods to identify other genres that would be useful for officedocuments. And finally, we would like to conduct a userstudy to examine how genre affects search performance andperceived workload.

References

1. Bagdanov A, Worring M (2001) Fine-grained documentgenre classification using first order random graphs. In:Proceedings of the International Conference on Docu-ment Analysis and Recognition, pp 79–83

2. Boese ES, Howe AE (2005) Effects of web documentevolution on genre classification. In: CIKM ’05: Pro-ceedings of the 14th ACM International Conference onInformation and Knowledge Management, New York,NY, USA, pp 632–639

3. Burges CJC (1998) A tutorial on support vector ma-chines for pattern recognition. Data Mining and Knowl-edge Discovery 2(2):121–167

4. Chen N, Blostein D (2007) A survey of document im-age classification: problem statement, classifier archi-tecture and performance evaluation. International Jour-nal on Document Analysis and Recognition 10(1):1–16

5. Meyer zu Eissen S, Stein B (2004) Genre classificationof web pages: User study and feasibility analysis. In:Biundo S, Fruhwirth T, Palm G (eds) Advances in Arti-ficial Intelligence, Springer, pp 256–269

6. Fleiss JL (1971) Measuring nominal scale agreementamong many raters. Psychological Bulletin 76(5):378–382

7. Freund L, Clarke CLA, Toms EG (2006) Towards genreclassification for IR in the workplace. In: IIiX: Proceed-ings of the 1st International Conference on InformationInteraction in Context, pp 30–36

8. Gupta MD, Sarkar P (2007) A shared parts model fordocument image recognition. In: Proceedings of theNinth International Conference on Document Analysisand Recognition, pp 1163–1172

9. Hall M, Frank E, Holmes G, Pfahringer B, ReutemannP, Witten IH (2009) The WEKA data mining software:

Page 15: Genre identification for office document search and … · Genre identification for office document search and browsing ... Abstract When searching or browsing documents, the

Accepted by IJDAR. The final publication will be available at www.springerlink.com. 15

An update. SIGKDD Explorations 11(1):10–18, http://www.cs.waikato.ac.nz/ml/weka/

10. Hao X, Wang J, Bieber M, Ng P (1993) A tool for clas-sifying office documents. In: Proceedings of the FifthInternational Conference on Tools with Artificial Intel-ligence, pp 427–434

11. Hearst MA (2006) Design recommendations for hierar-chical faceted search interfaces. In: Broder AZ, MaarekYS (eds) Proceedings of the SIGIR 2006 Workshop onFaceted Search, pp 26–30

12. Henderson S (2005) Genre, task, topic and time:facets of personal digital document management. In:CHINZ ’05: Proceedings of the 6th ACM SIGCHINew Zealand chapter’s International Conference onComputer-Human Interaction, ACM, New York, NY,USA, pp 75–82

13. Hsu CW, Chang CC, Lin CJ (2010) A practical guide tosupport vector classification. http://www.csie.ntu.edu.tw/∼cjlin/libsvm/

14. Huang J, Kumar SR, Mitra M, Zhu WJ, Zabih R (1997)Image indexing using color correlograms. In: CVPR’97: Proceedings of the 1997 IEEE Conference on Com-puter Vision and Pattern Recognition, pp 762–768

15. Joachims T (1998) Text categorization with supportvector machines: Learning with many relevant features.In: ECML ’98: Proceedings of the 10th European Con-ference on Machine Learning, Springer-Verlag, Lon-don, UK, pp 137–142

16. Kessler B, Nunberg G, Schutze H (1997) Automatic de-tection of text genre. In: Proceedings of the 35th AnnualMeeting of the Association for Computational Linguis-tics, pp 32–38

17. Kim Y, Ross S (2007) Feature type analysis in au-tomated genre classification, http://eprints.erpanet.org/128/

18. Kim Y, Ross S (2008) Examining variations of promi-nent features in genre classification. In: Proceedings ofthe 41st Annual Hawaii International Conference onSystem Sciences

19. Lee YB, Myaeng SH (2002) Text genre classificationwith genre-revealing and subject-revealing features. In:SIGIR ’02: Proceedings of the 25th Annual Interna-tional ACM SIGIR Conference on Research and Devel-opment in Information Retrieval, ACM, New York, NY,USA, pp 145–150

20. Levering R, Cutler M, Yu L (2008) Using visual fea-tures for fine-grained genre classification of web pages.In: Proceedings of the 41st Annual Hawaii InternationalConference on System Sciences

21. Manning C, Raghavan P, Schutze H (2008) Introductionto Information Retrieval, Cambridge University Press,chap Text Classification and Naive Bayes

22. Peng H, Long F, Ding C (2005) Feature selection basedon mutual information: criteria of max-dependency,max-relevance, and min-redundancy. IEEE Transac-tions on Pattern Analysis and Machine Intelligence27:1226–1238, http://penglab.janelia.org/proj/mRMR/index.htm

23. Rauber A, Muller-Kogler A (2001) Integrating auto-matic genre analysis into digital libraries. In: Proceed-ings of the Joint Conference on Digital Libraries

24. Roussinov D, Crowston K, Nilan M, Kwasnik B, Cai J,Liu X (2001) Genre based navigation on the web. In:Proceedings of the 34th Annual Hawaii InternationalConference on System Sciences-Volume 4, IEEE Com-puter Society, Washington, DC, USA

25. Santini M, Rehm G, Sharoff S, Mehler A (2009)Editorial: Automatic genre identification: Issues andprospects. Journal for Language Technology and Com-putational Linguistics 24

26. Scholkopf B, Burges C, Smola A (eds) (1999) Ad-vances in Kernel Methods - Support Vector Learning,MIT-Press, chap 11: Making Large-Scale SVM Learn-ing Practical

27. Scholl P, Domınguez Garcıa R, Bohnstedt D, RensingC, Steinmetz R (2009) Towards language-independentweb genre detection. In: WWW ’09: Proceedings ofthe 18th International Conference on World Wide Web,New York, NY, USA, pp 1157–1158

28. Shin C, Doermann D, Rosenfeld A (2001) Classifi-cation of document pages using structure-based fea-tures. International Journal on Document Analysis andRecognition 3(4):232–247

29. Sivic J, Zisserman A (2003) Video Google: A text re-trieval approach to object matching in videos. In: Pro-ceedings of the Ninth IEEE International Conference onComputer Vision, vol 2

30. Snoek CGM, Worring M, Smeulders AWM (2005)Early versus late fusion in semantic video analysis. In:MULTIMEDIA ’05: Proceedings of the 13th AnnualACM International Conference on Multimedia, ACM,New York, NY, USA, pp 399–402

31. Stamatatos E, Fakotakis N, Kokkinakis G (2000) Textgenre detection using common word frequencies. In:Proceedings of the 18th International Conference onComputational Linguistics (COLING2000), pp 808–814

32. Witten IH, Frank E (2005) Data Mining: Practical Ma-chine Learning Tools and Techniques, 2nd edn. MorganKaufmann

33. Wong K, Casey R, Wahl F (1982) Document analysissystems. IBM Journal of Research and Development