15
A Random Walk on an Ontology: Using Thesaurus Structure for Automatic Subject Indexing Craig Willis Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 E. Daniel Street, Champaign, IL 61820. E-mail: [email protected] Robert M. Losee School of Information and Library Science, University of North Carolina, 216 Lenoir Drive, 302 Manning Hall, Chapel Hill, NC. E-mail: [email protected] Relationships between terms and features are an essen- tial component of thesauri, ontologies, and a range of controlled vocabularies. In this article, we describe ways to identify important concepts in documents using the relationships in a thesaurus or other vocabulary struc- tures. We introduce a methodology for the analysis and modeling of the indexing process based on a weighted random walk algorithm. The primary goal of this research is the analysis of the contribution of thesaurus structure to the indexing process. The resulting models are evaluated in the context of automatic subject index- ing using four collections of documents pre-indexed with 4 different thesauri (AGROVOC [UN Food and Agri- culture Organization], high-energy physics taxonomy [HEP], National Agricultural Library Thesaurus [NALT], and medical subject headings [MeSH]). We also intro- duce a thesaurus-centric matching algorithm intended to improve the quality of candidate concepts. In all cases, the weighted random walk improves automatic indexing performance over matching alone with an increase in average precision (AP) of 9% for HEP, 11% for MeSH, 35% for NALT, and 37% for AGROVOC. The results of the analysis support our hypothesis that subject indexing is in part a browsing process, and that using the vocabulary and its structure in a thesaurus contributes to the indexing process. The amount that the vocabulary structure contributes was found to differ among the 4 thesauri, possibly due to the vocabulary used in the corresponding thesauri and the structural relationships between the terms. Each of the thesauri and the manual indexing associated with it is character- ized using the methods developed here. Introduction Our motivation for studying thesauri and subject index- ing arose while working on the Helping Interdisciplinary Vocabulary Engineering (HIVE) project, a system for the integration of multiple thesauri and controlled vocabularies (Greenberg et al., 2011). HIVE includes a machine-aided indexing feature based on the Maui machine-learning algo- rithm (Medelyan, 2009). Early in the project, it became apparent that some vocabularies performed better during automatic indexing than others. Because the Maui algorithm relies primarily on term matching and frequency informa- tion, the question arose as to whether different properties of the thesauri, in particular thesaurus structure, could be used to explain differences in thesaurus performance and possibly improve automatic indexing. Research in this area serves not only to improve techniques for automatic indexing, but also to further understanding of vocabularies, and the subject index process in general, which are the goals of this study. The manual process of selecting and assigning terms from a thesaurus to represent documents relies in part on the vocabulary structure. Indexers select terms based not only on the frequency of concepts in documents being indexed, but on term matching between documents and thesaurus as well as the term relationships within the thesaurus, often determined by term meaning. The hierarchical structure of vocabularies, and thesauri in particular, introduces a browsing process into subject indexing. The indexer may begin by searching for terms based on concepts identified in the document. Given a set of potential matches, the indexer can browse the results and traverse the various relationships between terms until selecting the best term or set of terms to represent a particular concept. As a result, the final selected terms may not be represented in the document, as in the example that indexers will often choose to represent broader concepts when mul- tiple narrower concepts are found, even if the broader concept is not specifically mentioned (Hood, 1990). Received April 30, 2012; revised October 1, 2012; accepted October 1, 2012 © 2013 ASIS&T Published online 22 May 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asi.22853 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 64(7):1330–1344, 2013

A random walk on an ontology: Using thesaurus structure for automatic subject indexing

Embed Size (px)

Citation preview

Page 1: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

A Random Walk on an Ontology: Using ThesaurusStructure for Automatic Subject Indexing

Craig WillisGraduate School of Library and Information Science, University of Illinois at Urbana-Champaign,501 E. Daniel Street, Champaign, IL 61820. E-mail: [email protected]

Robert M. LoseeSchool of Information and Library Science, University of North Carolina, 216 Lenoir Drive, 302 Manning Hall,Chapel Hill, NC. E-mail: [email protected]

Relationships between terms and features are an essen-tial component of thesauri, ontologies, and a range ofcontrolled vocabularies. In this article, we describe waysto identify important concepts in documents using therelationships in a thesaurus or other vocabulary struc-tures. We introduce a methodology for the analysis andmodeling of the indexing process based on a weightedrandom walk algorithm. The primary goal of thisresearch is the analysis of the contribution of thesaurusstructure to the indexing process. The resulting modelsare evaluated in the context of automatic subject index-ing using four collections of documents pre-indexedwith 4 different thesauri (AGROVOC [UN Food and Agri-culture Organization], high-energy physics taxonomy[HEP], National Agricultural Library Thesaurus [NALT],and medical subject headings [MeSH]). We also intro-duce a thesaurus-centric matching algorithm intendedto improve the quality of candidate concepts. In allcases, the weighted random walk improves automaticindexing performance over matching alone with anincrease in average precision (AP) of 9% for HEP, 11%for MeSH, 35% for NALT, and 37% for AGROVOC. Theresults of the analysis support our hypothesis thatsubject indexing is in part a browsing process, and thatusing the vocabulary and its structure in a thesauruscontributes to the indexing process. The amount that thevocabulary structure contributes was found to differamong the 4 thesauri, possibly due to the vocabularyused in the corresponding thesauri and the structuralrelationships between the terms. Each of the thesauriand the manual indexing associated with it is character-ized using the methods developed here.

Introduction

Our motivation for studying thesauri and subject index-ing arose while working on the Helping InterdisciplinaryVocabulary Engineering (HIVE) project, a system for theintegration of multiple thesauri and controlled vocabularies(Greenberg et al., 2011). HIVE includes a machine-aidedindexing feature based on the Maui machine-learning algo-rithm (Medelyan, 2009). Early in the project, it becameapparent that some vocabularies performed better duringautomatic indexing than others. Because the Maui algorithmrelies primarily on term matching and frequency informa-tion, the question arose as to whether different properties ofthe thesauri, in particular thesaurus structure, could be usedto explain differences in thesaurus performance and possiblyimprove automatic indexing. Research in this area serves notonly to improve techniques for automatic indexing, but alsoto further understanding of vocabularies, and the subjectindex process in general, which are the goals of this study.

The manual process of selecting and assigning terms froma thesaurus to represent documents relies in part on thevocabulary structure. Indexers select terms based not only onthe frequency of concepts in documents being indexed, but onterm matching between documents and thesaurus as well asthe term relationships within the thesaurus, often determinedby term meaning. The hierarchical structure of vocabularies,and thesauri in particular, introduces a browsing process intosubject indexing. The indexer may begin by searching forterms based on concepts identified in the document. Given aset of potential matches, the indexer can browse the resultsand traverse the various relationships between terms untilselecting the best term or set of terms to represent a particularconcept. As a result, the final selected terms may not berepresented in the document, as in the example that indexerswill often choose to represent broader concepts when mul-tiple narrower concepts are found, even if the broader conceptis not specifically mentioned (Hood, 1990).

Received April 30, 2012; revised October 1, 2012; accepted October 1,

2012

© 2013 ASIS&T • Published online 22 May 2013 in Wiley Online Library(wileyonlinelibrary.com). DOI: 10.1002/asi.22853

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 64(7):1330–1344, 2013

Page 2: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

Although manual indexing includes a significant browsingcomponent, most current automatic indexing techniques relyprimarily on term- and phrase-frequency information, andmodel the process as a sequence of matching and filteringoperations (Medelyan, 2009; Medelyan & Witten, 2008).During the matching process, words from the document arecompared to the indexing language, resulting in a list ofcandidate indexing terms. During the filtering process, dif-ferent properties or attributes of the terms from the documentare used to identify the most significant concepts. Commonattributes include frequency, position, phrase length, etc. Inthis sense, the thesaurus is treated as a simple list of terms andphrases with synonyms, and the hierarchical and associativerelationships are largely ignored. This limits the automati-cally assigned terms to only those that can be matched in thetext, which does not account for cases, such as the onedescribed above, where indexers may select broader termswhen multiple narrower terms are present in the document.

Like electronic documents and webpages, the construc-tion of vocabularies and the selection of indexing terms torepresent documents can be described as consistent withrandom and probabilistic processes. Losee (2007) discussesthe uncertainty involved in the inclusion of terms in a the-saurus as well as the uncertainty involved in the selectionof indexing terms by the indexer to describe a particulardocument.

Thesauri, with their broader, narrower, and associativerelationships, can be viewed as a tree or graph. The indexingprocess can then be modeled as browsing or “walking” thethesaurus graph, and an automatic indexing model consistentwith this assumption is developed. Individual thesauri arerepresented as graphs, with terms being linked to other termsby different relationships. A matching algorithm is used toidentify a list of candidate starting terms, and a weightedrandom walk to nearby terms in the thesaurus is used inidentifying possible index terms, which are then ranked toestimate which terms are the most likely to have beenassigned by a human indexer. This approach does not use termoccurrence frequencies, instead relying exclusively on thethesaurus relationships for terms identified in each document.The expectation is that manual subject indexing includes abrowsing process that partially determines the final selectedterms. Information about how indexers navigate the thesauruscan be used to improve future automatic indexing solutions.

This article is organized as follows. In the next section, weprovide an overview of thesauri and indexing languages,followed by an overview of the subject indexing process andrelated automatic indexing techniques. In the section, Match-ing: Translating to the Indexing Language of the Thesuarus,we present an algorithm for term matching using a thesaurus,followed by a description of the weighted random walkalgorithm in the section, A Random Walk on a Thesaurus. Inthe sections, Thesauri and Corpora, Evaluation, and Auto-matic Indexing Performance with Varying Random WalkParameters, we discuss the test collections, methods of evalu-ation, and the experimental results. Finally, we present con-clusions and a discussion of additional research questions.

Subject Indexing and Thesauri

Historically, the primary purpose of a thesaurus was toaid in the creation of a subject index to support access todocuments in a collection. The process of building con-trolled vocabularies such as thesauri, and indexing docu-ments using them is broadly referred to as subject analysis.The process of identifying the central subjects of a docu-ment and representing them with an indexing language iscommonly referred to as subject indexing. Chan, Richmond,and Svenonius (1985), Foskett (1996), and Lancaster (2003)provide thorough descriptions of these processes. We brieflyreview relevant aspects of manual and automatic subjectindexing in the sections that follow.

Manual Subject Indexing

When using a thesaurus, the indexer necessarily takesinto account the various relationships between concepts inthe document and the thesaurus terms. Indexers draw ontheir knowledge of the domain, document collection,vocabulary, audience, and prior indexing experience. Index-ers are generally allowed to consult external resources suchas indexing guidelines and manuals, or specialized dictio-naries to fill gaps in their knowledge during the indexingprocess. In practice, many thesauri also include additionalnotes (e.g., scope notes) that define and further clarify whena particular term should or should not be used.

Subject indexing using thesauri and other indexing lan-guages consists of (a) a collection of documents, usuallyconstrained to a specific domain; (b) a thesaurus comprisingimportant concepts in the domain, represented as a list ofterms and their relationships; and (c) external sourcessuch as indexing guidelines and manuals or specializeddictionaries to aid in the indexing process.

The AGRICOLA Guide to Subject Indexing, a manual foruse with the National Agricultural Library Thesaurus(NALT), serves as an example of a local indexing manualand to highlight the complexity of the manual indexingprocess (Hood, 1990). First, indexers are instructed to readthe title, abstract, and introduction; to note illustrations,charts, and tables; and to consider bibliographic references,keywords, and even the type of journal or affiliation of theauthor. Next, the indexer is instructed to identify the con-cepts that represent the central subjects of the document.The AGRICOLA guide instructs indexers to “give a highpriority to concepts the author considers important as evi-denced by the manner and frequency of their treatment”(emphasis added). However, it continues, the indexer shouldalso “index concepts that [they] know to be important” (p.3). Several of the steps that follow require subjective knowl-edge of the collection and users, such as only consideringinformation that “warrants the time and expense of retrieval”(p. 3). Similar guidelines can be found in ISO 5963 (Inter-national Organization for Standardization, 1985).

As can be seen from these basic instructions, the processof indexing is complex. Indexers are working with the

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013 1331DOI: 10.1002/asi

Page 3: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

natural language of a document, the scaffolding of documentstructure, the controlled terms of an indexing language, therelationships between terms in the language and the vocabu-lary, and the knowledge to connect natural language con-cepts to associated controlled terms in the context of aspecific collection.

Over the years, many different approaches to automatic ormachine-aided indexing have been developed. These pro-cesses employ simplified models of some of the specificaspects of the manual indexing process just described. Theseprocesses are discussed in greater detail in the next section.

Automatic Subject Indexing

Since the adoption of manual processes for subject index-ing with controlled vocabularies in the 1950s and 1960s,many researchers have explored techniques to automate thisprocess (Borko, 1962; Klingbiel, 1969; Stevens & Urban,1964). Over the years, the research agenda has remainedremarkably stable. As Maron (1961) wrote, automatic index-ing “concerns the problem of deciding automatically what agiven document is about” (p. 404). These words are echoednearly 5 decades later by Medelyan and Witten (2008), whoset out to explore the problem of “automatically identifyingthe main topics of documents” (p. 1027). Over the years, thephrase “automatic indexing” has been used to describe avariety of techniques in modern information retrieval, oftenoutside of the context of subject indexing with controlledvocabularies. There is, however, a body of research specifi-cally focused on approaches using controlled vocabulariesin the context of manual subject indexing practice. This isthe sense of “automatic indexing” used in this article—asshorthand for automatic subject indexing and classificationusing controlled vocabularies.

It is worth noting that because of the duration of researchin this area, researchers have worked with a variety of dif-ferent document components over the years. Early research-ers worked primarily with titles and citations, later followedby abstracts. Document fulltext has only been available rela-tively recently. Manual indexers, however, have generallyhad access to the fulltext in physical form. Over the years,researchers have also worked with a variety of differentcontrolled vocabularies and indexing languages. Some ofthese vocabularies are more akin to simple terms lists orsubject headings lists and others closer to thesauri andontologies with rich relationships. The most recent work inthis area, including that of Medelyan and Witten (2008) andMedelyan (2009), focuses on indexing using document full-text and thesauri. Still, Návéol, Shoosan, Humphrey, Morn,and Aronson (2009) describe a system that works primarilywith bibliographic citations.

According to Sebastiani (2002) and Hlava (2005),approaches to automatic indexing with controlled languagesfall into two broad categories: rule based (or expertsystems) and statistical (or machine-learning). Rule-basedapproaches, such as those described by Vleduts-Stokolov(1987), Humphrey and Miller (1987), Silvester, Genuardi,

and Klingbeil (1994), and Hlava (2005) rely on formalizedlanguages for the assignment of controlled indexing termsbased on document text. These formalized languages, some-times referred to as “semantic vocabularies” or “lexical dic-tionaries,” contain rules to support the translation fromnatural language terms into controlled indexing terms basedon different senses of words in the collection. Variations ofthis approach have been adopted in production indexingenvironments at the Biosciences Information Service(BIOSIS), the US National Aeronautics and Space Adminis-tration (NASA), and the National Library of Medicine(NLM), among others.

Early statistical approaches focused on the co-occurrenceof headings under various conditions. Field (1975) looked atthe co-occurrence of index terms with free-text keywordsassociated with citations. Leung and Kan (1997) and Plauntand Norgard (1998) apply statistical learning techniques toassociate controlled terms with the natural language in titlesand abstracts. Humphrey (1999) computes co-occurrencevalues between words in document titles and abstracts andcontrolled indexing terms assigned at the journal level.

More recent statistical learning approaches to automaticsubject indexing, such as those described by Medelyan andWitten (2008) and Medelyan (2009), attempt to learn con-trolled indexing based on features from the document andcollection. Features include term frequency/inverse docu-ment frequency (tf-idf), position of term first occurrence,and node degree (the number of links to a term in thethesaurus). Medelyan (2009) introduces several additionalfeatures, including position of the last occurrence of the termand the term keyphraseness. Keyphraseness originally dis-cussed by Mihalcea and Csomai (2007) is the likelihood thata particular term or phrase is a keyphrase based on informa-tion derived from Wikipedia. Another feature included byMedelyan for free text, but not for use with controlledvocabularies, is semantic relatedness, which leverages thestructure or relationships between concepts. The authorevaluates several different classifiers, concluding that naïveBayes and decision trees with baggers perform best on thetested thesauri and collections.

In practice, rule-based and statistical approaches arecombined to provide the best results, usually limited to asingle thesaurus or vocabulary. The NLM Medical TextIndexer (MTI) initiative uses a combination of machinelearning, natural language processing, and rule-based tech-niques to provide the best possible results in a productionindexing environment (Aronson, Mork, Gay, Humphrey, &Rogers, 2004).

Each of the techniques described looks primarily at therelationship between the natural language in the documentand manually assigned controlled terms without regard tothe structure inherent in the vocabulary. Medelyan andWitten (2008) and Medelyan (2009) use the number of rela-tionships for a specific vocabulary term as one feature in thestatistical learning process. In the field of geographic infor-mation retrieval, Martins and Silva (2005) describe anapproach to disambiguating geographic place names using

1332 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013DOI: 10.1002/asi

Page 4: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

the relationships between place names in a gazetteer.Related techniques have been explored by others in thegeographic domain (Buscaldi & Rosso 2008; Leidner &Lieberman, 2011; Overell & Rüger, 2008). Mihalcea andTarau (2004) describe an approach to ranking importantfree-text keyphrases using relationships between words in adocument and ranking them using a variation of thePageRank algorithm. Aside from disambiguation, to ourknowledge, none of the research in the area of automaticsubject indexing has considered the impact of thesaurusstructure on term selection.

As outlined by Lancaster (2003) and evident in each ofthe automated techniques, a central step in indexing ismatching terms or concepts present in the document to thoserepresented in the vocabulary. Most automated approachestake a document-centric approach. For this article, we intro-duce a novel thesaurus-centric approach to matching. Thematching process is discussed in the next section.

Matching: Translating to the Indexing Languageof the Thesaurus

Translation, an essential step in the subject indexingprocess, is the act of matching the results of the conceptualanalysis to the indexing language of the thesaurus (Lan-caster, 2003). Human indexers employ a variety of tech-niques and external resources to aid in the translationprocess. They can easily draw on their own internal knowl-edge of language, specialized knowledge of the domain,prior indexing experience, and refer to external sources suchas domain-specific dictionaries. In automatic subject index-ing, a common practice is to simply match the language ofthe document to the terms in the vocabulary. Most existingtechniques take a document-centric approach, where thedocument is tokenized and normalized and compared to thevocabulary. In this section, we review two document-centricapproaches and describe the novel thesaurus-centric processused in this study.

Document-Centric Approaches to Matching

Document-centric approaches to matching are describedby Vleduts-Stokolov (1987), Silvester et al. (1994), andMedelyan (2009). Each of these takes as input the documentand thesaurus and outputs a set of matched terms. Medelyan(2009) describes a 4-step approach. The document is firsttokenized and a set of all phrase n-grams up to a certain lengthis generated. The n-grams and vocabulary terms are normal-ized using various techniques including downcasing, stop-word removal, stemming, and word reordering. (In reordering,the words in the n-grams are ordered alphabetically to improvematching.) Finally, the normalized n-grams are looked up inthe vocabulary, matching both preferred and nonpreferredterms. The output is the set of matched indexing terms.

Silvester et al. (1994) apply the same approach asVleduts-Stokolov (1987). The document is parsed into sen-tences based on simple boundaries (e.g., punctuation). Word

n-grams are generated for all combinations of words in afive-word window. The n-gram combinations are looked-upagainst the vocabulary, and the longest match is selected.Due to the nature of the vocabulary used for their research,conflation of preferred and nonpreferred terms is unneces-sary. The output is the set of matched indexing terms.

An alternate approach, and the one used in this study, is tostart with the thesaurus and attempt to match each preferredand nonpreferred term to the document. This thesaurus-centric approach is described in the following section.

A Thesaurus-Centric Approach to Matching

In this section, we describe a thesaurus-centric matchingapproach, dubbed “ThesaurusMatcher”. The input for thisprocess is the thesaurus and document text. The output is theset of matched indexing terms. The algorithm expands onMedelyan (2009) and Silvester et al. (1994). The maindifference is that instead of looking up document n-gramsin the thesaurus, thesaurus terms are matched against thecollection of document n-grams. The ThesaurusMatcheralgorithm follows:

1. From the document, generate all word n-grams of sizenmin to nmax using a window of size j, keeping trackof word positions. Normalize the document text byremoving punctuation, downcasing, stopword removal,stemming, and word reordering.

2. Normalize thesaurus terms and order phrases by descend-ing phrase length.

3. For each thesaurus phrase of length nmax to nmin, look upphrase in the collection of document n-grams. Recordeach matched phrase and remove all n-grams that containthe words matched at each position and continue.

Longer phrases in a controlled vocabulary have moreability to identify relevant documents (Losee, 2004, 2006).Thus, ordering thesaurus phrases from the longest to theshortest and not using all the phrases will result in the bestphrases for retrieval being selected. The expectation is thatthis approach will result in higher-quality matches and fewerfalse-positives. In effect, the thesaurus terms become aquery against the document as opposed to the approachesjust described, that use the document text to search thevocabulary.

This method is effective as long as the thesaurus anddocument are relatively small or processing time is not ofconcern. For larger thesauri, an alternative approach is togenerate all phrase n-grams for the thesaurus and document,processing document n-grams based on descending length,and always preferring the longest match.

Each of the approaches to matching provides a list ofcandidate indexing terms and information about each termfrom the document (e.g., match position, frequency, etc.) Inautomatic indexing, often the next step is to filter the list ofcandidate matches based on some criteria. In this study, thelist of candidate terms is used as a set of starting points fora random walk process. Additionally, the term frequencies

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013 1333DOI: 10.1002/asi

Page 5: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

of the candidate terms are not used once the starting pointsfor random walks are determined. The underlying intuitionis that indexers do not necessarily choose only those termsthat are explicitly matched in the document. In many cases,they browse the thesaurus and have the option to selectdifferent terms based on information in the thesaurus.Random walking and the algorithm used in this study aredescribed in detail below.

A Random Walk on a Thesaurus

A random walk is a method for moving from feature tofeature in a graph, such as moving from term to adjacentterm in a thesaurus. It provides a mathematical method formodeling stochastic or nondeterministic processes. Randomwalks have been applied in a variety of disciplines, such ascomputer science, biology, and economics, to explain thebehaviors of web surfers, cells, and the stock market (Berg,1993; Brin & Page, 1998; Malkiel, 1999). A random walk isa type of Markov chain, characterized by a set of indepen-dent steps with a set of probabilities of moving from onestep to the next, known as transition probabilities (Ghahra-mani, 2005). In computer science, random walks also referto a class of algorithms, generally on graphs (Mihalcea &Radev, 2011). Perhaps the most famous application of arandom walk is the PageRank algorithm used by Google torank webpages (Brin & Page, 1998). PageRank uses arandom walk to model the behavior of a “random surfer”who moves from webpage to webpage by clicking onlinks. Simply put, a score is assigned to each page based onthe probability that the surfer will arrive on it. Randomwalks on graphs have since been used for such diverseapplications as word-sense disambiguation and collabora-tive recommendation.

Thesauri, with their many relationships, can also beviewed as trees or graphs, with each term a node and eachrelationship an edge. In this article, a weighted random walkis used for automatically selecting terms in a thesaurusbased on the text in a document. The algorithm is describedin detail in the next section, but simple illustrations mayprovide helpful clarification.

What Is a Random Walk?

According to Klafter and Sokolov (2011), the name“random walk” came from a question posed to readers ofNature by the mathematician Karl Pearson in 1905. Pearsonwas studying the evolution of mosquito populations incleared jungle regions, but his question was phrased in termsof person randomly walking:

A man starts from a point O and walks l yards in a straight line;he then turns through any angle whatever and walks another lyards in a second straight line. He repeats the process n times.I require the probability that after n stretches he is at a distancebetween r and r + dr from his starting point O. (Pearson, 1905,p. 294).

The resulting model has been used to explain the behav-ior of nondeterministic processes in a variety of fields. Aslightly different story might help to illustrate the approachused in this article:

Flying over an unknown territory, an airplane drops a number ofparachutists at a number of different locations. The parachutistshave no maps and are instructed to follow any established roador path through a fixed number of towns, villages, or cities.What is the probability that after visiting n towns, the parachut-ists have reached a specific location?

After enough steps, one would expect the majority of theparachutists to end up in cities with the largest number ofincoming roads, as long as there was a route from theirstarting point to a city. If parachutists were dropped consis-tently near lakes, one would expect more walks to arrive atlakes than many other kinds of destinations. In this study, thestarting points are the initial set of indexing terms matchedin the document, and the walks are expected to arrive atterms related to the starting points, with the most commondestinations being near what is common between the differ-ent starting points. Figure 1 illustrates a simple set of termsand relationships from a thesaurus. If the starting points are“bot flies,” “fire ants,” and “locusts,” one might expect therandom walks to most frequently end on “insect pests.” If thestarting points are “butterflies” and “bot flies,” then onemight expect the random walks to most frequently end on“insects.” Because it is a random process, some of the walkswill end on sometimes unrelated terms. The idea is to iden-tify those terms that are most often endpoints for the walks.

In this article, we use a variation known as a weightedrandom walk, where taking the individual paths have differ-ing probabilities. For example, the random walk may beconfigured to choose broader relationships 10% of the time,narrower relationships 60% of the time, and associative rela-tionships the remaining 30% of the time. Knowing theseprobabilities can be helpful in understanding how indexersselect indexing terms using specific thesauri, and probabili-ties serve as a fundamental part of our model of the indexingprocess using controlled vocabularies. Because indexers areinstructed to account for the relationships between termsduring the selection process, so must any process thatintends to model or replicate the manual indexing process.

Algorithm for a Weighted Random Walk on a Thesaurus

To model the effect of thesaurus structure on the subjectindexing process, we begin with S, the set of one or morecandidate vocabulary terms produced by the thesaurus-centric matching process and used as starting points forbrowsing and selection. We construct an undirected graph Gwith vertices V for each of the preferred terms in the the-saurus. The edges E are associations between the broader,narrower, and associative thesaurus terms.

For each starting term si, representing a match between thethesaurus and the document, we begin a series of weighted

1334 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013DOI: 10.1002/asi

Page 6: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

random walks on the thesaurus graph. Algorithm 1 presents adescription of the process. We define five parameters for theweighted random walk: the number of walks N initiated fromeach particular starting term, the length of the walk K, and theprobabilities assigned to each of the thesaurus relationships:broader (pb), narrower (pn), and associative (pr).

(1)Input: Unordered list of starting terms SOutput: Ordered list of resulting term endpoints and frequency ofeach endpoint.DoRandomWalk(N, K, pb, pn, pr)

1. for each starting term s2. for n ← 1 to N3. for k ← 1 to K4. Select next term t using weighted random selec-

tion based on probabilities of selecting each rela-tionship (pb, pn, pr).

5. endpoints[t]++6. end for k7. end for n8. Rank endpoints by frequency

Underlying the weighted random walk is the intuitionthat indexers select nearby terms based on browsing thesau-rus relationships and that some types of relationships aremore important than others. A simple example is the selec-tion of a broader term when multiple narrower terms arefound to be important, even if the broader term does notoccur in the document. Other examples include the selectionof narrower or related terms when found to be better con-ceptual matches than the starting term.

Increasing the number of walks N serves to reduce thenoise from the random process. With a larger number ofwalks, the more important or frequently encountered end-points will cluster or separate more accurately from thoseless frequently encountered. The length of the walk K rep-resents how far the indexer browses from the starting term toselect a final term. The weights assigned to each relationshiprepresent the probability that the indexer will browse a par-ticular type of relationship.

An interesting condition is when a particular relationshipis not present for a term during the random walk. Whatshould be done, for example, if the random walk tries toselect a narrower term, but no narrower relationship exists?Other random walk implementations, such as PageRank,“teleport” or jump randomly to another node in the graph.However, this is not a likely model for browsing duringsubject indexing. In this implementation, if the relationshipdoes not exist, the resulting behavior is to “stay put” on thecurrent node.

The optimal values for each parameter are empiricallyestimated for each thesaurus and document collection, andare reported in later sections. The next section describes thethesauri and document test collections used to evaluate thealgorithm we described, and to study the usefulness of incor-porating the relationships between entries in a thesaurus orontology into automatic indexing systems.

Thesauri and Corpora

For evaluation purposes, this study uses the thesauri andcorpora first used by Medelyan (2009). The weightedrandom walk algorithm described in the previous section isevaluated using the four different vocabularies and docu-ment collections listed in Table 1. This section presents pre-liminary results of an analysis of these vocabularies thatinform the evaluation. Vocabularies and corpora include theNALT, with a collection of 200 pre-indexed documents fromthe AGRICOLA bibliographic database; the AGROVOC(UN Food and Agriculture Organization) thesaurus and 80pre-indexed documents from the UN Food and AgricultureOrganization (FAO) digital library; the HEP, with 290 pre-indexed documents from the European Organization forNuclear Research (CERN) Document Server; and the tax-onomy (MeSH) with 500 pre-index documents from thePubMed bibliographic database. Each thesaurus is encodedusing the simple knowledge organization system (SKOS)format. Documents in the AGRICOLA-200, FAO-80, andCERN-290 collections were selected randomly from

FIG. 1. Sample terms and relationships from a thesaurus.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013 1335DOI: 10.1002/asi

Page 7: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

English-only documents in each system. Documents in theNLM-500 collection were provided by the NLM IndexingInitiative (Aronson et al., 2004).

Table 2 presents a summary of different properties of thethesauri used in this study. The vocabularies differ in anumber of ways. They represent different numbers of con-cepts. For example HEP, the smallest vocabulary, includesjust over 16,000 concepts whereas NALT, the largest,includes nearly 50,000. The vocabularies also differ in thenumber of alternate terms (or equivalence relationships) perconcept, an important factor during the matching process.MeSH, which by far has the largest number of alternateterms per concept, has an average of seven alternate termsper concept. On the other hand, HEP concepts have anaverage of 0.02 alternate terms per concept. Another differ-ence can be seen in the number of associative relationshipsin each vocabulary. Associative relationships are alsoreferred to as related terms. Whereas AGROVOC has anaverage of one associative relationship per concept, MeSHhas an average of 0.73 associative relationships per concept,and almost no HEP concepts have associative relationships.

Other differences can be seen in the lengths of phrasesused in each thesaurus. As illustrated in Figure 2, all fourthesauri are primarily comprised of two-word phrases.However, although NALT and AGROVOC have a largenumber of one-word entries, both HEP and MeSH havelarger numbers of three-word phrases. All four of thevocabularies have smaller numbers of longer phrases. Thedistribution of phrase lengths may be approximated bythe Poisson distribution for each vocabulary. The phraselengths in the thesaurus significantly affect the matchingprocess.

Figure 3 presents the frequencies of phrase lengths ofdescriptors assigned by manual indexers in each collection.It can be seen that the phrase lengths of selected termsclosely relate to the phrase lengths in the vocabularies, asillustrated in Figure 2. Equal distributions suggest that thethesaurus may reflect the use of terminology by authors andmaybe by searchers.

In this study, the frequencies in Figure 2 and Figure 3 areused to establish the minimum and maximum phrase lengthsfor each vocabulary during the matching process.

In this section, we have described the thesauri and docu-ment collections used to evaluate the random walk algorithmdescribed in an earlier section. In the next section, wediscuss the evaluation metrics used in this study.

Evaluation

Following Medelyan (2009), this study uses simple pre-cision, recall, and average precision (AP) to measure theperformance of the random walk algorithm for automaticindexing. These measures are adapted to the context ofsubject indexing. In this context, the “good terms” are thoseautomatically identified terms or phrases that match the onesassigned manually by indexers to a document. “Extracted”terms are those returned by the automatic indexing process.We define precision as:

precision Pgood extracted terms

extracted terms( ) = #

#

and define recall as:

recall Pgood extracted terms

manually assigned terms( ) = #

#.

Average precision is used as a single-value measure forparameter tuning, and is calculated as the average precisionat each decile recall level from 0–1.0. Precision/recall curvesare also used to illustrate the full profile of differentconfigurations.

Automatic Indexing Performance With VaryingRandom Walk Parameters

The experiments described here are intended to test theeffect of various parameter combinations on the perfor-mance of the weighted random walk algorithm, measured asAP. Two sets of experiments were run: the first includingonly broader and narrower relationships, the secondincluding broader, narrower, and associative relations. Theparameters explored in these experiments include:

• Minimum and maximum phrase length during matching• Optimal walk length (K)• Optimal probabilities for each relationship type including

broader (pb), narrower (pn), and associative (pr)

The experimental platform is as follows. For eachthesaurus and document collection, the weighted randomwalk algorithm was run with all possible combinations ofrelationship probabilities summing to 1, with probabilitiesincrementing by 0.10. For example, if the probability ofselecting a broader relationship is 0.40, then the probabilityof selecting a narrower relationship is 0.60. For the first

TABLE 1. Vocabularies and Document Collections Used in This Study.

Thesaurus Corpus Description

NALT AGRICOLA-200

200 documents from AGRICOLA indexedusing National Agricultural LibraryThesaurus (NALT)

AGROVOC FAO-80 80 documents from the UN Food andAgriculture (FAO) collection indexedusing AGROVOC thesaurus

MeSH NLM-500 500 documents from the National Libraryof Medicine (NLM) PubMed databaseindexed using the Medical SubjectHeadings (MeSH)

HEP CERN-290 290 documents from the EuropeanOrganization for Nuclear Research(CERN) Document Server indexedusing the High Energy Physics (HEP)taxonomy.

1336 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013DOI: 10.1002/asi

Page 8: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

experiment, only combinations of broader and narrowerrelationship probabilities are considered. For the second,associative relationships were also included. The walk dis-tance, K, increases from 2–10. To account for possiblereturn-trips, both K and K - 1 are used for a single walkreferred to as of distance K. Thus, the analysis with a K = 2represents an actual walk with a K of 2 and a walk with a Kof 1. One exception is when K = 0, which is used as abaseline for comparison, and no random walk is performed.Minimum phrase length for the thesaurus-centric matchingalgorithm varies from one to three words per phrase, with amaximum length of 5. These values are based on the phrase

length frequencies illustrated in Figure 2. Average precisionis used to identify the best performing parameter combina-tions for each thesaurus and document collection.

The intent here is to assess the contribution of eachweighted parameter to the indexing process, not to present atechnique for parameter estimation. Future research mayexplore methods for identifying the best parametercombinations.

Weighted Random Walk Using Broader andNarrower Relationships

The first experiment considered random walks using onlybroader and narrower relationships. Figure 4 illustrates theaverage precision for each vocabulary using a walk-lengthof 2 (k = 2) as narrower relationship probability increasesfrom 0 to 1.0 (and therefore broader relationship probabilitydecreases from 1.0 to 0). In Figure 4A, the matching uses allphrases of length 5 down to 1. In Figure 4B, the matchinguses only phrases of length 5 down to 2.

The best minimum phrase lengths used during the match-ing phase are thesaurus dependent. A minimum phraselength of 1 produces the highest average precision whenusing AGROVOC, NALT, and HEP. A minimum phraselength of 2 produces the highest average precision whenusing MeSH. This suggests that the useful terminology inmedicine is longer than that in the others areas beingconsidered.

The effect of changes in the probabilities of selectingbroader or narrower relationships differs among the thesauri.Increased narrower relationship probability has a pro-nounced effect on NALT, until the probability exceeds 0.80with the smallest phrases being 1-grams or 0.70 with thesmallest phrases being 2-grams, where average precisionbegins to decrease. Increased narrower relationship prob-ability has a modest effect on AGROVOC, increasingaverage precision to until the probability exceeds 0.70 withthe smallest phrases being 1-grams or 0.6 with the smallestphrases being 2-grams. Increasing the narrower terms prob-ability has a similarly modest positive effect on MeSH. Inthe case of HEP, average precision decreases as narrowerrelationship probability increases.

In this study, average precision is determined by compar-ing the terms found by the random walk process to the

TABLE 2. Thesaurus Properties.

AGROVOC HEP NALT MESH

UN Food andAgriculture Organization

High EnergyPhysics taxonomy

National AgriculturalLibrary

National Libraryof Medicine

Total number of concepts 28,175 16,018 48,308 25,990Total number of alternate labels 10,028 370 33,494 177,316Alt. labels per concept 0.35 0.02 0.69 6.8Number of broader relationships 27,686 27,333 50,721 35,199Number of narrower relationships 27,688 27,059 50,721 35,199Number of associative relationships 27,712 108 21,746 7,188

FIG. 2. Distribution of phrase n-grams for each thesaurus.

FIG. 3. Distributions of phrase n-grams for each test collection.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013 1337DOI: 10.1002/asi

Page 9: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

manually assigned terms from each test collection. Theresults suggest that, in the cases of NALT, MeSH, andAGROVOC, indexers do not only select from the set ofmatched terms. Instead, indexers also select from the set ofbroader and narrower terms for each of the matched candi-dates. By increasing the probability of browsing narrowerterms, the average precision increases up to a point. Theseresults suggest that indexing in the NALT, AGROVOC, andMeSH test collections is more likely to include terms that

are narrower than the matched candidates. However, somebroader terms are also included.

Weighted Random Walk Using Broader, Narrower, andAssociative Relationships

In the second set of experiments, the weighted randomwalk was evaluated using all three relationship types:broader, narrower, and associative. Figure 5 illustrates the

FIG. 4. (A) Average precision (AP) of terms selected as a result of weighted random walks with increasing narrower relationship probability usingminimum phrase 1-gram. (B) Average precision of terms selected as a result of weighted random walks with increasing narrower relationship probabilityusing minimum phrase 2-grams.

1338 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013DOI: 10.1002/asi

Page 10: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

effect of including the associative relationship on averageprecision for each vocabulary. It can be seen from this graphthat increasing the probability of traversing associative rela-tionships increases average precision for all vocabulariesexcept for HEP. For the graph in Figure 5, as the associativerelationship probability increases from 0.0 to 1.0, the nar-rower relationship probability decreases from 1.0 to 0.0. Thebroader relationship probability is held constant at zero.

The precision and recall graphs in Figure 6 illustratethe effect of minimum phrase length and the inclusion ofassociative relationships on precision and recall. Each ofthese graphs plots three different data types: (a) matchingonly (no random walk), (b) weighted random walk withbroader/narrower relationships only (no associativerelationships), and (c) weighted random walk with all rela-tionship types. These graphs are for only those configura-tions with the highest average precision values perthesaurus.

The results of each of these configurations are summa-rized in Tables 3 and 4. A minimum phrase length of 1produces the highest average precision values for NALT,AGROVOC, and HEP, and a minimum phrase length of 2produces the highest average precision values for MeSH.Higher associative relationships probabilities improve theaverage precision for NALT, AGROVOC, and MESH.Because HEP has so few associative relationships, thehigher probabilities have no effect on performance. Recallthat K = 0 is the baseline representing matching only with norandom walk. For all other K, the value represents randomwalks of length K and K - 1.

In all cases, the weighted random walk improves averageprecision over matching alone. In the cases of NALT, MeSH,

and AGROVOC, including associative relationshipsimproves average precision over broader and narrower rela-tionships only. Optimal minimum phrase lengths differacross vocabularies and collections, but a minimum phraselength of 3 consistently underperforms using minimumphrase lengths of 2 or 1. Optimal combinations of the prob-abilities of each relationship type also differ among thethesauri.

For NALT, the highest average precision of 0.1901 isachieved with parameters K = 2, pn = 0.3, pb = 0.4, andpr = 0.6. For AGROVOC, the highest average precision isachieved with parameters K = 2, pn = 0.0, pb = 0.2, andpr = 0.8. For MeSH, the highest average precision of 0.1431is achieved with parameters K = 2, pn = 0.1, pb = 0.3, andpr = 0.6. For HEP the highest average precision of 0.1332 isachieved with parameters K =10, pn = 0.3, pb = 0.7, andpr = 0.0. These results support the hypothesis that there aredifferences in thesaurus structure that effect automaticsubject indexing.

Successes and Failures

Comparing the rankings of terms output from theweighted random walk algorithm to the rankings of termsoutput from the thesaurus-centric matching process based onterm frequency, the weighted random walk affects averageprecision in two ways. First, it includes terms that are notpresent in the document, but are related to other terms foundduring the matching process. Second, it changes the rankingof terms that are present in the document based on relation-ships to other terms in the document. The first case tends toinclude terms at the lower rankings. The second is more

FIG. 5. Average precision (AP) of term selected as a result of weighted random walks as associative relationship probability increases from 0.0 to 1.0,narrower relationship probability decreases from 1.0 to 0.0, with the broader relationship probability held constant at 0.0.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013 1339DOI: 10.1002/asi

Page 11: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

pronounced at higher rankings. The effect on average preci-sion can be positive or negative.

For example, consider the ranked terms for a sampleAGRICOLA document in Table 5. Of the five manuallyassigned terms, only two were found by matching alone.Given the same matched candidate terms as input, the

random walk algorithm identifies an additional term“ambient temperature” and improves the ranking of the firstterm. This comes at the expense of reducing the ranking ofthe fifth term significantly. This example is intended only asa qualitative illustration. In the future, a range of statisticalstudies across a variety of results would be useful for pro-viding a stronger statistical picture.

Discussion

The results presented provide a unique picture ofthesauri and how their properties and relationshipsinfluence the subject indexing process. These resultssupport the hypothesis that thesaurus structure contributesto term selection, and that subject indexing can be modeledin part as a browsing process. The results indicate thatthere are differences between individual thesauri thataffect the matching process and the degree to whichthesaurus relationships contribute to subject indexing.These results also help with the characterization of the dif-ferent thesauri. The following discussion addresses theseaspects.

FIG. 6. Precision/recall curves of subject vocabulary selected as a result of weighted random walks for (a) AGRICOLA/NALT, (b) FAO/AGROVOC, (c)NLM/MESH, and (d) CERN/HEP.

TABLE 3. Average Precision of Using Match-Only (K = 0),Broader/Narrower Relationships, and Broader/Narrower/AssociativeRelationships, Including Percent Increase Over Matching.

AP(K = 0)

AP(pn, pb) % incr.

AP(pb, pn, pr) % increase

AGROVOC 0.1360 0.1665 22% 0.1862 37%NALT 0.1403 0.1741 24% 0.1901 35%MeSH 0.1292 0.1352 5% 0.1431 11%HEP 0.1223 0.1332 9% 0.1332 9%

Note. AP = Average precision; AGROVOC = UN Food and AgricultureOrganization; HEP = High Energy Physics taxonomy; NALT = NationalAgricultural Library; MESH = medical subject headings, National Libraryof Medicine.

1340 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013DOI: 10.1002/asi

Page 12: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

Effect of Differences Between Thesauri on Matching

The matching process we have described relies primarilyon the preferred and alternate labels defined in the vocabu-lary. The primary factors affecting average precision for the

matching process are the number of alternate labels and thephrase length. As discussed, the proportion of alternatelabels per concept and the frequency of each phrase lengthdiffers between the vocabularies. Because the average pre-cision is determined based on the comparison to manuallyassigned terms, when considering only the matching process(K = 0 with no random walk) a higher average precisionindicates that the vocabulary includes preferred and alter-nate labels that match terms used in documents in the testcollection, and that the indexers select from the set ofmatched terms. For matching only, the highest average pre-cision for each vocabulary is 0.1403 for NALT, 0.1328 forAGROVOC, 0.1236 for MeSH, and 0.1223 for HEP. NALTand AGROVOC include the largest number of conceptsamong the four vocabularies and both include some alternatelabels. MeSH has a much higher proportion of alternatelabels. This suggests that the most important factor in thematching process is how well the controlled vocabularyrepresents the language and concepts used in the collection,not simply the number of preferred and alternate terms. Thelow average precision compared to the other thesauri result-ing from the matching process indicates that either thevocabularies do not represent the language used in the docu-ments or that indexers select terms other than those thatmatch terms in the documents.

Thesauri and controlled vocabularies are sparse and donot include all of the possible terms and phrases used torepresent concepts in the document collection. Earlyresearch in automatic subject indexing sought to overcomethe poor coverage of controlled vocabularies through thecreation of intermediate vocabularies. Vleduts-Stokolov(1987) and Silvester et al. (1994) constructed intermediatevocabularies to map common words in the collection thatfrequently occur with specific indexing terms. The statisticallearning approaches described by Field (1975), Leung andKan (1997), and Plaunt and Norgard (1998) also rely onco-occurrence between words in the document collectionand pre-assigned indexing terms from a training set.

Medelyan and Witten (2008) and Medelyan (2009) do notuse an intermediate vocabulary, instead relying entirely onmatching the document text to indexing terms in the con-trolled vocabulary and filtering matched candidates. One ofthe strengths of this approach is that it does not require alarge corpus for training to learn the relationships betweenpreviously assigned controlled vocabulary terms and wordsin the document collection. Instead, a small training set isused to learn how to filter the matched candidates.

The approach described in this random walk study issimilar to that of Medelyan (2009) in that it does not rely onlearning the relationship between controlled vocabularyterms and words in the document collection. The matchingalgorithm is used to match terms in the controlled vocabu-lary to the document text. However, the random walkprocess is not limited to producing only matched terms. Byusing a random walk, the process is able to identify othercandidate terms from the controlled vocabulary that may notappear in the document text. This suggests that the current

TABLE 4. Results of Top Random Walk Configurations for EachVocabulary.

Vocabulary Min Phr. Len. K pn pb pr AP

NALT 1 0 0.0 0.0 0.0 0.14031 2 0.8 0.2 0.0 0.17411 2 0.3 0.1 0.6 0.19012 2 0.0 0.0 0.0 0.15952 2 0.7 0.3 0.0 0.16692 2 0.3 0.4 0.6 0.18293 2 0.0 0.0 0.0 0.11503 2 0.5 0.5 0.0 0.12353 2 0.2 0.3 0.5 0.1344

AGROVOC 1 0 0.0 0.0 0.0 0.13281 2 0.5 0.5 0.0 0.15541 2 0.0 0.2 0.8 0.16742 0 0.0 0.0 0.0 0.13602 2 0.6 0.4 0.0 0.16652 2 0.0 0.2 0.8 0.18623 0 0.0 0.0 0.0 0.09803 4 0.3 0.7 0.0 0.11753 10 0.0 0.2 0.8 0.1277

MESH 1 0 0.0 0.0 0.0 0.12361 2 0.8 0.2 0.0 0.12561 2 0.2 0.1 0.7 0.14142 0 0.0 0.0 0.0 0.12922 2 0.7 0.3 0.0 0.13522 2 0.1 0.3 0.6 0.14313 0 0.0 0.0 0.0 0.10703 2 0.5 0.5 0.0 0.11653 2 0.3 0.4 0.3 0.1207

HEP 1 0 0.0 0.0 0.0 0.12231 10 0.3 0.7 0.0 0.13322 0 0.0 0.0 0.0 0.12142 10 0.3 0.7 0.0 0.12962 10 0.5 0.5 0.0 0.13143 0 0.0 0.0 0.0 0.10123 10 0.8 0.2 0.0 0.10653 10 0.5 0.4 0.1 0.1072

Note. Bolded cells indicate maximum average precision (AP) achievedfor vocabulary. AGROVOC = UN Food and Agriculture Organization;HEP = High Energy Physics taxonomy; NALT = National AgriculturalLibrary; MESH = medical subject headings, National Library of Medicine.

TABLE 5. Example Term Ranking for AGRICOLA Document 3508441Including Manual Indexing, Match Frequency, and Random Walk WithoutFrequency.

TermManualranking

Frequencyranking

Random walkranking

Diaprepes abbreviatus 1 20 1Embriogenesis 2Ambient temperature 3 5Linear models 4Mortality 5 13 346

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013 1341DOI: 10.1002/asi

Page 13: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

state-of-the-art in automatic subject indexing could beimproved by using the random walk method, or by includingmethods like the random walk method that use additionalinformation about the co-occurrence of certain words in thedocument collection and existing assigned indexing terms,or methods that expand the thesaurus to better represent thewords that occur in the document collection.

Using Thesaurus Structure to Improve Automatic Indexing

The central question guiding this study is whether usingthe structure in a thesaurus can be used to improve theautomatic subject indexing process. Most existing researchinto approaches to automatic subject indexing rely primarilyon term frequency information. The random walk algorithmimproves average precision for all vocabularies over match-ing alone, suggesting that using thesaurus structure indeedplays a significant role in improving subject indexing. Thedegree to which using thesaurus structure contributes toperformance differs among the vocabularies and documentcollections. As illustrated in Table 3, the percent increase inaverage precision using the random walk with broader andnarrower relationships over matching alone is 24% forNALT (K = 2, pb = 0.2, pn = 0.8), 22% for AGROVOC(K = 2, pb = 0.4, pn = 0.6), 9% for HEP (K=10, pb =0.7,pn = 0.3), and 5% for MeSH (K = 2, pb = 0.3, pn = 0.7).

The contributions of each relationship type to the index-ing process for each vocabulary can be seen in the associatedprobabilities listed in Table 4. Beginning with the broaderand narrower relationships only, we can see that for NALT(pb = 0.2, pn = 0.8) and MeSH (pb = 0.3, pn = 0.7), averageprecision values improved most with the higher probabilityof following narrower relationships. For AGROVOC(pb = 0.5, pn = 0.5) the probabilities are equal and for HEP(pb = 0.7, pn = 0.3), the probability of following broaderrelationships is higher. This suggests that given a set ofmatched candidates, NALT and MeSH indexers are morelikely to select narrower terms, whereas HEP indexers aremore likely to select broader terms, and AGROVOC index-ers are equally likely to select broader and narrower terms.

Including the associative relationship presents a differentpicture. For NALT with performance maximized at(pb = 0.4, pn = 0.3, pr = 0.6), AGROVOC performance maxi-mized at (pb = 0.2, pn = 0.0, pr = 0.8), and MeSH perfor-mance maximized at (pb = 0.3, pn = 0.1, pr = 0.6), averageprecision improves with increased probability of incorporat-ing associative relationships. This suggests that indexersselect related terms more frequently than either broader ornarrower terms based on matched candidates. The weightedrandom walk process including associative relationshipsproduces a percent increase in average precision over match-ing alone of 37% for AGROVOC (K = 2, pb = 0.2, pn = 0.0,pr = 0.8), 35% for NALT (K = 2, pb = 0.4, pn = 0.0, pr = 0.6),11% for MeSH (K = 2, pb =0.3, pn = 0.1, pr = 0.6), and 9%for HEP (K = 2, pb = 0.7, pn = 0.3, pr = 0.0).

The output of the matching process is the input into therandom walk process. The matching process alone with

NALT and AGROVOC produced higher average precisionthan with MeSH or HEP. NALT and AGROVOC alsoshowed the greatest improvement in average precision withthe random walk. It is possible that better matching provideshigher-quality starting points for the random walk indepen-dent of differences in vocabulary structure. However, thedifferent rates of improvement in average precision mightalso be explained by differences in thesaurus structure ormanual indexing practice. The differences in the effect ofthesaurus structure may also in part be explained by poorterm coverage in the thesaurus. It is likely that the differ-ences can be attributed to a combination of vocabularycoverage, higher quality matching, differences in vocabularystructure, and differences in manual indexing practice.

Characterizing Thesauri

This study presents a unique picture of thesauri and howtheir characteristics affect the subject indexing process. Wehave described the thesauri used in this study, highlightingdifferences in size, proportions of alternate labels, differ-ences in the number and types of relationships represented,as well as differences in the phrase lengths of terms in thethesauri and those terms selected by indexers to representdocuments in each test collection. The results of theweighted random walk illustrate differences in how the rela-tionships in each thesaurus may contribute to the manualindexing process.

The AGROVOC thesaurus is the second largest of thevocabularies explored in this study, representing 28,175concepts. It has similar proportions of broader, narrower,and associative relationships and the third highest propor-tion of alternate labels per concept. The vocabulary iscomprised of mainly 2-grams (59%) and 1-grams (34%).Indexers selected primarily 2-grams (53%) and 1-grams(43%) to describe documents in the FAO test collection.The matching process alone achieved average precision of0.1360, the second highest of the vocabularies studied.AGROVOC saw the highest percent increase in averageprecision (37%) using the weighted random walk. Thehighest average precision was achieved with a thesaurus-centric matching minimum phrase length of 1, K = 2,pb = 0.2, pn = 0.0, and pr = 0.8. These results of the match-ing process suggest that AGROVOC has better coverage ofterms used in the document collection. The results of therandom walk suggest that FAO indexers select terms otherthan those matched in the text, often “browsing” associa-tive relationships.

The NALT thesaurus has the largest vocabulary exploredin this study, representing 48,308 concepts. It has the secondhighest proportion of alternate labels per concept. NALT hassimilar numbers of broader and narrower relationships, but asmaller proportion of associative relationships. NALT iscomposed of mostly 2-grams (53%) and 1-grams (32%), witha small number of longer phrase lengths. Indexers selectprimarily 2-grams (55%), and 1-grams (38%) to describedocuments in the test collection. NALT produced the highest

1342 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013DOI: 10.1002/asi

Page 14: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

average precision (0.1403) from the matching process aloneand saw the second largest improvement in average precision(35%) using the random walk. This increase was achievedusing a thesaurus-centric matching minimum phrase lengthof 1, K = 2, pb = 0.1, pn = 0.3, and pr = 0.6. As withAGROVOC, the results of the matching process suggest thatNALT has better coverage of terms used in the test documentsand that “browsing” associative and narrower provides thegreatest improvement in average precision.

With 25,990 concepts, MeSH is the third largest vocabu-lary explored in this study. MeSH has the highest proportionof alternate labels per concept. The vocabulary has a similarnumber of broader and narrower relationships, with a smallnumber of associative relationships. MeSH is polyhierarchi-cal, where a single term can have multiple broader concepts.It is composed of 2-grams (44%), 3-grams (28%), and1-grams (17%). Indexers select primarily 1-grams (36%),2-grams (33%), and 3-grams (23%) to describe documentsin the test collection. The weighted random walk improvedaverage precision modestly (11%) over matching alone. Thegreatest improvement in average precision was producedusing a thesaurus-centric matching minimum phrase lengthof 1, K = 2, pb = 0.3, pn = 0.1, and pr = 0.6. As with NALTand AGROVOC, these results suggest that MeSH indexersbrowse associative terms. However, unlike the other vocabu-laries, MeSH indexers are more likely to select terms that arebroader than the matched terms.

With only 16,018 concepts, HEP is the smallest vocabu-lary studied here. It also has the lowest proportion of alternatelabels. HEP has a similar number of broader and narrowerrelationships and almost no associative relationships. LikeMeSH, HEP is polyhierarchical. The vocabulary is composedmostly of 2-grams (44%), 3-grams (34%), and 1-grams(11%). Indexers select primarily 3-grams (36%), 2-grams(31%), and 1-grams (16%). Unlike the other vocabularies andtest collection, average precision decreases with HEP as theprobability of selecting a narrower term increases. Theweighted random walk improved average precision modestly(9%) over matching alone. The greatest improvement wasproduced using a thesaurus centric matching minimumphrase length of 1, K = 10, pb = 0.7, pn = 0.3, and pr = 0.0.Like MeSH, HEP indexers are more likely to select terms thatare broader than those found in the document text.

In all cases, shorter minimum phrase lengths outperformlonger phrase lengths, although we expected longer phrasesto result in higher-quality initial matches. This may bebecause there are fewer longer phrases in the studied the-sauri. This may also suggest that the vocabulary in the the-saurus does not match the language in the document text.Future research may further explore the performance ofdifferent phrase lengths and methods for expanding the the-saurus vocabulary based on phrases that occur in the text.

Conclusions and Next Steps

The analysis presented in this article has explored theeffect of vocabulary structure on the subject indexing

process using a weighted random walk method. We pre-sented a thesaurus-centric matching algorithm and aweighted random walk technique for automatic indexing ofdocuments using thesauri and other structured vocabular-ies. The novel approach presented here relies on thesaurusstructure instead of term frequency or co-occurrence infor-mation in automatically indexing documents. This study isintended to further our understanding of thesauri andother structure vocabularies as they are applied in subjectindexing.

The thesaurus-centric matching and weighted randomwalk algorithms were evaluated using four different thesauriand collections of pre-indexed documents. In all cases, theweighted random walk improves performance of over-matching along with an increase in average precision (AP)of 9% for HEP, 11% for MeSH, 35% for NALT, and 37% forAGROVOC. These results support our hypothesis thatsubject indexing includes a browsing process and that usingthe vocabulary structure in a thesaurus contributes to theindexing process. The amount that vocabulary structure con-tributes was found to differ among the four thesauri. This ispossibly due to the vocabulary used in the correspondingthesauri as well as the number and types of relationshipsfound between terms. We discussed several possible expla-nations for these differences and how this information canbe used to improve thesauri. We also discussed how theweighted random walk approach could be used to improveexisting approaches to automatic indexing.

Future work is looking at how the thesaurus structure canbe combined with frequency and co-occurrence methods toprovide a general-purpose automatic subject indexing solu-tion that works with multiple vocabularies, and is superior toautomatic indexing systems based only on term frequenciesor only on thesaurus structures. Additional work is lookingat the use of thesaurus structure for disambiguation forvocabularies with high homonymy.

Acknowledgments

This work was conducted while Craig Willis was a studentat the University of North Carolina at Chapel Hill and sup-ported in part by Institute of Museum and Library Services(IMLS) grant LG-07-08-0120-08 (Helping InterdisciplinaryVocabulary Engineering—HIVE). We thank Jane Greenbergfor her insights into the subject analysis and indexing pro-cesses and helpful suggestions on a draft of this paper.

References

Aronson, A.R., Mork, J.G., Gay, C.W., Humphrey, S.M., & Rogers. W.(2004). The NLM Indexing initiative’s medical text indexer. Medinfo,11(pt 1), 368–372.

Berg, H.C. (1993). Random walks in biology. Princeton, NJ: PrincetonUniversity Press.

Borko, H. (1962). The construction of an empirically based mathematicallyderived classification system. In Proceedings of the AIEE-IRE’62 (pp.279–280). New York: ACM Press.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013 1343DOI: 10.1002/asi

Page 15: A random walk on an ontology: Using thesaurus structure for automatic subject indexing

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual websearch engine. In Seventh International World-Wide Web Conference(WWW 1998) (pp. 107–117). Amsterdam: Elsevier.

Buscaldi, D., & Rosso, P. (2008). A conceptual density-based approach forthe disambiguation of toponyms. International Journal of GeographicalInformation Science, 22(3), 301–313.

Chan, L.M., Richmond, P.A., & Svenonius, E. (Eds.). (1985). Theory ofsubject analysis: A sourcebook. Littleton, CO: Libraries Unlimited.

Field, B.J. (1975). Towards automatic indexing: Automatic assignment ofcontrolled-language indexing from free indexing. Journal of Documen-tation, 31(4), 246–265.

Foskett, A.C. (1996). The subject approach to information (5th ed.).London, England: Library Association Publishing.

Ghahramani, S. (2005). Fundamentals of probability with stochastic pro-cesses (3rd ed.). Upper Saddle River, NJ: Pearson.

Greenberg, J., Losee, R., Pérez Agüera, J.R., Scherle, R., White, H., &Willis, C. (2011). HIVE: Helping interdisciplinary vocabulary engineer-ing. Bulletin of the American Society for Information Science andTechnology, 37(4), 23–26.

Hlava, M.M.K. (2005). Automatic indexing. Information Outlook, 9(8),22–23.

Hood, M.W. (1990). AGRICOLA–guide to subject indexing. Washington,DC: National Agricultural Library.

Humphrey, S.M. (1999). Automatic indexing of documents from journaldescriptors: A preliminary investigation. Journal of the American Societyfor Information Science, 50(8), 661–674.

Humphrey, S.M., & Miller, N.E. (1987). Knowledge-based indexing of themedical literature: The indexing aid project. Journal of the AmericanSociety of Information Science, 38(3), 184–196.

International Organization for Standardization. (1985). ISO 5963–Documentation–Methods for examining documents, determining theirsubjects, and selecting indexing terms. Geneva, Switzerland: Author.

Klafter, J., & Sokolov, I.M. (2011). First steps in random walks: From toolsto applications. New York: Oxford University Press.

Klingbiel, P.H. (1969). Machine-aided indexing (Technical Report DDC-TR-69-1). Fort Belvoir, VA: Defense Technical Information Center.

Lancaster, F.W. (2003). Indexing and abstracting in theory and practice(3rd ed.). London, England: Facet Publishing.

Leidner, J.L., & Lieberman, M.D. (2011). Detecting geographical refer-ences in the form of place names and associated spatial natural language.SIGSPATIAL Special, 3(2), 5–11.

Leung, C-H, & Kan, W-K. (1997). A statistical learning approach to auto-matic indexing of controlled index terms. Journal of the AmericanSociety for Information Science, 48(1), 55–66.

Losee, R.M. (2004). A performance model of the length and number ofsubject headings and index phrases. Knowledge Organization, 31(4),245–251.

Losee, R.M. (2006). Is 1 noun worth 2 adjectives? Measuring the relativefeature utility. Information Processing & Management, 42(5), 1248–1259.

Losee, R.M. (2007). Decisions in thesaurus construction and use. Informa-tion Processing & Management, 43(4), 958–968.

Malkiel, B.G. (1999). A random walk down Wall Street. New York: W.W.Norton.

Maron, M.E. (1961). Automatic indexing: An experimental inquiry. Journalof the ACM, 8(3), 404–417.

Martins, B., & Silva, M. (2005). A graph-ranking algorithm for geo-referencing documents. In Proceedings of the Fifth IEEE InternationalConference on Data Mining (ICDM’05) (pp. 741–744). Piscataway, NJ:IEEE.

Medelyan, O. (2009). Human-competitive automatic topic indexing. TheUniversity of Waikato. Retrieved from http://researchcommons.waikato.ac.nz/handle/10289/3513

Medelyan, O., & Witten, I.H. (2008). Domain independent automatic key-phrase indexing with small training sets. Journal of American Society forInformation Science and Technology, 59(7), 1026–1040.

Mihalcea, R., & Csomai, A. (2007). Linking documents to encyclopedicknowledge. In Proceedings of the ACM Conference on Information andKnowledge Management (pp. 233–242). New York: ACM Press.

Mihalcea, R., & Radev, D. (2011). Graph-based natural language process-ing and information retrieval. New York: Cambridge UniversityPress.

Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into texts.In Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2004) (pp. 404–411). Stroudsburg, PA:Association for Computational Linguistics.

Návéol, A., Shoosan, S.E., Humphrey, S.M., Morn, J.G., & Aronson, A.R.(2009). A recent advance in the automatic indexing of biomedical litera-ture. Journal of Biomedical Informatics, 42, 814–823.

Overell, S., & Rüger, S. (2008). Using co-occurrence models for placenamedisambiguation. International Journal of Geographical InformationScience, 22(3), 265–287.

Pearson, K. (1905). The problem of the random walk. Nature, 72(1865),294.

Plaunt, C., & Norgard, B.A. (1998). An association-based method forautomatic indexing with a controlled vocabulary. Journal of the Ameri-can Society for Information Science, 49(10), 888–902.

Sebastiani, F. (2002). Machine learning in automated text categorization.ACM Computing Surveys, 34(1). Retrieved from http://dx.doi.org/10.1145/505282.505283

Silvester, J.P., Genuardi, M.T., & Klingbeil, P. (1994). Machine-aidedindexing at NASA. Information Processing & Management, 30(5), 631–645.

Stevens, M.E., & Urban, G.H. (1964). Training a computer to assigndescriptors to documents: Experiments in automatic indexing. InProceedings of the AFIPS ‘64 (pp. 563–575). New York: ACMPress.

Vleduts-Stokolov, N. (1987). Concept recognition in an automatic textprocessing system for the life sciences. Journal of the American Societyfor Information Science, 38(4), 269–287.

1344 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2013DOI: 10.1002/asi