An Adaptive Information Retrieval System based on ...dieter/research/publications/apccm04-1.pdf · one: “multilingual information retrieval is informa-tion retrieval in any language

An Adaptive Information Retrieval Systembased on Associative Networks

Helmut Berger1 Michael Dittenbach1 Dieter Merkl1,2

1E-Commerce Competence Center – EC3,Donau-City-Straße 1, A–1220 Wien, Austria

2Research Group of Industrial Software Engineering (RISE),Favoritenstraße 9–11/188, A–1040 Wien, Austria

Email: {helmut.berger, michael.dittenbach, dieter.merkl}@ec3.at

Abstract

In this paper we present a multilingual informationretrieval system that provides access to Tourism in-formation by exploiting the intuitiveness of naturallanguage. In particular, we describe the knowledgerepresentation model underlying the information re-trieval system. This knowledge representation ap-proach is based on associative networks and allows thedefinition of semantic relationships between domain-intrinsic information items. The network structureis used to define weighted associations between in-formation items and augments the system with afuzzy search strategy. This particular search strat-egy is performed by a constrained spreading activa-tion algorithm that implements information retrievalon associative networks. Strictly speaking, we takethe relatedness of terms into account and show, howthis fuzzy search strategy yields beneficial results and,moreover, determines highly associated matches tousers’ queries. Thus, the combination of the associa-tive network and the constrained spreading activationapproach constitutes a search algorithm that evalu-ates the relatedness of terms and, therefore, providesa means for implicit query expansion.

Keywords: knowledge representation, associative net-works, constrained spreading activation, natural lan-guage information retrieval.

1 Introduction

Providing easy and intuitive access to informationstill remains a challenge in the area of information sys-tem research and development. Moreover, as Van Ri-jsbergen (1979) points out, the amount of availableinformation is increasing rapidly and offering accu-rate and speedy access to this information is becom-ing ever more difficult. This quote, although about20 years old, is still valid nowadays if you consider theamount of information offered on the Internet. Buthow to address these problems? How to overcomethe limitations associated with conventional search in-terfaces? Furthermore, users of information retrievalsystems are often computer illiterate and not famil-iar with the required logic for formulating appropri-ate queries, e.g. the burdens associated with Booleanlogic. This goes hand in hand with the urge to un-

Copyright c©2004, Australian Computer Society, Inc. This pa-per appeared at First Asia-Pacific Conference on ConceptualModelling (APCCM 2004), Dunedin, New Zealand. Confer-ences in Research and Practice in Information Technology, Vol.32. Sven Hartmann and John Roddick, Ed. Reproduction foracademic, not-for profit purposes permitted provided this textis included.

derstand what users really want to know from infor-mation retrieval systems.

Standard information retrieval interfaces consist ofcheck boxes, predefined option sets or selection listsforcing users to express her or his needs in a very re-stricted manner. Therefore, an approach leaving themeans of expression in users’ hands, narrows the gapbetween users’ needs and interfaces used to expressthese needs. An approach addressing this particu-lar problem is to allow query formulation in naturallanguage. Natural language interfaces offer easy andintuitive access to information sources and users canexpress their information needs in their own words.

Hence, we present a multilingual information re-trieval system allowing for query formulation in nat-ural language. To reduce word sense ambiguities thesystem operates on a restricted domain. In particular,the system provides access to tourism information,like accommodations and their amenities throughoutAustria.

However, the core element of the information re-trieval system remains the underlying knowledge rep-resentation model. In order to provide a knowl-edge representation model allowing to define relationsamong information items, an approach based on anetwork structure, namely an associative network, isused. More precisely, this associative network in-corporates a means for knowledge representation al-lowing for the definition of semantic relationships ofdomain-intrinsic information. Processing the networkand, therefore, result determination is accomplishedby a technique refereed to as spreading activation.Some nodes of the network act as sources of activa-tion and, subsequently, activation is propagated toadjacent nodes via weighted links. These newly acti-vated nodes, in turn, transmit activation to associatednodes, and so on.

We introduce a knowledge representation ap-proach based on an associative network consistingof three layers. Moreover, a constrained spreadingactivation algorithm implements a processing tech-nique that operates on the network. Due to the net-work structure of the knowledge representation modeland the processing technique, implicit query expan-sion enriches the result set with additional matches.Hence, a fuzzy search strategy is implemented.

The remainder of the paper is organized as fol-lows. In Section 2 we review the architecture of theinformation retrieval system that acts as a basis forthe redeveloped approach presented herein. More-over, Section 3 gives an overview about associativenetworks and we present an algorithm for processingsuch networks, i.e. spreading activation. In Section 4we describe our approach based on associative net-works and finally, some conclusions are given in Sec-tion 5.

2 AD.M.IN. – A Natural Language Informa-tion Retrieval System

Crestani (1997) points out that information retrievalis a science that aims to store and allow fast accessto a large amount of data. In contrast to conven-tional database systems, an information retrieval sys-tem does not provide an exact answer to a query buttries to produce a ranking that reflects the intentionof the user. More precisely, documents are ranked ac-cording to statistical similarities based on the occur-rence frequency of terms in queries and documents.The occurrence frequency of a term provides an indi-cator of the significance of this term. Moreover, in or-der to get a measure for determining the significanceof a sentence, the position of terms within a sentenceis taken into account and evaluated. For comprehen-sive reports about information retrieval see Salton& McGill (1983), Salton (1989) and Baeza-Yates &Ribeiro-Neto (1999).

In order to adapt information retrieval systems tothe multilingual demands of users, great efforts havebeen made in the field of multilingual information re-trieval. Hull & Grafenstette (1996) subsume severalattempts to define multilingual information retrieval,where Harman (1995) formulates the most conciseone: “multilingual information retrieval is informa-tion retrieval in any language other than English”.

Multilingual information retrieval systems have tobe augmented by mechanisms for query or documenttranslation to support query formulation in multiplelanguages. Information retrieval is such an inexactdiscipline that it is not clear whether or not querytranslation is necessary or even optimal for identify-ing relevant documents and, therefore, to determineappropriate matches to the user query. Strictly speak-ing, the process of translating documents or queriesrepresents one of the main barriers in multilingual in-formation retrieval.

Due to the shortness of user queries, query trans-lation introduces ambiguities that are hard to over-come. Contrarily, resolving ambiguity in documenttranslation is easier to handle because of the quan-tity of text available. Nevertheless, state-of-the-artmachine translation systems provide only an insuf-ficient means for translating documents. Therefore,resolving ambiguities associated with translations re-mains a crucial task in the field of multilingual in-formation retrieval. Ballesteros & Croft (1998), forinstance, present a technique based on co-occurrencestatistics from unlinked text corpora which can beused to reduce the ambiguity associated with transla-tions. Furthermore, a quite straightforward approachin reducing ambiguities is to restrict the domain amultilingual information retrieval system operates on.

Xu, Netter & Stenzhorn (2000) describe an in-formation retrieval system that aims at providinguniform multilingual access to heterogeneous datasources on the web. The MIETTA (MultilingualTourist Information on the World Wide Web) sys-tem has been applied to the tourism domain con-taining information about three European regions,namely Saarland, Turku, and Rome. The languagessupported are English, Finnish, French, German, andItalian. Since some of the tourism information aboutthe regions were available in only one language, ma-chine translation was used to deal with these webdocuments. Due to the restricted domain, automatictranslation should suffice to understand the basicmeaning of the translated document without havingknowledge of the source language. Users can querythe system in various ways, such as free text queries,form-based queries, or browsing through the concepthierarchy employed in the system. MIETTA makesit transparent to the users whether they search in a

database or a free-form document collection.

2.1 The Architecture of the Original System

The software architecture of the natural language in-formation retrieval system is designed as a pipelinestructure. Hence, successively activated pipeline el-ements apply transformations on natural languagequeries that are posed via arbitrary client devices,such as, for instance, web browsers, PDAs or mobilephones. Due to the flexibility of this approach, differ-ent pipeline layouts can be used to implement differ-ent processing strategies. Figure 1 depicts the layoutof the software architecture and illustrates the way ofinteraction of the pipeline elements.

In a first step, the natural language query is evalu-ated by an automatic language identification moduleto determine the language of the query. Next, the sys-tem corrects typographic errors and misspellings toimprove retrieval performance. Before adding gram-mar rules and semantic information to the queryterms, a converter transforms numerals to their nu-meric equivalents. Depending on the rules assigned tothe query terms, a mapping process associates theseterms with SQL fragments that represent the queryin a formal way. Due to the fact that the systemuses a relational database as backend this mappingprocess is crucial. In a next step the SQL fragmentsare combined according to the modifiers (e.g. “and”,“or”, “near”, “not”) identified in the query and a sin-gle SQL statement that reflects the intention of thequery is obtained. Then the system determines theappropriate result and generates an XML representa-tion for further processing. Finally, the XML resultset is adapted to fit the needs of the client device.

The remainder of this section gives a brief outlineof the system.

2.1.1 The Knowledge Base

A major objective of the Ad.M.In.(Adaptive Multi-lingual Interfaces) system was to separate the pro-gram logic from domain dependent data. In particu-lar, language, domain and device dependent portionsare placed in the knowledge base. Thus, the knowl-edge base represents the backbone of the system andconsists of a relational database and a set of ontolo-gies. The database stores information about domainentities, as, for instance, amenities of accommoda-tions. The ontologies store synonyms, define semanticrelations and grammar rules.

Basically, the knowledge base consists of separateXML files, whereas the synonym ontology is used toassociate terms having the same semantic meaning,i.e. describes linguistic relationships like synonymy.The synonym ontology is based on a flat structure,allowing to define synonymy. Taking a look at thetourism domain, “playground” represents a conceptpossessing several semantic equivalents, as, for in-stance, “court”.

Unfortunately, the synonym ontology provides nomeans to associate concepts. Consider, for example,the three concepts “sauna”, “steam bath” and “vege-tarian kitchen”. Straightforward, someone might de-rive a stronger degree of relatedness between the con-cepts “sauna” and “steam bath” as between “sauna”and “vegetarian kitchen”.

The second component of the knowledge basestores a set of grammar rules. More precisely, alightweight grammar describes how certain conceptsmay be modified by prepositions, adverbial or adjecti-val structures that are also specified in the synonymontology. For a more detailed description we referto Berger (2001).

Figure 1: Software Architecture

2.1.2 Language Identification

To identify the language of a query, an n-gram-basedtext classification approach (cf. Cavnar & Trenkle(1994)) is used. An n-gram is an n-character slice ofa longer character string. As an example, for n = 3,the tri–grams of the string “language” are: { la, lan,ang, ngu, gua, uag, age, ge }. Dealing with mul-tiple words in a string, the blank character is usu-ally replaced by an underscore “ ” and is also takeninto account for the construction of an n-gram docu-ment representation. This language classification ap-proach using n-grams requires sample texts for eachlanguage to build statistical models, i.e. n-gram fre-quency profiles, of the languages. We used varioustourism-related texts, e.g. hotel descriptions and holi-day package descriptions, as well as news articles bothin English and German language. The n-grams, withn ranging from 1...5, of these sample texts were an-alyzed and sorted in descending order according totheir frequency, separately for each language. Thesesorted histograms are the n-gram frequency profilesfor a given language. For a comprehensive descrip-tion see Berger, Dittenbach & Merkl (2003).

2.1.3 Error Correction

To improve retrieval performance, potential ortho-graphic errors have to be considered in the web-basedinterface. After identifying the language, a spell-checking module is used to determine the correctnessof query terms. The efficiency of the spell checkingprocess improves during the runtime of the systemby learning from previous queries. The spell checkeruses the metaphone algorithm (cf. Philips (1990)) totransform the words into their soundalikes. Becausethis algorithm has originally been developed for theEnglish language, the rule set defining the mapping ofwords to the phonetic code has to be adapted for otherlanguages. In addition to the base dictionary of thespell checker, domain-dependent words and propernames like names of cities, regions or states, haveto be added to the dictionary. For every misspelledterm of the query, a list of potentially correct wordsis returned. First, the misspelled word is mapped toits metaphone equivalent, then the words in the dic-tionary, whose metaphone translations have at mostan edit distance (cf. Levenshtein (1966)) of two, are

added to the list of suggested words. The suggestionsare ranked according to the mean of first, the editdistance between the misspelled word and the sug-gested word, and second, the edit distance betweenthe misspelled word’s metaphone and the suggestedword’s. The smaller this value is for a suggestion, themore likely it is to be the correct substitution fromthe orthographic or phonetic point of view. However,this ranking does not take domain-specific informa-tion into account. Because of this deficiency, correctlyspelled words in queries are stored and their respec-tive number of occurrences is counted. The wordsin the suggestion list for a misspelled query term arelooked up in this repository and the suggested wordhaving the highest number of occurrences is chosenas the replacement of the erroneous original queryterm. In case of two or more words having the samenumber of occurrences the word that is ranked firstis selected. If the query term is not present in therepository up to this moment, it is replaced by thefirst suggestion, i.e. the word being phonetically ororthographically closest. Therefore, suggested wordsthat are very similar to the misspelled word, yet makeno sense in the context of the application domain,might be rejected as replacements. Consequently, theword correction process described above is improvedby dynamic adaptation to past knowledge.

Another important issue in interpreting the nat-ural language query is to detect terms consisting ofmultiple words. Proper names like “Bad Kleinkirch-heim” or nouns like “parking garage” have to betreated as one element of the query. Therefore, allmulti-word denominations known to the system arestored in an efficient data structure allowing to iden-tify such cases. More precisely, regular expressionsare used to describe rules applied during the identifi-cation process.

2.1.4 SQL Mapping

With the underlying relational database managementsystem PostgreSQL, the natural language query hasto be transformed into a SQL statement to retrievethe requested information. As mentioned above theknowledge base describes parameterized SQL frag-ments that are used to build a single SQL statementrepresenting the natural language query. The queryterms are tagged with class information, i.e. the rel-

evant concepts of the domain (e.g. “hotel” as a typeof accommodation or “sauna” as a facility providedby a hotel), numerals or modifying terms like “not”,“at least”, “close to” or “in”. If none of the classesspecified in the ontology can be applied, the databasetables containing proper names have to be searched.If a noun is found in one of these tables, it is taggedwith the respective table’s name, such that ”Tyrol”will be marked as a federal state. In the next step,this class information is used by the grammar to se-lect the appropriate SQL fragments. Finally, the SQLfragments have to be combined to a single SQL state-ment reflecting the natural language query of theuser. The operators combining the SQL fragmentsare again chosen according to the definitions in thegrammar.

3 Associative Networks

Quillian (1968) introduced the basic principle of asemantic network and it played, since then, a cen-tral role in knowledge representation. The buildingblocks of semantic networks are, first, nodes that ex-press knowledge in terms of concepts, second, conceptproperties, and third,the hierarchical sub-super classrelationship between these concepts.

Each concept in a semantic network represents asemantic entity. Associations between concepts de-scribe the hierarchical relationship between these se-mantic entities via is-a or instance-of links. Thehigher a concept moves up in the hierarchy along is-a relations, the more abstract is its semantic mean-ing. Properties are attached to concepts and, there-fore, properties are also represented by concepts andlinked to nodes via labeled associations. Furthermore,a property that is linked to a high-level concept is in-herited by all descendants of the concept. Hence, it isassumed that the property applies to all subsequentnodes. An example of a semantic network is depictedin Figure 2.

Semantic networks initially emerged in cognitivepsychology and the term itself has been used in thefield of knowledge representation in a far more gen-eral sense than described above. In particular, theterm semantic network has been commonly used torefer to a conceptual approach known as associativenetwork. An associative network defines a genericnetwork which consists of nodes representing infor-mation items (semantic entities) and associations be-tween nodes, that express, not necessarily defined orlabeled, relations among nodes. Links between par-ticular nodes might be weighted to determine thestrength of connectivity.

3.1 Spreading Activation

A commonly used technique, which implements infor-mation retrieval on semantic or associative networks,is often referred to as spreading activation. Thespreading activation processing paradigm is tight-knitwith the supposed mode of operation of human mem-ory. It was introduced to the field of artificial intel-ligence to obtain a means of processing semantic orassociative networks. The algorithm, which under-lies the spreading activation (SA) paradigm, is basedon a quite simple approach and operates on a datastructure that reflects the relationships between in-formation items. Thus, nodes model real world en-tities and links between these nodes define the re-latedness of entities. Furthermore, links might pos-sess, first, a specific direction, second, a label and,third, a weight that reflects the degree of association.This conceptual approach allows for the definition ofa more general, a more generic network than the basic

structure of a semantic network demands. Neverthe-less, it could be used to model a semantic network aswell as a more generic one, for instance an associativenetwork.

The idea, underlying spreading activation, is topropagate activation starting from source nodes viaweighted links over the network. More precisely, theprocess of propagating activation from one node toadjacent nodes is called a pulse. The SA algorithmis based on an iterative approach that is divided intotwo steps: first, one or more pulses are triggered and,second, a termination check determines if the processhas to continue or to halt.

Furthermore, a single pulse consists of a pre-adjustment phase, the spreading process and a post-adjustment phase. The optional pre- and post-adjustment phases might incorporate a means of ac-tivation decay, or to avoid reactivation from previouspulses. Strictly speaking, these two phases are usedto gain more control over the network. The spreadingphase implements propagation of activation over thenetwork. Spreading activation works according to theformula:

Ij(p) =k∑i

(Oi(p− 1) · wij) (1)

Each node j determines the total input Ij at pulsep of all linked nodes. Therefore, the output Oi(p− 1)at the previous pulse p−1 of node i is multiplied withthe associated weight wij of the link connecting node ito node j and the grand total for all k connected nodesis calculated. Inputs or weights can be expressed bybinary values (0/1), inhibitory or reinforcing values(-1/+1), or real values defining the strength of theconnection between nodes. More precisely, the firsttwo options are used in the application of semanticnetworks, the latter one is commonly used for asso-ciative networks. This is due to the fact that the typeof association does not necessarily have some exactsemantic meaning. The weight rather describes therelationship between nodes. Furthermore, the outputvalue of a node has to be determined. In most cases,no distinction is made between the input value andthe activation level of a node, i.e. the input value of anode and its activation level are equal. Before firingthe activation to adjacent nodes a function calculatesthe output depending on the activation level of thenode:

Oi = f(Ii) (2)

Various functions can be used to determine theoutput value of a node, for instance the sigmoid func-tion, or a linear activation function, but most com-monly used is the threshold function. The thresholdfunction determines, if a node is considered to be ac-tive or not, i.e. the activation level of each node iscompared to the threshold value. If the activationlevel exceeds the threshold, the state of the node isset to active. Subsequent to the calculation of theactivation state, the output value is propagated toadjacent nodes. Normally, the same output value issent to all adjacent nodes. The process describedabove is repeated, pulse after pulse, and activationspreads through the network and activates more andmore nodes until a termination condition is met. Fi-nally, the SA process halts and a final activation stateis obtained. Depending on the application’s task theactivation levels are evaluated and interpreted accord-ingly.

Figure 2: A semantic network example of tourism-related terms

3.2 Taming Spreading Activation

Unfortunately, the basic approach of spreading acti-vation entails some major drawbacks. Strictly speak-ing, without appropriate control, activation might bepropagated all over the network. Furthermore, the se-mantics of labeled associations are not incorporatedin SA and it is quite difficult to integrate an infer-ence mechanism based on the semantics of associa-tions. To overcome these undesired side-effects theintegration of constraints helps to tame the spread-ing process (cf. Crestani (1997)). Some constraintscommonly used are described as follows.

• Fan-out constraint: Nodes with a broad se-mantic meaning possess a vast number of links toadjacent nodes. This circumstance implies thatsuch nodes activate large areas of the network.Therefore, activation should diminish at nodeswith a high degree of connectivity to avoid thisunwanted effect.

• Distance constraint: The basic idea under-lying this constraint is, that activation ceaseswhen it reaches nodes far away from the acti-vation source. Thus, the term far correspondsto the number of links over which activation wasspread, i.e. the greater the distance between twonodes, the weaker is their semantic relationship.According to the distance of two nodes their rela-tion can be classified. Directly connected nodesshare a first order relation. Two nodes connectedvia an intermediate node are associated by a sec-ond order relation, and so on.

• Activation constraint: Threshold values areassigned to nodes (it is not necessary to applythe same value to all nodes) and are interpretedby the threshold function. Moreover, thresholdvalues can be adapted during the spreading pro-cess in relation to the total amount of activity inthe network.

• Path constraint: Usually, activation spreadsover the network using all available links. The in-tegration of preferred paths allows to direct acti-vation according to application-dependent rules.

Another enhancement of the spreading activationmodel is the integration of a feedback process. Theactivation level of some nodes or the entire networkis evaluated by, for instance, another process or bya user. More precisely, a user checks the activationlevel of some nodes and adapts them according toher or his needs. Subsequently, activation spreadsdepending on the user refinement. Additionally, usersmay indicate preferred paths for spreading activationand, therefore, are able to adapt the spreading processto their own needs.

4 Recommendation via Spreading Activation

One of the first information retrieval systems usingconstrained spreading activation was GRANT. Kjeld-sen & Cohen (1987) developed a system that han-dles information about research proposals and poten-tial funding agencies. GRANT’s domain knowledge isstored in a highly associated semantic network. Thesearch process is carried out by constrained spreadingactivation over the network. In particular, the systemextensively uses path constraints in the form of pathendorsement. GRANT can be considered as an infer-ence system applying repeatedly the same inferenceschema:

IF x AND R(x, y) → y (3)

R(x, y) represents a path between two nodes x and y.This inference rule can be interpreted as follows: “ifa founding agency is interested in topic x and thereis a relation between topic x and topic y then thefounding agency might be interested in the relatedtopic y.”

Croft, Lucia, Crigean & Willet (1989) developedan information retrieval system initially intended tostudy the possibility of retrieving documents by plau-sible inference. In order to implement plausible in-ference constrained spreading activation was chosenaccidently. The I3R system acts as a search inter-mediary (cf. Croft & Thompson (1987)). To accom-plish this task the system uses domain knowledge torefine user queries, determines the appropriate searchstrategy, assists the user in evaluating the output andreformulating the query. In its initial version, the do-main knowledge was represented using a tree struc-ture of concepts. The design was later refined to meetthe requirements of a semantic network.

Belew (1989) investigated the use of connection-ist techniques in an information retrieval systemcalled Adaptive Information Retrieval (AIR). Thesystem handles information about scientific publica-tions, like the publication title and the author. AIRuses a weighted graph as knowledge representationparadigm. For each document, author and keyword(keywords are words found in publication titles) anode is created and associations between nodes areconstructed from an initial representation of docu-ments and attributes. A user’s query causes initialactivity to be placed on some nodes of the network.This activity is propagated to other nodes until cer-tain conditions are met. Nodes with the highest levelof activation represent the answer to the query by theAIR system. Furthermore, users are allowed to assigna degree of relevance to the results (++,+,−,−−).This causes new links to be created and the adap-tation of weights between existing links. Moreover,

feedback is averaged across the judgments of manyusers.

A mentionable aspect of the AIR system is that noprovision is made for the traditional Boolean opera-tors like AND and OR. Rather, AIR emulates theselogical operations because “the point is that the dif-ference between AND and OR is a matter of degree”.This insight goes back to Von Neumann (as pointedout by Belew (1989)).

A system based on a combination of an ostensiveapproach with the associative retrieval approach is de-scribed in Crestani & Lee (2000). In the WebSCSA(Web Searching by Constrained Spreading Activa-tion) approach a query does not consist of keywords.Instead, the system is based on an ostensive approachand assumes that the user has already identified rel-evant Web pages that act as a basis for the followingretrieval process. Subsequently, relevant pages areparsed for links and they are followed to search forother relevant associated pages. The user does not ex-plicitly refine the query. More precisely, users point toa number of relevant pages to initiate a query and theWebSCSA system combines the content of these pagesto build a search profile. In contrast to conventionalsearch engines WebSCSA does not make use of exten-sive indices during the search process. Strictly speak-ing, it retrieves relevant information only by navigat-ing the Web at the time the user searches for informa-tion. The navigation is processed and controlled bymeans of a constrained spreading activation model.In order to unleash the power of WebSCSA the sys-tem should be used when users already have a pointto start for her or his search. Pragmatically speaking,the intention of WebSCSA is to enhance conventionalsearch engines, use these as starting points and notto compete with them.

Hartmann & Strothotte (2002) focus on a spread-ing activation approach to automatically find associ-ations between text passages and multimedia mate-rial like illustrations, animations, sounds, and videos.Moreover, a media-independent formal representationof the underlying knowledge is used to automaticallyadapt illustrations to the contents of small text seg-ments. The system contains a hierarchical represen-tation of basic anatomic concepts such as bones, mus-cles, articulations, tendons, as well as their parts andregions.

Network structures provide a flexible model foradaptation and integration of additional informationitems. Nevertheless, Crestani (1997) points out that”... the problem of building a network which effec-tively represents the useful relations (in terms of theIRs aims) has always been the critical point of manyof the attempts to use SA in IR. These networks arevery difficult to build, to maintain and keep up to date.Their construction requires in depth application do-main knowledge that only experts in the applicationdomain can provide.”

Dittenbach, Merkl & Berger (2003) present an ap-proach based on neural networks for organizing wordsof a specific domain according to their semantic re-lations. A two-dimensional map is used to displaysemantically similar words in spatially regions. Thisrepresentation can support the construction and en-richment of information stored in the associative net-work.

4.1 The Redeveloped System Architecture

To overcome the limitations of the knowledge base ofthe original system, the development of an alterna-tive approach to model domain knowledge was nec-essary. Basically, the unassociated, non-hierarchicknowledge representation model inhibits the powerof the system. Strictly speaking, the original system

failed to retrieve results on a fuzzy basis, i.e. the re-sults determined by the system provide exact matchesonly, without respect to first, interactions users madeduring past sessions, second, personal preferences ofusers, third, semantic relations of domain intrinsic in-formation, and fourth, locational interdependencies.

In order to adapt the system architecture accord-ingly, an approach based on associative networks wasdeveloped. This associative network replaces the flatsynonym ontology used in the original system. More-over, both the grammar rules and the SQL frag-ments have been removed from the knowledge base.More precisely, the functionality and logic is now cov-ered by newly developed pipeline elements or implic-itly resolved by the associative network. In anal-ogy to the original pipeline, the first three process-ing steps are accomplished. Next, a newly imple-mented initialization–element associates concepts ex-tracted from the query with nodes of the associativenetwork. These nodes act as activation sources. Sub-sequently, the newly designed spreading–element im-plements the process of activation propagation. Fi-nally, the new evaluation–element analyzes the ac-tivation level of the associative network determinedduring the spreading phase and produces a rankingaccording to this activation level.

4.1.1 The Knowledge Representation Model

Basically, the knowledge base of the information re-trieval system is composed of two major parts: first,a relational database that stores information aboutdomain entities and, second, a data structure basedon an associative network that models the relation-ships among terms relevant to the domain. Each do-main entity is described by a freely definable set of at-tributes. To provide a flexible and extensible meansfor specifying entity attributes, these attributes areorganized as <name, value> pairs. An example fromthe tourism domain is depicted in Table 1.

Hotel Wellnesshof

category 4facility saunafacility solariumfacility ...

Table 1: <name,value>-pair example for entity “HotelWellnesshof”

The associative network consists of a set of nodesand each node represents an information item. More-over, each node is member of one of three logical lay-ers defined as follows:

• Abstraction layer: One objective of the rede-velopment of the knowledge base was to integrateinformation items with abstract semantic mean-ing. More precisely, in contrast to the knowl-edge base used in the original system which onlysupported modeling of entity attributes, the newapproach allows the integration of a broader setof terms, e.g. terms like “wellness” or “summeractivities” that virtually combine several infor-mation items.

• Conceptual layer: The second layer is used toassociate entity attributes according to their se-mantic relationship. Thus, each entity attributehas a representation at the conceptual layer. Fur-thermore, the strengths of the relationships be-tween information items are expressed by a realvalue associated with the link.

• Entity layer: Finally, the entity layer asso-ciates entities with information items (entity at-tributes) of the conceptual layer, e.g. an entitypossessing the attribute “sauna” is associatedwith the sauna–node of the conceptual layer.

The building blocks of the network are concepts.A concept represents an information item possessingseveral semantically equivalent terms, i.e. synonyms,in different languages. Each concept possesses one ofthree different roles:

• Concrete concepts are used to represent in-formation items at the conceptual layer. Moreprecisely, concrete concepts refer to entity at-tributes.

• Concepts with an abstract role refer to terms atthe abstraction layer.

• Finally, the modifier role is used to categorizeconcepts that alter the processing rules for ab-stract or concrete concepts. A modifier like, forinstance, “not” allows the exclusion of conceptsby negation of the assigned initialization value.

Moreover, concepts provide, depending on theirrole, a method for expressing relationships amongthem. The connectedTo relation defines a bidirec-tional weighted link between two concrete concepts,e.g. the concrete concept “sauna” is linked to “steambath”. The second relation used to link informationitems is the parentOf association. It is used to expressthe sub-super class relationship between abstract con-cepts or concrete and abstract concepts.

A set of concepts representing a particular domainis described in a single XML file and acts as inputsource for the information retrieval system. Duringinitialization, the application parses the XML file, in-stantiates all concepts, generates a list of synonymspointing at corresponding concepts, associates con-cepts according to their relations and, finally, links theentities to concrete concepts. Currently, the associa-tive network consists of about 2,200 concepts, 10,000links and more than 13,000 entities. The concept net-work includes terms that describe the tourism domainas well as towns, cities and federal states throughoutAustria.

To get a better picture of the interdependenciesassociated with the layers introduced above see Fig-ure 3. Each layer holds a specific set of concepts.Abstract concepts associate concepts at the same orat the conceptual layer. Concepts at the conceptuallayer define links between entity attributes and asso-ciate these attributes with entities at the entity layer.Finally, entities are placed at the lowest layer, theentity layer. Concepts at the entity layer are not as-sociated with items at the same layer. Consider, forexample, the abstract concept “indoor sports” andthe concept “sauna” as concepts from which activa-tion originates from. First, activation is propagatedbetween the abstraction layer to the conceptual layervia the dashed line from “indoor sports” to “table ten-nis”. We shall note, that dashed lines indicate linksbetween concepts of different layers. Thus, “sauna”and “table tennis” act as source concepts and, more-over, activation is spread through the network alonglinks at the conceptual layer. Activation received byconcepts at the conceptual layer is propagated to theentities at the entity layer stimulating, in this partic-ular case, the entities “Hotel Stams”, “Hotel Thaya”as well as “Wachauerhof”. Moreover, a fraction ofactivation is propagated to adjacent concept nodes atthe conceptual layer, i.e. “solarium”, “whirlpool” aswell as “tennis”, and to entities, i.e. “Hotel Wiental”and “Forellenhof”, respectively.

4.1.2 Processing the Associative Network

Due to the flexibility and adaptivity of the originalsystem, the integration of the redesigned parts hasbeen accomplished with relatively little effort. In par-ticular, the existing knowledge base has been replacedby the associative network and additional pipeline el-ements to implement spreading activation have beenincorporated.

Figure 4 depicts the redeveloped knowledge baseon which the processing algorithm operates. Theconceptual layer stores concrete concepts and theweighted links among them. Associating abstractconcepts with concrete concepts is done at the ab-straction layer. Each entity has a unique identifierthat is equivalent to the entity identifier stored in therelational database. Furthermore, entities are con-nected to concepts at the conceptual layer. Moreprecisely, an entity is connected to all attributes itpossesses. As an example consider the entity “Ho-tel Stams” as depicted in Figure 4. This hotel offersa “sauna”, a “steam bath” and a “solarium” and is,therefore, linked to the corresponding concepts at theconceptual layer.

First, a user’s query, received by the informationretrieval system, is decomposed into single terms. Af-ter applying an error correction mechanism and aphrase detection algorithm to the query, terms foundin the synonym lexicon are linked to their correspond-ing concept at the abstraction or conceptual layer.These concepts act as activation sources and, sub-sequently, the activation process is initiated and ac-tivation spreads according to the algorithm outlinedbelow.

At the beginning, the role of each concept is eval-uated. Depending on its role, different initializationstrategies are applied:

• Modifier role: In case of the “not” modifier,the initialization value of the subsequent conceptis multiplied with a negative number. Due to thefact that the “and” and “or” modifiers are im-plicitly resolved by the associative network, theyreceive no special treatment. More precisely, if,for instance, somebody is searching for an accom-modation with a sauna or solarium, those accom-modations offering both facilities will be rankedhigher than others, providing only one of the de-sired facilities. Furthermore, the “near” modi-fier reflecting geographic dependencies, is auto-matically resolved by associating cities or townswithin a circumference of 15km. Depending onthe distance, the weights are adapted accord-ingly, i.e. the closer they are together, the higheris the weight of the link in the associative net-work.

• Abstract role: If a source concept is abstract,the set of source concepts is expanded by resolv-ing the parentOf relation between parent andchild concepts. This process is repeated untilall abstract concepts are resolved, i.e. the set ofsource concepts contains members of the concep-tual layer only. The initial activation value ispropagated to all child concepts, with respect tothe weighted links.

• Concrete role: The initial activation level ofconcrete concepts is set to initialization valuedefined in the XML source file. The spreadingactivation process takes place at the conceptuallayer, i.e. the connectedTo relations between ad-jacent concepts are used to propagate activationthrough the network.

After the initialization phase has completed, the iter-ative spreading process is activated. During a single

Figure 3: Network layer interdependencies

iteration one pulse is performed, i.e. the number ofiterations equals the number of pulses. Starting fromthe set of source concepts determined during initial-ization, in the current implementation activation isspread to adjacent nodes according to the followingformula:

Oi(p) ={

0 if Ii(p) < τ,Fi

p+1 · Ii(p) else, with Fi = (1− Ci

CT) (4)

The output, Oi(p), sent from node i at pulse p,is calculated as the fraction of Fi, which limits thepropagation according to the degree of connectivityof node i (i.e. fan-out constraint, cf. Section 3.2), andp + 1, expressing the diminishing semantic relation-ship according to the distance of node i to activationsource nodes (i.e. distance constraint, cf. Section 3.2).Moreover, Fi is calculated by dividing the number ofconcepts Ci directly connected to node i by the totalnumber of nodes CT building the associative network.Note, τ represents a threshold value.

Simultaneous to calculating the output value forall connected nodes, the activation level Ii(p) of nodei is added to all associated entities. More precisely,each entity connected to node i receives the samevalue and adds it to an internal variable represent-ing the total activation of the entity. As an example,if the concept node “sauna” is activated, the acti-vation potential is propagated to the entities “HotelStams” and “Hotel Thaya” (cf. Figure 4). Next, allnewly activated nodes are used in the subsequent iter-ation as activation sources and the spreading processcontinues until the maximum number of iterations isreached.

After the spreading process has terminated, thesystem inspects all entities and ranks them accord-ing to their activation. Figure 5 depicts the results

determined for the example query

Ich und meine Kinder mochten in einem Hotel inKitzbuhel Urlaub machen. Es sollte ein Dampfbadhaben.1

In this particular case, the entities “SchwarzerAdler Kitzbuhel” and “Hotel Schloss Lebenberg –Kitzbuhel” located in “Kitzbuhel” are suggested tobe the best matching answers to the query. Moreover,the result set includes matches that are closely relatedto the user’s query. Thus, depending on the relationsstored in the associative network, entities offering re-lated concepts are activated accordingly. More pre-cisely, not only the attributes “hotel”, “steam bath”and “kids” are taken into account, but also all otherrelated entity attributes (e.g. “sauna”, “whirlpool”,“solarium”, etc.) have some influence on the rankingposition. Furthermore, accommodations in cities inthe vicinity of “Kitzbuhel” providing the same or evenbetter offers are also included in the result set. Thus,the associative network provides a means for exactinformation retrieval and incorporates a fuzzy searchstrategy that determines closely related matches tothe user’s query.

5 Conclusion

A natural language system based on an approachdescribed in Berger (2001) and Berger, Dittenbach,Merkl & Winiwarter (2001) has been reviewed in thispaper and, furthermore, provided the basis for the re-search presented herein. The reviewed system offersmultilingual access to information on a restricted do-main. In this particular case the system operates on

1Me and my kids would like to spend our holidays in a hotel inKitzbhel. It should have a steam bath.

Conceptual Layer

Synonym Lexicon

Entity Layer

sauna

solarium

steam bath

pool

indoor pool

w 2

w 5

w 6

w 4

w 1

w 3

2344 | Hotel Stams

1240 | Hotel Thaya

4661 | Forellenhof

1455 | Hotel Wiental

5529 | Wachauerhof

de schwimmbad hallenbad swimmingpool schwimmbecken ... ...

. . .

. . .

Abstraction Layer

wellness

. . .

. . .

Figure 4: Knowledge base architecture

the tourism domain. Moreover, users of the search in-terface are encouraged to formulate queries in naturallanguage, i.e. they are able to express their intentionsin their own words.

We developed a knowledge representation modelthat facilitates the definition of semantic relations be-tween information items exemplified by terms of thetourism domain. In particular, an associative net-work based on a three layered structure was intro-duced. First, the abstraction layer allows modellingof terms with a subjective or broader semantic mean-ing, second, the conceptual layer is used to definerelations via weighted links between terms, and, fi-nally, the entity layer provides a means to associateelements stored in a relational database with infor-mation items in the associative network. Moreover,a constrained spreading activation algorithm imple-ments a processing technique operating on the net-work. Generally, the combination of the associativenature of the knowledge representation model and theconstrained spreading activation approach constitutesa search algorithm that evaluates the relatedness ofterms and, therefore, provides a means for implicitquery expansion.

The flexible method of defining relationships be-tween terms unleashes the ability to determine highlyassociated results as well as results that are predefineddue to personal preferences. Moreover, especially de-signed associative networks can be used to model sce-narios, as, for instance, a winter holiday scenario thatfavors accommodations offering winter sports activi-ties by adapting the weights of links accordingly.

One important task for further enhancement is thepossibility to express the relevance of query terms.Users should be able to assign a degree of significanceto terms. Consider, for example, a user searching foran accommodation with several amenities in the capi-tal city of Austria. Moreover, the user is a vegetarian.Therefore, a means for expressing the importance ofvegetarian kitchen is needed. In order to accomplishthis requirement, the system might be extended tounderstand words that emphasis terms, e.g. in anal-ogy to modifiers like “and”, “or”, “near”, etc. theword “important” is handled like a modifier and in-fluences the activation level of the following queryterm. Additionally, an interface providing a graph-ical instrument to express relevance by means of aslide controller might be considered.

Furthermore, an associative network might act asa kind of short term memory. More precisely, duringa user session a particular network is used to store

the activation level determined during past user in-teractions. A user, for instance, is searching for ahotel in Vienna. Thus, the associative network storesthe activation level for further processing. Next, theuser might restrict the results to accommodations of-fering a sauna. This spreading process is carried outusing the associative network determined during theprevious interaction.

References

Baeza-Yates, R. A. & Ribeiro-Neto, B. (1999),Modern Information Retrieval, Addison-Wesley,Reading, MA.

Ballesteros, L. & Croft, W. B. (1998), Resolvingambiguity for cross-language retrieval, in ‘Re-search and Development in Information Re-trieval’, pp. 64–71.

Belew, R. K. (1989), Adaptive information retrieval:Using a connectionist representation to retrieveand learn about documents, in N. J. NicholasJ. Belkin & C. J. Van Rijsbergen, eds, ‘Pro-ceedings of the 12th International Conference onResearch and Development in Information Re-trieval (SIGIR’89)’, ACM, pp. 11–20.

Berger, H. (2001), Adaptive multilingual interfaces,Master’s thesis, Vienna University of Technol-ogy.

Berger, H., Dittenbach, M. & Merkl, D. (2003),Querying tourism information systems in nat-ural language, in ‘Proceedings of the 2nd In-ternational Conference on Information SystemTechnology and its Applications (ISTA 2003)’,Kharkiv, Ukraine.

Berger, H., Dittenbach, M., Merkl, D. & Winiwarter,W. (2001), Providing multilingual natural lan-guage access to tourism information, in W. Wini-warter, S. Bressan & I. K. Ibrahim, eds, ‘Pro-ceedings of the 3rd International Conference onInformation Integration and Web-based Appli-cations and Services (IIWAS 2001)’, AustrianComputer Society, Linz, Austria, pp. 269–276.

Cavnar, W. B. & Trenkle, J. M. (1994), N-gram-based text categorization, in ‘International Sym-posium on Document Analysis and InformationRetrieval’, Las Vegas, NV.

Figure 5: Weighted result set determined by constrained spreading activation

Crestani, F. (1997), ‘Application of spreading activa-tion techniques in information retrieval’, Artifi-cial Intelligence Review 11(6), 453–582.

Crestani, F. & Lee, P. L. (2000), ‘Searching the webby constrained spreading activation’, Informa-tion Processing and Management 36(4), 585–605.

Croft, W., Lucia, T., Crigean, J. & Willet, P. (1989),‘Retrieving documents by plausible inference: anexperimental study’, Information Processing &Management 25(6), 599–614.

Croft, W. & Thompson, R. H. (1987), ‘I3R: A NewApproach to the Design of Document RetrievalSystems’, Journal of the American Society forInformation Science 38(6), 389–404.

Dittenbach, M., Merkl, D. & Berger, H. (2003), Us-ing a connectionist approach for enhancing do-main ontologies: Self-organizing word categorymaps revisited, in ‘Proceedings of the 5th Inter-national Conference on Data Warehousing andKnowledge Discovery - (DaWaK 2003)’. Ac-cepted for publication.

Harman, D. K. (1995), Overview of the 3rd TextRetrieval Conference (TREC-3), in D. K. Har-man, ed., ‘Proceedings of the 3rd Text RetrievalConference (TREC-3)’, NIST Special Publica-tion 500–225, pp. 1–19.

Hartmann, K. & Strothotte, T. (2002), A spreadingactivation approach to text illustration, in ‘Pro-ceedings of the 2nd International Symposium onSmart graphics’, ACM Press, pp. 39–46.

Hull, D. A. & Grafenstette, G. (1996), Queryingacross languages: A dictionary-based approachto multilingual information retrieval, in ‘Pro-ceedings of ACM SIGIR Conference on Researchand Development in Information Retrieval (SI-GIR 1996)’, pp. 49–57.

Kjeldsen, R. & Cohen, P. (1987), ‘The evolution andperformance of the GRANT system’, IEEE Ex-pert pp. 73–79.

Levenshtein, V. I. (1966), ‘Binary codes capable ofcorrecting deletions, insertions and reversals’,Soviet Physics Doklady 10(8), 707–710.

Philips, L. (1990), ‘Hanging on the metaphone’, Com-puter Language Magazine 7(12).

Quillian, M. R. (1968), Semantic memory, in M. Min-sky, ed., ‘Semantic Information Processing’, MITPress, pp. 227–270.

Salton, G. (1989), Automatic Text Processing: TheTransformation, Analysis, and Retrieval of In-formation by Computer, Addison-Wesley, Read-ing, MA.

Salton, G. & McGill, M. J. (1983), Introductionto Modern Information Retrieval, McGraw-Hill,New York.

Van Rijsbergen, C. J. (1979), Information Retrieval,Department of Computer Science, University ofGlasgow.

Xu, F., Netter, K. & Stenzhorn, H. (2000), Mietta -a framework for uniform and multilingual accessto structured database and web information, in‘Proceedings of the 5th International Workshopon Information Retrieval with Asian languages’.

Documents

An Adaptive Information Retrieval System based on ...dieter/research/publications/apccm04-1.pdf · one: “multilingual information retrieval is informa-tion retrieval in any language