10
Combining ontological profiles with context in information retrieval q Geir Solskinnsbakk * , Jon Atle Gulla Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway article info Article history: Available online 13 October 2009 Keywords: Information retrieval Ontologies Ontological profiles Context abstract An ontology is a formal conceptualization of a domain, specifying the concepts of the domain and the relations between them. It is however not a straight forward task to use this knowledge for information retrieval purposes. In this paper we describe the concept of an ontological profile, which is a semantic extension of an ontology where each ontology concept is given a description in terms of a vector of weighted keywords. An experiment has been conducted with a prototype search engine using ontological profiles for query expansion. The evaluation shows encouraging results compared to standard keyword based search. Furthermore, we describe the notion of context in an information retrieval setting and address how we can combine semantics and context in search based on query expansion. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Ontologies are now used in a wide range of applications and have been instrumental in many interoperability projects. They help applications work together by providing common vocabularies that describe all important domain concepts with- out being tied to particular applications in the domains. However, used as part of semantic search applications, ontologies have so far had only limited success. Early semantic search engines tried to use ontology concepts and structures as controlled search vocabularies, but this was unpractical both functionally and from a usability perspective. Ontologies for query disambiguation or reformulation seem more promising, though there is a fundamental problem with comparing ontology concepts with query or document terms. Concepts are ab- stract notions that are not necessarily linked to a particular term. A number of terms may refer to the same concept, or a specific term may be the realization of several different concepts. Using conceptual structures to index or retrieve document text requires that there is something bridging the conceptual and real world. Another issue is tailoring ontologies to the retrieval task. Research indicates that ontologies are of little use if they are not aligned with the documents indexed by the search application. The granularity of the ontology needs to match the granu- larity of the document collection. While there is no need to have an elaborate ontology for a sub-domain with very few doc- uments, it is often necessary to expand ontologies in areas that are well covered by the document collection. This paper presents an ontology enrichment approach that both bridges the conceptual and real world and ensures that the ontology is well adapted to the documents at hand. The idea is to provide contextual concept characterizations that re- veal how the concepts are referred to semantically in the document collection. The characterizations come in the form of weighted terms that are all-in some sense-related to the concept itself. The ontology together with the concept character- izations are referred to as an ontological profile of the document collection. The approach has already been used for ontology 0169-023X/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2009.10.006 q This research was carried out as part of the IS A project, project no. 176755, funded by the Norwegian Research Council under the VERDIKT program. * Corresponding author. E-mail addresses: [email protected] (G. Solskinnsbakk), [email protected] (J.A. Gulla). Data & Knowledge Engineering 69 (2010) 251–260 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Combining ontological profiles with context in information retrieval

Embed Size (px)

Citation preview

Data & Knowledge Engineering 69 (2010) 251–260

Contents lists available at ScienceDirect

Data & Knowledge Engineering

journal homepage: www.elsevier .com/locate /datak

Combining ontological profiles with context in information retrieval q

Geir Solskinnsbakk *, Jon Atle GullaDepartment of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway

a r t i c l e i n f o a b s t r a c t

Article history:Available online 13 October 2009

Keywords:Information retrievalOntologiesOntological profilesContext

0169-023X/$ - see front matter � 2009 Elsevier B.Vdoi:10.1016/j.datak.2009.10.006

q This research was carried out as part of the IS A* Corresponding author.

E-mail addresses: [email protected] (G. Solski

An ontology is a formal conceptualization of a domain, specifying the concepts of thedomain and the relations between them. It is however not a straight forward task to usethis knowledge for information retrieval purposes. In this paper we describe the conceptof an ontological profile, which is a semantic extension of an ontology where each ontologyconcept is given a description in terms of a vector of weighted keywords. An experimenthas been conducted with a prototype search engine using ontological profiles for queryexpansion. The evaluation shows encouraging results compared to standard keywordbased search. Furthermore, we describe the notion of context in an information retrievalsetting and address how we can combine semantics and context in search based on queryexpansion.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Ontologies are now used in a wide range of applications and have been instrumental in many interoperability projects.They help applications work together by providing common vocabularies that describe all important domain concepts with-out being tied to particular applications in the domains.

However, used as part of semantic search applications, ontologies have so far had only limited success. Early semanticsearch engines tried to use ontology concepts and structures as controlled search vocabularies, but this was unpractical bothfunctionally and from a usability perspective. Ontologies for query disambiguation or reformulation seem more promising,though there is a fundamental problem with comparing ontology concepts with query or document terms. Concepts are ab-stract notions that are not necessarily linked to a particular term. A number of terms may refer to the same concept, or aspecific term may be the realization of several different concepts. Using conceptual structures to index or retrieve documenttext requires that there is something bridging the conceptual and real world.

Another issue is tailoring ontologies to the retrieval task. Research indicates that ontologies are of little use if they are notaligned with the documents indexed by the search application. The granularity of the ontology needs to match the granu-larity of the document collection. While there is no need to have an elaborate ontology for a sub-domain with very few doc-uments, it is often necessary to expand ontologies in areas that are well covered by the document collection.

This paper presents an ontology enrichment approach that both bridges the conceptual and real world and ensures thatthe ontology is well adapted to the documents at hand. The idea is to provide contextual concept characterizations that re-veal how the concepts are referred to semantically in the document collection. The characterizations come in the form ofweighted terms that are all-in some sense-related to the concept itself. The ontology together with the concept character-izations are referred to as an ontological profile of the document collection. The approach has already been used for ontology

. All rights reserved.

project, project no. 176755, funded by the Norwegian Research Council under the VERDIKT program.

nnsbakk), [email protected] (J.A. Gulla).

252 G. Solskinnsbakk, J.A. Gulla / Data & Knowledge Engineering 69 (2010) 251–260

alignment (cf. [16]), and we are now experimenting with these profiles in search and ontology learning. Our initial searchprototypes display a significant improvement of search relevance, provided that the quality of the characterizations aresufficient.

Further, we describe the notion of context for information retrieval purposes. We define what a search context is and howit may later be used to improve the information retrieval task for a user in a specific situation.

2. Related work

Su describes in [16] a method for ontology mapping, based on an extension of the ontology. The extension of the ontologyconsists of constructing a feature vector for each concept of the ontology. The feature vector is made up of weighted termsreflecting relevance between concept and terms found in an underlying document collection. Further, the vectors are used toprovide a mapping between concepts of different ontologies.

In [17], Tomassen describes an ontology driven information retrieval system based on extending ontologies. Each conceptof the ontology is extended by associating it with a vector of key phrases describing the concept. First, the ontology conceptsare ranked according to ontology relevance. Next a query is generated for each concept based on relations with other con-cepts. The returned documents (for each concept) are separately clustered, and for each cluster a set of candidate terms areextracted. The candidate terms for each cluster of the concept are compared to the candidate terms for neighboring concepts(weights are used to differ between relation types). The candidate terms with the highest similarity are chosen and used forthe final assembly of the concept vector. The concept vectors are during search used to post-process the result set returned tothe user. The two former profiles are document or domain centric, while the next two are user centric.

Kearney et al. [8] uses ontological profiles for user recommendation in a web setting (shopping, movies, etc.). The profileis constructed from analyzing single sessions of pages visited by a user. For each page visited, the algorithm tags the pagewith a set of ontology concepts and calculates weights for the concepts of each visit. The profile is now a weighted repre-sentation of the visit in terms of the ontology concepts. A visitor ontological profile (VOP) is generated for the user by sum-ming up each of the single session profiles, giving less importance to old profiles. Finally, the VOP can be used to makerecommendations to other users by calculating the distance between the profiles of the users.

Sieg et al. [13] describe an approach for representing user context for search. The context is represented by an ontologicaluser profile, which is a reference ontology annotated with interest scores for each concept. The interest scores are graduallymodified based on the searches performed by the user and the documents viewed. The reference ontology consists of termvectors for each concept constructed on the basis of pre-classified documents. The ontological profile is used during search tore-rank the final results by combining query, document, interest scores and the term vector representation of the concepts.

The VIVACE project [11] describes modeling and use of context in a search system supporting engineers in their work. Theresearchers define six important context dimensions which together identify the user context. Knowledge Elements (K-El,pieces of topic specific knowledge) are tagged as relevant for a given context and a weight describing the strength of therelevance is associated with each such K-El, context pair. Search is performed using case based reasoning. The system alsohas capabilities for learning. This is supported by users being able to re-weight K-Els, or even do full text search on the K-Elsassigning new context to K-Els found based on the users context.

Shen et al. [12] reports on an approach for context sensitive information retrieval based on implicit feedback. The implicitfeedback is in essence based on click-through history and query history. In this context, the authors make a distinction be-tween short-term and long-term context. Short-term context describes the users needs for a single session, while the long-term context describes an accumulation of context over time (query history, general interest, past click-through behavior,etc.). In the paper the authors concentrate on the short-term context, and use the query history and click-through historyas a basis for introducing context to the information retrieval task.

3. Ontological profiles

As stated in Section 1, matching query terms with ontology concepts is not a straightforward task. Simply relying on asyntactic match between the query term and its conceptual counterpart has its limitations. One of the most obvious prob-lems might be that the mapping of terms to concepts and vice versa is hampered by polysemy (in both directions). As aneffort towards this mapping problem we propose the ontological profile.

An ontological profile is an extension of a domain ontology. The ontology is extended with semantically related terms thatare added as vectors for each of the concepts of the ontology. The terms are given weights to reflect the importance of thesemantic relation between the concept and the terms. For a definition of concept vectors, see Definition 1 below.

Definition 1 (Concept vector). The definition given here is adapted from Su [16]. Let T be the set of n terms in the documentcollection used for construction of the ontological profile. ti 2 T denotes term i in the set of terms. Then the concept vector forconcept j is defined as the vector Cj = [w1,w2, . . . ,wn] where each wi denotes the semantic relatedness weight for each term ti

with respect to concept Cj.

The construction of an ontological profile relies on having a domain relevant document collection at hand. This meansthat the documents must cover the same domain as the ontology we are constructing the ontological profile for. By applying

G. Solskinnsbakk, J.A. Gulla / Data & Knowledge Engineering 69 (2010) 251–260 253

text mining techniques to the document collection, we extract terms that are semantically related to the concepts of theontology. Weights are calculated to express the strength of the relations, and are given in the range [0,1]. A weight of 0means that there is no relation, whereas a weight of 1 means that the term is highly related or even synonymous to the con-cept. Hence the final ontological profile will have one concept vector associated with each concept of the ontology. The con-cept vector can be viewed as an extended semantic characterization of the concept, reflecting the semantics of the concept inthe document collection. We would like to stress this point, as it may be hard for authors to use a consistent language over alarge domain and large amounts of text.

The concept vectors will typically contain terms that are synonymous to the concept and more indirect references to theconcept or the use of it. Sources like thesauri and WordNet [3] contain a similar type of information. However, in some casesthe information required for the domain model is of a more specific nature. We therefore argue that the ontological profile maybe better suited, since it is adapted to: (i) a specific document collection, (ii) the vocabulary in the documents, and (iii) the useof the ontology concepts in the documents. Moreover, one could also argue that these terms could be added to the ontology tofurther specialize it. This is however not necessarily a desirable path to take. Firstly, the terms found in the concept vectorsmay be very specialized since they are tailored towards the underlying document collection. This suggests that the termsare not generic to the domain, resulting in a very specialized ontology which is hard to reuse. Secondly, the purpose of con-structing the ontological profile is to do a semantic analysis of the domain based on the documents. The analysis may discoverrelations that are very specific to the document base and thus the terms may be too fine-grained to add to the ontology.

When applied to information retrieval (see Section 5), we argue that ontological profiles are generic to the search process.By this we mean that an information retrieval system built to utilize such profiles may be adapted to different documentcollections or even domains by exchanging the document collection and/or the ontology used. In Fig. 1, two different doc-ument collections are used to build two different ontological profiles for the ontology. By exchanging the ontological profilethe system can search more focused in a different document collection using the same ontology.

Finally, we will show an example of a concept vector. We will be using the concept Christmas Tree from the IIP ontology[6]. A christmas tree is by the ISO 15926 standard used in the petroleum industry defined as ‘‘an artefact that is an assembly ofpipes and piping parts, with valves and associated control equipment that is connected to the top of a wellhead and is intended forcontrol of fluid from a well”. The top 10 terms for the concept vector are shown below:

Cchristmastree ¼ ½christma0:71; tree0:60; valve0:13;master0:11;wing0:11; bop0:08; located0:08; stack0:07; choke0:07;wellhead0:06�

The terms from the example were found in the experiment (Section 5) using stemming (explaining the term christma.)The two highest weighted terms are the constituents of the concept name. From the definition, we also note that valveand wellhead are related to a christmas tree. Choke (a type of valve) is also relevant since it is an integral part of fluid control.The term bop is at first glance taken for noise, but is actually an abbreviation for blowout-prevention, which certainly is rel-evant in this context. Bop also demonstrates how semantically related terms are picked up by the text mining process usedfor constructing ontological profiles.

4. Construction of ontological profiles

The ontological profile is constructed based on a document collection covering the same domain as the ontology. Thisassures that the connection we want between the vocabulary of the domain (at least the vocabulary in the document

Fig. 1. The concept of ontological profiles.

254 G. Solskinnsbakk, J.A. Gulla / Data & Knowledge Engineering 69 (2010) 251–260

collection used) and the concepts of the domain is found, letting the ontological profile be a semantic characterization of thedomain. A detailed description of the approach is found in [14], and is based on a method described by Su [16].

Construction of the ontological profile is based on three important aspects which together reflect the content of the doc-uments used as basis for the ontological profile:

1. Statistics. We apply statistical techniques to the document collection by counting the frequency of the terms in the doc-uments. We hypothesize that terms that are found to co-occur with the concept more frequently are probably morerelevant to the concept than terms that co-occur less frequently.

2. Linguistics. A simple linguistic technique is applied to the terms of the documents, namely stemming, to collapse cer-tain semantically similar terms into a single form. There is obviously a trade-off between light and strong stemmingalgorithms. If the stemming algorithm is too strong, it will collapse the terms too much leading to problems withsemantics of the words in different contexts with a loss of precision [4] as a consequence. We have therefore optedto stem the terms only lightly.

3. Proximity. Finally, we apply a proximity measure to the co-occurrence of concept names and terms. The assumptionthat lies behind the proximity analysis is that terms found close to each other should be more semantically relatedthan terms found far apart. Term proximity is reflected by the way terms are weighted in the concept vectors.

The overall process of constructing the ontological profile is shown in Fig. 2. The first step is to preprocess the documentsused during the construction phase. During this process we remove all stop words, and stem the terms lightly (removingplural s), using the conversion s ? ; for all terms not terminated with ss.

Next, we construct three separate indexes over the documents, used for the proximity analysis. Each document is firstindexed as a single object where all terms are equally related. Next we index each document as a set of paragraphs whereeach paragraph makes up a semantic entity with higher semantic coherence among the terms than in the documents. Finally,we index each document as a set of sentences, where the terms within a single sentence have the highest semantic coher-ence. We have now formed a hierarchy of increasing semantics over the text; documents having the lowest sense of semanticcoherency, while sentences have the highest semantic coherency, reflecting the proximity property described above.

The name of the concept is used as a phrase query into the three indexes, and all documents containing the phrase areassigned to the concept. This implicitly means assigning all of the terms in the documents to the concept in question.

The next step in the construction process is to calculate the weights for the terms assigned to the concepts. The principlebehind the weight calculation is that we want to differentiate the score given to the terms based on proximity. Thus we havechosen to give terms found in the same document as the concept name the lowest weight, terms found in the same para-graph higher weight, and finally terms found in the same sentence the highest weight. This weighting scheme reflectsour assumption about proximity and strength of relation described above. As a basis for the calculation of the weights,we use the frequency of the terms found in the documents assigned to each concept. Eq. (1) shows the calculation, wherevfi,j is the term frequency for term i in concept vector j, fi,k is the term frequency for term i in document vector k, D, P,and S are the possibly empty sets of relevant documents, paragraph documents and sentence documents assigned to j,and a = 0.1, b = 1.0, and c = 10.0 are the constant modifiers for documents, paragraph documents, and sentence documents,respectively. The constants (a, b, and c) simply reflect the relative importance of terms found in whole documents, paragraph

Fig. 2. The process of constructing the ontological profile.

G. Solskinnsbakk, J.A. Gulla / Data & Knowledge Engineering 69 (2010) 251–260 255

documents, and sentence documents. Although the value for these have not been researched extensively, we have found thatthe given values perform quite well:

vfi;j ¼ a �Xd2D

fi;d þ b �Xp2P

fi;p þ c �Xs2S

fi;s ð1Þ

The vectors resulting after the calculations in Eq. (1) are what we refer to as the basic vectors. Next, the familiar tfidf [1]score is applied. The idf factor gives less importance to terms that are found in many documents across the collection, and isused in an analogue way here. The difference is that we use the basic vectors as documents, meaning that the idf factor giveslower weight to terms that are found in many concept vectors. The calculations are shown in Eq. (2), where tfidfi,j is the tfidfscore for term i in concept vector j, vfi,j is the term frequency for term i in concept vector j, max(vfl,j) is the frequency of themost frequent occurring term l in concept vector j, N is the number of concept vectors, and ni is the number of concept vec-tors containing term i:

tfidfi;j ¼vfi;j

maxðvfl;jÞ� log

Nni

ð2Þ

This completes the ontological profile, although some final calculations may be applied, such as normalizing the vectorsto unit length. The last step is to index the vectors so that the vectors may be searched both by concept name and by terms.

In [14], we also introduced the notion of negative concept vectors, which are built in the same way but contain only termsthat are not relevant for the domain. The documents used to build these negative concept vectors are strictly non-relevant tothe domain.

5. Experiment

In this section, we report on an initial experiment carried out based on using the ontological profile for semantic queryexpansion in information retrieval. The process consists of two steps: (i) query interpretation and (ii) query expansion. In (i),we map from the query terms entered by the user to one or more concept(s) of the ontology. This is done by comparing thequery terms with the terms and weights found in the concept vectors. In (ii), the mapped concepts are used for subsequentquery expansion based on the concept vectors, adding semantically related terms to the query.

5.1. Query interpretation

During the query interpretation we try to make a semantic mapping from the keywords of the user query on to a set ofone or more concepts of the ontology. The ontological profile plays a key role in this mapping process. As each of the conceptvectors is a semantic representation of the concept itself, we use this as a measure of coherence between the keywords of thequery and the concept. For this prototype evaluation we have suggested four different strategies for interpretation of thekeyword query. Each of the strategies result in a ranked list of concepts to choose from during expansion. As the conceptvectors contain weighted terms defining their importance for the concept in question, we try to find the set of concepts thatmaximize the semantic coherence between the keywords and the concepts.

5.1.1. Simple query interpretationThis is the most basic strategy for query interpretation. We simply map each query term to a single concept (we do not

consider any connection between the query terms). For each query term ti, we find the concept with the highest tfidf scorefor ti in its corresponding concept vector. The concept chosen for each of the query terms is subsequently used for queryexpansion.

5.1.2. Best match query interpretationIn contrast to the simple query interpretation where we assume no connection between the query terms, we do assume

that there is some connection between the query terms in the best match strategy. We try to recognize this by assumingthat the user is thinking of a specific concept when issuing the query. During interpretation of the query we try to recognizethe connection between the query terms by mapping the terms collectively to a single concept. In effect we require eachcandidate concept to contain all query terms in its corresponding concept vector. The concept which maximizes Sc givenin Eq. (3) (Sc is the score for concept c, and ti,c is the tfidf score for query term i in concept c) is picked for subsequentexpansion:

Sc ¼Xn

i¼1

ti;c ð3Þ

5.1.3. Cosine similarity query interpretationThe two last interpretation strategies are attempts at disambiguating the user query. Also in these strategies do we as-

sume that there is some relationship between the query terms. However, in contrast to the former strategy, we are not able

256 G. Solskinnsbakk, J.A. Gulla / Data & Knowledge Engineering 69 (2010) 251–260

to directly recognize the relation from the query terms. We therefore try to recognize relations between the concepts thatthe query terms map to via the ontological profile. For instance there might be a strong relation between the best concept forterm 1 and the second best concept for term 2 which is not recognized by the two former interpretation strategies. In thisstrategy (and ontology structure interpretation) we first apply the same technique as in the simple strategy, ending up with aranked list of concepts for each of the query terms (we only use the top 15 concepts). We try to recognize the relationshipbetween concepts by calculating the cosine similarity between the concept vectors. A high similarity means that there is astrong relation, likewise a low score signals a weak relation. For each pair of concepts for term 1 and term 2, we calculate ascore (Eq. (4), cin is the concept ranked as n with respect to query term i) which takes into account both the score for eachterm and the relationship between the concept vectors. The pair of concepts with the highest score will be picked as thesemantic representation of the query and used for subsequent expansion:

1 http2 http

Scin;cjm ¼ ti;n � tj;m � cos simðcin; cjmÞ ð4Þ

5.1.4. Ontology structure query interpretationThis interpretation strategy is essentially the same as the former, only differing in the measure for concept similarity. We

use the structure of the ontology to measure similarity between concepts, assuming that concepts that are found close aremore semantically related than concepts found further apart in the ontology. The measure we use is based on the distancebetween the concepts in terms of relations. We start of by generating a graph representation of the ontology, in which nodesrepresent concepts and edges represent relations. Next we traverse the graph to find the distance for each pair of concepts.The score for each pair of concepts is calculated as shown in Eq. (5) where the path function gives the distance between theconcept pairs. The pair of concepts with the highest score is picked as the semantic representation of the query and used inthe subsequent expansion process:

Scin;cjm ¼ ti;n � tj;m �1

pathðcin; cjmÞ ð5Þ

5.2. Query expansion

The query expansion process reformulates the original query by adding semantically related terms to the query. For eachconcept chosen for expansion, the top 15 terms from the corresponding concept vector (including weights reflecting relativeimportance) are added to the query. The weight of the original query terms are boosted to reflect their importance. The finalstage of query expansion is to add for each concept the top 15 terms of its negative concept vector. These terms are added asNOT terms, filtering documents containing these terms from the result set. The motivation behind adding these terms is toremove some of the noise introduced by the increased set of query terms. Finally, the weighted query is executed against avector space search engine (Lucene1) and the result set is returned to the user.

5.3. Experimental data and results

For the evaluation of the approach we have chosen to use the Lucene search engine as a baseline. The query expansionmodule of the system is built as a module on top of the search engine, and will thus use the same underlying search mech-anisms as standard keyword based querying. So, even though using Lucene as a baseline may be viewed as a weakness, weargue that this setup will isolate the effect of the query expansion process to make the two more directly comparable interms of gain.

The ontology used for the evaluation of our prototype implementation is the IIP [6] core ontology, consisting of 18,675concepts which cover the domain of sub-sea oil drilling and production. The Schlumberger Oilfield Glossary2 has been usedas the domain relevant document collection for the construction of the ontological profile. This resulted in a total of 2195 con-cept vectors successfully constructed.

The evaluation consisted of 7 (two term) queries and 5 test subjects. The 7 queries were divided into four different cat-egories based on the query properties (two queries for three of the categories, and one query in the last). The test subjectswere asked to score the top 10 hits from the search results from 0 to 2; 0 meaning totally irrelevant, 1 meaning related, and 2meaning a full match.

Fig. 3 shows that all four of our reformulation strategies performed significantly better than keyword search. One surpris-ing result was that all four of the strategies performed very equally, none of them stood out in any way with respect to querytype. This suggests that the structure of the ontology is not as important as we first thought. Although the evaluation is notstatistically significant, it gives an impression of the performance of our strategy. For a more thorough description of theevaluation, see [15].

://www.lucene.apache.org.://www.glossary.oilfield.slb.com.

Fig. 3. Evaluation of ontological profiles in search [15].

G. Solskinnsbakk, J.A. Gulla / Data & Knowledge Engineering 69 (2010) 251–260 257

As we introduce new terms to the query, we hypothesize that the reformulated queries enhance the recall of the system,as a broadened set of query terms will retrieve more documents. We have however not investigated how precision is affectedby the proposed search strategy. The quality of the vectors is of critical importance for the system to perform well. Furtherwork includes tweaking of parameters and giving the system a more thorough evaluation.

Looking back at the number of concept vectors that were successfully constructed (2195 out of 18,675 potential concepts)shows that we are not very successful in recognizing the concept names for a large portion of the concepts. This is partiallycaused by our ‘‘simplistic” approach during construction where we only use the concept name as phrase. Some of the con-cept names consist of 4+ terms and may be compositional; they may be broken up in the text. Another problem is that manyof the concept names are artificial and not used in the communication of the topic in the text. The second problem is hard todeal with, since it is a problem of the ontology, but can be handled by using labels for the concept names describing alter-native names. However, one must apply this solution cautiously, as it may introduce ambiguity. The first problem is on theother hand somewhat simpler to tackle, as we can in the search break up the concept names in its constituent terms and rankthe documents based on phrase match, partial phrase match, etc.

The example vector given in Section 3 does not contain phrases. This is a general weakness of our implemented prototype.In our opinion, the quality of the concept vectors would be improved by adding phrases to the vectors, since phrases carrymore semantics than their single constituents.

When comparing our work to previous work done on extending ontologies, we would like to pay special attention to thework by Tomassen [17] and by Sieg et al. [13]. Both of these works extend the ontology with related terms. Tomassen on theone hand uses more information found in the ontology and the structure of the ontology to select terms. Sieg et al. on theother hand use the structure of the Open Directory Project3 as a topic hierarchy/ontology. During vector construction, Sieget al. use documents classified under a given concept, in contrast to Tomassen’s and our work, which uses a document collectionand classify the documents based on the text.

6. Context

We are currently exploring techniques to extend the information retrieval system with contextual features that providethe user with even more relevant information depending on the user context.

A multitude of different definitions of context are to be found in the literature, and a commonality of these are that theydiffer slightly and are adapted to the specific application. Dey [2] came up with a definition of context that is more genericthan many of the others: ‘‘Context is any information that can be used to characterize the situation of an entity. An entity is aperson, place, or object that is considered relevant to the interaction between a user and an application, including the user andthe applications themselves”. Although generic, we find this definition to be a bit too vague for our purposes, and we will

3 http://www.dmoz.org.

258 G. Solskinnsbakk, J.A. Gulla / Data & Knowledge Engineering 69 (2010) 251–260

instead use a definition from Kofod-Petersen and Mikalsen [10]: ‘‘Context is the set of suitable environmental states and settingsconcerning a user, which are relevant for a situation sensitive application in the process of adapting the services and informationoffered to the user”.

Kofod-Petersen also gives a definition of context sensitivity in [9]: ‘‘Attributes are context if they are non-essential (for tasks)”.With this definition we are really narrowing down what should be considered contextual information. A piece of information(attribute in the definition) can only be considered context if it is not essential for the task carried out by the application. Inother words, if the application requires a location to do its computations, location is not context (the application will notwork properly without the location). We can thus conclude that only non-essential information used to enhance the appli-cations computations can be considered context.

6.1. Search context

In most information retrieval systems, the only information known about the user and his task is the query which the userissues. For general searches where the user just wants to get an idea about the topic this is not a large problem. The problemarises once the user is trying to find information related to his specific situation, which may lead to several reformulations ofthe query before the result set is satisfying. If the system has knowledge about the situation of the user (user context), suchas the task the user is performing, the location of the user, and the users role, the system should be able to provide a lot moreprecise results without the user having to perform multiple query reformulations.

We can partition the systems knowledge about the user into two separate parts. The first is of course the query, whichgives the system a general idea of what the user is looking for. The second part is additional knowledge which can charac-terize the situation of the user, i.e., the context. This leads us to the following definition of search context, based on the gen-eric definition of context by Kofod-Petersen et al.:

Definition 2. Search context is the context of a user posting queries to a search application.

Important pieces of information regarded as context are (among others) work process, user role, and location. Work pro-cesses are generally formalized in some sort of business process modeling (BPM) language like Business Process ModelingNotation (BPMN4) (see Havey [7] for an introduction to BPM). By assuming that the user is part of such a process, relevant infor-mation for the retrieval task can be found in the model. A system having knowledge of the user’s participation in a given workprocess, can use this knowledge together with any data connected to the relevant stages of the model to improve retrievalperformance.

Further, the notion of role can be quite important. People have different roles in the process depending on their respon-sibilities and educational background. The engineer monitoring the flow of oil from an oil well will have different needs forinformation from the maintenance engineer who is performing maintenance on the oil well, even though both are occupiedin the process of producing oil. The system can thus exploit the user’s role to deliver more accurate results to the user.

Location can also be an important piece of information as the relevance of information presented to the user may bedependant on the users location. The basic information need of an engineer monitoring oil flow from an oil well at the Heid-run field is not that different from the engineer monitoring flow of oil at the Troll platform. The task of these two engineers isbasically the same. However, there may be local variations in equipment, regulations, or in the bedrock of the oilfield.Although the tasks are the same, these local variations may influence which information is regarded the most relevantfor the two engineers.

6.2. Contextualized ontological search

This section will give an overview of how context and semantics can be used in a search application. We are assumingthat the search context is known to the system. Hence the search context is hidden from the user, relieving the user fromentering this information himself. We also assume that the search context is represented in a similar way as the semanticinformation in the ontological profile. That is, whereas the semantic information is represented as a vector of weighted termsfor each concept of the ontology, the search context is represented by a context query, in the form of a vector of weightedontology concepts as defined below.

Definition 3 (Context query). Let Cu be the search context of a user u and let O be the set of n ontology concepts of theontology describing context. oi 2 O denotes concept i in the set of concepts. The context query describing Cu is then defined asthe vector CQu

��!¼ ½w1;w2; . . . ;wn� where each wi describes the importance of oi with respect to Cu.

The process of combining the user query and the context query is depicted in Fig. 4. We will in the following describe theactions taken from user query to the result set presented to the user.

First, let us assume that there exists two ontological profiles. The first one, OPs, used to expand the user query, and thesecond one OPc used to expand the context query. Further we assume that there exists a common language, T, containingthe terms used in both profiles and in the documents searched. The user query (explicitly stated by the user), UQ, is

4 http://www.bpmn.org/.

Fig. 4. Schematic view of contextualized semantic search.

G. Solskinnsbakk, J.A. Gulla / Data & Knowledge Engineering 69 (2010) 251–260 259

represented by the set of query terms UQ = {t1, t2, . . . , tj}, where each ti 2 T. The context query (autonomously captured by the

system), CQu��!

, is represented corresponding to the above definition.The first step of the query reformulation is to map UQ to a set of concepts of OPs. This step results in a new set,

UQ0 = {c1,c2, . . . ,ck}, where each ci 2 OPs. Next the set of concepts chosen for expansion, UQ0, is transformed to a vector ofterms. For each concept, ci 2 UQ0, we take into account the most relevant terms of the corresponding concept vector, v i

!.

The terms are added to UQ 00��!

¼ ½w1;w2; . . . ;wj�, where each wi is the weight of ti 2 T.

Next, the context query, CQu��!

, is expanded using OPc. For each concept, oi 2 OPc, we take into account only the most rel-

evant terms of the corresponding concept vector, ui!. This results in the vector CQ 00u

��!¼ ½w1;w2; . . . ;wj�, where each wi repre-

sents the weight of ti 2 T.The next step in the process is to combine the user query with the context query. This is done by constructing a final

query vector, FQV��! ¼ aUQ 00

��!þ bCQ 00u

��!. The constants a and b (a + b = 1) are used to specify the relative importance of the user

query and the context query, respectively.Finally, FQV

��!is fired as a query against a vector space model search engine, and the ranked list of results are presented to

the user.

7. Conclusion

We have in this paper presented the concept of ontological profiles and how they are constructed. An ontological profile isa powerful extension to an ontology, bridging the gap between the ‘‘abstract” ontology and the use of the concepts in realdocuments. The ontological profile describes each concept as a vector of terms with weights describing the strength of therelation between them. The evaluation of using ontological profiles for query expansion shows promising results, and a gen-erally better performance than the baseline. We have also performed a small scale evaluation of the quality of the conceptvectors and relations. The evaluation was performed in the project management domain, and we evaluated the quality of therelationships found by calculating the cosine similarity between the concept vectors. With relationships we mean not nec-essarily the relationships found in the ontology, but the relationships found among the concepts based on the concept vec-tors. The number of high quality relations found using concept vectors were generally higher than for the relations foundusing association rules [5].

Lastly, we defined search context and discussed how contextual search may be incorporated into our semantic search ap-proach. Future work includes; evolving the strategy for constructing ontological profiles together with evaluation of thequality of the ontological profile; constructing a context model for the user context; and implement and evaluate the contextmodel for search.

References

[1] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press, New York, 1999.[2] A.K. Dey, Understanding and using context, Personal Ubiquitous Computing 5 (1) (2001) 4–7.[3] C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA, 1998.[4] W.B. Frakes, C.J. Fox, Strength and similarity of affix removal stemming algorithms, SIGIR Forum 37 (1) (2003) 26–30.[5] J.A. Gulla, T. Brasethvik, G. Sveia Kvarv, Using association rules to learn concept relationships in ontologies, in: ICEIS 2008: 10th International

Conference on Enterprise Information Systems, Barcelona, Spain, June 12–16, 2008.[6] J.A. Gulla, S.L. Tomassen, D. Strasunskas, Semantic interoperability in the Norwegian petroleum industry, in: Fifth International Conference on

Information Systems Technology and its Applications (ISTA 2006), vol. P-84 of LNI, Köllen Druck Verlag GmbH, Bonn, Klagenfurt Austria, 2006, pp. 81–94.

[7] M. Havey, Essential Business Process Modelling, O’Reilly, 2005.[8] P. Kearney, S.S. Anand, M. Shapcott, Employing a domain ontology to gain insights into user behaviour, in: Proceedings of the Third Workshop on

Intelligent Techniques for Web Personalization (ITWP’05), Edinburgh, Scotland, UK, 2005.[9] A. Kofod-Petersen, A Case-Based Approach to Realising Ambient Intelligence among Agents, Ph.D. Thesis, Norwegian University of Science and

Technology, 2007.[10] A. Kofod-Petersen, M. Mikalsen, Context: representation and reasoning—representing and reasoning about context in a mobile environment, Revue

d’Intelligence Artificielle 19 (3) (2005) 479–498.[11] R. Redon, A. Larsson, R. Leblond, B. Longueville, VIVACE context based search platform, in: CONTEXT 2007, LNAI, vol. 4635, 2007, pp. 397–410.[12] X. Shen, B. Tan, C. Zhai, Context-sensitive information retrieval using implicit feedback, in: SIGIR ’05: Proceedings of the 28th Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 2005.[13] A. Sieg, B. Mobasher, R. Burke, Ontological user profiles for personalized web search, in: Proceedings of AAAI Workshop on Intelligent Techniques for

Web Personalization, AAAI Press Technical Report WS-07-08, July 2007, pp. 84–91.

260 G. Solskinnsbakk, J.A. Gulla / Data & Knowledge Engineering 69 (2010) 251–260

[14] G. Solskinnsbakk, Ontology-Driven Query Reformulation in Semantic Search, M.Sc. Thesis, Norwegian University of Science and Technology,Trondheim, Norway, 2007.

[15] G. Solskinnsbakk, J.A. Gulla, Ontological profiles in enterprise search, in: EKAW 2008: 16th International Conference on Knowledge Engineering andKnowledge Management, Acitrezza, Catania, Italy, 2008.

[16] X. Su, Semantic Enrichment for Ontology Mapping, Ph.D. Thesis, Norwegian University of Science and Technology, Trondheim, Norway, 2004.[17] S.L. Tomassen, Searching with document space adapted ontologies, in: Emerging Technologies and Information Systems for the Knowledge Society,

First World Summit, WSKS 2008, vol. 5288, Springer-Verlag, Athens, Greece, 2008, pp. 513–522.

Geir Solskinnsbakk received his M.Sc. degree in computer science (2007) from the Norwegian University of Science andTechnology (NTNU). He is currently a Ph.D. student at the Department of Computer and Information Science at NTNU, workingon semantics and context for information retrieval. His research interests include information retrieval, ontologies, semantics,context, and text mining.

Jon Atle Gulla is professor of Information Systems at the Norwegian University of Science and Technology since 2002. He

received his M.Sc. in 1988 and his Ph.D. in 1993, both in Information Systems, at the Norwegian Institute of Technology. He alsohas a M.Sc. in Linguistics from the University of Trondheim and a M.Sc. of Management (Sloan fellow) from London BusinessSchool. He has previously worked as a manager in Fast Search & Transfer in Munich and as a project leader for Norsk Hydro inBrussels. His research interests include text mining, semantic search, ontologies, and enterprise modeling.