Semantic relevance ranking for XML keyword search relevance ranking for XML...Semantic relevance ranking for XML keyword search Ying Loua,b, Zhanhuai Lia, Qun Chena,⇑ a School of

Information Sciences 190 (2012) 127–143

Contents lists available at SciVerse ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Semantic relevance ranking for XML keyword search

Ying Lou a,b, Zhanhuai Li a, Qun Chen a,⇑a School of Computer, Northwestern Polytechnical University, Xi’an 710073, PR Chinab Electronic Information Engineering College, Henan University of Science and Technology, Luoyang 471003, PR China

a r t i c l e i n f o a b s t r a c t

Article history:Received 6 September 2010Received in revised form 21 October 2011Accepted 3 December 2011Available online 14 December 2011

Keywords:XMLInformation retrievalKeyword SearchRelevance ranking

0020-0255/$ - see front matter � 2011 Elsevier Incdoi:10.1016/j.ins.2011.12.011

⇑ Corresponding author. Tel.: +86 29 88431520.E-mail addresses: [email protected] (Y.

Keyword search is a user-friendly mechanism used to retrieve XML data for web and sci-entific applications. Unlike text data, XML data contain rich semantics, which are obviouslyuseful for information retrieval. It is observed that most existing approaches for XML key-word search either do not consider relevance ranking or perform relevance ranking usingtraditional text IR techniques. Based on an in-depth analysis of user information need andXML structural semantics, we propose to rank the relevance between a keyword query andan XML fragment by their semantic similarity. We first present a formula to quantify theconcept of semantic similarity and then introduce a novel semantic ranking scheme forXML keyword search. Our extensive experiments demonstrate that the proposed schemeoutperforms existing approaches in terms of search quality and achieve high efficiencyand scalability.

� 2011 Elsevier Inc. All rights reserved.

1. Introduction

As more and more XML data are generated in web and scientific applications, determining how to effectively retrieveinformation from these data has attracted much research interest in recent years [1–6]. Unlike text data, XML data containsemantic markups. These semantics are obviously useful for XML information retrieval but cannot be exploited by traditionaltext IR approaches. Thus, an important problem of XML IR is determining how to take advantage of XML semantics to im-prove search quality.

XML keyword search inherits the user-friendly interface of popular web search engines. Unlike the IR approach based onstructured languages [7–9], it conveniently serves the users who are unfamiliar with or uninterested in the structure ofunderlying XML data. The easy-to-use characteristic of keyword search makes it more suitable for commercial applications.However, it also gives rise to the new challenge of measuring the relevance between unstructured data (keyword query) andstructured data (XML fragment). An intuitive and widely adopted keyword search approach is to find the SLCA (SmallestLowest Common Ancestor) structure [3] in XML data. Given a keyword query Q, a relevant SLCA is an XML fragment thatcontains all of the keywords of Q but has no subtree that also contains them. The SLCA approach judges the relevance be-tween an XML fragment T and a keyword query Q by simply determining whether T contains all of the keywords of Q. With-out considering XML semantics, it has serious shortcomings [2,10,11]. The sample keyword queries addressed in this paperare shown in Table 1. For instance, consider the query Q1(XML, IR) on the SIGMOD Record dataset. It intends to find the pub-lications about XML IR. SLCA may return a result such as that shown in Fig. 1(a). Because the keywords ‘‘XML’’ and ‘‘IR’’ arelocated in two distinct article elements, they describe different entities. This result is indeed irrelevant to Q1. To overcomethis shortcoming, XSEarch [2] proposed that for an LCA to be qualified, it should satisfy the criterion that any two nodeswithin the tree be interconnected. Two nodes n1 and n2 are interconnected on an LCA tree if and only if the shortest path

. All rights reserved.

Lou), [email protected] (Z. Li), [email protected] (Q. Chen).

http://dx.doi.org/10.1016/j.ins.2011.12.011

mailto:[email protected]



http://dx.doi.org/10.1016/j.ins.2011.12.011

http://www.sciencedirect.com/science/journal/00200255

http://www.elsevier.com/locate/ins

Table 1Sample keyword queries.

Q1 XML,IRQ2 open_auction, Tom, JackQ3 XML, Johnson, positionQ4 article,XML,IRQ5 article, title, XMLQ6 article, XML, authorQ7 article, titleQ8 XML, IR, authorQ9 XML, JohnsonQ10 issue, article, Johnson

article

title initpage

XML IR

(d)

articles

article article

title title

(a)

XML IR

article(1)

title(2) author(4)

Sammy

(c)

XML

author(3)

Johnson position(5)

1

position(6)

2

articles

article article

title initpage

(e)

XML IR

open_auction

bidder bidder

name name

(b)

Tom Jack

article article

Fig. 1. XML fragments.

128 Y. Lou et al. / Information Sciences 190 (2012) 127–143

between them does not have two distinct nodes with the same label except n1 and n2 themselves. XSEarch disqualifies theresult shown in Fig. 1(a) because the tree has two nodes both labelled article on the shortest path between the two nodesmatching ‘‘XML’’ and ‘‘IR’’, respectively. Unfortunately, XSEarch may falsely filter relevant matches. For instance, considerthe query Q2(open_auction,Tom, Jack) on the auction dataset Xmark. The XML fragment shown in Fig. 1(b) is disqualified be-cause the path between the two nodes matching ‘‘Tom’’ and ‘‘Jack’’, respectively, has two distinct nodes both labelled bidder.However, this is unlikely to be desirable because the user may be interested in the open auctions in which both Tom and Jackbid.

A variation of SLCA named MaxMatch has also been proposed in [11]. It states that an SLCA tree T is relevant to a query Qif and only if every node on the paths from the root of T to the keyword match nodes is a contributor to Q. Given XML data D,a node n1 in T is a contributor to Q if it does not have a sibling node n2 in D such that MatchKeywords(n1) �MatchKey-words(n2), in which MatchKeywords(n) represents the set of keywords in Q that have at least one match in the subtree rootedat n. For example, consider the query Q3(XML, Johnson,position) on the SIGMOD Record dataset. The user intends to find pub-lications about XML that are authored by ‘‘Johnson’’, and wants to know his position in the author list. In Fig. 1(c), the SLCAfragment consisting of the keyword match nodes (2,3,6) is not qualified because the node author (4) on the path from theroot article(1) to the node position (6) is not a contributor to Q3: it has a sibling node author (3) containing more keywords inQ3. The MaxMatch approach satisfies monotonicity and consistency with respect to the data and query, which are reasonedto be desirable properties of an effective XML keyword search engine. Unfortunately, these properties are not enough toguarantee a good search engine. For instance, MaxMatch still has the shortcoming exemplified by the query Q1. It fails toidentify the SLCA as shown in Fig. 1(a) as being irrelevant.

It is observed that existing SLCA-based approaches determine the relevance of an XML fragment to a keyword queryaccording to its structural properties. The diversity of keyword combination and user intention means that these approacheswork well under some circumstances but perform poorly if underlying data and keyword queries change. Their judgementsof relevance are not universally valid. On the other hand, these approaches consider an XML fragment to be relevant or irrel-evant outright. They cannot flexibly accommodate probable relevance ambiguity implied by keyword queries. More desir-able search approaches should instead rank most relevant results first and at the same time return likely relevant resultswith lower ranks as much as possible. For example, consider again the SLCA of Q1 shown in Fig. 1(a). Even though the key-words ‘‘XML’’ and ‘‘IR’’ are contained in two distinct articles, it may still be interesting to the user to find out if these twoarticles have the same authors or are published in the same journal issue. Instead of reasoning the search results to be rel-evant or irrelevant, a desirable search engine should return it with a rank lower than that of the result in which the two key-words appear in one article.

We note that most proposed ranking schemes [12–15,5] measure the relevance of XML keyword searches in the tradi-tional text IR manner. They mainly consider two factors: the compactness of XML fragment and the TF ⁄ IDF formula. Oneof the typical ranking schemes for XML IR is the INEX approach [16]. INEX classifies IR queries into two types: content-and-structure (CAS) queries and content-only (CO) queries. CAS queries are based on the NEXI structured language and

Y. Lou et al. / Information Sciences 190 (2012) 127–143 129

CO queries correspond to the keyword searches we consider in this paper. INEX supposes that CO queries only include thekeywords that appear in the content nodes of XML data. In contrast, the keyword queries we consider may contain any key-word in XML data, including element and attribute labels. For CO queries, INEX’s ranking scheme [12] is a variation of thetraditional TF ⁄ IDF metric. TF ⁄ IDF and compactness measurements, though proven to be effective for text IR, are inade-quate for XML IR because they do not consider XML semantics. For instance, consider the query Q4(article,XML, IR) on theSIGMOD Record dataset. The user intends to find the articles about XML IR. Two matching XML fragments are shown inFig. 1(d) and (e). Naive application of the (TF ⁄ IDF) metric might rank result (e) higher than (d) because result (e) has a high-er keyword frequency. However, this ranking is misleading: result (d) is indeed more relevant to user information need than(e).

In this paper, we approach the ranking scheme for XML keyword search from a new perspective. We do not determine anXML fragment’s relevance to a keyword query from its semantics alone; we also analyse its semantic similarity to the query.Even though keyword query is unstructured and may be ambiguous, the associated user intention can usually been deter-mined. For instance, consider the query Q4(article,XML, IR) on the SIGMOD Record dataset. Becasue the keyword ‘‘article’’ isan element label and the other two keywords appear on the content nodes in the XML data, the most reasonable interpre-tation of Q4’s intention is that the user is interested in the articles with descriptions that include the keywords ‘‘XML’’ and‘‘IR’’. Based on this analysis, we reexamine the fragments (d) and (e) in Fig. 1. The topic of the fragment (d) is the name of itsroot node, article. It matches the user’s topic of interest semantically. The topic of the fragment (e) is instead the elementarticles. It does not match the user’s topic of interest. Therefore, the fragment (e) should be assigned a lower ranking scorethan the fragment (d), regardless of its tree size and keyword frequency.

The preceding example illustrates the importance of analysing the semantic relationship between an XML fragment and akeyword query in determining their relevance. This paper proposes a new ranking scheme for XML keyword search based onmeasuring semantic similarity. Our contributions can be summarised as follows:

1. We propose to determine the relevance between an XML fragment and a keyword query according to their semantic sim-ilarity, which can be measured by the similarity between the information topic of the XML fragment and the user’s topicof interest implied by the query;

2. We present a formal method of quantifying the concept of topic similarity. Our method is based on an in-depth analysis ofuser information need and underlying XML data; thus, accurate and objective;

3. We introduce a novel semantic ranking scheme called SRank for XML keyword search and design an efficient implemen-tation algorithm to implement it.

4. We conduct comprehensive experiments to verify the effectiveness of the proposed ranking scheme. Our results showthat SRank improves XML keyword search quality beyond that of existing approaches and achieves high efficiency as well.

The rest of this paper is organised as follows: we review related work in Section 2; Section 3 presents a data model, querydefinition and analysis of user information need; Section 4 describes the concept of semantic similarity, its measurementmethod and the semantic ranking scheme SRank; a search algorithm implementing SRank is detailed in Section 5; the exper-imental study is presented in Section 6; and we conclude this paper with some suggestions for future research in Section 7.

2. Related work

There are many studies on how to identify relevant matches for XML keyword search. Besides SLCA [3], XSEarch [2], VLCA[10] and MaxMatch [11] we mentioned before, there are also other structures such as GDMCT [17] and MLCA [6]. Thesestructures do not consider the semantic meanings of keywords within returned matches either. Except for XSEarch, noneof them provides a ranking scheme.

Regarding the ranking scheme, [13,14] proposed to rank query results according to the distance between different key-word matches in a document. XRANK [1] extends Google’s page-rank hyperlink metric to XML elements to rank returned LCAresults. However, no empirical study has been done to verify the effectiveness of its ranking function. XSEarch [2] combines asimple TF ⁄ IDF IR ranking with tree size to rank results. However, its keyword input format requires users to have someknowledge of underlying schema information. This property limits its usefulness. XReal [5] exploits the statistics of under-lying XML data to infer users’ search intention and rank LCAs. Based on the concepts of XML TF (term frequency) and XML DF(document frequency), it proposes an XML TF ⁄ IDF similarity ranking scheme. Note that even though XReal infers thesemantics of user search intention, it uses this information to improve the ways the weights of keywords are set in theTF ⁄ IDF formula. Unlike SRank, its ranking scheme is still IR style without considering the semantic similarity between key-word queries and XML fragments.

Some reports in literatures have studied the keyword search problem in graph-structured data. Most of these works havefocused on search efficiency. They are all heuristics-based because the reduced tree search problem in a graph is NP-hard[18]. BANKS [19] uses bidirectional expansion heuristic algorithms to search for the smallest structure containing keywords.BLINKS [20] proposes a bi-level index to accelerate the search of top-k results. XKeyword [13] provides a keyword proximitysearch approach for graph-structured XML data conforming to a certain schema. EASE [4] combines IR ranking and DB rank-ing based on structural compactness to complete a keyword search on heterogeneous graph data.


There are also some interesting works orthogonal to ours. XSeek [21] determines how to identify return nodes within aquery match. Besides relevant keyword matches as well as the paths connecting them, other nodes that do not match key-words may also be relevant and should be returned. SAIL [22] answers keyword queries according to the concept of minimal-cost trees and identifies the top-k answers by using link-based relevance ranking and keyword-pair-based ranking. To deter-mine relevant non-matches, XKeyword [13] and Precis [23] allow a system administrator/user to specify them on a schemagraph. More recently, [24] studied how to process top-k keyword searches over probabilistic XML data and [25] investigatedthe new problem of nearest keyword search in XML databases.

In addition, keyword search has also been used for information retrieval in other application scenarios, such as relationaldatabases[26], semantic web [27], proxy re-encryption [28], etc.

3. Preliminaries

3.1. Data model and keyword search

We model XML data as a rooted,labelled and unordered tree. Every internal node in the tree has a name and each leafnode has a data value. XML attribute nodes are modelled as children of the associated element nodes and differentiated fromelement nodes. Besides the attribute nodes explicitly specified in XML data, we categorise a node as an attribute if it has onlyone child that is a data value. A user query is expressed as a set of keywords, each of which may match name or value nodesin the XML tree.

Definition 1 (Keyword Match). If a keyword k is the name of a node u or is contained in the value of a node u, we say that thenode u is a match for the keyword k.

For example, consider the XML fragment shown in Fig. 1(c). The node article(1) is a match for keyword ‘‘article’’ and thenode title(2) is a match for ‘‘XML’’.

Similar to existing approaches [1,3,21,11], we consider conjunctive keyword query. Under conjunctive semantics, a querymatch should contain all the keywords of a query.

Definition 2 (Query Match). Processing a keyword query Q on XML data D returns a set of query matches. Each query matchis a tree defined by the pair T = (r,M), where r is the root and M is a subset of keyword matches in the tree. Each keyword in Qhas at least one match in M. The match tree consists of the paths in D connecting r to each match in M and its root r should bethe LCA (lowest common ancestor) node of all the matches in M.

In this paper, we study how to effectively measure semantic relevance between a query match and a keyword query. In-stead of simply categorising a query match as relevant or irrelevant, we intend to provide a ranking scheme that would as-sign it an appropriate relevance score. Query matches would then be returned to users in the decreasing order of theirranking scores. Note that compared with existing LCA-based approaches our approach is more consistent with the require-ment of IR systems, which are supposed to return most the relevant results first and at the same time return other probablyrelevant results as often as possible.

3.2. Semantic analysis of user information need

XML data possess rich semantics, which can be conveniently inferred from their structure. A keyword query, on the otherhand, is unstructured and contains no obvious semantic information. To determine the semantic relevance, we have to ana-lyse the semantics of user information need implied by a keyword query.

We model the user information need of implied by a keyword query by a template consisting of three Q&As (Question-and-Answer):

� QA(1): what does the user search for?� QA(2): what are the search conditions?� QA(3): what information regarding the search targets is the user interested in?

QA(1) corresponds to the FROM clause in an SQL query, which specifies the relational tables to be queried. Its answer indi-cates a user’s topic of interest. QA(2) corresponds to the WHERE clause. A search condition consists of two parts: attributename and the keywords to be contained. Note that a search condition can be incomplete: it may specify the keywords tobe contained or attribute name, but not both. QA(3) corresponds to the SELECT clause. It specifies the target informationto be returned to the user.

Based on the model above, we classify keywords into four types according to the roles they play in defining the semanticsof user information need:

� Topic Keyword. A topic keyword directly answers QA(1). It explicitly reveals what XML elements the user is interested in.For instance, consider the query Q4(article,XML, IR). The keyword ‘‘article’’ is a topic keyword.


� Condition Attribute Keyword. A condition attribute keyword specifies the attribute name of a search condition. Forinstance, consider Q5(article, title,XML). It can be reasoned that the user intends to find the articles with titles that contain‘‘XML’’. The keyword ‘‘title’’ is therefore a condition attribute keyword.� Return Attribute Keyword. A return attribute keyword specifies the attribute information of search targets to be returned.

For instance, consider the query Q6(article,XML,author). The user’s intention is to identify the authors of the articles aboutXML. The keyword ‘‘author’’ is therefore a return attribute keyword.� Condition Value Keyword. A condition value keyword specifies the attribute values of a search condition. For instance,

consider the query Q5(article, title,XML) again. It suggests that the titles of target articles should contain ‘‘XML’’. The key-word ‘‘XML’’ is therefore a condition value keyword.

Note that a meaningful user query does not have to contain all the types of keywords. For instance, the query Q4(arti-cle,XML, IR) only contains a topic keyword and condition value keywords. A query may even contain only one type of key-words. For instance, Q1(XML, IR) only has condition value keywords. It is also possible that a query contains no conditionvalue keyword, such as Q7(article, title). Q7 is supposed to find the titles of all the articles.

According to the model, determining the semantics of user information need requires that the types of query keywords tobe properly identified. It is observed that there are three types of words in XML data: element name, attribute name and datavalue. We thus have the following inference rules:

1. If a keyword is an element name in XML data, it is a topic keyword;2. If a keyword is an attribute name, it is either a condition attribute keyword or a return attribute keyword;3. If a keyword is a text value word, it is a condition value keyword.

Consider the query Q5(article, title,XML) on the SIGMOD Record dataset. The keyword ‘‘article’’ is an element name andtherefore a topic keyword. The keyword ‘‘XML’’ is a data value word, therefore a condition value keyword. The keyword ‘‘ti-tle’’ is formally defined to be an element name. However, because the nodes labelled ‘‘title’’ contain only one child of a datavalue node, they are considered to be attribute nodes. As a result, the keyword ‘‘title’’ is either a condition attribute keywordor a return attribute keyword.

Note that if a keyword is an attribute name, its semantic meaning is ambiguous. In the query Q5(article, title,XML), thekeyword ‘‘title’’ is a condition attribute keyword. In the query Q6(article,XML,author), the keyword ‘‘author’’ is also an attri-bute name, but a return attribute keyword. This ambiguity can be clarified by the structure of underlying XML data. In thequery matches for Q5, the attribute nodes labelled ‘‘title’’ contain the keyword ‘‘XML’’. The keyword ‘‘title’’ therefore serves asa condition attribute keyword. In the query matches for Q6, the attribute nodes labelled ‘‘author’’ do not contain any condi-tion value keyword. The keyword ‘‘author’’ is therefore a return attribute keyword.

By properly inferring the types of query keywords, the semantics of user information need can be determine according tothe model of three Q&As. For example, consider the query Q6(article,XML,author). The keyword ‘‘article’’ is a topic keywordand therefore represents a user’s topic of interest. The keyword ‘‘XML’’ is a condition value keyword. It should therefore beused to specify a search condition for the target elements. The keyword ‘‘author’’ is a return attribute keyword. Therefore, theauthor attribute values of the target elements should be returned to the user. Semantically, Q6 intends to identify the authorsof the articles with attribute values that contain ‘‘XML’’.

Determining user information need may become more challenging if a keyword query contains incomplete information.For instance, consider the query Q8(XML, IR,author). Its keywords provide no answer to QA(1) and only a partial answer toQA(2). It remains unclear what elements the user is interested in and what attributes over which the search conditionsshould be specified over. Another typical query is Q9(XML, Johnson), which intends to identify the publications about XMLand that are authored by ‘Johnson’. Its keywords only provide partial answers to QA(2). All other semantics of user informa-tion need are missing.

To make matters worse, the semantics implied by a keyword query may be ambiguous. For instance, consider the queryQ10(issue, article, Johnson). The keywords ‘‘issue’’ and ‘‘article’’ are both inferred to be topic keywords because they are ele-ment names. However, only one of them is valid because a query is supposed to have one and only one topic of interest.Without knowledge of the underlying XML data, it is difficult to know which of them represents a user’s topic of interest.By analysing the retrieved query matches, we observe that the issue elements contain the article elements. The keyword ‘‘is-sue’’ therefore represents a user’s topic of interest. The proper semantic interpretation of Q10 is that it intends to find thejournal issues containing the articles authored by ‘‘Johnson’’. In this example, the keyword ‘‘article’’ plays the role of a con-dition attribute keyword.

Ambiguity may also arise if a query keyword appears at different types of positions in XML data. In this case, an analysis ofuser information need should consider all probable combinations of keyword roles. For example, consider the query Q10

again. The keyword ‘‘issue’’ appears at data value nodes as well as element nodes in the SIGMOD Record dataset. Therefore,another interpretation of Q10’s intention is to find the articles with titles that contain the keyword ‘‘issue’’ and are authoredby ‘‘Johnson’’.

Summary. In this subsection, we propose to reason the semantics of user information need by inferring the roles of querykeywords. Our approach is based on a sound analysis of XML semantics. However, we also show that reasoning based on


keywords alone may be inadequate to accurately identify query intention because keywords may provide incomplete orambiguous information.

4. Semantic ranking scheme

In this section, we propose to measure semantic relevance between an XML fragment and a keyword query by theirsemantic similarity and introduce the semantic ranking scheme SRank.

4.1. Semantic similarity

4.1.1. Semantic consistencyIf an XML fragment exactly matches the semantics of a keyword query, its semantic relevance to the query should be

ranked to be the highest. Consider the query Q5(article, title,XML) and its query match consisting of article(1) and title(2)shown in Fig. 1(c). A preliminary analysis of user information need shows that ‘‘article’’ represents the user’s topic of interest,‘‘XML’’ is a condition value keyword, and ‘‘title’’ is either a condition attribute keyword or a return attribute keyword. It isalso observed that the information topic of the query match is ‘‘article’’ and the attribute node matching ‘‘title’’ contains thekeyword ‘‘XML’’. Because the keyword ‘‘title’’ serves as a condition attribute keyword, a reasonable interpretation of Q5’sintention is to find the articles with titles that contain ‘‘XML’’. The XML fragment shown in Fig. 1(c) therefore exactly matchesthe semantics of Q5’s user information need. It is worth noting that in the above example the ambiguity of the semantic roleof the keyword ‘‘title’’ is clarified by the structure of XML data. Such reasoning is based on the principle that whateversemantics exist in XML data are proper interpretations of user information need.

We model the semantics of a keyword query using the structure shown in Fig. 2(a). It is based on the three Q&As we de-fined in Section 2. The label of its root node is the topic keyword. It may also contain condition attribute nodes and returnattribute nodes, which are labelled with condition attribute keywords and return attribute keywords, respectively. Each con-dition attribute node has a child of a data value node containing at least one condition value keyword. If the query providesonly condition value keywords but no condition attribute keyword, its condition attribute node is labelled with a wildcard.Note that the condition/return attribute nodes are connected to the root node by ancestor–descendant edges.

Semantic consistency between an XML fragment and a keyword query is formally defined as follows:

Definition 3 (Semantic Consistency). If an XML fragment exactly matches the semantic structure of user information needimplied by a keyword query as shown in Fig. 2(a), we say that it is semantically consistent with the query.

As we have shown in Section 2, not all semantic information in Fig. 2(a) can be accurately reasoned beforehand. Keywordquery may provide incomplete or ambiguous information. For instance, consider the query Q6(article,XML,author) on the SIG-MOD Record dataset. A preliminary analysis of the query shows that the keyword ‘‘article’’ represents user’s topic of interest,‘‘XML’’ is a condition value keyword and ‘‘author’’ is an attribute name. However, it remains unclear whether ‘‘author’’ is acondition attribute keyword or a return attribute keyword and what is the condition attribute name for ‘‘XML’’. In the querymatch consisting of the nodes (1,2,3) in Fig. 1(c), the attribute node labelled ‘‘author’’ does not contain any condition valuekeyword and the attribute node matching ‘‘XML’’ has the label of ‘‘title’’. It can be therefore reasoned that the tree in Fig. 2(b)is a proper semantic interpretation of Q6’s search intention. Note that a wildcard can match any attribute name. Generally,the ambiguity of an attribute keyword can be clarified in the following way: if its corresponding attribute node in a querymatch contains at least one condition value keyword, it is a condition attribute keyword; otherwise, it is a return attributekeyword.

Another scenario in which ambiguity may arise is when a query contains no topic keyword. In this case, keywords areeither attribute names or data value words. If a keyword’s match node is an attribute, its parent should be an element node.

TK

CAK1 CAKi... RAK1 RAKj

CVK 1 CVK i

(a)

article

XML...

* author

(b)

TK: Topic KeywordCAK: Condition Attribute KeywordCAK: Condition Value KeywordRAK: Return Attribute Keyword

Fig. 2. Semantic structure of keyword query.


Consider the lowest element node containing this keyword. Its name represents the most specific topic this keyword is sup-posed to describe; therefore it indicates the user’s topic of interest. For instance, consider the query Q1(XML, IR) and its querymatch in Fig. 1(a). The attribute nodes matching ‘‘XML’’ and ‘‘IR’’ have the same parent, the element node labelled article. Thesearch topic of Q1 is therefore reasoned to be ‘‘article’’. Note that the inferred element nodes indicating the user’s search topicmay be more than one. Consider the query match for Q1 shown in Fig. 1(a). Two element nodes, both labelled article, areinferred to represent Q1’s search topic. Its search topic is therefore reasoned to be ‘‘article’’. In this case, the query matchin Fig. 1(a) is not semantically consistent with Q1 because its information topic is ‘‘articles’’ while the user’s topic of interestis not.

4.1.2. Semantic similarityNow we consider the problem of how to measure semantic similarity between a keyword query and its query match. It is

observed that in a query match, a condition value keyword should be a text value word and the semantic role of an attributekeyword is determined by whether its matching node contains condition value keywords. A query match therefore automat-ically matches all semantics of the structure as shown in Fig. 2(a), except the topic keyword. As demonstrated in Fig. 1(a), thetopic keyword may appear at internal nodes in the query match. In this case, the semantic difference between a query and itsquery match can be conceptually explained as follows:

The user is interested in the elements of a topic keyword satisfying all the search conditions as specified in Fig. 2(a). Thequery match instead describes an element that satisfies the search conditions but is a super-element of the user’s interestelement. We call an element u a super-element of v if u contains v in XML data.

The semantic similarity between a query and its query match therefore corresponds to the semantic similarity between auser’s search topic and the information topic of the query match, which can be naturally measured by the distance betweentheir corresponding positions in XML data. It is formally defined as follows:

S SimðM;QÞ ¼ kd

where k is the relevance decay constant, which assumes a value within (0,1), and d represents the distance between twotopics in the query match.

For instance, consider the query Q1(XML, IR). In the query match shown in Fig. 1(a), its root node’s label represents theuser’s search topic. Its semantic similarity with Q1 is therefore 1. In the query match shown in Fig. 1(a), both element nodesprobably indicating the user’s search topic are located at a distance of 1 from the root. Its semantic similarity with Q1 istherefore k.

As mentioned in Section 2, if a query contains one and only one topic keyword, the topic keyword is reasoned to be theuser’s search topic. Otherwise, the user’s search topic may be any one of multiple probabilities. The possible scenariosinclude:

1. the lack of topic keywords in the query;2. the presence of more than one topic keyword in the query;3. a query keyword that appears at different types of positions in the query match.

In the first case, a user’s search topic is supposed to be determined by the topic node closest to the root in the querymatch. Consider the query match of Q(XML,2008) shown in Fig. 3(a). The two element nodes indicating the user’s search to-pic are article and issue. Because the node issue is closer to the root, it is reasoned to represent user’s search topic. The querymatch’s semantic similarity to the query is therefore k0=1.

issue

articles

article

(a)

title

time

September . 2008

XML

issue

articles

article

(b)

title

time

January . 2007

issue

Fig. 3. Multiple probabilities of user search topic.


In the second case, if a topic keyword k1 contains another topic keyword k2 in the query match, a user’s search topic isinferred to be k1. Otherwise, whichever is closer to the root is chosen to represent user’s search topic.

In the third case, we first determine a user’s search topic according to each distinct keyword type and then choose the onemost similar to the information topic of the query match. Consider the query match of Q(issue,2007) shown in Fig. 3(b). Thekeyword ‘‘issue’’ is both an element name and a data value word. In case of element name, its inferred topic node is the root,whose semantic similarity to the information topic of the query match is 1; in case of data value word, its inferred topic nodeis the node ‘‘article’’, with a semantic similarity of k2. The semantic similarity is therefore measured to be 1.

4.2. Ranking scheme

The ranking scheme SRank values semantic similarity as the most important factor in measuring the relevance between akeyword query and a query match. A query match with a higher value of S_Sim(M,Q) is always ranked higher. When twoquery matches have the same value of S_Sim(M,Q), SRank further considers the factors of the attribute keyword’s relevanceto an information topic and the value keyword’s relevance to the attribute.

Attribute Keyword’s Relevance to Information Topic. In a query match, an attribute keyword, either a condition attributekeyword or return attribute keyword, is connected to the root by a path. The path length indicates the corresponding attri-bute node’s semantic closeness to the information topic of the query match. It is observed that the shorter the path is, themore relevant the attribute is to the information topic. Given a query match M, we compute a k-labelled attribute node’srelevance to M’s information topic by the following formula:

ARðk;MÞ ¼ 1da;

where da represents the distance of the k-labelled attribute node to the root of M.Value Keyword’s Relevance to Attribute. Suppose that a condition value keyword k is matched by the attribute nodes la-

belled La in a query match. Consider all the keyword matches (La,k) in the query match. We measure the keyword k’s rele-vance to the attribute La by a variant of the traditional TF*IDF metric. It is based on the following two guidelines:

1. A query match with a higher percentage of La-labelled attribute nodes matching the keyword k in the XML tree with thesame root should be considered to be more relevant than a query match with a lower percentage;

2. A keyword appearing in many La-labelled attribute nodes should be considered to be less important than a keywordappearing in only a few.

Given a query match M rooted at an element er, a condition value keyword k’s relevance to a condition attribute keywordLa is computed by the following formula:

VRðk; LaÞ ¼ percðk; erÞ � lnfreqðLaÞ

freqðk; LaÞ

� �;

where perc(k,er) represents the percentage of the La-labelled attribute nodes matching the keyword k in the XML tree rootedat er, freq(La) represents the total number of the La-labelled attribute nodes that are contained by the elements with the samelabel as er, freq(k,La) represents the number of La-labelled attribute nodes that match the keyword k and are contained by theelements with the same label as er, and ln() is the normalisation function.

Note that in the above formula, perc(k,er) and ln freqðLaÞfreqðk;LaÞ

� �correspond to the TF and IDF factors respectively. The IDF factor

does not consider the La-labelled attribute nodes in the whole XML data, but those within the elements with the same labelas er. It is based on the observation that the La-labelled attributes have the same semantic meaning within the same type ofelements but may have different semantic meanings under different types of elements.

Ranking Scheme. As shown in Fig. 2(a), the semantics of a keyword query consist of search topic, search conditions andreturn attributes. A query match would automatically satisfy a query’s semantics except for the search topic; therefore, itcontains the corresponding instances of its search conditions and return attributes. In the case that two query matches havethe same semantic similarity to a query, SRank ranks them by the relevance of their search conditions and return attributesto their information topics. The relevance of a return attribute is measured by its corresponding attribute keyword’s rele-vance to an information topic. The relevance of a search condition is measured by the metric consisting of two factors: itscondition attribute keyword’s relevance to information topic and its condition value keyword’s relevance to its conditionattribute keyword.

SRank ranks the query matches with the same semantic similarity according to the following formula:

aX8k

ARðk;MÞ þ bX8scðk0 ;LaÞ

ðARðLa;MÞ � VRðk0; LaÞÞ;

where k represents return attribute keyword, sc(k0,La) represents search condition, k0 represents condition value keyword, La

represents condition attribute keyword, a and b are the constants representing the relative importance of search conditionsand return attributes.


5. Search algorithm

The search algorithm implementing SRank first generates the query matches in order of semantic similarity and thensorts those with the same semantic similarity using the formula presented in Section 3.2. In the pre-process step, we parsean XML document, determine its word types, record their positions and construct inverted lists. Each distinct keyword k hasan inverted index with entries that correspond to the XML nodes matching k. Each entry in an inverted index stores the fol-lowing information: (i) the type of its match node, element or attribute; (ii) the Dewey code of its match node [1]. The entriesare not ordered by the Dewey codes of their match nodes but by the Dewey codes of the lowest element nodes (LED) con-taining these match nodes. Note that if a match node is an attribute node, the lowest element node containing it is its parentnode. If two entries in an inverted list have the same LED, they are merged into one. For simplicity of presentation, we refer tothe Dewey code of a match node’s LED as the entry’s Dewey code. Because the algorithm sorts the query matches with thesame semantic similarity after they are generated, the rest of this section focuses on how to generate query matches in orderof semantic similarity. The values of AR(k,M) and VR(k,La) used by sorting are precomputed during the pre-process step.

Algorithm 1 SRank Search Algorithm

1: while (none of ILj (1 6 j 6m) is empty) do2: Select ej with the smallest Dewey code;3: if ((ej’s LED contains all the other keywords) and (its corresponding query match Mi has semantic similarity of 1))

then4: Generate and output Mi;5: if (the number of query matches reaches t) then6: algorithm ends;7: end if8: Remove all the keyword match nodes in Mi from their inverted lists;9: else10; Remove ej from its inverted list;11: if (ej’s LED does not contain all the other keywords) then12: Attach the parent element of ej’s LED to the list Lj1;13: else14: Attach ej’s LED to the list Lj1;15: end if16: end if17: end while18: for (each remaining entry in ILj (1 6 j 6m)) do19: Attach the parent element of its LED to Lj1;20: end for21: r 1;22: while (none of Ljr (1 6 j 6m) is empty) do23: for (1 6 j 6m)) do24: Sort the elements in Ljr by Dewey code;25: end for26: Process the lists of Ljr (1 6 j 6m) in the same way as ILj and generate the new lists of Lj(r+1);27: r (r + 1);28: end while

In the order of r = (0,1, . . . ,d), the algorithm sequentially identifies the query matches with the semantic similarity of ki

until the desired top t answers are returned. The details of this process are sketched in Algorithm 1. The inputs are the queryQ consisting of m query keywords (k1, . . . ,km) and the number of returned query matches t.

In the first part of Algorithm 1 (line 1–17), the algorithm repeatedly selects the entry with Dewey code that is the smallestin m inverted lists (IL1, . . . , ILm) and matches it with the entries of other keywords. Suppose this entry (ej) is of keyword kj. Ifits LED contains all the other keywords of Q, the query match Mi rooted at ej’s LED has a semantic similarity of 1 except in thecase that a topic keyword of Q matches an internal node in Mi but there is no topic keyword in Q matching the root of Mi. Anexample of this scenario is shown in Fig. 4. The query consists of three keywords (k1,k2,k3). The keyword k1 is the label of theelement node n2 in the query match while the keywords k2 and k3 are the words in the data value nodes n3 and n5, respec-tively. Because k1 is the only topic keyword in Q, it is inferred to represent a user’s search topic. However, the informationtopic of the query match is the label of the element node n1. Therefore, this query match’s semantic similarity to Q is not 1but k.

If the query match Mi has a semantic similarity of 1, the algorithm outputs it and at the same time removes the entries ofall the keyword matches in Mi from their corresponding inverted lists. Otherwise, either the entry ej’s LED does not contain

1

2 3

4 5

Fig. 4. The exception case.


all the other keywords of Q or Mi does not have a semantic similarity of 1. In the first case, the algorithm removes the entry ej

from its inverted list. Because there does not exist any query match rooted at ej’s LED, the algorithm inserts the parent ele-ment of ej’s LED into the list Lj1 for next-round processing. In the second case, the algorithm also removes the entry ej from itsinverted list and inserts ej’s LED into Lj1 for next-round processing. Whenever an inverted list becomes empty, all the existingentries in other inverted lists are processed in the following way (line 18–20): the parent elements of their LEDs are insertedinto their corresponding inverted lists Lj1. Note that the first part of Algorithm 1 outputs all the query matches with a seman-tic similarity of 1.

The second part of Algorithm 1 (line 22–28), it first sorts the elements in the lists of Lj1 (1 6 j 6m) according to their De-wey codes and then processes these lists in the same way that it processes ILj (1 6 j 6m). In general, the algorithm processesthe lists of Ljr in rounds of (r = 1,2, . . . ,d) until the number of the generated query matches reaches t or at least one Ljr listbecomes empty. In the rth round, one pass over the elements in the lists generates all the query matches with a semanticsimilarity of kr.

Note that similar to existing approaches such as XRank [1] and SLCA [3], Algorithm 1 includes a keyword match node in atmost one query match. However, in the case that two query matches are nested: a query match M1’s root contains the root ofanother one, M2, Algorithm 1 performs differently. If M1’s semantic similarity is lower than that of M2, both of them are re-turned. Otherwise, only the query match M1 is returned and all the keyword match nodes within M2 become part of M1.

As an illustration, suppose that query Q9(XML, Johnson) is processed on the XML data as shown in Fig. 5. The inverted listfor the keyword ‘‘XML’’ has two match nodes title(0.0.1) and title(0.2.1). The inverted list for ‘‘Johnson’’ has two match nodesauthor(0.0.0) and author(0.1.0). The algorithm performs as follows.

1. The entry with the smallest Dewey code is author(0.0.0). Consider its LED article(0.0). Because article(0.0) containstitle(0.0.1) and the query match rooted at article(0.0) M1 has the semantic similarity of 1, the algorithm outputs M1

and removes title(0.0.1) and author(0.0.0) from their corresponding inverted lists.2. The algorithm continues to retrieve next entry author(0.1.0). Since its LED article(0.1) does not contain the keyword

‘‘XML’’, the algorithm removes it from the inverted list and inserts the parent element of its LED articles(0) into the listL21 for the keyword ‘‘Johnson’’.

3. Similarly, the algorithm continues to process the entry title(0.2.1). It inserts the element articles(0) into the list L11 for thekeyword ‘‘XML’’.

4. In the next round, the algorithm processes the lists L11 and L21. It retrieves the entry articles(0) from L11 and finds that itcontains an entry in L21. The query match M2 rooted at articles(0) is therefore returned. It has a semantic similarity of k.The algorithm ends.

articles (0)

article (0.0) article (0.2)

author (0.2.0)author (0.0.0)

article (0.1)

author (0.1.0) title (0.1.1) title (0.2.1)title (0.0.1)

Johnson XML IR Johnson Database Willians XML Database

Fig. 5. An SRank running example.


Finally, we describe our implementation details of Algorithm 1 to justify its efficiency in practice. Instead of sorting theelements in the list of Ljr after they are all generated, we insert the elements into the list in an orderly way in the first phaseof the algorithm. Note that the elements of the list Ljr are generated from the ordered list of Lj(r�1). According to the algorithm,the elements of the list Ljr would be generated in an orderly way except when an element’s parent contains another elementbefore it in the list Lj(r�1). This corresponds to the case when the same keyword appears at nested positions in XML tree. Eventhough the nested case is not rare in an XML data, the appearance frequency of the same keyword within a nested tree isusually small. As a result, it incurs a low additional cost to keep the generated elements in the list Ljr in order. The compu-tational cost of Algorithm 1 can be therefore estimated to be around O(djDj), where jDj and d represent the size of the XMLtree and its maximal depth, respectively. Furthermore, it is worth noting that if the exception case exemplified in Fig. 4 doesnot occur while the algorithm is running, each element in the XML tree D will be checked only once at Line 3 of Algorithm 1in the worst case. The computational cost of Algorithm 1 would then be linear with the XML tree size. This is exactly what weobserved in the experimental study.

6. Experimental study

In this section, we empirically evaluate the effectiveness and efficiency of the proposed ranking scheme SRank. Our exper-iments measure search quality by precision and recall, and measure efficiency by elapsed CPU time.

6.1. Experimental setup

Datasets. The empirical study is conducted on two real XML datasets: DBLP and Mondial. DBLP is a computer science bib-liography dataset widely used for XML IR evaluation. It has a relatively simple structure but rich content information. Mon-dial is a world geographic dataset. It has rich structural information. Our experiments use a Mondial dataset with a size of1.4 MB and a DBLP dataset with a size of 127 MB. Query Sets. We generate a pool of 32 queries for each dataset. Our methodfirst randomly picks up 2–4 keywords from the XML data for a query and then manually revises the query to make it havemeaningful semantics. We also choose 8 queries on each dataset for detailed performance analysis, as shown in Table 2.These queries represent a variety of scenarios in which both tag names and value words are used for keyword searches.

For each keyword query, we construct its corresponding XQuery expressions and use them to retrieve its relevant XMLfragments. These results are then used to assess the relevance of returned query matches. All retrieval methods are imple-mented in Java and their inverted lists are stored in the open source database system MySQL 5.0.45. The experiments areperformed on a desktop running Windows Vista with an Intel (R) Core (TM) 2.0-GHZ CPU and 2 GB of RAM.

6.2. Search quality

We compare the performance of SRank with that of SLCA [3], EASE [4] and the TopX approach [12]. We choose them forcomparative evaluation because they represent three typical ways of performing an XML keyword search:

� SLCA. SLCA is a classical approach based on LCA. It returns the most specific XML fragments containing all the querykeywords.

Table 2Example queries.

DBLP

QD1 SIGMOD, NetworkQD2 XML, RetrievalQD3 ICDE, ProceedingsQD4 2000, ICDEQD5 Article, Jim, GrayQD6 Author, Richard, ThomasQD7 Richard, Simulation, ModelQD8 John, Book

MondialQM1 Parliamentary, Democracy, GovernmentQM2 Island, USAQM3 City, Latitude, 47QM4 City, RiverQM5 Europe, CountryQM6 Ethnicgroups, Chinese, Indian, CapitalQM7 Country, PopulationQM8 United, States, Indep_date


� EASE. EASE is a general keyword search method that works on both tree-structured and graph-structured data. It returnsthe compact r-radius graphs that contain all or part of the query keywords.� TopX. TopX accepts two types of queries, Content-Only (CO) queries and Content-And-Structure (CAS) queries. The key-

word search we address in this paper corresponds to the CO query. TopX supposes that a CO query only contains the key-words in data value nodes. In contrast, SRank can accept any keyword in an XML dataset, including element and attributelabels. Note that a desirable XML keyword search engine should be able to accept any keyword present in XML data. Forfairness, our comparative performance study on TopX and SRank only selects the queries including value keywords.

We evaluate search quality by the metrics of precision and recall. The metric of precision refers to the percentage of rel-evant results among all the returned results. The metric of recall refers to the percentage of returned relevant results amongall the relevant results existing in an XML dataset. Formally, we have precision ¼ jRr j

jRt j and recall ¼ jRr jjRj , where Rr, Rt and R rep-

resent the sets of returned relevant results, all the returned results and all the relevant results respectively. We use the cutoffpoints of 10, 20 and 50 to measure precision. Because the Mondial dataset is small, most queries return only a few querymatches. We therefore only measure the precision of the top-10 returned results. On DBLP data, we measure the precisionof the top-10, 20, and 50 query matches.

6.2.1. PrecisionDBLP Dataset.The average precision results of the SRank, SLCA and EASE approaches on the DBLP dataset are presented in

Table 3. Among the three, SCLA exhibits the poorest performance and SRank achieves the best result. The evaluation resultson the 8 example queries are also presented in Fig. 6.

There are two reasons for the poor performance of SLCA results. First, it may return incomplete information fragments.For instance, for QD2, SLCA returns the query matches rooted at the title attribute nodes. Second, SLCA does not consider thekeywords’ semantics within query matches and may thus return the results with the keywords in undesirable positions.Consider the query QD1. Its intention is to find the papers about ‘‘network’’ published in the proceedings of SIGMOD. Inall of the top-20 results returned by SLCA, the keyword ‘‘SIGMOD’’ does not appear under the desirable booktitle attributenodes but under the cite nodes. The EASE approach considers keyword frequency and keyword distance within a querymatch as important ranking factors. In the case that keywords appear frequently but at undesirable positions, EASE exhibitspoor performance. For instance, consider the query QD4. Its intention is to find the papers published in ‘‘ICDE 2010’’. In nineof the top-10 results returned by EASE, the keyword ‘‘ICDE’’ is matched by the cite nodes but not the booktitle nodes. Theseresults are therefore irrelevant. Another scenario in which EASE may performs poorly is when several keywords appear neareach other but at undesirable positions. Consider the query QD8. It intends to identify the books edited or written by ‘‘John’’.In nine of the top-10 results returned by EASE, the keywords ‘‘John’’ and ‘‘Book’’ both appear under the title attribute nodes.These results are therefore irrelevant.

Table 3Precision evaluation on DBLP.

SLCA (%) EASE (%) SRank (%)

Top-10 45.5 83.3 88.0Top-20 47.7 81.4 86.4Top-50 51.3 78.6 79.8

QD1 QD2 QD3 QD4 QD5 QD6 QD7 QD80

20

40

60

80

100

Prec

isio

n(%

)

SLCA EASE TopX SRank

Fig. 6. Top-10 precision of example queries on DBLP.


The typical scenario in which SRank performs worse than EASE is when the keywords within a query match appear atunexpected positions. For instance, consider the query QD5. It intends to find the articles authored by ‘‘Jim Gray’’. In oneof the top-10 results returned by SRank, the keywords ‘‘Jim’’ and ‘‘Gray’’ appear under the title attribute node. This resultis irrelevant but has a high ranking because the IDF values of ‘‘Jim’’ and ‘‘Gray’’ are computed to be high. SRank thereforeachieves only 90% precision in its top-10 results. Similarly, SRank achieves only 70% precision in its top-10 results for thequery QD6. In the three irrelevant query matches, the keywords ‘‘Richard’’ and ‘‘Thomas’’ appear under the title attributenodes.

It is observed that both EASE and SRank perform significantly better than SLCA. In the case that EASE achieves better pre-cision than SRank, it only performs slightly better. For instance, for the query QD5, the top-10 precision is improved from 90%to 100%. In the case that SRank achieves better precision than EASE, the performance improvement is much more consider-able. For instance, for the query QD4, SRank achieves 90% precision in its top-10 results while EASE only have a precision of10%. As a result, the overall performance of SRank measured by average precision is better than that of EASE. Our experi-ments also demonstrate that SRank performs more stably than EASE and SLCA, achieving desirable precisions in up to90% of the tested queries.

Mondial Dataset. Compared with DBLP, the Mondial data have deeper structure and contain richer semantic information.To express user intention unambiguously, most tested queries contain element or attribute labels. The average precision re-sults measured in the query pool consisting of 32 queries are presented in Table 4. SRank achieves the best performancewhile EASE is the worst among them. The results of the eight example queries are also presented in Fig. 7.

Our experiments show that EASE performs poorly for many queries, including QM2, QM3, QM4 and QM5. The main reasonfor this is that the keyword frequency that EASE considers to be an important ranking factor is misleading in the Mondialdataset. For instance, consider the query QM4. It intends to identify cities located near rivers. All the top-10 results returnedby EASE are rooted at the country elements which contain many occurrences of the keyword ‘‘city’’. Consider the query QM5.Its intention is to identify the countries in Europe. Most of the top-10 results returned by EASE are rooted at the country ele-ments which contain many occurrences of the keyword ‘‘country’’ but do not contain the keyword ‘‘Europe’’ at desirablepositions.

Similar to the DBLP case, the SLCA approach may return the query matches rooted at undesirable elements. For instance,consider the query QM7. It intends to determine each country’s population. None of the top-10 results returned by SLCA isrooted at the country elements. They are instead rooted at the city elements, which contain the keywords ‘‘country’’ and‘‘population’’. They are therefore irrelevant to QM7. Consider the query QM2. Its intention is to identify the islands of theUSA. Two of the top-10 results returned by SLCA are not rooted at the desirable island elements, but at country elements.SLCA therefore achieves only 80% precision for QM7.

It is observed that SRank achieves overall better performance than SLCA and EASE. With only a few exceptions such asQM3, SLCA performs better than SRank. QM3’s intention is to identify the cities located at latitude 47. In some of the

Table 4Precision evaluation on Mondial.

SLCA (%) EASE (%) SRank (%)

Top-10 66.9 50.0 80.3Top-20 66.3 49.5 80.2Top-50 65.3 48.9 78.4

QM1 QM2 QM3 QM4 QM5 QM6 QM7 QM80

20

40

60

80

100

Prec

isio

n(%

)

SLCA EASE SRank

Fig. 7. Top-10 precision of example queries on Mondial.


top-10 results returned by SRank, the keyword ‘‘47’’ is contained by the longitude attributes. They are therefore irrelevant toQM3. It is worth pointing out that even in the case of QM3, measured with respect to the top-20 and top-50 results, the per-formance of SRank is slightly better than that of SLCA. Our experiments on the Mondial dataset demonstrate that, comparedwith SLCA and EASE, SRank is better at identifying the desirable elements and the appropriate positions where the querykeywords should be matched.

Among the 32 queries on the Mondial dataset, SRank achieves desirable top-10 precision (greater than 70%) on up to 75%of them. Its performance on the Mondial dataset is not as good as that on the DBLP dataset. This results from the fact that theMondial data have more complex structures and its keywords have more ambiguous semantic meanings. For instance, thehigh-incidence keywords ‘‘country’’ and ‘‘city’’ are both data value words and element labels. Such variety leads to moreambiguity in semantic analysis. As a result, there is a higher probability that SRank returns undesirable results.

SRank vs TopX. We compare the performance of SRank with that of TopX with respect to the DBLP dataset because theDBLP data contains rich content information. We generate a pool of 32 tested queries consisting of data value words, includ-ing QD1, QD2, QD4 and QD7. The average precision comparisons are presented in Table 5. The detailed results for the four sam-ple queries are also presented in Fig. 6.

It is observed that SRank outperforms TopX by considerable margins. Similar to EASE, TopX considers keyword frequencyto be an important ranking factor. It therefore often returns undesirable results in which the keywords appear frequently butat inappropriate positions. For instance, consider the query QD4. TopX returns many papers with cite attributes that containthe keyword ‘‘ICDE’’. These papers are ranked high by TopX because they have many cite attributes matching ‘‘ICDE’’. In con-trast, according to the formula presented in Section 3.2, SRank ranks the papers in which only booktitle attribute matches‘‘ICDE’’ better. It therefore achieves better performance.

The scenario in which SRank performs worse than TopX is when it returns the results with two closely related keywordsaccording to different attributes. For instance, consider the query QD7. In two of the top-10 results returned by SRank, onlyone of the two keywords ‘‘Simulation’’ and ‘‘Model’’ is matched by title attributes. SRank therefore only achieves a precisionof 80%. It is worth noting that in the cases in which SRank performs worse than TopX, the performance gap between the twois small. Our experiments demonstrate that SRank performs more stably than TopX, achieving desirable precisions on moretested queries.

Table 5SRank vs TopX: precision comparison with respect to DBLP.

TopX (%) SRank (%)

Top-10 73.2 88.6Top-20 69.8 86.7Top-50 70.3 82.6

Table 6SRank vs SLCA: recall comparison.

SLCA (%) SRank (%)

DBLP 70.3 100Mondial 72.9 90.6

QD1 QD2 QD3 QD4 QD5 QD6 QD7 QD80

20

40

60

80

100

Rec

all (

%)

SLCA SRank

Fig. 8. Recall of example queries on DBLP.

QM1 QM2 QM3 QM4 QM5 QM6 QM7 QM80

20

40

60

80

100

Rec

all(%

)

SLCA SRank

Fig. 9. Recall of example queries on Mondial.

10M 20M 30M 40M 50M

20406080

100120140160180200220240

Tim

e(m

s)

Data Size

SRank SLCA SRank without DB time

10M 20M 30M 40M 50M0

100

200

300

400

500

600

700 SRank SLCA SRank without DB time

Tim

e(m

s)

Data Size

10M 20M 30M 40M 50M0

50

100

150

200

250

300 SRank SLCA SRank without DB time

Tim

e(m

s)

Data Size

Fig. 10. Efficiency evaluation of SRank.


Top-20 Top-40 Top-60 Top-80 Top-100

40

60

80

100

120

140

160

180

Tim

e(m

s)

QD(SIGIR,2000) QD(Multimedia,Database) QD(ICDE,2000)

Fig. 11. SRank efficiency on the number of returned results.


6.2.2. RecallRegarding the metric of recall, we compare the performance of SRank with that of SLCA. We do not compare SRank with

EASE and TopX because for two reasons. First, EASE and TopX generate partially matched results containing only some of thekeywords in queries. The values of their outputs can be manually set beforehand. Both of them can therefore return manymore query matches than SRank and SLCA. Second, as will be shown later, SRank performs very well on both datasets,achieving desirable recall levels.

The recall evaluation results are presented in Table 6. In both datasets, SRank performs considerably better than SLCA.SRank achieves average recall levels of 100% and 90% on DBLP and Mondial respectively. SLCA performs poorly when thereturned XML fragments are not rooted at appropriate elements or the relevant results do not satisfy the properties of SLCAstructures. The detailed results regarding the 8 sample queries are also presented in Figs. 8 and 9. For instance, consider thequeries QD2 and QM7. In QD2, SLCA returns many XML trees rooted at the title nodes but not meaningful publication entities.In QM7, SLCA returns many fragments rooted at city elements. The ancestor elements of these city elements with the label‘‘Country’’ are actually relevant query matches. SRank does not achieve the 100% recall level in the Mondial set because itmay also fail to return the desirable results rooted at appropriate elements. Consider the query QM(Border,TJ). Its intentionis to identify the countries that share a border with the country coded ‘‘TJ’’. SRank returns the query matches rooted at borderelements. These results are undesirable.

6.3. Efficiency

We measure efficiency as CPU time consumed by running the algorithms. We compare the performance of SRank withthat of SLCA. We have also studied the scalability of SRank and SLCA. The scalability tests demonstrate how their consumedCPU time varies as the XML data size increases. The experiments are conducted on the Mondial dataset. We generate differ-ent Mondial datasets with sizes ranging from 10 MB to 50 MB by replicating the original dataset. The detailed results regard-ing the three sample queries (QM2, QM4 and QM6) are shown in Fig. 10. Note that the experimental results for the otherqueries are similar, thus not presented here. With respect to SRank, the elapsed CPU time includes the time consumed bythe operations to retrieve the parameter values of the relevance formula from the database. It can be observed that theparameter retrieval process occupies a considerable portion of the total CPU time. If it is excluded, SRank achieves compa-rable efficiency with SLCA. With respect to Mondial data and our tested queries, as described in the section Search Algorithm,the frequency incidence of the same keyword within nested trees is usually low and the exception case exemplified in Fig. 4rarely occurs. It is observed that the CPU time consumed by SRank increases roughly linearly with data size. Our experimentsdemonstrate that SRank achieves good scalability.

We have also investigated the scalability of SRank with respect to the number of returned results. The correspondingexperiments are conducted on the DBLP dataset. The detailed results for the three sample queries are presented inFig. 11. Note that the results for the other queries are similar and are thus omitted here. It can be observed that as the num-ber of returned results increases, the CPU time consumed by SRank increases roughly linearly.

7. Conclusion

This paper proposes a novel semantic ranking scheme SRank for XML keyword search. SRank is based on a sound analysisof XML semantics and user information need. Its major contribution is the measurement of the relevance between a keywordquery and an XML fragment according to their semantic similarity. Our extensive experiments demonstrated that SRank


outperforms existing approaches with respect to search quality. Its efficiency scalability also bodes well for its practicalimplementation.

It is observed that most existing techniques for performing XML keyword searches were proposed for homogeneous XMLdata, such as DBLP and Wiki data. As XML and its relevant technologies (such as semantic web) have evolved, it has becomeclear that keyword search must be performed over heterogeneous XML data in many application scenarios. Recently, [29]proposed a method for the search problem over large-scale web-extracted data. However, further investigations are clearlynecessary. Another research direction worth pursuing is to effective integration of the IR and database approaches for XMLkeyword search. Over the past years, the database community has initiated the LCA-based approach and the IR communityhas proposed the search approach based on a language model under the initiative of INEX. Determining how to integratethese two approaches for better search experience over complex and heterogeneous XML data is also an interesting subjectfor future work.

Acknowledgments

This work is sponsored partially by National Natural Science Foundation of China (Nos. 60803043, 60873196 and61033007) and the National High-Tech Research and Development Plan of China (863) under Grant No. 2009AA01A404.

References

[1] L. Guo, F. Shao, C. Botev, J. Shanmugasundaram, Xrank: ranked keyword search over xml documents, in: Proceedings of the 2003 ACM SIGMODInternational Conference on Management of Data (SIGMOD2003), 2003, pp. 16–27.

[2] S. Cohen, J. Mamou, Y. Kanza, Y. Sagiv, Xsearch: a semantic search engine for xml, in: Proceedings of 29th International Conference on Very Large DataBases (VLDB2003), 2003, pp. 45–56.

[3] Y. Xu, Y. Papakonstantinou, Efficient keyword search for smallest lcas in xml databases, in: Proceedings of the ACM SIGMOD International Conferenceon Management of Data (SIGMOD2005), 2005, pp. 537–538.

[4] G. Li, B.C. Ooi, J. Feng, J. Wang, L. Zhou, Ease: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data, in:Proceedings of the ACM SIGMOD International Conference on Management of Data(SIGMOD2008, 2008, pp. 903–914.

[5] Z. Bao, T.W. Ling, B. Chen, J. Lu, Effective xml keyword search with relevance oriented ranking, in: Proceedings of the 25th International Conference onData Engineering (ICDE2009), 2009, pp. 517–528.

[6] Y. Li, C. Yu, H.V. Jagadish, Schema-free xquery, in: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB2004), 2004, pp.72–83.

[7] N. Fuhr, K. Großjohann, Xirql: A query language for information retrieval in xml documents, in: Proceedings of the 24th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval(SIGIR2001), 2001, pp. 172–180.

[8] S. Liu, Q. Zou, W.W. Chu, Configurable indexing and ranking for xml information retrieval, in: Proceedings of the 27th Annual International ACM SIGIRConference on Research and Development in Information Retrieval (SIGIR2004), 2004, pp. 88–95.

[9] S. Amer-Yahia, L.V.S. Lakshmanan, S. Pandit, Flexpath: Flexible structure and full-text querying for xml, in: Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMOD2004), 2004, pp. 83–94.

[10] G. Li, J. Feng, J. Wang, L. Zhou, Effective keyword search for valuable lcas over xml documents, in: Proceedings of the Sixteenth ACM Conference onInformation and Knowledge Management (SIGMOD2007), 2007, pp. 31–40.

[11] Z. Liu, Y. Chen, Reasoning and identifying relevant matches for xml keyword search, PVLDB 1 (1) (2008) 921–932.[12] M. Theobald, R. Schenkel, G. Weikum, An efficient and versatile query engine for topx search, in: Proceedings of the 31st International Conference on

Very Large Data Bases (VLDB2005), 2005, pp. 625–636.[13] V. Hristidis, Y. Papakonstantinou, A. Balmin, Keyword proximity search on xml graphs, in: Proceedings of the 19th International Conference on Data

Engineering (ICDE2003), 2003, pp. 367–378.[14] M. Barg, R.K. Wong, Structural proximity searching for large collections of semi-structured data, in: Proceedings of the 2001 ACM CIKM International

Conference on Information and Knowledge Management(CIKM2001), 2001, pp. 175–182.[15] J. Feng, G. Li, J. Wang, L. Zhou, Finding and ranking compact connected trees for effective keyword proximity search in xml documents, Inform. Syst. 35

(2) (2010) 186–203.[16] Initiative for the evaluation of xml retrieval, <http://inex.is.informatik.uni-duisburg.de/>.[17] V. Hristidis, N. Koudas, Y. Papakonstantinou, D. Srivastava, Keyword proximity search in xml trees, IEEE Trans. Knowl. Data Eng. 18 (4) (2006) 525–539.[18] W.-S. Li, K.S. Candan, Q. Vu, D. Agrawal, Retrieving and organizing web pages by ‘‘information unit’’, in: Proceedings of the 10th International World

Wide Web Conference (WWW2001), 2001, pp. 230–244.[19] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, S. Sudarshan, Keyword searching and browsing in databases using banks, in: Proceedings of the 18th

International Conference on Data Engineering (ICDE2002), 2002, pp. 431–440.[20] H. He, H. Wang, J.Y. 0001, P.S. Yu, Blinks: ranked keyword searches on graphs, in: Proceedings of the ACM SIGMOD International Conference on

Management of Data (SIGMOD2007), 2007, pp. 305–316.[21] Z. Liu, Y. Chen, Identifying meaningful return information for xml keyword search, in: Proceedings of the ACM SIGMOD International Conference on

Management of Data (SIGMOD2007), 2007, pp. 329–340.[22] G. Li, C. Li, J. Feng, L. Zhou, Sail: structure-aware indexing for effective and progressive top-k keyword search over xml documents, Inform. Sci. 179 (21)

(2009) 3745–3762.[23] G. Koutrika, A. Simitsis, Y.E. Ioannidis, Précis: the essence of a query answer, in: Proceedings of the 22nd International Conference on Data Engineering

(ICDE2006), 2006, pp. 69–78.[24] J. Li, C. Liu, R. Zhou, W. Wang, Top-k keyword search over probabilistic xml data, in: Proceedings of the 27th International Conference on Data

Engineering (ICDE2011), 2011, pp. 673–684.[25] Y. Tao, S. Papadopoulos, C. Sheng, K. Stefanidis, Nearest keyword search in xml documents, in: Proceedings of the ACM SIGMOD International

Conference on Management of Data (SIGMOD2011), 2011, pp. 589–600.[26] G.J. Fakas, A novel keyword search paradigm in relational databases: object summaries, Data Knowl. Eng. 70 (2) (2011) 208–229.[27] X. Ning, H. Jin, W. Jia, P. Yuan, Practical and effective ir-style keyword search over semantic web, Inform. Process. Manag. 45 (2) (2009) 263–271.[28] J. Shao, Z. Cao, X. Liang, H. Lin, Proxy re-encryption with keyword search, Inform. Sci. 180 (13) (2010) 2576–2587.[29] J. Pound, I.F. Ilyas, G.E. Weddell, Expressive and flexible access to web-extracted data: a keyword-based structured query language, in: Proceedings of

the ACM SIGMOD International Conference on Management of Data (SIGMOD2010), 2010, pp. 423–434.

http://inex.is.informatik.uni-duisburg.de/

Documents

Semantic relevance ranking for XML keyword search relevance ranking for XML...Semantic relevance ranking for XML keyword search Ying Loua,b, Zhanhuai Lia, Qun Chena,⇑ a School of