Using TREC for cross-comparison between classic IR and ontology-based search models at a Web scale Miriam Fernández 1, Vanessa López 2, Marta Sabou 2,

Using TREC for cross-comparison between classic IR and ontology-based search models

at a Web scale

Miriam Fernández1, Vanessa López2, Marta Sabou2, Victoria Uren2, David Vallet1, Enrico Motta2, Pablo Castells1

Semantic Search 2009 Workshop (SemSearch 2009)18th International World Wide Web Conference (WWW 2009)

21st April 2009, Madrid

1 Escuela Politécnica SuperiorUniversidad Autónoma de Madrid

2 Knowledge Media instituteThe Open University

Table of contents

Motivation Part I. The proposal: a novel evaluation benchmark• Reusing the TREC Web track document collection

• Introducing the semantic layer– Reusing ontologies from the Web

– Populating ontologies from Wikipedia

– Annotating documents

Part II. Analyzing the evaluation benchmark• Using the benchmark to compare an ontology-based search approach against

traditional IR baselines– Experimental conditions

– Results

• Applications of the evaluation benchmark

Conclusions

Motivation (I)

Problem: How can semantic search systems be evaluated and compared with standard IR systems to study whether and how semantic search engines offer competitive advantages?

Traditional IR evaluation• Evaluation methodologies generally based on the Cranfield paradigm

(Cleverdon, 1967)– Documents, queries and judgments

• Well-known retrieval performance metrics– Precision, Recall, P@10, Average Precision (AP), Mean Average Precision (MAP)

• Wide initiatives like TREC to create and use standard evaluation collections, methodologies and metrics

• The evaluation methods are systematic, easily reproducible, and scalable

Motivation (II)

Ontology-based search evaluation• Ontology-based search approaches

– Introduction of a new semantic search space (ontologies and KBs)

– Change in the IR vision (input, output, scope)

• The evaluation methods rely on user-centered studies, and therefore they tend to be high-cost, non-scalable and difficult to reproduce

• There is still a long way to define standard evaluation benchmarks for assessing the quality of ontology-based search approaches

Goal: develop a new reusable evaluation benchmark for cross-comparison between classic IR and ontology-based models on a significant scale

Part I. The proposal







– Results

Part III. Applications of the evaluation benchmark Conclusions

The evaluation benchmark (I)

A benchmark collection for cross-comparison between classic IR and ontology-based search models at a large scale should comprise five main components: • a set of documents,

• a set of topics or queries,

• a set of relevance judgments (or lists of relevant documents for each topic),

• a set of semantic resources, ontologies and KBs, which provide the need semantic information for ontology-based approaches.

• a set of annotations that associate the semantic resources with the document collection (not needed for all ontology-based search approaches)

The evaluation benchmark (II)

Start from a well-known standard IR evaluation benchmark• Reuse of the TREC Web track collection used in the TREC 9 and TREC

2001 editions of the TREC conference– Document collection: WT10g (Bailey, Craswell, & Hawking, 2003). About 10GB in

size.1.69 million Web pages

– The TREC topics and judgments for this text collection are provided with the TREC 9 and TREC 2001 datasets

Example of a TREC topic <num> 494 <title> nirvana <desc> Find information on members of the rock group Nirvana <narr> Descriptions of members' behavior at various concerts and their performing style is relevant. Information on who wrote certain songs or a band member's role in producing a song is relevant. Biographical information on members is also relevant.

The evaluation benchmark (III)

Construct the semantic search space• In order to fulfill Web-like conditions, all the semantic search information

should be available online

• The selected semantic information should cover, or partially cover, the domains involved in the TREC query set

• The selected semantic resources should be completed with a larger set of random ontologies and KBs to approximate a fair scenario

• If the semantic information available online has to be extended in order to cover the TREC queries, this must be done with information sources which are completely independent from the document collection, and available online

The evaluation benchmark (IV)

Document collection• TREC WT10G

Queries and judgments• TREC 9 and TREC 2001 test corpora

– 100 queries with their corresponding judgments

• 20 queries selected and adapted to be used by a NLP QA query processing module

Ontologies• 40 public ontologies covering a subset of the TREC domains and queries (370 files comprising

400MB of RDF, OWL and DAML)

• 100 additional repositories (2GB of RDF and OWL)

Knowledge Bases

• Some of the 40 selected ontologies have been semi-automatically populated from Wikipedia

Annotations• 1.2 · 108 non-embedded annotations generated and stored in a MySQL database

The evaluation benchmark (V)

Selecting TREC queries• Queries have to be formulated in a way suitable for ontology-based search

systems (informational queries)– E.g., queries such as “discuss the financial aspects of retirement planning” (topic

514) are not selected

• Ontologies must be available for the domain of the query– We selected 20 queries

Adapting TREC queries

Example of a TREC topic extension <num> 494 <title> nirvana <desc> Find information on members of the rock group Nirvana <narr> Descriptions of members' behavior at various concerts and their performing style is relevant. Information on who wrote certain songs or a band member's role in producing a song is relevant. Biographical information on members is also relevant.<adaptation> What are the members of nirvana?<ontologies> Tapfull, music

The evaluation benchmark (VI)

Populating ontologies from Wikipedia• The semantic resources available online are still scarce and incomplete (Sabou,

Gracia, Angeletou, D'Anquin, & Motta, 2007)

• Generation of a simple semi-automatic ontology-population mechanism – Populates ontology classes with new individuals

– Extracts ontology relations for a specific ontology individual

– Uses Wikipedia lists and tables to extract this information

The evaluation benchmark (VII)

Annotating documents with ontology entities• Identify ontology entities (classes, properties, instances or literals) within the

documents to generate new annotations

• Do not populate ontologies, but identify already available semantic knowledge within the documents

• Support annotation in open domain environments (any document can be associated or linked to any ontology without any predefined restriction).

This brings scalability limitations. To solve them we propose to:– Generate of ontology indices

– Generate of document indices

– Construct an annotation database which

stores non-embedded annotations

Entity ID Doc Id Weight

182904817

361452228 0.54

182904817

361452228 0.21

The evaluation benchmark (VIII)

Ontology (o1)

Select the next semantic entity

1

E1= Individual: MaradonaLabels = {“Maradona”, “Diego Armando Maradona”, “pelusa”}

Search the terms in the document index2

Keyword Documents

Maradona D1, D2, D87

Pelusa D95 D140

football_player D87, D61, D44, D1

Argentina D43, D32, D2

Potential documents to annotate= {D1, D2, D87, D95, D140}

Select the semantic context 3

E34 = Class: football_playerLabels = {“footbal l player”} I22= Individual: ArgentinaLabels = {“Argentina”}

Conextualized documents= {D1, D2,D32, D43, D44, D61, D87}

Search contextualized terms in the document

index

5Select the semantic contextualized docs

4

documents to annotate= {D1, D2, D87}

Annotations creator 6

Ontology Entity Document Weight

E1 D1 0.5

E1 D2 0.2

E1 D87 0.67

Weighted annotations

• Ambiguities: exploit ontologies as background knowledge (increasing precision but reducing the number of annotations)

Annotation based on contextual semantic information

Part II. Analyzing the evaluation benchmark







– Results

• Applications of the evaluation benchmark

Conclusions

Experimental conditions

Keyword-based search (Lucene) Best TREC automatic search Best TREC manual search Semantic-based search (Fernandez, et al., 2008)

NL interface

NL query

Query processing

instances

searching

Unordered documents

Ranked documents

Weightedannotations

Unstructured Web contents

Semantic Web

Semantic Web gateway

Pre-processed Semantic

Knowledge

ranking

indexing

Semantic entities

Semantic index(Weightedannotations)

Results (I)

• Figures in bold correspond to best result for each topic, excluding the best TREC manual approach (because of the way it constructs the query)

MAP: mean average precision P@10: precision at 10

Topic

Semantic

Lucene

TREC automa

tic

TREC manu

al451 0.42 0.29 0.58 0.54452 0.04 0.03 0.20 0.33454 0.26 0.26 0.56 0.48457 0.05 0.00 0.12 0.22465 0.13 0.00 0.00 0.61467 0.10 0.12 0.09 0.21476 0.13 0.28 0.41 0.52484 0.19 0.12 0.05 0.36489 0.09 0.11 0.06 0.41491 0.08 0.08 0.00 0.7494 0.41 0.22 0.57 0.57504 0.13 0.08 0.38 0.64508 0.15 0.03 0.06 0.1511 0.07 0.15 0.23 0.15512 0.25 0.12 0.30 0.28513 0.08 0.06 0.12 0.11516 0.07 0.03 0.07 0.74523 0.29 0.00 0.23 0.29524 0.11 0.00 0.01 0.22526 0.09 0.06 0.07 0.2Mea

n0.16 0.10 0.20 0.38

Topic

Semantic

retrieval

Lucene

TREC automa

tic

TREC manu

al

451 0.70 0.50 0.90 0.80452 0.20 0.20 0.30 0.90454 0.80 0.80 0.90 0.80457 0.10 0.00 0.10 0.80465 0.30 0.00 0.00 0.90467 0.40 0.40 0.30 0.80476 0.50 0.30 0.10 1.00484 0.20 0.30 0.00 0.30489 0.20 0.00 0.10 0.40491 0.20 0.30 0.00 0.90494 0.90 0.80 1.00 1.00504 0.20 0.20 0.50 1.00508 0.50 0.10 0.30 0.30511 0.40 0.50 0.70 0.20512 0.40 0.20 0.30 0.30513 0.10 0.40 0.00 0.40516 0.10 0.00 0.00 0.90523 0.90 0.00 0.40 0.90524 0.20 0.00 0.00 0.40526 0.10 0.00 0.00 0.50Mea

n0.37 0.25 0.30 0.68

Results (II)

By P@10, the semantic retrieval outperforms the other two approaches – It provides maximal quality for 55% of the queries and it is only

outperformed by both Lucene and TREC in one query (511)– Semantic retrieval provides better results than Lucene for 60% of the queries

and equal for another 20%– Compared to the best TREC automatic engine, our approach improves 65%

of the queries and produces comparable results in 5% By MAP, there is no clear winner

– The average performance of TREC automatic is greater than semantic retrieval.

– Semantic retrieval outperforms TREC automatic in 50% of the queries and Lucene in 75%

– Bias in the MAP measure– More than half of the documents retrieved by the semantic retrieval approach

have not been rated in the TREC judgments– The annotation technique used for the semantic retrieval approach is very

conservative (missing potential correct annotations)

Results (III)

For some queries for which the keyword search (Lucene) approach finds no relevant documents, the semantic search does queries 457 (Chevrolet trucks), 523 (facts about the five main clouds) and 524 (how to

erase scar?)

In the queries in which the semantic retrieval did not outperform the keyword baseline, the semantic information obtained by the query processing module was scarce Still, overall, the keyword baseline only rarely provides significantly better results

than semantic search

TREC Web search evaluation topics are conceived for keyword-based search engines With complex structured queries (involving relationships), the performance of

semantic retrieval would improve significantly compared to the keyword-based The full capabilities of the semantic retrieval model for formal semantic queries were

not exploited in this set of experiments

Results (IV)

Studying the impact of retrieved non-evaluated documents• 66% of the results returned by semantic retrieval were not judged

• P@10 not affected. Results in the first positions have a higher probability of being evaluated

• MAP: evaluating the impact

– Informal evaluation of the first 10 unevaluated results returned for every query

– 89% of these results occur in the first 100 positions for their respective query

– A significant portion, 31.5%, of the documents we judged turned out to be relevant

– Even though this can not be generalized to all the unevaluated results returned by the semantic retrieval approach (the probability of being relevant drops around the first 100 results and then varies very little) we believe that the lack of evaluations for all the results returned by the semantic retrieval impairs its MAP value

Applications of the Benchmark

Goal: How this benchmark can be applied to evaluate other ontology-based search approaches?

Criteria Approaches

Scope Web searchLimited domain repositories Desktop search

Input (query) Keyword queryNatural language queryControlled natural language queryOntology query languages

Output Data retrieval Information retrieval

Content ranking No rankingKeyword-based rankingSemantic-based ranking

Conclusions (I)

In the semantic search community, there is the need of having standard evaluation benchmarks to evaluate and compare ontology-based approaches against each other, and against traditional IR models

In this work, we have addressed two issues:

1. Construction of a potentially widely applicable ontology-based evaluation benchmark from traditional IR datasets, such as the TREC Web track reference collection

2. Use the benchmark to evaluate a specific ontology-based search approach (Fernandez, et al., 2008) against different traditional IR models at a large scale

Conclusions (II)

Potential limitations of the above benchmark are:

• The need of ontology-based search systems to participate in the pooling methodology to obtain a better set of document judgments

• The use of queries with a low level of expressivity in terms of relations, more oriented to traditional IR models

• The scarceness of the publicly available semantic information to cover the meanings involved in the document search space

A common understanding of ontology-based search in terms of inputs, outputs and scope should be reached before achieving a real standardization in the evaluation of ontology-based search models

Thank you! http://nets.ii.uam.es/miriam/thesis.pdf (chapter 6)

http://nets.ii.uam.es/publications/icsc08.pdf