44
TREC 2009 Review Lanbo Zhang

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Embed Size (px)

Citation preview

Page 1: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

TREC 2009 Review

Lanbo Zhang

Page 2: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

7 tracks

• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track

Page 3: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

67 Participating Groups

Page 4: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

The new dataset: ClueWeb09

• 1 billion web pages, in 10 languages, half are in English

• Crawled by CMU in Jan. and Feb. 2009• 5 TB (compressed), 25 TB (uncompressed)• Subset B

– 50 million English pages– Includes all Wikipedia pages

• The original dataset and the Indri index of subset B are available on our lab machines

Page 5: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Tracks

• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track

Page 6: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Web Track

• Two tasks– Adhoc Retrieval Task– Diversity Task

• Return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the return list.

Page 7: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Web Track• Topic type 1: ambiguous

Page 8: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Web Track• Topic type 2: faceted

Page 9: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Web Track• Results of adhoc task

Page 10: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Web Track• Results of diversity task

Page 11: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Waterloo at Web track

• Two runs– Top 10000 docs in the entire collection– Top 10000 docs in the Wikipedia set

• Wikipedia docs as pseudo relevance feedback• Machine learning methods to re-rank the top

20000 docs, and return the top 1000• Diversity task

– A Naïve Bayes classifier designed to re-rank the top 20000 to exclude duplicates

Page 12: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

MSRA at Web track

• Mining subtopics for a query by– Anchor texts– Search results clusters– Sites of search results

• Search results diversification– A greedy algorithm to iteratively select the next best

document

Page 13: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Tracks

• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track

Page 14: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Relevance Feedback Track

• Tasks– Phase 1: find a set of 5 documents that are good

for relevance feedback.– Phase 2: develop an RF algorithm to do retrieval

based on the relevance judgments of 5 docs.

Page 15: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Results of RF track: Phase 1

Page 16: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Results of RF track: Phase 2

Page 17: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

UCSC at RF track

• Phase 1: documents selection– Clustering top ranked documents– Transductive Experimental Design (TED)

• Phase 2: RF algorithm– Combining different document representations

• Title, anchor, heading, document

– Incorporating term position information• Phrase match, text window match

– Incorporating document similarities to labeled docs

Page 18: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

UMas at RF track

• A supervised method to estimate the weights of expanded terms for RF

• Train collection: wt10g• Term features given a query:

– Term frequency in FB docs and entire collection– Co-occurrence with query terms– Term proximity to query terms– Document frequency

Page 19: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

UMas at RF track

• Model: Boosting

Page 20: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Tracks

• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track

Page 21: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Entity Track• Task

– Given an input entity, find the related entities• Return 100 related entities and their homepages

Page 22: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Results of Entity track

Page 23: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Purdue at Entity track

• Entity Extraction– Hierarchical Relevance Model– Three levels of relevance: document, passage,

entity

Page 24: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Purdue at Entity track• Homepage Finding for Entities

– Logistic Regression model

Page 25: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Tracks

• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track

Page 26: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Blog Track

• Tasks– Faceted Blog Distillation– Top Stories Identification

• Collection: Blogs08– Crawled between 01/14/2008 and 02/10/2009– 1.3 million unique blogs

Page 27: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Blog Track

• Task 1: Faceted Blog Distillation– Given a topic and the faceted restriction, find the relevant

blogs.

– Facets• Opinionated vs. Factual• Personal vs. Official• In-depth vs. Shallow

– Topic example

Page 28: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Blog Track

• Task 2: Top Stories Identification– Given a date, find the hottest news headlines for

that day and select the relevant and diverse blog posts for those headlines

– News headlines from New York Times used– Topic example

Page 29: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Results of Blog track

• Faceted Blog Distillation

Page 30: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Results of Blog track

• Top Stories Identification– Find the hottest news headlines

– Identify the related blog posts

Page 31: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

BUPT at Blog track

• Faceted Blog Distillation– Scoring function:

• The title section of a topic plus automatically selected terms from the DESC and NARR sections

• Phrase match

– Facets Analysis• Opinionated v.s. Factual: a sentiment analysis model• Personal v.s Official: the maximum frequency of an organization

entity occurring in a blog (Stanford Named Entity Recognizer)• In-depth v.s. Shallow: post length

– Linear combination of the above two parts

),()(

1),(

)(

1

qpscorebN

qbscorebN

ii

r

Page 32: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Univ. of Glasgow at Blog track

• Top Stories Identification– The model:

– Incorporating the following days

– Using Wikipedia to enrich news headline terms and keep the top 10 terms for each headline

),(1000

)),(exp(),(dhCp top

hpscoredhscore

Page 33: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Tracks

• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track

Page 34: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Legal Track

• Tasks– Interactive task (Enron email collection)

• Retrieval with topic authorities involved, participants can ask topic authorities to clarify topics, judge the relevance of sample docs

– Batch task (IIT CDIP 1.0)• Retrieval with relevance evidence (RF)

Page 35: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Results of Legal track

Page 36: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Waterloo at Legal track

• Interactive task– Phase 1: interactive search and judging

• To find a large and diverse set of training examples

– Phase 2: interactive learning• To find more potentially relevant documents

• Batch task– Run three spam filters on every document:

• An on-line logistic regression filter,• A Naïve Bayes spam filter• An on-line version of BM25 RF method

Page 37: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Tracks

• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track

Page 38: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Million Query Track• Tasks

– Adhoc retrieval for 40000 queries– Predict query types

• Query intent: Precision-oriented vs. Recall-oriented• Query difficulty: Hard vs. Easy• Precision-oriented

– Navigational: Find a specific URL or web page.– Closed: Find a short, unambiguous answer to a specific question.– Resource: Locate a web-based resource or download.

• Recall-oriented– Open: Answer an open-ended question, or nd all available information

about a topic.– Advice : Find advice or ideas regarding a general question or problem.– List: Find a list of results that will help satisfy an open-ended goal.

Page 39: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Results of Million Query track

Precision vs. Recall

Hard vs. Easy

Page 40: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Northeastern Univ. at MQ track

• Query-specific learning to rank– Learn different ranking functions for queries in different

classes

• Using SVM to classify queries– Training data: MQ 2008 dataset

• Features– Document features: document length, TF, IDF, TF*IDF, normalized TF,

Robertson’s TF, Robertson’s IDF, BM25, Language Models (Laplace, Dirichlet, JM).

– Field features: title, heading, anchor text, and URL– Web graph features

Page 41: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Tracks

• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track

Page 42: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Chemical IR Track

• Tasks– Technical Survey Task

• Retrieve documents in response to each topic given by chemical patent experts

– Prior Art Search Task• Find relevant patents with respect to a set of 1000

existing patents

Page 43: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Results of Chemical track

Page 44: TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR

Geneva at Chemical track

• Document Representation:– Title, Description, Abstract, Claims, Applicants, Inventors, IPC codes,

Patent references

• Exploiting Citation Networks–

• Query expansion using chemical annotations• Filtering based on IPC codes• Re-ranking based on claims