Upload
cora-merritt
View
226
Download
3
Tags:
Embed Size (px)
Citation preview
TREC 2009 Review
Lanbo Zhang
7 tracks
• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track
67 Participating Groups
The new dataset: ClueWeb09
• 1 billion web pages, in 10 languages, half are in English
• Crawled by CMU in Jan. and Feb. 2009• 5 TB (compressed), 25 TB (uncompressed)• Subset B
– 50 million English pages– Includes all Wikipedia pages
• The original dataset and the Indri index of subset B are available on our lab machines
Tracks
• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track
Web Track
• Two tasks– Adhoc Retrieval Task– Diversity Task
• Return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the return list.
Web Track• Topic type 1: ambiguous
Web Track• Topic type 2: faceted
Web Track• Results of adhoc task
Web Track• Results of diversity task
Waterloo at Web track
• Two runs– Top 10000 docs in the entire collection– Top 10000 docs in the Wikipedia set
• Wikipedia docs as pseudo relevance feedback• Machine learning methods to re-rank the top
20000 docs, and return the top 1000• Diversity task
– A Naïve Bayes classifier designed to re-rank the top 20000 to exclude duplicates
MSRA at Web track
• Mining subtopics for a query by– Anchor texts– Search results clusters– Sites of search results
• Search results diversification– A greedy algorithm to iteratively select the next best
document
Tracks
• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track
Relevance Feedback Track
• Tasks– Phase 1: find a set of 5 documents that are good
for relevance feedback.– Phase 2: develop an RF algorithm to do retrieval
based on the relevance judgments of 5 docs.
Results of RF track: Phase 1
Results of RF track: Phase 2
UCSC at RF track
• Phase 1: documents selection– Clustering top ranked documents– Transductive Experimental Design (TED)
• Phase 2: RF algorithm– Combining different document representations
• Title, anchor, heading, document
– Incorporating term position information• Phrase match, text window match
– Incorporating document similarities to labeled docs
UMas at RF track
• A supervised method to estimate the weights of expanded terms for RF
• Train collection: wt10g• Term features given a query:
– Term frequency in FB docs and entire collection– Co-occurrence with query terms– Term proximity to query terms– Document frequency
UMas at RF track
• Model: Boosting
Tracks
• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track
Entity Track• Task
– Given an input entity, find the related entities• Return 100 related entities and their homepages
Results of Entity track
Purdue at Entity track
• Entity Extraction– Hierarchical Relevance Model– Three levels of relevance: document, passage,
entity
Purdue at Entity track• Homepage Finding for Entities
– Logistic Regression model
Tracks
• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track
Blog Track
• Tasks– Faceted Blog Distillation– Top Stories Identification
• Collection: Blogs08– Crawled between 01/14/2008 and 02/10/2009– 1.3 million unique blogs
Blog Track
• Task 1: Faceted Blog Distillation– Given a topic and the faceted restriction, find the relevant
blogs.
– Facets• Opinionated vs. Factual• Personal vs. Official• In-depth vs. Shallow
– Topic example
Blog Track
• Task 2: Top Stories Identification– Given a date, find the hottest news headlines for
that day and select the relevant and diverse blog posts for those headlines
– News headlines from New York Times used– Topic example
Results of Blog track
• Faceted Blog Distillation
Results of Blog track
• Top Stories Identification– Find the hottest news headlines
– Identify the related blog posts
BUPT at Blog track
• Faceted Blog Distillation– Scoring function:
• The title section of a topic plus automatically selected terms from the DESC and NARR sections
• Phrase match
– Facets Analysis• Opinionated v.s. Factual: a sentiment analysis model• Personal v.s Official: the maximum frequency of an organization
entity occurring in a blog (Stanford Named Entity Recognizer)• In-depth v.s. Shallow: post length
– Linear combination of the above two parts
),()(
1),(
)(
1
qpscorebN
qbscorebN
ii
r
Univ. of Glasgow at Blog track
• Top Stories Identification– The model:
– Incorporating the following days
– Using Wikipedia to enrich news headline terms and keep the top 10 terms for each headline
),(1000
)),(exp(),(dhCp top
hpscoredhscore
Tracks
• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track
Legal Track
• Tasks– Interactive task (Enron email collection)
• Retrieval with topic authorities involved, participants can ask topic authorities to clarify topics, judge the relevance of sample docs
– Batch task (IIT CDIP 1.0)• Retrieval with relevance evidence (RF)
Results of Legal track
Waterloo at Legal track
• Interactive task– Phase 1: interactive search and judging
• To find a large and diverse set of training examples
– Phase 2: interactive learning• To find more potentially relevant documents
• Batch task– Run three spam filters on every document:
• An on-line logistic regression filter,• A Naïve Bayes spam filter• An on-line version of BM25 RF method
Tracks
• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track
Million Query Track• Tasks
– Adhoc retrieval for 40000 queries– Predict query types
• Query intent: Precision-oriented vs. Recall-oriented• Query difficulty: Hard vs. Easy• Precision-oriented
– Navigational: Find a specific URL or web page.– Closed: Find a short, unambiguous answer to a specific question.– Resource: Locate a web-based resource or download.
• Recall-oriented– Open: Answer an open-ended question, or nd all available information
about a topic.– Advice : Find advice or ideas regarding a general question or problem.– List: Find a list of results that will help satisfy an open-ended goal.
Results of Million Query track
Precision vs. Recall
Hard vs. Easy
Northeastern Univ. at MQ track
• Query-specific learning to rank– Learn different ranking functions for queries in different
classes
• Using SVM to classify queries– Training data: MQ 2008 dataset
• Features– Document features: document length, TF, IDF, TF*IDF, normalized TF,
Robertson’s TF, Robertson’s IDF, BM25, Language Models (Laplace, Dirichlet, JM).
– Field features: title, heading, anchor text, and URL– Web graph features
Tracks
• Web track• Relevance Feedback track (RF)• Entity track• Blog track• Legal track• Million Query track (MQ)• Chemical IR track
Chemical IR Track
• Tasks– Technical Survey Task
• Retrieve documents in response to each topic given by chemical patent experts
– Prior Art Search Task• Find relevant patents with respect to a set of 1000
existing patents
Results of Chemical track
Geneva at Chemical track
• Document Representation:– Title, Description, Abstract, Claims, Applicants, Inventors, IPC codes,
Patent references
• Exploiting Citation Networks–
• Query expansion using chemical annotations• Filtering based on IPC codes• Re-ranking based on claims