15
Building a Domain-Specific Document Building a Domain-Specific Document Collection for Evaluating Metadata Collection for Evaluating Metadata Effects on Information Retrieval Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones School of Computing, Dublin City University, Ireland 20 May 2010 LREC 2010

Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

  • Upload
    oke

  • View
    30

  • Download
    1

Embed Size (px)

DESCRIPTION

20 May 2010 LREC 2010. Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval. Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones School of Computing, Dublin City University, Ireland. Outline. CNGL Objective - PowerPoint PPT Presentation

Citation preview

Page 1: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Building a Domain-Specific Document Collection for Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information RetrievalEvaluating Metadata Effects on Information Retrieval

Walid Magdy, Jinming Min, Johannes Leveling, Gareth JonesSchool of Computing, Dublin City University, Ireland

20 May 2010

LREC 2010

Page 2: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Outline

CNGL

Objective

Data collection preparation and overview

IR test collection design

Baseline Experiments

Summary

Page 3: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

CNGL

Centre of Next Generation Localisation (CNGL)

4 Universities: DCU, TCD, UCD, and UL

Team: 120 PhD students, PostDocs, and PIs

Supported by Science Foundation of Ireland (SFI)

9 Industrial Partners: IBM, Microsoft, Symantec, …

Objective: Automation of the localisation process

Technologies: MT, AH, IR, NLP, Speech, and Dev.

Page 4: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Objective

Create a collection of data that is:

1. Suitable for IR tasks

2. Suitable for other research fields (AH, NLP)

3. Large enough to produce conclusive results

4. Associated with defined evaluation strategies

Prepare the collection from freely available dataYouTube

Domain specific (Basketball)

Build standard IR test collection (document set + topics set + relevance assessment)

Page 5: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

YouTube Videos Features

Document

Tags

- Video URL- Video Title

Posting User

Posting date

Description

Category

Number of

Views

Length

Responded Videos

Related Videos

Comments

Number of

Ratings

Number of

Favorited

Page 6: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Methodology for Crawling Data

50 NBA related queries used to search YouTube

First 700 results per query crawled with related videos

Crawled pages parsed and metadata extracted.

Extracted data represented in XML format

Non-sport category results filtered out

Used Queries:NBA - NBA Highlights - NBA All Starts - NBA fights

Top ranked 15 NBA players in 2008 + Jordan + Shaq

29 NBA teams

Page 7: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Data Collection Overview

Crawled video pages: 61,34061,340 pages

Max crawled related/responded video pages: 2020

Max crawled comments for a given video page: 500500

Comments associated with contributing user’s ID

Crawled user profiles ≈ 250k250k

Page 8: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

XML sample

Page 9: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Topics Creation

<title>Michael Jordan best dunks</title>

<description>Find the best dunks through the career of Michael Jordan in NBA. It can be a collection of dunks in matches, or dunk contest he participated in. </description>

<narrative>A relevant video should contain at least one dunk for Jordan. Videos of dunks for other players are not relevant. And other plays for Jordan other than dunks are not relevant as well</narrative>

40 topics (queries) created

Specific topics related to NBA

TREC topic = query (title) + description + narrative

Page 10: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Relevance Assessment

4 indexes created:Title

Title +Tags

Title + Tags + Description

Title + Tags + Description + Related videos titles

5 different retrieval models used

20 different result lists, each contains 60 documents

Result lists merged with random ranking

122 to 466 documents assessed per topic

1 to 125 relevant documents per topic (avg. = 23)

Page 11: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Baseline Experiments

Search 4 different indexes:Title

Title +Tags

Title + Tags + Description

Title + Tags + Description + Related videos titles

Indri retrieval model used to rank results

1000 results retrieved for each search

Mean average precision (MAP) used to compare the results

Page 12: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Results

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Title Title+Tags Title+Tags+Desc All text fields

MA

P

Page 13: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Summary (new language resource)

61,340XML docs

40 topics +rel. assess.

250,000User profiles

Comments

Ratings

# Views

MetadataIR test set

AH/Personalisation

Se

ntim

en

t A

nal

ysis

Videos

Multimedia processing

Reranking using ML

TagsNER

Top bigrams in Top bigrams in “Tags” field“Tags” field

Kobe BryantNBA BasketballLebron JamesMichael Jordan

Los AngelesAll Star

Chicago BullsBoston CelticsAllen Iverson

Angeles LakersSlam Dunk

Basketball NBADwight Howard

Vince CarterDwyane WadeKevin Garnett

Toronto RaptorsHouston Rockets

Miami HeatO’Neal

Phoenix SunsDetroit PistonsTracy Mcgrady

Yao MingChris Paul

Amazing HighlightsNew YorkPau Gasol

Cleveland CavaliersNBA Amazing

Top bigrams in Top bigrams in “Tags” field“Tags” field

Kobe BryantNBA BasketballLebron JamesMichael Jordan

Los AngelesAll Star

Chicago BullsBoston CelticsAllen Iverson

Angeles LakersSlam Dunk

Basketball NBADwight Howard

Vince CarterDwyane WadeKevin Garnett

Toronto RaptorsHouston Rockets

Miami HeatO’Neal

Phoenix SunsDetroit PistonsTracy Mcgrady

Yao MingChris Paul

Amazing HighlightsNew YorkPau Gasol

Cleveland CavaliersNBA Amazing

Page 14: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Questions & Answers

Q: Is this collection available for free?

A: No

Q: Nothing could be provided?

A: Scripts + Topics + Rel. assess. (needs updating)

Q: Any other questions?

A: …

Page 15: Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

Thank youThank you