Hsin-Hsi Chen3-1 Chapter 3 Retrieval Evaluation Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 3-1

Chapter 3 Retrieval Evaluation

Hsin-Hsi Chen

Department of Computer Science and Information Engineering

National Taiwan University

Hsin-Hsi Chen 3-2

Evaluation• Function analysis• Time and space

– The shorter the response time, the smaller the space used, the better the system is

• Performance evaluation (for data retrieval)– Performance of the indexing structure– The interaction with the operating systems– The delays in communication channels– The overheads introduced by software layers

• Performance evaluation (for information retrieval)– Besides time and space, retrieval performance is an issue

Hsin-Hsi Chen 3-3

Retrieval Performance Evaluation

• Retrieval task– Batch mode

• The user submits a query and receives an answer back• How the answer set is generated

– Interactive mode• The user specifies his information need through a series of

interactive steps with the system• Aspects

– User effort– characteristics of interface design– guidance provided by the system– duration of the session

Hsin-Hsi Chen 3-4

Recall and Precision

• Recall– the fraction of the relevant documents which

has been retrieved

• Precision– the fraction of the retrieved documents which is

relevant

||

||

R

Ra

||

||

A

Ra

collection

Relevant Docs|R|

Answer Set|A|

Relevant Docsin Answer Set

|Ra|

Hsin-Hsi Chen 3-5

precision versus recall curve• The user is not usually presented with all the docume

nts in the answer set A at once• Example

Rq={d3,d5,d9,d25,d39,d44,d56,d71,d89,d123}

Ranking for query q by a retrieval algorithm1. d123 6. d9 11. d382. d84 7. d511 12. d483. d56 8. d129 13. d2504. d6 9. d187 14. d1135. d8 10. d25 15. d3

(precision, recall)(100%,10%)

(66%,20%) (50%,30%) (40%,40%) (33%,50%)

Hsin-Hsi Chen 3-6

11 standard recall levels for a query

• precision versus recall based on 11 standard recall levels: 0%, 10%, 20%, …, 100%

0 20 40 60 80 100 120

2040

6080100120

interpolation

recall

precision

Hsin-Hsi Chen 3-7

11 standard recall levels for several queries

• average the precision figures at each recall level

• P(r): the average precision at the recall level r• Nq: the number of queries used• Pi(r): the precision at recall level r for the i-th

query

Nq

i

iNq

rPrP

1

)()(

Hsin-Hsi Chen 3-8

necessity ofinterpolation procedure

• Rq={d3,d56,d129}1. d123 6. d9 11. d382. d84 7. d511 12. d483. d56 8. d129 13. d2504. d6 9. d187 14. d1135. d8 10. d25 15. d3

(33.3%,33.3%) (25%,66.6%)(20%,100%)

How about the precision figures at the recall levels 0, 0.1, 0.2, 0.3, …, 1?

(precision, recall)

Hsin-Hsi Chen 3-9

interpolation procedure

• rj (j {0,1,2,…,10}): a reference to the j-th standard recall level (e.g., r5 references to the recall level 50%)

• P(rj)=max rjrrj+1P(r)

• Example

interpolated precision

d56 (33.3%,33.3%)d129 (25%,66.6%)d3 (20%,100%)

r0: (33.33%,0%) r1: (33.33%,10%) r2: (33.33%,20%)r3: (33.33%,30%) r4: (25%,40%) r5: (25%,50%)r6: (25%,60%) r7: (20%,70%) r8: (20%,80%)r9: (20%,90%) r10: (20%,100%)

Hsin-Hsi Chen 3-10

Precision versus recall figurescompare the retrieval performance of distinct retrieve

algorithms over a set of example queries

• The curve of precision versus recall which results from averaging the results for various queries

0 20 40 60 80 100 120

10

100

20

90

30

80

40

70

5060

recall

precision

Hsin-Hsi Chen 3-11

Average Precision at given Document Cutoff Values

• Compute the average precision when 5, 10, 15, 20, 30, 50 or 100 relevant documents have been seen.

• Provide additional information on the retrieval performance of the ranking algorithm

Hsin-Hsi Chen 3-12

Single Value Summariescompare the retrieval performance of a retrieval algorithm for

individual queries

• Average precision at seen relevant documents– Generate a single value summary of the ranking by

averaging the precision figures obtained after each new relevant document is observed

– Example1. d123 (1) 6. d9 (0.5) 11. d382. d84 7. d511 12. d483. d56 (0.66) 8. d129 13. d2504. d6 9. d187 14. d1135. d8 10. d25 (0.4)15. d3 (0.3)

(1+0.66+0.5+0.4+0.3)/5=0.57

Favor systems which retrieve relevant documents quickly

Hsin-Hsi Chen 3-13

Single Value Summaries(Continued)

• R-Precision– Generate a single value summary of ranking by comput

ing the precision at the R-th position in the ranking, where R is the total number of relevant documents for the current query1. d123 6. d9 2. d84 7. d5113. d56 8. d1294. d6 9. d1875. d8 10. d25 R=10 and # relevant=4R-precision=4/10=0.4

2. 1. d1232. d84

3. 56

R=3 and # relevant=1R-precision=1/3=0.33

Hsin-Hsi Chen 3-14


• Precision Histograms– A R-precision graph for several queries– Compare the retrieval history of two algorithms

– RPA/B=0: both algorithms have equivalent performance for the i-the query

– RPA/B>0: A has better retrieval performance for query i– RPA/B<0: B has better retrieval performance for query i

querythitheforBandAorithmsaretrieval

ofvaluesprecisionRareiRPandiRPwhere

iRPiRPiRP

BA

BABA

lg

)()(

)()()(/

Hsin-Hsi Chen 3-15


0.0

0.5

1.0

1.5

-0.5

-1.0

-1.5

1 2 3 4 5 6 7 8 9 10

Query Number

2

8

Hsin-Hsi Chen 3-16

Summary Table Statistics

• Statistical summary regarding the set of all the queries in a retrieval task– the number of queries used in the task

– the total number of documents retrieved by all queries

– the total number of relevant documents which were effectively retrieved when all queries are considered

– the total number of relevant documents which could have been retrieved by all queries

– …

Hsin-Hsi Chen 3-17

Precision and Recall Appropriateness

• Estimation of maximal recall requires knowledge of all the documents in the collection

• Recall and precision capture different aspects of the set of retrieved documents

• Recall and precision measure the effectiveness over queries in batch mode

• Recall and precision are defined under the enforcement of linear ordering of the retrieved documents

Hsin-Hsi Chen 3-18

The Harmonic Mean

• harmonic mean F(j) of recall and precision

• R(j): the recall for the j-th document in the ranking• P(j): the precision for the j-th document in the ran

king

)(1

)(1

2)(

jPjR

jF

RP

RPF

2

Hsin-Hsi Chen 3-19

Example1. d123 6. d9 11. d382. d84 7. d511 12. d483. d56 8. d129 13. d2504. d6 9. d187 14. d1135. d8 10. d25 15. d3

(33.3%,33.3%) (25%,66.6%) (20%,100%)

36.0

67.01

25.01

2)8(

F33.0

33.01

33.01

2)3(

F 33.0

11

20.01

2)15(

F

Hsin-Hsi Chen 3-20

The E Measure

• E evaluation measure– Allow the user to specify whether he is more

interested in recall or precision

)(1

)(

11)(

2

2

jPjRb

bjE

FP R

P R

( )

2

2

1

Hsin-Hsi Chen 3-21

User-oriented measures

• Basic assumption of previous evaluation– The set of relevant documents for a query is the

same, independent of the user

• User-oriented measures– coverage ratio– novelty ratio– relative recall– recall effort

Hsin-Hsi Chen 3-22

Relevant Docs |R|Answer Set |A|

Relevant Docsknown to the User

which were retrieved |Rk|

Relevant Docspreviously unknown to theuser which were retrieved |Ru|

Relevant Docsknown to the user |U|

||

||cov

U

Rerage k

||||

||

ku

uRR

Rnovelty

high coverage ratio: system finds most of the relevant documents the user expected to see

high novelty ratio: the system reveals many new relevant documents which were previously unknown

(proposed by system)relative recall=

||

||||

U

RuRk recall effort:# of relevant docsthe user expectedto find/# of docsexamined to findthe expectedrelevant docs

Hsin-Hsi Chen 3-23

測試集 (Test Collections)• 組成要素

– 文件集 (Document Set; Document Collection)– 查詢問題 (Query; Topic)– 相關判斷 (Relevant Judgment)

• 用途– 設計與發展 : 系統測試– 評估 : 系統效能 (Effectiveness) 之測量– 比較 : 不同系統與不同技術間之比較

• 評比– 根據不同的目的而有不同的評比項目– 量化的測量準則，如 Precision與 Recall

Hsin-Hsi Chen 3-24

測試集 (Test Collections) ( 續 )

• 小型測試集– 早期 : Cranfield

– 英文 : SMART Collections, OHSUMED, Cystic Fibrosis,

LISA….

– 日文 : BMIR-J2

• 大型評比環境 : 提供測試集及研討的論壇– 美國 : TREC

– 日本 : NTCIR, IREX

– 歐洲 : AMARYLLIS, CLEF

Hsin-Hsi Chen 3-25

各測試集之基本資料

相關判斷層次

測試集文件數文件集大小

(MB)

平均字數/文件

查詢

問題數

平均字數 /查詢問題

平均相關文件數

/查詢問題主題領域

相關

不

相關

語文

Cranfield II 1,400 1.6 53.1 225 9.2 7.2 太空動力學 4 1 英文

ADI 82 0.04 27.1 35 14.6 9.5 文獻學 N/A 英文

MEDLARS 1,033 1.1 51.6 30 10.1 23.2 醫學 2 2 英文

TIME 423 1.5 570 24 16.0 8.7 世界情勢 N/A 英文

CACM 3,204 2.2 24.5 64 10.8 15.3 ACM通訊 N/A 英文

CISI 1,460 2.2 46.5 112 28.3 49.8 資訊科學 N/A 英文

NPL 11,429 3.1 20.0 100 7.2 22.4 電子、電腦、物理、地理

N/A 英文

INSPEC 12,684 N/A 32.5 84 15.6 33.0 物理、電子、控制

2 1 英文

ISILT 800 N/A N/A 63 N/A 8.4 文獻學 1 1 英文

UKCIS 27,361 N/A 182 193 N/A 57 生化 2 2 英文

UKAEA 12,765 N/A N/A 60 N/A N/A 核子科學 2 1 英文

LISA 6,004 3.4 N/A 35 N/A 10.8 N/A N/A 英文

Cystic Fibrosis

1,239 N/A 49.7 100 6.8 6.4-31.9 醫學 6 1 英文

OSHUMED 348,566 N/A 250 101 10 17/19.4 N/A 2 1 英文

BMIR-J2 5,080 N/A 621.8 60 102.2 10.6/28.4 經濟、工程 2 1 日文

TREC (TREC-1~6)

1,754,896 ~5GB 481.6 350 105.8 185.3 多主題 1 1 英文

AMARYLLIS 336,000 201 N/A 56 N/A N/A 多主題 N/A 法文

NTCIR 300,000 N/A N/A 100 N/A N/A 多主題 2 1 日文

IREX N/A N/A N/A N/A N/A N/A 多主題 2 1 日文

早期測試集：(1)簡短書目資料，如題名

，摘要，關鍵詞等組成(2)專門主題領域

近期測試集：(1)多主題全文及詳細的查詢

問題(2)大規模

Hsin-Hsi Chen 3-26

Cranfield II(ftp://ftp.cs.cornell.edu/pub/smart/cran/)

• 比較 33 種不同索引方式之檢索效益• 蒐集 1400 篇有關太空動力學的文件 ( 摘要形式 ) ，請每位作者根據這些文件與其當時研究的主題提出問題，經篩選後產生 200 多個查詢問題

.I 001

.Wwhat similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft?

Hsin-Hsi Chen 3-27

Cranfield II (Continued)

• Cranfield II 測試集中相關判斷建立四個步驟– 首先請提出查詢問題的建構者，對文件後所附之引用

及參考文獻進行，相關判斷– 接著請五位該領域的研究生，將查詢問題與每篇文件

逐一檢視，共花了 1500 小時，進行了 50 萬次以上的相關判斷，希望能找出所有的相關文件。

– 為了避免前述過程仍有遺漏，又利用文獻耦合的概念，計算文件間之相關性。發掘更多可能的相關文件。若有兩篇以上的文獻，共同引用了一篇或多篇論文，則稱這些文獻間具有耦合關係。

– 最後，將以上找出的所有文件，再一併送回給原作者進行判斷。

找可能相關的文件

驗證

Hsin-Hsi Chen 3-28

TREC ～簡介• TREC: Text REtrieval Conference

• 主辦 : NIST 及 DARPA ，為 TIPSTER 文件計劃之子計劃之一

• Leader: Donna Harman (Manager of The Natural Language Processing and Information Retrieval Group of the Information Access and User Interfaces Division, NIST)

• 文件集– 5GB 以上– 數百萬篇文件

Hsin-Hsi Chen 3-29

History• TREC-1 (Text Retrieval Conference) Nov 1992• TECC-2 Aug 1993• TREC-3• TREC-7

January 16, 1998 -- submit application to NIST.Beginning February 2 -- document disks distributed to those new participants who have returned the required forms. June 1 -- 50 new test topics for ad hoc task distributedAugust 3 -- ad hoc results due at NISTSeptember 1 -- latest track submission deadline. September 4 -- speaker proposals due at NIST.October 1 -- relevance judgments and individual evaluation scores due back to participantsNov. 9-11-- TREC-7 conference at NIST in Gaithersburg, Md.

TREC-8 (1999) TREC-9 (2000) TREC-10 (2001) …

Hsin-Hsi Chen 3-30

The Test Collection

• the documents

• the example information requests (called topics in TREC)

• the relevant judgments (right answers)

Hsin-Hsi Chen 3-31

The Documents

• Disk 1 (1GB)– WSJ: Wall Street Journal (1987, 1988, 1989) 華爾街日報– AP: AP Newswire (1989) 美聯社– ZIFF: Articles from Computer Select disks (Ziff-Davis Publishing)– FR: Federal Register (1989) 美國聯邦政府公報– DOE: Short abstracts from DOE publications

• Disk2 (1GB)– WSJ: Wall Street Journal (1990, 1991, 1992)– AP: AP Newswire (1988)– ZIFF: Articles from Computer Select disks– FR: Federal Register (1988)

Hsin-Hsi Chen 3-32

The Documents (Continued)

• Disk 3 (1 GB)– SJMN: San Jose Mercury News (1991) 聖荷西水星報– AP: AP Newswire (1990)

– ZIFF: Articles from Computer Select disks

– PAT: U.S. Patents (1993)

• Statistics– document lengths

DOE (very short documents) vs. FR (very long documents)

– range of document lengthsAP (similar in length) vs. WSJ and ZIFF (wider range of lengths)

Hsin-Hsi Chen 3-33

Volume Revised SourcesSize

(M B )Docs

Median #Terms/Doc

Mean #Terms/Doc

1March1994

Wall Street Journal, 1978-1989

Associated Press newswire, 1989

Computer Selects Articles, Ziff-Davis

Federal Register, 1989

Abstracts of U.S. DOE publications

267

254

242

260

184

98,732

84,678

75,180

25,960

226,087

245

446

200

391

111

434.0

473.9

473.0

1315.9

120.4

2March1994

Wall Street Journal, 1990-1992(WSJ)

Associated Press newswire(1988)(AP)

Computer Selects articles, Ziff-Davis(ZIFF)

Federal Register(1988)(FR88)

242

237

175

209

74,520

79,919

56,920

19,860

301

438

182

396

508.4

468.7

451.9

1378.1

3March1994

San Jose Mercury News, 1991

Associated Press newswire, 1990

Computer Selects articles, Ziff-Davis

U.S. patents, 1993

287

237

345

243

90,257

78,321

161,021

6,711

379

451

122

4445

453.0

478.4

295.4

5391.0

4 May 1996

The Financial Times, 1991-1994(FT)

Federal Register, 1994(FR94)

Congressional Record, 1993(CR)

564

395

235

210,158

55,630

27,922

316

588

288

412.7

644.7

1373.5

5April1997

Foreign Broadcast Information Service(FBIS)

Los Angeles Times (1989, 1990)

470

475

130,471

131,896

322

351

543.6

526.5

RoutingTestData

Foreign Broadcast Information Service(FBIS) 490 120,653 348 581.3

TREC 文件集 DOE (very short documents) vs. FR (very long documents)

AP (similar in length) vs. WSJ and ZIFF (wider range of lengths)

Hsin-Hsi Chen 3-34

Document Format(in Standard Generalized Mark-up Language, SGML)

<DOC><DOCNO>WSJ880406-0090</DOCNO><HL>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </HL><AUTHOR>Janet Guyon (WSJ staff) </AUTHOR><DATELINE>New York</DATELINE><TEXT> American Telephone & Telegraph Co. introduced the first of a new generation of phone services with broad implications for computer and communications . .</TEXT></DOC>

Hsin-Hsi Chen 3-35

TREC 之文件標示<DOC>

<DOCN0>FT911-3</DOCN0>

<PROFILE>AN-BE0A7AAIFT</PROFILE>

<DATE>910514

</DATE>

<HEADLINE>

FT 14 MAY 91 / International Company News: Contigas plans DM900m east German project

</HEADLINE>

<BYLINE>

By DAVID GOODHART

</BYLINE>

<DATELINE>

BONN

</DATELINE>

<TEXT>

CONTIGAS, the German gas group 81 per cent owned by the utility Bayernwerk, said yesterday that it intends toinvest DM900m (Dollars 522m) in the next jour years to build a new gas distribution system in the east German state ofThuringia. …

</TEXT>

</DOC>

Hsin-Hsi Chen 3-36

The Topics

• Issue 1– allow a wide range of query construction methods– keep the topic (user need) distinct from the query (the

actual text submitted to the system)

• Issue 2– increase the amount of information available about each

topic– include with each topic a clear statement of what criteria

make a document relevant

• TREC– 50 topics/year, 400 topics (TREC1~TREC7)

Hsin-Hsi Chen 3-37

Sample Topics used in TREC-1 and TREC-2<top><head>Tipster Topic Description<num>Number: 066<dom>Domain: Science and Technology<title>Topic: Natural Language Processing

<desc>Description: (one sentence description)Document will identify a type of natural language processing technology whichis being developed or marketed in the U.S.

<narr>Narrative: (complete description of document relevance for assessors)A relevant document will identify a company or institution developing ormarketing a natural language processing technology, identify the technology,and identify one or more features of the company’s product.

<con>Concepts: (a mini-knowledge base about topic such as a real searcher 1. natural language processing might possess)2. translation, language, dictionary, font3. software applications

Hsin-Hsi Chen 3-38

<fac> Factors (allow easier automatic query building by listing specific<nat> Nationality: U.S. items from the narrative that </fact> constraint the documents that <def>Definition(s): are relevant)</top>

Hsin-Hsi Chen 3-39

TREC-1 and TREC-2 查詢主題<top>

<head> Tipster Topic Description

<num> Number: 037

<dom> Domain: Science and Technology

<title> Topic: Identify SAA components

<desc> Description:

Document identifies software products which adhere to IBM's SAA standards.

<narr> Narrative:

To be relevant, a document must identify a piece of software which is considered a Systems Application Architectural(SAA) component or one which conforms to SAA.

<con> Concept(s):

1. SAA

2. OfficeVision

3. IBM

4. Standards, Interfaces, Compatibility

<fac> Factor(s):

<def> Definition(s):

OfficeVision - A series of integrated office automation applications from IBM that runs across all of its major coputerfamilies.

Systems Application Architecture (SAA) - A set of IBM standards that provide consistent user interfaces, programminginterfaces, and communications protocols among all IBM computers from micro to mainframe.

</top>

Hsin-Hsi Chen 3-40

TREC-3 查詢主題

<top>

<num> Number: 177

<title> Topic: English as the Official Language in U.S.

<desc> Description:

Document will provide arguments supporting the making of English the standard language of the U.S.

<narr> Narrative:

A relevant document will note instances in which English is favored as a standard language. Examples are thepositive results achieved by immigrants in the areas of acceptance, greater economic opportunity, and increasedacademic achievement. Reports are also desired which describe some of the language difficulties encountered byother nations and groups of nations, e.g., Canada, Belgium, European Community, when they have opted for the use oftwo or more languages as their official means of communication. Not relevant are reports which promotebilingualism or multilingualism.

</top>

Hsin-Hsi Chen 3-41

Sample Topics used in TREC-3

<num>Number: 168<title>Topic: Financing AMTRAK

<desc>Description:A document will address the role of the Federal Government in financingthe operation of the National Railroad Transportation Corporation (AMTRAK)

<narr>Narrative:A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity.It could also discuss the privatization of AMTRAK as an alternative tocontinuing government subsides. Document comparing government subsidesgiven to air and bus transportation with those provided to AMTRAK would alsobe relevant.

Hsin-Hsi Chen 3-42

Features of topics in TREC-3

• The topics are shorter.

• The topics miss the complex structure of the earlier topics.

• The concept field has been removed.

• The topics were written by the same group of users that didassessments.

• Summary:– TREC-1 and 2 (1-150): suited to the routing task

– TREC-3 (151-200): suited to the ad-hoc task

Hsin-Hsi Chen 3-43

TREC-4查詢主題

<top>

<num> Number: 217

<desc> Description:

Reporting on possibility of and search for extra-terrestrial life/intelligence.

</top>

TREC-4 只留下主題欄位， TREC-5 將查詢主題調整回 TREC-3相似結構，但平均長度較短。

Hsin-Hsi Chen 3-44

TREC ～查詢主題

字數 (包含停字)

欄位最小

字數最大

字數平均

字數

Total 44 250 107.4

Title 1 11 3.8

Description 5 41 17.9

Narrative 23 209 64.5

TREC-1

(51-100)

Concepts 4 111 21.2

Total 54 231 130.8

Title 2 9 4.9



TREC-2

(101-150)

Concepts 3 88 28.5

Total 49 180 103.4

Title 2 20 6.5


TREC-3

(151-200)


Total 8 33 16.3 TREC-4

(201-250) Description 8 33 16.3

Total 29 213 82.7

Title 2 10 3.8


TREC-5

(251-300)


Total 47 156 88.4

Title 1 5 2.7


TREC-6

(301-350)


• 主題結構與長度• 主題建構• 主題篩選

– pre-search

– 判斷相關文件的數量

Hsin-Hsi Chen 3-45

TREC-6 之主題篩選程序

前 25篇文章中有多少篇是相關的?

0 1-5 6-20 ≧ 20

不採納此主題

繼續閱讀檢索出的

第 26-100篇文件，

判斷其相關性

根據相關回饋等方

式，輸入更多的查詢

問句，再次執行檢

索，並判斷前 100篇

文件的相關性

記錄相關文件的數量

不採納此主題

在 PRISE系統中輸入關鍵字執行檢索

Hsin-Hsi Chen 3-46

The Relevance Judgments• For each topic, compile a list of relevant documents.• approaches

– full relevance judgments (impossible)judge over 1M documents for each topic, result in 100M judgments

– random sample of documents (insufficient relevance sample)relevance judgments done on the random sample only

– TREC approach (pooling method)make relevance judgments on the sample of documents selected byvarious participating systemsassumption: the vast majority of relevant documents have been found andthat documents that have not been judged can be assumed to be no relevant

• pooling method– Take the top 100 documents retrieved by each system for a given topic.– Merge them into a pool for relevance assessment.– The sample is given to human assessors for relevance judgments.

Hsin-Hsi Chen 3-47

TREC ～相關判斷• 判斷方法

– Pooling Method

– 人工判斷• 判斷基準 : 二元式 , 相關與不相關• 相關判斷品質

– 完整性– 一致性

Hsin-Hsi Chen 3-48

Pooling 法• 針對每個查詢主題，從參與評比的各系統所送回之測試結果中抽取出前 n 篇文件，合併形成一個 Pool

• 視為該查詢主題可能的相關文件候選集合，將集合中重覆的文件去除後，再送回給該查詢主題的原始建構者進行相關判斷。

• 利用此法的精神是希望能透過多個不同的系統與不同的檢索技術，盡量網羅可能的相關文件，藉此減少人工判斷的負荷。

Hsin-Hsi Chen 3-49

Overlap of Submitted Results

TREC-1 (TREC-2): top 100 documents for each run (33 runs & 40 runs)TREC-3: top 100 (200) documents for each run (48 runs)After pooling, each topic was judged by a single assessor to insure the best consistency of judgment.

TREC-1 和 TREC-2 runs 的個數差 7 個，檢索所得的 unique documents 個數(39% vs. 28%)差異不大，經人判定相關的文件數目差異也不大 (22% vs. 19%) 。TREC-3 提供判斷的文件取兩倍大， unique部份差異不大 (21% vs. 20%) ，經經人判定相關的文件數目差異也不大 (15% vs. 10%) 。

unique

Hsin-Hsi Chen 3-50

TREC 候選集合與實際相關文件之對照表

Adhoc Routing

各系統送至Pool 內之文件總數

Pool中實際之文件數

(去除重覆)

實際相關文件數

各系統送至Pool 內之文件總數

Pool中實際之文件數

(去除重覆)

實際相關文件數

TREC-1 8800 1279(39%) 277(22%) TREC-1 2200 1067(49%) 371(35%)

TREC-2 4000 1106(28%) 210(19%) TREC-2 4000 1466(37%) 210(14%)

TREC-3 2700 1005(37%) 146(15%) TREC-3 2300 703(31%) 146(21%)

TREC-4 7300 1711(24%) 130(08%) TREC-4 3800 957(25%) 132(14%)

TREC-5 10100 2671(27%) 110(04%) TREC-5 3100 955(31%) 113(12%)

TREC-6 8480 1445(42%) 92(6.4%) TREC-6 4400 1306(30%) 140(11%)

Hsin-Hsi Chen 3-51

TREC之相關判斷結果記錄

54 0 FB6-F004-0059 0

54 0 FB6-F004-0073 1

54 0 FB6-F004-0077 1

54 0 FB6-F004-0078 1

54 0 FB6-F004-0080 1

54 0 FB6-F004-0083 1

54 0 FB6-F004-0087 1

54 0 FB6-F004-0089 1

54 0 FB6-F004-0090 1

54 0 FB6-F004-0092 1

54 0 FB6-F004-0094 1

54 0 FB6-F004-0095 1

54 0 FB6-F004-0096 1

54 0 FB6-F004-0098 1

54 0 FB6-F004-0100 1

54 0 FB6-F004-0102 1

54 0 FB6-F004-0104 1

54 0 FB6-F004-0105 1

查詢主題序號

文件集

編號

決策層級

Hsin-Hsi Chen 3-52

TREC ～評比Tasks/Tracks TREC1 TREC2 TREC3 TREC4 TREC5 TREC6 TREC7

Routing Main Tasks

Adhoc

Confusion Confusion Spoken Document

Retrieval

Database Merging

Filtering

High Precision

Interactive

Cross Language

Spanish Multilingual

Chinese

Natural Language Processing

Query

Very Large Corpus

Hsin-Hsi Chen 3-53

TREC-7

• Ad hoc task– Participants will receive 5 gigabytes of data for use in training

their systems.

– The 350 topics used in the first six TREC workshops and the relevance judgments for those topics will also be available.

– The 50 new test topics (351-400) will be distributed in June and will be used to search the document collection consisting of the documents on TREC disks 4 and 5.

– Results will be submitted to NIST as the ranked top 1000 documents retrieved for each topic.

Hsin-Hsi Chen 3-54

TREC-7 (Continued)

• Track tasks– Filtering Track

• A task in which the topics are stable (and some relevant documents are known) but there is a stream of new documents.

• For each document, the system must make a binary decision as to whether the document should be retrieved (as opposed to forming a ranked list).

– Cross-Language Track• An ad hoc task in which some documents are in English, some

in German, and others in French.• The focus of the track will be to retrieve documents that

pertain to the topic regardless of language.

Hsin-Hsi Chen 3-55

TREC-7 (Continued)

• High Precision User Track– An ad hoc task in which participants are given five minutes per topic

to produce a retrieved set using any means desired (e.g., through user interaction, completely automatically).

• Interactive Track – A task used to study user interaction with text retrieval systems.

• Query Track– A track designed to foster research on the effects of query variability

and analysis on retrieval performance.– Participants each construct several different versions of existing

TREC topics, some versions as natural language topics and some as structured queries in a common format.

– All groups then run all versions of the topics.

Hsin-Hsi Chen 3-56

TREC-7 (Continued)

• Spoken Document Retrieval Track– An ad hoc task that investigates a retrieval system's ability to

retrieve spoken document (recordings of speech).

• Very Large Corpus (VLC)– An ad hoc task that investigates the ability of retrieval systems to

handle larger amounts of data. The current target corpus size is approximately 100 gigabytes.

Hsin-Hsi Chen 3-57

Categories of Query Construction

• AUTOMATICcompletely automatic initial query construction

• MANUALmanual initial construction

• INTERACTIVEuse of interactive techniques to construct the queries

Hsin-Hsi Chen 3-58

Levels of Participation

• Category A: full participation• Category B:

full participation using a reduced database• Category C: evaluation only• submit up to two runs for routing task, the adhoc t

ask, or both• send in the top 1000 documents retrieved for each

topic for evaluation

Hsin-Hsi Chen 3-59

TREC-3 Participants(14 companies, 19 universities)

Hsin-Hsi Chen 3-60

TREC-6

Apple ComputerAT&T Labs ResearchAustralian National Univ.Carnegie Mellon Univ.CEA (France)Center for Inf. Res., RussiaDuke Univ./Univ. of Colorado/BellcoreETH (Switzerland)FS Consulting, Inc.GE Corp./Rutgers Univ.George Mason Univ./NCR CorpHarris Corp.IBM T.J. Waston Res. (2 groups)ISS (Singapore)ITI (Singapore)APL, Johns Hopkins Univ.……………

Hsin-Hsi Chen 3-61

Evaluation Measures at TREC

• Summary table statistics– The number of topics used in the task– The number of documents retrieved over all topics– The number of relevant documents which were

effectively retrieved for all topics

• Recall-precision averages• Document level averages

– Average precision at specified document cutoff values (e.g., 5, 10, 20, 100 relevant documents)

• Average precision histogram

Hsin-Hsi Chen 3-62

TREC ～質疑與負面評價• 測試集方面

– 查詢主題• 並非真實的使用者需求，過於人工化• 缺乏需求情境的描述

– 相關判斷• 二元式的相關判斷不實際• pooling method會遺失相關文件，導致回收率不準確• 品質與一致性

• 效益測量方面– 只關注量化測量– 回收率的問題– 適合作系統間的比較，但不適合作評估

Hsin-Hsi Chen 3-63

TREC ～質疑與負面評價 ( 續 )

• 評比程序方面– 互動式檢索

• 缺乏使用者介入• 靜態的資訊需求不切實際

Hsin-Hsi Chen 3-64

NTCIR ～簡介• NTCIR: NACSIS Test Collections for IR• 主辦 : NACSIS( 日本國家科學資訊系統中心 ) • 發展背景

– 大型日文標竿測試集的需求– 跨語言檢索的研究發展需要

• 文件集– 來源為 NACSIS Academic Conference Papers Database

– 主要為會議論文的摘要– 超過 330,000 篇文件 , 其中超過 1/2 為英日文對照之文件– 有部分包含 part-of-speech tags

Hsin-Hsi Chen 3-65

NTCIR ～查詢主題• 來源 : 搜集真實的使用者需求 , 再加以修正改寫• 已有 100 個查詢主題，分屬不同學科領域• 組成結構

<TOPIC q=nnnn>編號 <title> 標題 </title>

<description> 資訊需求之簡短描述 </description>

<narrative> 資訊需求之細部描述 , 包括更進一步的解釋 , 名詞的定義 , 背景知識 , 檢索的目的 , 預期的相關文件數量 , 希望的文件類型 , 相關判斷的標準等 </narrative>

<concepts> 相關概念的關鍵詞 </concepts>

Hsin-Hsi Chen 3-66

NTCIR ～相關判斷• 判斷方法

– 利用 pooling method 先進行篩選– 由各主題專家，及查詢主題的建構者進行判斷

• 判斷基準– A: 相關– B: 部分相關– C: 不相關

• 精確率計算 : 依測試項目的不同而有不同– Relevant quel: B與 C 均視為不相關– Partial Relevant quel: A 與 B 均視為相關

Hsin-Hsi Chen 3-67

NTCIR ～評比• Ad-hoc Information Retrieval Task• Cross-lingual Information Retrieval Task

– 利用日文查詢主題檢索英文文件– 共有 21 個查詢主題，其相關判斷包括英文文件與日文文件

– 系統可選擇自動或人工建立查詢問題– 系統需送回前 1000 篇檢索結果

• Automatic Term Extraction and Role Analysis Task– Automatic Term Extraction ：從題名與摘要中抽取出 tec

hnical terms– Role Analysis Task

Hsin-Hsi Chen 3-68

NTCIR Workshop 2

• organizers– Hsin-Hsi Chen (Chinese IR track)– Noriko Kando (Japanese IR track)– Sung-Hyon Myaeng (Korean IR track)

• Chinese test collection– developer: Professor Kuang-hua Chen (LIS, NTU)– Document collection: 132,173 news stories– Topics: 50

Hsin-Hsi Chen 3-69

NTCIR 2 schedule

• Someday in April, 2000: Call for Participation• May or later: Training set will be distributed• August, 2000: Test Documents and Topics will be

distributed.• Sept.10-30, 2000: Results submission• Jan., 2001: Evaluation results will be distributed.• Feb. 1, 2001: Paper submission for working notes• Feb. 19-22, 2001 (or Feb. 26-March 1): Workshop (in

Tokyo)• March, 2001: Proceedings

Hsin-Hsi Chen 3-70

IREX ～簡介• IREX: Information Retrieval and Extraction Exercise

• 主辦 : Information Processing Society of Japan

• 參加者 : 約 20隊 ( 或以上 )

• 預備測試：利用 BMIR-J2 測試集中之查詢主題• 文件集

– 每日新聞 , 1994-1995

– 參加者必須購買新聞語料

Hsin-Hsi Chen 3-71

IREX ～查詢主題• 組成結構

<topic_id>編號 </topic_id>

<description> 簡短的資訊需求 , 主要為名詞與其修飾語構成的名詞詞組 </description>

<narrative> 詳細的資訊需求 , 以自然語言敘述 , 通常為2 至 3 個句子組成 , 亦包含名詞解釋 ,

同義詞或實例 . </narrative>

– description 欄位中的詞彙必須包含在 narrative 欄位中

Hsin-Hsi Chen 3-72

IREX ～相關判斷• 判斷依據 : 測試主題的所有欄位• 判斷方法 : 由學生二名進行判斷

– 若二人之判斷結果一致，則完成相關判斷– 若二人之判斷結果不一致或不確定，則由三人來作最後的判定

• 判斷基準– 學生 : 6 個判斷層次

• A: 相關 A?: 不確定是否為相關• B: 部分相關 B?: 不確定是否為部分相關• C: 不相關 C?: 不確定是否為不相關

Hsin-Hsi Chen 3-73

IREX ～相關判斷 ( 續 )

– 最終判斷者 : 3 個判斷層次• A: 相關• B: 部分相關• C: 不相關

• 相關判斷的修正

Hsin-Hsi Chen 3-74

IREX ～評比• 評比項目

– Name Entity Task (NE)

• 與MUC 相似，測試系統自動抽取專有名詞的能力，如組織名、人名、地名等 .

• 一般領域文件抽取 v.s. 特殊領域文件抽取– Information Retrieval (IR)

• 與 TREC 相似

• 評比規則– 送回文件：前 300 篇

Hsin-Hsi Chen 3-75

BMIR-J2 ～簡介• 第一個日文資訊檢索系統測試集

– BMIR-J1: 1996

– BMIR-J2: 1998.3

• 發展單位 : IPSG-SIGDS

• 文件集 : 主要為新聞文件– 每日新聞 : 5080 篇– 經濟與工程

• 查詢主題 : 60 個

Hsin-Hsi Chen 3-76

BMIR-J2 ～相關判斷

• 以布林邏輯結合關鍵詞檢索 1-2個 IR 系統• 由資料庫檢索者做進一步的相關判斷• 由建構測試集的人員再次檢查

Hsin-Hsi Chen 3-77

BMIR-J2 ～查詢主題Q: F=oxoxo: “Utilizing solar energy”Q: N-1: Retrieve texts mentioning user of solar energyQ: N-2: Include texts concerning generating electricity and drying

things with solar heat.

• 查詢主題的分類– 目的 : 標明該測試主題的特性 , 以利系統選擇– 標記 : o(necessary), x(unnecessary)– 類別

• The basic function• The numeric range function• The syntactic function• The semantic function• The world knowledge function:

Hsin-Hsi Chen 3-78

AMARYLLIS ～簡介• 主辦： INIST (INstitute of Information Scientific

and Technique)

• 參加者 : 約近 10隊• 文件集

– 新聞文件 : The World, 共 2 萬餘篇– Pascal(1984-1995) 及 Francis(1992-1995) 資料中抽取出來的文件題名與摘要部分，共 30餘萬篇

Hsin-Hsi Chen 3-79

AMARYLLIS ～查詢主題• 組成結構

<num>編號 </num>

<dom> 所屬之學科領域 </dom>

<suj> 標題 </suj>

<que> 資訊需求之簡單描述 </que>

<cinf> 資訊需求之詳細描述 </cinf>

<ccept><c> 概念 , 敘述語 </ccept></c>

Hsin-Hsi Chen 3-80

AMARYLLIS ～相關判斷• 原始的相關判斷

– 由文件集之擁有者負責建構• 標準答案的修正

– 加入• 不在最初的標準答案中，但被一半以上的參加者檢索出來的文件

• 參加者所送回的檢索結果中的前 10 篇的文件– 減去

• 在原始的標準答案中出現，但在參加者送回的檢索結果中未出現的文件

Hsin-Hsi Chen 3-81

AMARYLLIS ～評比

• 系統需送回檢索結果的前 250 篇• 系統可選擇採取自動或人工的方式建立 query

• 評比項目– Routing Task

– Adhoc Task

Hsin-Hsi Chen 3-82

An Evaluation of Query Processing Strategies Using the Tipster Collection

(SIGIR 1993: 347-355)

James P. Callan and W. Bruce Croft

Hsin-Hsi Chen 3-83

INQUERY Information Retrieval System

• Documents are indexed by the word stems and numbers that occur in the text.

• Documents are also indexed automatically by a small number of features that provide a controlled indexing vocabulary.

• When a document refers to a company by name, the document is indexed by the company name and the feature #company.

• INQUERY includes company, country, U.S. city, number and date, and person name recognizer.

Hsin-Hsi Chen 3-84

INQUERY Information Retrieval System

• feature operators#company operator matches the #company feature

• proximity operatorsrequire their arguments to occur either in order, within some distance of each other, or within some window

• belief operatorsuse the maximum, sum, or weighted sum of a set of beliefs

• synonym operators

• Boolean operators

Hsin-Hsi Chen 3-85

Query Transformation in INQUERY

• Discard stop phrases.

• Recognize phrases by stochastic part of speech tagger.

• Look for word “not” in the query.

• Recognize proper names by assuming that a sequence of capitalized words is a proper name.

• Introduce synonyms by a small set of words that occur in the Factors field of TIPSTER topics.

• Introduce controlled vocabulary terms (feature operators).

Hsin-Hsi Chen 3-86

Techniques for Creating Ad Hoc Queries

• Simple Queries (description-only approach)– Use the contents of Description field of TIPSTER topics only.– Explore how the system behaves with the very short queries.

• Multiple Sources of Information (multiple-field approach)– Use the contents of the Description, Title, Narrative, Concept(s)

and Factor(s) fields.– Explore how a system might behave with an elaborate user

interface or very sophisticated query processing

• Interactive Query Creation– Automatic query creation followed by simple manual

modifications.– Simulate simple user interaction with the query processing.

Hsin-Hsi Chen 3-87

Simple Queries

• A query is constructed automatically by employing all the query processing transformations on Description field.

• The remaining words and operators are enclosed in a weighted sum operator.

• 11-point average precision

Hsin-Hsi Chen 3-88

Hsin-Hsi Chen 3-89

Multiple Sources of Information

• Q-1: Created automatically, using T, D, N, C and F fields. Everything except the synonym and concept operators was discarded from the the Narrative field. (baseline model)

• Q-3: The same as Q-1, except that recognition of phrases and proper names was disabled. (words-only query)To determine whether phrase and proximity operators were helpful.

• Q-4: The same as Q-1, except that recognition of phrases was applied to the Narrative field.To determine whether the simple query processing transformation would be effective on the abstract descriptions in the Narrative field.

Hsin-Hsi Chen 3-90

Multiple Sources of Information (Continued)

• Q-6: The same as Q-1, except that only the T, C, and F fields were used.Narrow in on the set of fields that appeared most useful.

• Q-F: The same as Q-1, with 5 additional thesaurus words or phrases added automatically to each queryan approach to automatically discovering thesaurus terms

• Q-7: A combination of Q-1 and Q-6whether combining the results of two relatively similar queries could yield an improvement乍看之下， Q-6 似乎是 Q-1 的一部份，沒有合併的需要，但是仔細想想還是不一樣。如果選擇 terms 時，依某個標準， Q-1T,C,F可能只取一小部份，但 Q-6就不同。

Hsin-Hsi Chen 3-91

A Comparison of Six Automatic Methods of Constructing AdHoc Queries

Phrases improved performanceat low recall

Phrases from the Narrativewere not helpful.

Discarding the Descriptionand Narrative fields did nothurt performance appreciably.

It is possible to automaticallyconstruct a useful thesaurus for

a collection.

Q-1 and Q-6, which aresimilar, retrieve different

sets of documents.

Hsin-Hsi Chen 3-92

Interactive Query Creation• The system created a query using method Q-1, and then a

person was permitted to modify the resulting query.• Modifications

– add words from the Narrative field– delete words or phrases from the query– indicate that certain words or phrases should occur near each other

within a document

• Q-MManual addition of words or phrases from the Narrative, and manual deletion of words or phrases from the query

• Q-OThe same as Q-M, except that the user could also indicate that certain words or phrases must occur within 50 words of each other

Hsin-Hsi Chen 3-93

Recall levels(10%-60%)acceptable becauseusers are not likelyto examine alldocuments retrieved

Paragraph retrieval (within 50words)significantlyimproves effectiveness

Hsin-Hsi Chen 3-94

The effects of thesaurus terms and phrases on queriesthat were created automatically and modified manually

Cf. Q-O (42.7)Thesaurus words and phraseswere added after the querywas modified, so they werenot used in unordered windowoperators

Inclusion of unorderedwindow operators