37
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa shannon- [email protected]

Redeeming Relevance for Subject Search in Citation Indexes

  • Upload
    anne

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Redeeming Relevance for Subject Search in Citation Indexes. Shannon Bradshaw The University of Iowa [email protected]. Citation Indexes. Valuable tools for research Examples: SCI, CiteSeer, arXiv, CiteBase Permit traversal of citation networks Identify significant contributions - PowerPoint PPT Presentation

Citation preview

Page 1: Redeeming Relevance for Subject Search in Citation Indexes

Redeeming Relevance for Subject Search in

Citation Indexes

Shannon Bradshaw

The University of Iowa

[email protected]

Page 2: Redeeming Relevance for Subject Search in Citation Indexes

Citation Indexes

Valuable tools for research Examples: SCI, CiteSeer, arXiv, CiteBase Permit traversal of citation networks Identify significant contributions Subject search is often the entry point

Page 3: Redeeming Relevance for Subject Search in Citation Indexes

Subject search

Query similarity Citation frequency

Page 4: Redeeming Relevance for Subject Search in Citation Indexes

Citation frequency

PageRank Example: 2 papers similar in terms of relevance published at roughly the same time Paper A cited only by its author Paper B cited 10 times by other authors Paper B likely to have greater priority for

reading

Page 5: Redeeming Relevance for Subject Search in Citation Indexes

Problem

Boolean retrieval metrics Many top documents are not relevant Effective for Web-searches Any one of several popular pages will do Not so for users of citation indexes

Page 6: Redeeming Relevance for Subject Search in Citation Indexes

Reference Directed Indexing (RDI)

Objective: To combine strong measures of both relevance and significance in a single metric

Intuition: The opinions of authors who cite a document effectively distinguish both what a document is about and how important a contribution it makes

Similar to the use of anchor text to index Web documents

Page 7: Redeeming Relevance for Subject Search in Citation Indexes

Example Paper by Ron Azuma

and Gary Bishop On tracking the heads

of users in augmented reality systems

Head tracking is necessary in order to generate the correct perspective view

Page 8: Redeeming Relevance for Subject Search in Citation Indexes

A single reference to Azuma

Azuma et al. [2] developed a 6DOF tracking system using linear accelerometers and rate gyroscopes to improve the dynamic registration of an optical beacon ceiling tracker.

Page 9: Redeeming Relevance for Subject Search in Citation Indexes

Summarizes Azuma paper as…

A six degrees of freedom tracking system With additional details:

Improves dynamic registration Optical beacon ceiling tracker Linear accelerometers Rate gyroscopes

Page 10: Redeeming Relevance for Subject Search in Citation Indexes

Leveraging multiple citations

For any document cited more than once… We can compare the words of all authors Terms used by many referrers make good

index terms for a document

Page 11: Redeeming Relevance for Subject Search in Citation Indexes

Repeated use of “tracking” and “augmented reality”

Whereas several augmented reality environments are known (cf. State et al. 1] Azuma and Bishop [3])

… e.g. landmark tracking for determining head pose in augmented reality [2, 3, 4, 5]

Azuma and Holloway analyze sources of registration and tracking errors in AR systems [2, 11, 12].

Azuma et al. [2] developed a 6DOF tracking system using linear accelerometers

Page 12: Redeeming Relevance for Subject Search in Citation Indexes

A voting technique

RDI treats each citing document as a voter The presence of a query term in referential

text is a vote of “yes” The absence of that term, a “no” The documents with the most votes for the

query terms rank highest

Page 13: Redeeming Relevance for Subject Search in Citation Indexes

Related Work

McBryan – World Wide Web Worm Brin & Page – Google Chakrabarti et. al - CLEVER Mendelzon et. al - TOPIC Bharat et. al – Hilltop Craswell et. al – Effective Site Finding

Page 14: Redeeming Relevance for Subject Search in Citation Indexes

Contributions

Application to scientific literature “Anchor text” for unrestricted subject search “Anchor text” for combining measures of

relevance and significance

Page 15: Redeeming Relevance for Subject Search in Citation Indexes

Rosetta

Experimental system in which we implemented RDI

Term weighting metric:

Ranking metric:

i

idid

N

nw

log1

q

iddd wns1

Page 16: Redeeming Relevance for Subject Search in Citation Indexes
Page 17: Redeeming Relevance for Subject Search in Citation Indexes

Experiments

10,000 research papers Gathered from CiteSeer Each document cited at least once Evaluated

Retrieval precision Impact of search results

Page 18: Redeeming Relevance for Subject Search in Citation Indexes

Comparison system

We compared Rosetta to a traditional content-based retrieval system

Comparison system uses TFIDF for term weighting:

And the Cosine ranking metric:

)log(log 22 kikik dfNtfw

t

k

t

k

jkik

t

k

jkik

ji

QTERMTERM

QTERMTERMQueryDocCOSINE

1 1

22

1,

)()(

)()(

Page 19: Redeeming Relevance for Subject Search in Citation Indexes

Indexing

Indexed collection in both Rosetta and the TFIDF/Cosine system

Rosetta indexed documents based on references to them

The TFIDF/Cosine system indexed documents based on words used within them

Required that each document was cited at least once to ensure that both systems indexed the same set of documents

Page 20: Redeeming Relevance for Subject Search in Citation Indexes

As referential text, Rosetta used CiteSeer’s “contexts of citation”

Page 21: Redeeming Relevance for Subject Search in Citation Indexes

As referential text, Rosetta used CiteSeer’s “contexts of citation”

Page 22: Redeeming Relevance for Subject Search in Citation Indexes

Queries

32 queries in our test set Queries were key terms extracted from

“Keywords” sections of documents Queries extracted from sample of 24

documents Document from which key term was extracted

established the topic of interest

Page 23: Redeeming Relevance for Subject Search in Citation Indexes

Queries

Page 24: Redeeming Relevance for Subject Search in Citation Indexes

Relevance assessments

The topic of interest for a query was the idea identified by the corresponding key term

Relevant documents directly addressed this same topic

Example: Query: “force feedback” Relevant: Work on providing a sense of touch in

VR applications or other computer simulations

Page 25: Redeeming Relevance for Subject Search in Citation Indexes

Retrieval interface

Meta-interface Queried both systems Used top 10 search results from each system Integrated all 20 search results Presented them in random order No way to determine the source of a retrieved

document

Page 26: Redeeming Relevance for Subject Search in Citation Indexes

Experimental summary

32 queries drawn from document key terms Document identified the topic of interest Relevant documents addressed the same

topic Used a meta-search interface Evaluated top 10 from both systems Origin of search results hidden

Page 27: Redeeming Relevance for Subject Search in Citation Indexes

Precision at top 10

On average RDI provided a 16.6% improvement over TFIDF/Cosine

1 or 2 more relevant documents in the top 10 Result is significant

t-test of the mean paired difference Test statistic = 3.227 Significant at a confidence level of 99.5%

Page 28: Redeeming Relevance for Subject Search in Citation Indexes

Precision at top 10 (cont’d)

00.10.20.30.40.50.60.70.80.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Query

Pre

cisi

on

at

top

10

RDI TFIDF/Cosine

Page 29: Redeeming Relevance for Subject Search in Citation Indexes

Many retrieval errors avoided

Example: software architecture diagrams Most papers about software architecture

frequently use the term “diagrams” Few are about tools for diagramming TFIDF/Cosine system -- 0/10 relevant Rosetta -- 4/10 relevant (3 in top 5) Rosetta made the correct distinction more

often

Page 30: Redeeming Relevance for Subject Search in Citation Indexes

Rosetta Shortcomings

Retrieval metric sorts search results by number of query terms matched

Some authors reuse portions of text in which other documents are cited

Page 31: Redeeming Relevance for Subject Search in Citation Indexes

Impact of search results

A look at the number of citations to documents retrieved for each query

Compared RDI to a baseline provided by the TFIDF/Cosine system

TFIDF/Cosine includes no measure of impact Seeking only a measure of the relative impact

of documents retrieved by RDI on a given topic

Page 32: Redeeming Relevance for Subject Search in Citation Indexes

Experiment

For each query… Calculated the average citations/year for

each document Average publication year for Rosetta – 1994 TFIDF/Cosine – 1995 Found the median number of citations/year

for each set of search results Found the difference between the median for

Rosetta and the median for TFIDF/Cosine

Page 33: Redeeming Relevance for Subject Search in Citation Indexes

Difference in impact

On average the median citations/year… 8.9 for Rosetta 1.5 for the baseline

Page 34: Redeeming Relevance for Subject Search in Citation Indexes

Difference in impact (cont’d)

0

5

10

15

20

25

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Search results for each query

Med

ian

cita

tions

per

yea

r

RDI TFIDF/Cosine

Page 35: Redeeming Relevance for Subject Search in Citation Indexes

Summary of Experiments

Small study – results are tentative Surpassed retrieval precision of a widely

used relevance-based approach Consistently retrieved documents that have

had a significant impact

Page 36: Redeeming Relevance for Subject Search in Citation Indexes

Future Work

Retrieval metric that eliminates Boolean component

Large scale implementation with CiteSeer data

Studies with more sophisticated relevance-based retrieval systems

Comparison with popularity-based retrieval techniques

Page 37: Redeeming Relevance for Subject Search in Citation Indexes

Contact

Shannon Bradshaw

The University of Iowa

[email protected]

www.biz.uiowa.edu/sbradshaw