View
222
Download
4
Tags:
Embed Size (px)
Citation preview
Cornell CS 502
Scholarly Communication IICitation Analysis and Reference Linking
CS 502 – 20020430Carl Lagoze – Cornell University
Cornell CS 502
Recalling the themes
• Basic assumptions are broken– Expensive distribution– Distinctions between publishers, authors, readers
• Basic assumptions remain– Need for quality– Need for people to make money– Reward system: tenure and promotion
• Changing context– Readers habits– Increase of scholarly output– Computing and network power
Cornell CS 502 Acks. P. Ginsparg
Unbundling content from services
Cornell CS 502
Signs of Change - Readers
… there’s a sense in which the journal articles prior to the inception of the electronic abstracting and indexing database may as well not exist, because they are so difficult to find. Now that we are starting to see … full-text showing up online, I think we are very shortly going to cross a sort of critical mass boundary where those publications that are not instantly available in full-text will become kind of second-rate in a sense, not because their quality is low, but just because people will prefer the accessibility of things they can get right away.
Clifford Lynch 1997
Cornell CS 502
Signs of Change - Publishers
• Electronic versions of existing journals• Licensing arrangements to libraries
– http://campusgw.library.cornell.edu/cgi-bin/dj.cgi?section=ejournal&URL=SerialsSearch
• Problems– License bundling
• Inflate costs and maintain economic model• Force libraries to subscribe regardless of interest
– Longevity dependent on license continuity
Cornell CS 502
Signs of Change - Publishers
• Electronic Journals– D-Lib Magazine – http://www.dlib.org – Journal of Digital Information (JODI) –
http://journals.ecs.soton.ac.uk/jodi/– Journal of Electronic Publishing (JEP) –
http://www.press.umich.edu/jep/
• The economic models are not established
Cornell CS 502
Signs of Change – Publishers and Libraries
• JSTOR– http://www.jstor.org
• Recognition of reality– Archival journal storage is expensive for libraries
• Shelf space crisis forces libraries to choose between– Keeping archival issues to serials– Continuing subscriptions for new issues– Building expensive new buildings
– Archival copies have limited economic value to publishers
• Cooperative non-profit model among publishers/foundation (Mellon)/libraries
• Sliding window to digitize old issues of serials and provide ready access services
Cornell CS 502
Signs of Change – Libraries & Professional Societies
• HighWire Press – http://highwire.stanford.edu• Realities
– Many professional societies and journals are “Mom & Pop” operations
– Technical and economic cost of electronic publishing is often prohibitively high
• Solution– Highwire acts as a brokering service to provide
electronic publishing technology for small professional societies and journals
– Pooling technology allows creation of higher level services (e.g., reference linking amongst journals)
Cornell CS 502
Signs of Change - Scholars
• Eprint respositories– Author-self archiving gives scholars control over their
intellectual output– Harnad’s “subversive proposal”– Direct descendant of traditional pre-print sharing in
print form among scholars
• Examples– arXiv – http://arxiv.org– ePrints – http://www.eprints.org – California Digital Library scholarly publishing archive
- http://repositories.cdlib.org/
• Related Issues– Publisher agreements – some journals refuse to
publish anything that has been posted as an eprint
Cornell CS 502
Signs of Change – Computer Scientists
• Automatic creation of traditional journal services– Citation analysis– Reviewing
Cornell CS 502
Concepts – References and Citations
Doc1
Doc2
Doc3
Doc1 references:
(Doc2, Doc3)
Doc1 citations:
(Doc2)
Cornell CS 502
Concepts – References and Citations
• # of references of a document is finite, stable, and easy to determine/compute
• # of citations of a document is dynamic, impossible to computer (infinite)
• Generally, references are at the work, or manifestation level, NOT at the item level
Cornell CS 502
Citation Analysis
• Understanding citation patterns among scholarly journals– Quality metric– Cost/benefit analysis – what “basic” journals should a
library have in its holdings
• Eugene Garfield – “Father of citation analysis”• Science Citation Index
– Origins circa 1950’s– Hand analysis of printed journals showing patterns of
citations into and out from journals
Cornell CS 502
Results of citation analysis
acks. Garfield, Science, 1972
Cornell CS 502
Citation analysis in the digital age
• Automatic citation linking among papers in arXiv– Citebase (Open Citations Project)– http://citebase.eprints.org/cgi-bin/search?submit=1&
author=Hawking%2C%20S%20W%20
• Scientometrics - Automation of methods reveals lots of data– Longevity of interest in paper– Journal and ePrint citation patterns
• Automatic citation analysis as a reviewing tool?
Cornell CS 502
Are papers downloaded then cited or cited then downloaded?(2)
• If all these time differences are plotted the above graph is produced.
What came first the Citation or the Download
0
1000
2000
3000
4000
5000
6000
7000
-300 0 300 600 900 1200 1500 1800 2100 2400 2700
Age of Paper at Download minus Age of Paper at Citation
Fre
qu
ency
Acks: S. Harnad
Cornell CS 502
Citation Latencies
• The raw data show that the latency of the citation peak has been reducing over the period of the archive
Frequency of Citation Latencies: 1992-1999
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 12 24 36 48 60 72 84 96
Time Difference/Months
Cita
tions
99 98 97 96 95 94 93 92
Acks: S. Harnad
Cornell CS 502
Author Impact Quartiles
• High impact authors update more than medium or low• High and medium impact authors deposit more papers than low
Quartile Total % Total Citations PapersCitations/Aut
hor/PaperDeposits
Mean Updates/Author
High 25% 798 2.09% 240,092 2,732 0.11 6,720 0.48Med 50% 9,262 24.20% 733,272 37,318 0.00212 93,671 0.37Low 25% 28,211 73.71% 251,925 67,951 0.000131 165,971 0.27
Acks: S. Harnad
Cornell CS 502
Citation Quality
• Papers generally cite papers of like impact
High
Medium
Low
LowMedium
High
0
20000
40000
60000
80000
100000
120000
140000
No of Citations
Dest. Impact
Source Impact
Do Papers Cite Papers of Like Impact
Acks: S. Harnad
Cornell CS 502
Histogram of Citations per Paper(author impact) 30,000 papers were by authors w ith no citation
1386534 6072 5863
9627
30807
13668 11527
6784
3105
1797121 24925717047814441
2060
0
5000
10000
15000
20000
25000
30000
35000
40000
No citations 1 Citation 2/3 Citations 4/5/6Citations
7/8/9/10Citations
11 or moreCitations
Pap
ers
High (2.53%) Medium (34.55%) Low (62.92%)
Citation Spread
• A small number of papers receive a very large number of citations
Acks: S. Harnad
Cornell CS 502
How Paper Impact Effects Usage
• Higher impact papers have a longer download life expectancy.
All Papers
0
0.0005
0.001
0.0015
0.002
0.00250
109
218
327
436
545
654
763
872
981
1090
1199
1308
1417
1526
1635
1744
1853
1962
2071
2180
2289
2398
Age of paper (days)
Fre
qu
ency
Den
sity
High (2.0%) Medium (7.7%) Low (46.5%) Unknown (39.6%)
Acks: S. Harnad
Cornell CS 502
What is the correlation between citations and downloads?
• There is a significant positive correlation between citations and downloads for high impact papers.
Download type r nAll Papers 0.11155 63671
High Impact Papers (2.0%) 0.27293 1981Medium Impact Papers (7.7%) 0.01288 5937
Low Impact Papers (46.5%) -0.01412 30163
Acks: S. Harnad
Cornell CS 502
full text
reference linking
Cornell CS 502
Who is Who
Books in Print
Amazon.com
extended services
Cornell CS 502
Static Linking
• Fixed URLs in references• All the associated problems with URLs • Persistent link through document “footprint”
– Robust URLs – Berkeley
Cornell CS 502
General Idea: Enhance URLs with “signatures”
• Add to a URL a “signature”, a small piece of document content.
• When “traditional” (i.e., address-based) dereferencing fails, do “signature-(i.e., content-)based dereferencing: – Pass the signature to some search service, and hope that the
target will be prominent among a very small result set.• Two issues:
– Computing small, yet effective and robust signatures– Adding them innocuously to hyperlinks
Acks: Phelps/Wilensky
Cornell CS 502
Computing Small, Robust Signatures
• “Lexical” signatures: The top n words of a document chosen for rarity, subject to heuristic filters to aid robustness.– “a TF-IDF-like” measure
• Easy to compute and use.• Question: How big a signature is needed to locate a
document more or less uniquely on the Web?– Inktomi says there are approximately 1 billion web pages
now.
Acks: Phelps/Wilensky
Cornell CS 502
Answer: 5 words!
• I.e., a signature of 5 words will, in most cases, cause search engines to return the target document within the top few hits.
• Actually, a smaller signature will probably do just to locate exact matches, but length helps provide robustness and for growth.
• Martin and Holte (1998) and Bharat and Broder (1998) demonstrate summary queries and strong queries, resp., which use rarity of words (and, possibly, phrases) to local specific documents.
– Our variation focuses on robustness + rarity.
Acks: Phelps/Wilensky
Cornell CS 502
Some Examples
• Signature for Randy Katz’s home page was – “Californa ISRG Culler rimmed gaunt”
• Here is what happens when we feed this signature to HotBot:
Gives same result after correcting typo.
Acks: Phelps/Wilensky
Cornell CS 502
Another Example
• Signature for Endeavour home page is – “amplifies Endeavour leverages Charting Expedition”
• Here is what happens when we feed this signature to Google:
Acks: Phelps/Wilensky
Cornell CS 502
Why Does This Work?
• If terms were distributed independently, the probability of 5 even moderately common terms occurring in more than one document is very small.
– In fact, picking 3 terms restricted to those occurring in 100,000 documents works pretty well.
– Many documents contain very infrequently used words.– There is lots of room for independence to be off, and to play with term
selection for robustness, etc..
Acks: Phelps/Wilensky
Cornell CS 502
Persistent Linking
• CrossRef– Uses Digital Object Identifiers (DOIs)
• A type of URN (Handle)
– Cooperative agreement among publishers
• Publishers control the resolution mechanism– Can go to full-text, other services, or charging
mechanism
• Example– http://www.crossref.org/demos/springer/springer.htm
Cornell CS 502
Dynamic & Context Sensitive Linking
• Problem – Link behavior should not be the same for all
• Solution - Link contains metadata rather than an identifier
• OpenURL – standard for incorporating metadata into a URL– http://sfx.aaa.edu/menu?genre=article&issn=1234-
5678&volume=12&issue=3&spage=1&epage=8&date=1998&aulast=Smith&aufirst=Paul
• SFX– System for locally resolving an OpenURL to extended and
localized services
– http://www.sfxit.com/
Cornell CS 502
Researchindex – automatic interlinking on the web
• http://researchindex.org• Selective web crawling to gather CS resources• Heuristics and AI techniques to establish
services– Searching– Reference linking
• Why do we need metadata for textual documents?
Cornell CS 502
Automatic Reviewing Techniques
• Traditional Collaborative Filtering– Estimate what score a reviewer might give to an
item that he/she has not scored yet – Frequently used by recommender systems
• Collaborative quality filtering– http://www.cs.berkeley.edu/~tracyr/project/ – Attempts to automatically determine which
reviewers are "good" in an open reviewing system, in order to provide the same (or better) benefits as peer review
Cornell CS 502
Collaborative Quality Filtering Algorithm
• Assume true value of an item is the asymptotic average of review scores
• Good reviewers are those who consistently predict this average
• Normalize according to # of reviews of an item, # of reviews by reviewer, review latency
• Adjust by “expertise” – Use similarity of term vectors of items reviewed