View
241
Download
1
Category
Preview:
Citation preview
Leabharlann UCD
An Coláiste Ollscoile, Baile Átha Cliath,Belfield, Baile Átha Cliath 4, Eire
UCD Library
University College Dublin,Belfield, Dublin 4, Ireland
Joseph GreeneResearch Repository LibrarianUniversity College Dublinjoseph.greene@ucd.iehttp://researchrepository.ucd.ie
How accurate are IR usage statistics?
Open Repositories 2016Dublin, 16 June
Usage statistics are important for OA repositories
• How is the service used overall?• Advocacy
– Connects with authors on what is most important to them: the use of their research
• KPI for return on investment– Usage of a Library service– Visibility of university’s research
Monthly email sent to all depositors
Infographic distributed semi-annually by College Liaison Librarians
How accurate are they? Web robots
• Some follow rules– Search engines, Internet Archive, link checkers,
Twitterbot, etc.– robots.txt, naming themselves in the user agent
string• Others do not
– Email spammers, comment spammers, dictionary attackers, phishers, etc.
– Often mimic human users
Experimental study
• Simple random sample of 2 years of UCD repository’s download data– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human• Compared findings against our robot detection
technique– U. Minho DSpace Stats Add-on– Monthly outlier exclusion (manual)
Greene, J. Web robot detection in scholarly Open Access institutional repositories. Library Hi Tech, July 2016
First finding
85% of the Research Repository UCD’s unfiltered downloads come from robots• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (robots)
Accu
racy
of d
ownl
oad
stat
s (in
vers
e pr
eciti
on)
Catching more robots improves stats(But how much depends on the number of robots)
Get b
ette
r sta
ts
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
How did we do at UCD?
• What proportion of robot downloads did we catch? (Recall)– Our method catches 94% of all robots
• How often were we correct -- how many are actually human? (Precision)– 98.9% of downloads that we label robots really are
robots• How accurate are the download stats -- how many
are actually made by human beings? (Inverse precision)– 73% of the download statistics as reported are
human
How does that compare?
• Who knows? There are no other studies like this on repositories!
• Applied DSpace's and EPrints' web robot detection algorithms to our data– Experimental– Real data– Same dataset used for each ‘system’– Algorithms easy to mimic in vitro– But SEO, crawl behaviour may be different for
different systems
Robot detection techniques used
DSpace EPrints Minho DSpace
Statistics Add-on Rate of requests ✓ 3 User agent string ✓ ✓ ✓ robots.txt access ✓
Volume of requests ✓ 2 ✓ 3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓ 1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓ 3 1Only implemented nominally or experimentally 2Via the repeat download or ‘double-click’ filter 3Data available as a configurable report for manual decision making
Results
DSpace Eprints Minho (no manual outlier checking)
Minho plus monthly manual checking (UCD)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.897 0.911 0.8900.942
Robots detected (Recall)
DSpace Eprints Minho (no manual outlier checking)
Minho plus monthly manual checking (UCD)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
11.000
0.9400.989 0.989
Accuracy of detection (Precision)
DSpace
Eprin
ts
Minho (no m
anual
outlier c
hecking)
Minho plus monthly
manual
checki
ng (UCD)
Without fi
ltration
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.620 0.552 0.5900.730
0.144
Accuracy of download stats(Inverse precision)
I.e. 38% of DSpace’s reported downloads are made by robots, etc.
DSpace
EPrin
ts
Minho
Minho with
monthly
manual
checki
ng (UCD)
No robot d
etection
00.10.20.30.40.50.60.70.80.9
1
Robot detection in OA IR systems
RecallPrecisionNegative precision (accuracy of download stats)
Thank you!
Recommended