The MESUR project: an overview and update · network), we should probably leverage network science...

Preview:

Citation preview

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

The MESUR project: an overview and update

Johan Bollen

Indiana University

School of Informatics and Computing

Center for Complex Networks and System Research

jbollen@indiana.edu

Acknowledgements:

Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL),

Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL)

Research supported by the NSF and Andrew W. Mellon Foundation.

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

The scientific process: the importance of early indicators

(Egghe & Rousseau, 2000; Wouters, 1997)

(Brody, Harnad, & Carr 2006),

Citation: final products

• Publication delays

• Focus on publications

• Focus on authors

Usage data

• Scale, cf. Elsevier downloads

(+1B) vs. Wos citations (650M)

• Immediate, early stages

• Variety of resources and actors

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

The Promise of Usage Data

* Interactions be recorded for all digital scholarly content, i.e. papers,

journals, preprints, blog postings, datasets, chemical structures,

software, …

• Not just for ~ 10,000 journals

• Interactions reflects the activities of all users of scholarly information,

not only of scholarly authors

• Interactions are recorded starting immediately after publication

• Not once read and cited (think publication delays)

• Rapid indicator of scholarly trends

• So the interest in usage data from projects such as COUNTER,

Citebase, IKS and MESUR should not come as a surprise!

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Impact metrics: why more is not necessarily better…

So you want to know who’s best?

83M > 50K

Silly?

We do the same in scholarly evaluation!

• Count citations, calculate IFs

• More citations > less citations

BUT that is so last century:

• relationships matter more than counts

(cf Google)

• Web 2.0: social network thinking

? REM, Teenage

Fanclub, Placebo, This

Mortal Coil, Wilco

Who, Kinks,

Byrds, Beatles

Crucial distinction: data vs. statistics

• Data: who, what, when, how, …. Keep info on sequence and context.

• Statistics: what, how much. Loss of most sequence and context.

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Network-Based Metrics

We have 50 years of network science available to us

• A wide variety of metrics has been proposed to characterize

networks, and to assess the importance of nodes in a network

• E.g. social network analysis, small world graphs, graph theory,

social modeling

• So when defining metrics for scholarly communication (clearly a

network), we should probably leverage network science

• Cf. Google’s PageRank versus Alta Vista’s statistical ranking

• A network (and hence a network-based metric) takes context into

account; a statistical count does not.

• Readings:

• Barabasi (2003) Linked.

• Wasserman (1994). Social network analysis.

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Classes of metrics:

• Degree

• Shortest path

• Random walk

• Distribution

Degree

• In-degree

• Out-degree

Shortest path

• Closeness

• Betweenness

• Newman

Random walk

• PageRank

• Eigenvector

Distribution

• In-degree entropy

• Out-degree entropy

• Bucket Entropy

Network-Based Metrics

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

PageRank computed on Citation Network

2003 JCR, Science Edition

5709 journals

Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (DOI:10.1007/s11192-006-0176-z)

Philip Ball. Prestige is into journal ratings. Nature 439, 770-771, February 2006 (DOI:10.1038/439770a)

Cf: http://www.eigenfactor.org/

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

1. Create very large-scale reference data set: representative sample?

a) Usage, citation and bibliographic data combined

b) Various communities, various collections

2. Investigate validity of usage data and usage-based metrics – focus

on journals:

a) Is there any significant structure in usage data?

b) Compute a variety of journal metrics for usage data & cross-

validate with other journal metrics, e.g. citation-based IF

3. Deploy tools to explore usage-based journal metrics

MESUR: A Thorough, Scientific Approach

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Timeline and development

• 2006-2008:

o Andrew W. Mellon Foundation

o Digital Library Research and Prototyping team, Los Alamos National

Laboratory

o Collection of large-scale usage data from some of world’s most

significant publishers, aggregators and institutional consortia

o Feasibility: Usage data, usage-based network models of science,

usage-based impact metrics

• 2009 – infinity and beyond:

o NSF funding (SciSIP, 2009-2012)

o Indiana University, School of Informatics and Computing

o Transfer process underway

o Continuation of MESUR data collection and scientific work

o Focus on studying longitudinal phenomena: innovation, trends,

community-acceptance of novel metrics

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Presentation structure

1. MESUR’s Usage reference data set

2. Mapping scientific activity

3. Metrics survey

4. Future research

5. Discussion

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Creating the MESUR usage reference data set

2006-2008: Collaborating publishers, aggregators and institutional consortia:

• BMC, Blackwell, UC, CSU (23), EBSCO, ELSEVIER, EMERALD, INGENTA, JSTOR,

LANL, MIMAS/ZETOC, THOMSON, UPENN (9), UTEXAS

• Scale:

o > 1,000,000,000 usage events, and growing…

o +50M articles, +-100,000 serials

• Period: 2002-2007, but mostly 2006

1B

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Data normalization and ingestion

Minimal requirements for all usage data

• Unique usage events (article level)

• Fields: unique session ID, date/time, unique document ID and/or metadata, request type

• Note difference with usage statistics

2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foe.edu/abs/1996SPIE.2828..64S http://www.google.com

2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foe.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr

2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foe.edu/abs/2000bioa.conf.333S http://scholar.google.com

2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foe.edu/abs/1993WRR..29.133S http://scholar.google.com

2007 9 1 0 0 6 CFA cffoe pd9e980fc.dip0.t-ipconnect.de 45f0c69881 AST X 2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foe.edu

2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foeabs.edu/abs/1996SPIE.2828..64S http://www.google.com

2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foeabs.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr

2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foeabs.edu/abs/2000bioa.conf.333S http://scholar.google.com

2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foeabs.edu/abs/1993WRR..29.133S http://scholar.google.com

2007 9 1 0 0 6 CFA cffoe pd9e980fc.dip0.t-ipconnect.de 45f0c69881 AST X 2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foeabs.edu

2007 9 1 0 0 6 CFA cffoe foel25144.4u.com.gh 47002f8eda PHY A 2002AGUFM.S21A0965M http://foeabs.edu/abs/2002AGUFM.S21A0965M http://www.google.com

2007 9 1 0 0 6 CFA cffoe 66-215-171-214.dhcp.ccmn.ca.charter.com 4681d22a6f AST A 2001P&SS..49.657R http://foeabs.edu/cgi-bin/bib_query?bibcode=2001P%26SS..49.657R http://cfa

2007 9 1 0 0 7 CFA cffoe nat-ptouser3.uspto.gov unknown PHY A 2005ApPhL.86g2106M http://foeabs.edu/abs/2005ApPhL.86g2106M http://www.google.com

2007 9 1 0 0 7 CFA cffoe cpe-71-65-25-115.ma.res.rr.com unknown PHY A 1980SPIE.205.153S http://foeabs.edu/abs/1980SPIE.205.153S http://www.google.com

2007 9 1 0 0 7 CFA cffoe customer3491.pool1.unallocated-106-0.orangehomedsl.co.uk unknown PHY A 1983ElL..19.883V http://foeabs.edu/abs/1983ElL..19.883V http://www.google.co.uk

2007 9 1 0 0 8 CFA cffoe Uranus.seas.ucla.edu 46672d96b2 PHY A 1966Phy..32.385K http://foeabs.edu/abs/1966Phy..32.385K http://www.google.com

2007 9 1 0 0 9 CFA cffoe 75-121-173-37.dyn.centurytel.net 46cf1fd8a6 AST D 1984ApJS..56.257J http://vizier.cfa.edu/viz-bin/VizieR?-source=III/92/ http://foeabs.edu

2007 9 1 0 0 13 CFA cffoe foel17-18.kln.forthnet.gr unknown AST A 1987cosm.book...C http://foeabs.edu/abs/1987cosm.book...C http://www.google.gr

2007 9 1 0 0 15 CFA cffoe hades.astro.uiuc.edu 46f707564d PRE A 2007arXiv0707.3146N http://foeabs.edu/abs/2007arXiv0707.3146N http://foeabs.edu

2007 9 1 0 0 17 CFA cffoe ool-43554752.dyn.optonline.net unknown PHY A 2000PhTea.38.132K http://foeabs.edu/abs/2000PhTea.38.132K http://www.google.com

2007 9 1 0 0 17 CFA cffoe c-68-33-176-222.hsd1.md.comcast.net unknown GEN A 1994RSPSB.256.177M http://foeabs.edu/abs/1994RSPSB.256.177M http://www.google.com

2007 9 1 0 0 19 CFA cffoe 74-36-139-46.dr02.brvl.mn.frontiernet.net unknown AST T 2002SPIE.4767.114W http://foeabs.edu/cgi-bin/nph-abs_connect?bibcode=2002SPIE.4767&db_key=ALL&sort=BIBCODE&a

2007 9 1 0 0 19 CFA cffoe c-76-16-53-120.hsd1.il.comcast.net 46f667b71b AST F 1916PA...24.613L http://articles.foeabs.edu/cgi-bin/nph-iarticle_query?1916PA...24.613L&data_type=PDF_HIGH&w

2007 9 1 0 0 20 CFA cffoe 74-39-37-62.nas03.roch.ny.frontiernet.net unknown PHY E 2007JSTEd.tmp..29B http://dx.doi.org/10.1007/s10972-007-9067-2 http://foeabs.edu

2007 9 1 0 0 22 ANU bio-mirror uatu-virtual1.anu.edu.au 46f9e8f87f AST A 2006ApJ..647.128E http://foe.grangenet.net/abs/2006ApJ..647.128E http://foe.grangenet.net

2007 9 1 0 0 22 CFA cffoe fw.hia.nrc.ca 46f1531d59 AST A 2002P&SS..50.745H http://foeabs.edu/abs/2002P%26SS..50.745H http://foeabs.edu

2007 9 1 0 0 22 CFA cffoe 24-117-0-220.cpe.cableone.net unknown AST A 1984BITA..15.268S http://foeabs.edu/abs/1984BITA..15.268S http://www.google.com

2

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Presentation structure

1. MESUR’s Usage reference data set

2. Mapping scientific activity

3. Metrics survey

4. Future research

5. Discussion

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Data set: subset of MESUR

• Common time period:

o March 1st 2006 - February 1st 2007

o Thomson Scientific (Web of Science),

Elsevier (Scopus), JSTOR, Ingenta,

University of Texas (9 campuses, 6

health institutions), and California State

University (23 campuses)

• 346,312,045 usage events

• 97,532 serials (many of which not

journals)

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

How to generate a usage network.

Same session ~ documents relatedness

• Same session, same user: common interest

• Frequency of co-occurrence = degree of

relationship

• Normalized: conditional probability

Usage data is on article level:

• Works for journals and articles

• Anything for which usage was recorded

Note: not something we invented: association rule

learning in data mining.

Beer and diapers!

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Johan Bollen, Herbert Van de Sompel, Aric Hagberg,Luis

Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila

Balakireva. Clickstream data yields high-resolution maps

of science. PLoS One, February 2009.

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Network science for impact metrics.

: Number of geodesics between vi and vj

Betweenness centrality

PR(vi): PageRank of node vi

O(vj): out-degree of journal vj

N: number of nodes in network

L: dampening factor

PageRank

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Presentation structure

1. MESUR’s Usage reference data set

2. Mapping scientific activity

3. Metrics survey

4. Future research

5. Discussion

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

A variety of impact metrics

Note:

• Metrics can be calculated

both on citation and

usage data

• “Frequentist”

o Citation and usage

rates

• “Structural”

o Citation graph, e.g.

2005 JCR

o Usage graph, e.g.

created by MESUR

• H-index, G-index, SJR,

etcWhat do they MEAN?

What facets of impact do they represent?

Which are best suited?

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Set of metrics calculated on MESUR data set

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

The MESUR Metrics Map

RATE METRICS

TOTAL CITES

PAGERANK(S)

BETWEENNESS

USAGE METRICS

Johan Bollen, Herbert Van de Sompel, Aric

Hagberg and Ryan Chute. A Principal

Component Analysis of 39 Scientific Impact

Measures. PLoS ONE, June 2009. URL:

http://dx.plos.org/10.1371/journal.pone.0006

022.

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Presentation structure

1. MESUR’s Usage reference data set

2. Mapping scientific activity

3. Metrics survey

4. Future research

5. Discussion

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Samples of future work (can be skipped)

• Longitudinal studies:

o Network changes over time: collaboration with Carl Bergstrom (UW)

o Prediction of innovation using random walk models

• Logistics:

o Expand existing data set: focus on standardization, repeatability

o Establish continued funding, good home for project

o “Center” model: rather than data->scientists, scientists->data

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Animated maps: tracing bursts of scientific activity

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Coordinated bursts

1

23

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Coordinated bursts

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

MESUR Mapping and ranking services

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

MESUR Mapping and ranking services

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

MESUR Mapping and ranking services

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

MESUR Mapping and ranking services

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Presentation structure

1. MESUR’s Usage reference data set

2. Mapping scientific activity

3. Metrics survey

4. Future research

5. Discussion

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

MESUR: the good ...

After 3 years of MESUR:

• Scientific exploration of metrics for scholarly evaluation

• Creation of large-scale reference data set

• Mapping science from the viewpoint of users: there is structure!

• Variety of metrics that cover various aspects of scholarly impact and prestige

• MESUR dataset contains many more pearls for future research

• Foundation for future continued research program:

• Longitudinal studies

• Models of collective behavior of scientists

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Scalability of the approach:

• Lengthy negotiations to obtain log data

• No infrastructure standards (yet): Recording, aggregating, normalization, ingestion, de-duplication,…

• No generally accepted policies: privacy, property, …

• No census data: when is a sample large and representative enough?

Quality control:

• Bots, Crawlers (detectable but never perfect)

• Cheating, manipulation (easier with usage statistics than network metrics)

Acceptance:

• Network-based usage metrics require session information. This is overlooked! As a result, will we end up with usage-based statistics only?

• “As simple as possible, but not more simple!”

MESUR: the bad and the ugly …

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Registration is now open for "Scholarly Evaluation Metrics: Opportunities and Challenges", a one-day NSF-funded workshop that will take place in the Renaissance Washington DC Hotel on Wednesday, December 16th 2009. Participation in this workshop is limited to 50 people. Registration is free at http://informatics.indiana.edu/scholmet09/registration.html.

The topic of the workshop is the future of scholarly assessment approaches, including organizational, infrastructural, and community issues. The overall goal is to identify requirements for novel assessment approaches, several of which have been proposed in recent years, to become acceptable to community stakeholders including scholars, academic and research institutions, and funding agencies. The impressive group of speakers and panelists for the workshop includes representatives from each of these constituencies.

Further details are available at http://informatics.indiana.edu/scholmet09/announcement.html

Workshop organizers:

Johan Bollen (jbollen@indiana.edu),

Herbert Van de Sompel (hvdsomp@gmail.com) and

Ying Ding (dingying@indiana.edu)

Increasing community involvement

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Planning process underway to establish sustainable, open, community supported infrastructure.

New support from Andrew W. Mellon foundation to figure it all out.

Moving towards community involvement

Logistics:

Data aggregation

Normalization

Data-related services

Data management

Science:

Metrics

Analysis

Prediction

Services

Ranking

Assessment

Mapping

=More than sum of parts:

• Each component supports the other

• Various business and funding models

• Generate added value on all levels

Can fundamentally change scholarly

communication

Indiana University

SPARC/LIBER: June 29th, 2010

School of Informatics and Computing

Some relevant publications.

Johan Bollen, Herbert Van de Sompel, Aric Hagberg, Luis Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila Balakireva. Clickstream data yields high-resolution maps of science. PLoS One, March 2009 (In Press)

Johan Bollen, Herbert Van de Sompel, Aric HagBerg, Ryan Chute. A principal component analysis of 39 scientific impact measures.arXiv.org/abs/0902.2183

Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030)

Johan Bollen, Herbert Van de Sompel, and Marko A. Rodriguez. Towards usage-based impact metrics: first results from the MESUR project. In Proceedings of the Joint Conference on Digital Libraries, Pittsburgh, June 2008

Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007

Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (cs.DL/0610154)

Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006.

Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006.

Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005.

Recommended