Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010

Data Reuse, Sharing, and Production: An article-centric investigation of data citation

practices in prominent journals

Sarah Judson

DataONE

Summer 2010

Motivation

Many journals have data citation, or at least data sharing policies.

– Most are “recommendations”– Many will soon be mandatory

– But are they enacted? Multiple depositories exist for data sharing

– Allow browsing for available data– Provide space for data storage– Recommend how data reuse should be properly credited.

– But are they utilized?

Intentions

Report current status of data citation and sharing to relevant journals

Recommend best practices– Increase ability to retrace and reuse data– Ease transition to mandatory polices– Promote appropriate credit to data author

Background

Advent of data sharing/citation policies Continued expression of the need for

increased data sharing, esp. for meta-analysis and global change studies

Similar studies in Biomedical journals* or focused on Genbank**, but few in Ecological/Evolutionary journals

*Piwowar and Chapman. Public sharing of research datasets: A pilot study of associations. Journal of Informetrics April 2010 4(2):148-156

**Noor et al. 2006. Data Sharing: How Much Doesn't Get Submitted to GenBank? PLoS Biol. 4(7):228

Research Questions

What are current practices for data citation within articles? – Do authors tend to cite that dataset itself or related paper?– How does the author obtain the dataset?

How do these practices vary across discipline, journal, data type, data source?

– Are data citation practices influenced more by attitude of the discipline towards data sharing or journal policy?

How have these practices varied across time?– Does increased data reuse/sharing correlate with changes in

journal policy?– Does data reuse/sharing simply increase with time since the

advent of the internet?

Angles of Attack

“Snapshot” approach– 1st issue in 2010 for journals of interest

To assess “current state” To evaluate utility of a particular journal for more

detailed “Time Series” investigation

“Time Series” approach– Random sample of 25 articles per journal per

year To investigate trends over time, especially considering

changes in journal data/citation policies

Nitty-Gritty Methods

Random sampling– Export all articles and accompanying metadata

2005-2010 Journal- specific

– Assign record number to each article– Generate random numbers to select 25 articles

Data Extraction– Recorded on Excel spreadsheet, uploaded weekly to GoogleDocs– Read Journal Citation/Data Policy in Preparation for Extraction– Read through articles manually

Special attention to the Methods and Acknowledgements sections. Identify instances of data reuse and sharing

– Copy relevant excerpts Code according to established fields

– Record additional metadata Open access, Discipline, Submission to Publication duration, etc.

Extracted Fields

ISI metadata– DOI– Author and affiliation– Abstract and keywords– Journal and ISSN

For each instance of Data Reuse, Sharing, or Production– Depository – Type of InText and Bibliographic Citation

Author-Year, URL, Accession #– How dataset acquired

Is depository clearly referenced? Was it obtained from a colleague? Is it previous work by one of the authors?

– Where citation occurs– Type of Dataset

Gene Sequence, Phylogenetic Tree, Ecological, etc

Selected Journals

Dryad “Top Three”– Justification:

1. Most currently posted datasets...is it really being reused?

2. Known "High Impact" Journals

3. Cover target disciplines and depositories

– Systematic Biology (Systematics, Phylogenetics/geography)– American Naturalist (Behavior, Natural History, Ecology)– Molecular Ecology (Genetics, Molecular Evolution)

Other options: ESA family, Discipline-specific, Broad

Limitations

Only looking at a few journals and disciplines Relying only on the main text

– Not looking at supplementary material unless article extremely unclear

– Have to assume if it wasn’t stated, it wasn’t reused/shared Would have developed automated extraction, text

coding if time permitted– Process more articles– Remove bias– Standardization

Unresolved Problems (suggestions please!)

Data Type Classification– Easy: Gene Sequences and Phylogenetic Trees– Biology vs. Ecology– Subdivisions in Biology, Earth, etc

Bio: Morphology, Behavior Eco: Competition, Community Earth: Soils, GIS

“Articles” according to ISI– AmNat:

High % are models Notes and Comments Natural History Miscellany

– SysBio: Points of View

Author Recurrence– SysBio: only 50 articles per year and multiple publications/accreditations to the same

people (Wiens, Sullivan)– AmNat: less pronounced problem (Abrams)

Findings

Qualitative observations Good citation, bad citation Journal Comparison Time Series

– % Reuse % Sharing results not presented

– Data type– Depository

Qualitative Observations

- Internal (journal) supplementary depositories used more as a dump than for reusable data

Additional or color figures and tables Statistical outputs

– InText citations allude to raw data supplement,

but often ends up being raw results

– Defunct data storage Personal URLS Problem retrieving supplementary data (SysBio 2005-8)

– More data produced than shared– Alignments and Trees often not posted to TreeBase– Ecological datasets grossly under shared

Haphazard citation practices

Accessions cited in Text vs. Table Author vs. Accession Only depository referenced

– Especially with large datasets Some in Methods, Some in Results

– Majority of reuses cited in Methods– Sharing cited roughly 50/50 between Methods and Results

Crediting self before others– Bibliographic citations not given or only for same author – Give article citation for self, but not accession; accession for others

but not their article Disparate citation formats within a single paper

Good Citation

“Previously published sequence data were used for V. velella 18S (Collins, 2002, GenBank AF358087), P. porpita 18S (Collins, 2002, AF358086), Staurocladia wellingtoni 18S (Collins, 2002, AF358084), S. wellingtoni 16S (Schuchert, 2005, AJ580934), Hydra circumcincta 18S (Medina et al., 2001, AF358080), and H. vulgaris 16S (Pont-Kingdon et al., 2000, AF100773).”

Taxon Gene region Author-Year

– accompanying bibliographic citation GenBank Accession

Bad Citation

Incomplete• “The sequences, which were all produced

in our previous studies (Aceto et al. 1999; Cozzolino etal. 2001) and are available in GenBank”

• Usually missing accession, sometimes author and depository Sometimes the info is buried in tables or not given for large compilations

Unclear– “During annual aerial surveys, observers sketch the extent of defoliation

from the air on paper or digital maps (Ciesla 2000) that are then compiled as a series of polygons in a geographical information system (GIS) (Liebhold et al. 1997).”

Who is the original data author?– Are these theoretical, methodological or data citations?– Bibliographic citations occasionally shed light

Where is the data stored?

What is a good citation?

Data easily retraceable Proper credit given Criteria

– Depository mentioned in text– Accession mentioned in text– Author credit given in Bibliography

Citations: Systematic Biology

Bad

Good

Ok

Bad

Good

Ok

Count of Year

IdealCitationAll

Citations: American Naturalist

Bad

Good

Ok

Bad

Good

Ok

Count of Year

IdealCitationAll

Journal Comparison: Snapshot

Data Reuse Data Sharing

Systematic Biology

~ Frequent use of Genbank

~ Occasional use of Treebase

~ Often post to Treebase, but often unclear about GA vs. PT

~ Internal Difficult (no unique accession [generic URL]; not accessible pre-2008)

American Naturalist

~ Varied data (biological)

~ Often extracted from literature or used to validate a model

~ Occasional sharing: Dryad, Treebase, Genbank, internal

Molecular Ecology

~ Frequent use of Genbank, but steadily drops off after 2009

~ Some morphological data matricies

~ Posting to Genbank, but alternatively given in Methods and Results

~ Level of accessibility varies widely

Ecology ~ Minor datasets ~Extensive datasets rarely shared

~ Ecological Archives (accessible but used for excess figures and results)

Journal Comparison: % Reuse and Sharing in 2010

Data Reuse Data Sharing "Good" CitationsAmNat 47% 13% 0%Ecology 48% 5% 0%MolecEco 90% 70% 30%SysBio 86% 62% 14%

Percent Reuse over time

AmNat SysBio2005 20% 73%2006 20% 50%2007 24% 67%2008 32% 75%2009 28% 64%2010 47% 83%

All Years 27% 68%

Depository: Systematic Biology

Genbank

Treebase

Database

URL

Extraction

Not Indicated

Other

Genbank

Treebase

Database

URL

Extraction

Not Indicated

Data

Depository: American Naturalist

Genbank

Treebase

Database

URL

Extraction

Not Indicated

Other

Genbank

Treebase

Database

URL

Extraction

Not Indicated

Data

Data Types: Systematic Biology

Gene Sequence

Organism Biology

Phylogenetic Tree

G.I.S.

Unclassified

Gene Sequence

Organism Biology

Phylogenetic Tree

G.I.S.

Unclassified

Data

Data Types: American Naturalist

Gene Sequence

Organism BiologyPhylogenetic Tree

G.I.S.

Unclassified

Gene Sequence

Organism Biology

Phylogenetic Tree

G.I.S.

Unclassified

Data

Back to the big picture

Inform journals and depositories about current practices vs. policy

Best Practices recommendations Continued research on trends in data citation

Suggested Best Practices

Accession numbers and Authors of each dataset (reused and shared) given in the Methods or Supplementary Table referenced in the Methods

– Authors not charged extra page/online fees– Authors allowed to exceed Reference limit to credit data

Editorial enforcement– Checklist

Internal Depositories made more accessible– Usable formats– Unique and Stable URLs

Long Term Best Practices

Separate Supplementary Data Section– Example: Molecular Ecology

SysBio added a separate section but it is defunct AmNat has an “Online enhancements” header

– For both internal and external deposits– Distinguish from “data-dump” (extra figures, outputs)– Accompanying References section

Unlimited length DATA cited, in addition to publication

– Could combine into a new reference type: Author. Year. Title. Journal. Pages. Depository. Accession.

Track on par with publications in ISI, etc

Continued Research

Snapshot and time series of Molecular Ecology– Possibly Ecology if time permits– Alternative (suggestions please!): Just snapshots

Trends over time– Has reuse and sharing increased?– Have citation practices improved over time?

Is this influenced by journal/depository recommendations on citations? Correlation with influential factors

– Is there more data reuse in articles that are also open access or share data?– Are certain dataset types or article disciplines more inclined to reuse/share

data? Data shared vs. data produced Sync with Journal/Depository Metadata (Nic) and Search Findings

(Valerie)– Refine “Good” citation criteria

Journal and depository specific

Additional Exploration

Track the cited or shared datasets Look at supplemental data alone

– Internal (journal repositories) Additional data not cited in text? Data dumps?

– Ease of access Accuracy of accession numbers

Actual data reusability– Method/processing metadata – File format

Software/model reuse and sharing – R-packages, GUIs – Encouraged by American Naturalist

Databases– Independent databases vs. depositories

% utilized out of available– Caching/stability options, linking metadata to depositories

Final products

Reports to requisite journals/depositories Potential Manuscripts

– Journal Comparison: Citation Practices– Treebase: Shared vs. Produced– Best Practices recommendations

Shared dataset!

Thanks for listening!

Questions? Suggestions?

– Unresolved problems– Continued Research

Hurdles

Determining extracted fields Coding data now vs. later

In light of data/citation policies….

Compare “performance” of sysbio and amnat in their depository and journal policy performance (do they meet the requirements?) – or state this in future research section

OWW: Nic – do “editor” instructions or other sections of policy indicate how data/citation policies are enforced?

Documents

Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010