Upload
rosamund-martin
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Data Reuse, Sharing, and Production: An article-centric investigation of data citation
practices in prominent journals
Sarah Judson
DataONE
Summer 2010
Motivation
Many journals have data citation, or at least data sharing policies.
– Most are “recommendations”– Many will soon be mandatory
– But are they enacted? Multiple depositories exist for data sharing
– Allow browsing for available data– Provide space for data storage– Recommend how data reuse should be properly credited.
– But are they utilized?
Intentions
Report current status of data citation and sharing to relevant journals
Recommend best practices– Increase ability to retrace and reuse data– Ease transition to mandatory polices– Promote appropriate credit to data author
Background
Advent of data sharing/citation policies Continued expression of the need for
increased data sharing, esp. for meta-analysis and global change studies
Similar studies in Biomedical journals* or focused on Genbank**, but few in Ecological/Evolutionary journals
*Piwowar and Chapman. Public sharing of research datasets: A pilot study of associations. Journal of Informetrics April 2010 4(2):148-156
**Noor et al. 2006. Data Sharing: How Much Doesn't Get Submitted to GenBank? PLoS Biol. 4(7):228
Research Questions
What are current practices for data citation within articles? – Do authors tend to cite that dataset itself or related paper?– How does the author obtain the dataset?
How do these practices vary across discipline, journal, data type, data source?
– Are data citation practices influenced more by attitude of the discipline towards data sharing or journal policy?
How have these practices varied across time?– Does increased data reuse/sharing correlate with changes in
journal policy?– Does data reuse/sharing simply increase with time since the
advent of the internet?
Angles of Attack
“Snapshot” approach– 1st issue in 2010 for journals of interest
To assess “current state” To evaluate utility of a particular journal for more
detailed “Time Series” investigation
“Time Series” approach– Random sample of 25 articles per journal per
year To investigate trends over time, especially considering
changes in journal data/citation policies
Nitty-Gritty Methods
Random sampling– Export all articles and accompanying metadata
2005-2010 Journal- specific
– Assign record number to each article– Generate random numbers to select 25 articles
Data Extraction– Recorded on Excel spreadsheet, uploaded weekly to GoogleDocs– Read Journal Citation/Data Policy in Preparation for Extraction– Read through articles manually
Special attention to the Methods and Acknowledgements sections. Identify instances of data reuse and sharing
– Copy relevant excerpts Code according to established fields
– Record additional metadata Open access, Discipline, Submission to Publication duration, etc.
Extracted Fields
ISI metadata– DOI– Author and affiliation– Abstract and keywords– Journal and ISSN
For each instance of Data Reuse, Sharing, or Production– Depository – Type of InText and Bibliographic Citation
Author-Year, URL, Accession #– How dataset acquired
Is depository clearly referenced? Was it obtained from a colleague? Is it previous work by one of the authors?
– Where citation occurs– Type of Dataset
Gene Sequence, Phylogenetic Tree, Ecological, etc
Selected Journals
Dryad “Top Three”– Justification:
1. Most currently posted datasets...is it really being reused?
2. Known "High Impact" Journals
3. Cover target disciplines and depositories
– Systematic Biology (Systematics, Phylogenetics/geography)– American Naturalist (Behavior, Natural History, Ecology)– Molecular Ecology (Genetics, Molecular Evolution)
Other options: ESA family, Discipline-specific, Broad
Limitations
Only looking at a few journals and disciplines Relying only on the main text
– Not looking at supplementary material unless article extremely unclear
– Have to assume if it wasn’t stated, it wasn’t reused/shared Would have developed automated extraction, text
coding if time permitted– Process more articles– Remove bias– Standardization
Unresolved Problems (suggestions please!)
Data Type Classification– Easy: Gene Sequences and Phylogenetic Trees– Biology vs. Ecology– Subdivisions in Biology, Earth, etc
Bio: Morphology, Behavior Eco: Competition, Community Earth: Soils, GIS
“Articles” according to ISI– AmNat:
High % are models Notes and Comments Natural History Miscellany
– SysBio: Points of View
Author Recurrence– SysBio: only 50 articles per year and multiple publications/accreditations to the same
people (Wiens, Sullivan)– AmNat: less pronounced problem (Abrams)
Findings
Qualitative observations Good citation, bad citation Journal Comparison Time Series
– % Reuse % Sharing results not presented
– Data type– Depository
Qualitative Observations
- Internal (journal) supplementary depositories used more as a dump than for reusable data
Additional or color figures and tables Statistical outputs
– InText citations allude to raw data supplement,
but often ends up being raw results
– Defunct data storage Personal URLS Problem retrieving supplementary data (SysBio 2005-8)
– More data produced than shared– Alignments and Trees often not posted to TreeBase– Ecological datasets grossly under shared
Haphazard citation practices
Accessions cited in Text vs. Table Author vs. Accession Only depository referenced
– Especially with large datasets Some in Methods, Some in Results
– Majority of reuses cited in Methods– Sharing cited roughly 50/50 between Methods and Results
Crediting self before others– Bibliographic citations not given or only for same author – Give article citation for self, but not accession; accession for others
but not their article Disparate citation formats within a single paper
Good Citation
“Previously published sequence data were used for V. velella 18S (Collins, 2002, GenBank AF358087), P. porpita 18S (Collins, 2002, AF358086), Staurocladia wellingtoni 18S (Collins, 2002, AF358084), S. wellingtoni 16S (Schuchert, 2005, AJ580934), Hydra circumcincta 18S (Medina et al., 2001, AF358080), and H. vulgaris 16S (Pont-Kingdon et al., 2000, AF100773).”
Taxon Gene region Author-Year
– accompanying bibliographic citation GenBank Accession
Bad Citation
Incomplete• “The sequences, which were all produced
in our previous studies (Aceto et al. 1999; Cozzolino etal. 2001) and are available in GenBank”
• Usually missing accession, sometimes author and depository Sometimes the info is buried in tables or not given for large compilations
Unclear– “During annual aerial surveys, observers sketch the extent of defoliation
from the air on paper or digital maps (Ciesla 2000) that are then compiled as a series of polygons in a geographical information system (GIS) (Liebhold et al. 1997).”
Who is the original data author?– Are these theoretical, methodological or data citations?– Bibliographic citations occasionally shed light
Where is the data stored?
What is a good citation?
Data easily retraceable Proper credit given Criteria
– Depository mentioned in text– Accession mentioned in text– Author credit given in Bibliography
Citations: Systematic Biology
Bad
Good
Ok
Bad
Good
Ok
Count of Year
IdealCitationAll
Citations: American Naturalist
Bad
Good
Ok
Bad
Good
Ok
Count of Year
IdealCitationAll
Journal Comparison: Snapshot
Data Reuse Data Sharing
Systematic Biology
~ Frequent use of Genbank
~ Occasional use of Treebase
~ Often post to Treebase, but often unclear about GA vs. PT
~ Internal Difficult (no unique accession [generic URL]; not accessible pre-2008)
American Naturalist
~ Varied data (biological)
~ Often extracted from literature or used to validate a model
~ Occasional sharing: Dryad, Treebase, Genbank, internal
Molecular Ecology
~ Frequent use of Genbank, but steadily drops off after 2009
~ Some morphological data matricies
~ Posting to Genbank, but alternatively given in Methods and Results
~ Level of accessibility varies widely
Ecology ~ Minor datasets ~Extensive datasets rarely shared
~ Ecological Archives (accessible but used for excess figures and results)
Journal Comparison: % Reuse and Sharing in 2010
Data Reuse Data Sharing "Good" CitationsAmNat 47% 13% 0%Ecology 48% 5% 0%MolecEco 90% 70% 30%SysBio 86% 62% 14%
Percent Reuse over time
AmNat SysBio2005 20% 73%2006 20% 50%2007 24% 67%2008 32% 75%2009 28% 64%2010 47% 83%
All Years 27% 68%
Depository: Systematic Biology
Genbank
Treebase
Database
URL
Extraction
Not Indicated
Other
Genbank
Treebase
Database
URL
Extraction
Not Indicated
Data
Depository: American Naturalist
Genbank
Treebase
Database
URL
Extraction
Not Indicated
Other
Genbank
Treebase
Database
URL
Extraction
Not Indicated
Data
Data Types: Systematic Biology
Gene Sequence
Organism Biology
Phylogenetic Tree
G.I.S.
Unclassified
Gene Sequence
Organism Biology
Phylogenetic Tree
G.I.S.
Unclassified
Data
Data Types: American Naturalist
Gene Sequence
Organism BiologyPhylogenetic Tree
G.I.S.
Unclassified
Gene Sequence
Organism Biology
Phylogenetic Tree
G.I.S.
Unclassified
Data
Back to the big picture
Inform journals and depositories about current practices vs. policy
Best Practices recommendations Continued research on trends in data citation
Suggested Best Practices
Accession numbers and Authors of each dataset (reused and shared) given in the Methods or Supplementary Table referenced in the Methods
– Authors not charged extra page/online fees– Authors allowed to exceed Reference limit to credit data
Editorial enforcement– Checklist
Internal Depositories made more accessible– Usable formats– Unique and Stable URLs
Long Term Best Practices
Separate Supplementary Data Section– Example: Molecular Ecology
SysBio added a separate section but it is defunct AmNat has an “Online enhancements” header
– For both internal and external deposits– Distinguish from “data-dump” (extra figures, outputs)– Accompanying References section
Unlimited length DATA cited, in addition to publication
– Could combine into a new reference type: Author. Year. Title. Journal. Pages. Depository. Accession.
Track on par with publications in ISI, etc
Continued Research
Snapshot and time series of Molecular Ecology– Possibly Ecology if time permits– Alternative (suggestions please!): Just snapshots
Trends over time– Has reuse and sharing increased?– Have citation practices improved over time?
Is this influenced by journal/depository recommendations on citations? Correlation with influential factors
– Is there more data reuse in articles that are also open access or share data?– Are certain dataset types or article disciplines more inclined to reuse/share
data? Data shared vs. data produced Sync with Journal/Depository Metadata (Nic) and Search Findings
(Valerie)– Refine “Good” citation criteria
Journal and depository specific
Additional Exploration
Track the cited or shared datasets Look at supplemental data alone
– Internal (journal repositories) Additional data not cited in text? Data dumps?
– Ease of access Accuracy of accession numbers
Actual data reusability– Method/processing metadata – File format
Software/model reuse and sharing – R-packages, GUIs – Encouraged by American Naturalist
Databases– Independent databases vs. depositories
% utilized out of available– Caching/stability options, linking metadata to depositories
Final products
Reports to requisite journals/depositories Potential Manuscripts
– Journal Comparison: Citation Practices– Treebase: Shared vs. Produced– Best Practices recommendations
Shared dataset!
Thanks for listening!
Questions? Suggestions?
– Unresolved problems– Continued Research
Hurdles
Determining extracted fields Coding data now vs. later
In light of data/citation policies….
Compare “performance” of sysbio and amnat in their depository and journal policy performance (do they meet the requirements?) – or state this in future research section
OWW: Nic – do “editor” instructions or other sections of policy indicate how data/citation policies are enforced?