Upload
asist
View
105
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
What have Scientists Planned for Data Sharing and Reuse? A Content Analysis of NSF Awardees’ Data Management Plans
Renata Curty, Youngseek Kim & Dr. Jian Qin
Baltimore, 4-5 April 2013
Motivation
While the NSF mandate gives researchers plenty flexibility to define their own DMP and many academic institutions provide DMP writing support, little is known about how scientists address their strategies on their DMPs.
Study Design Online Survey: 20 questions
Target Population: NSF Awardees from January 18, 2011 to November 5, 2012 - Standard Grants - Total 16065
Random Sample: 1606 cases
Pilot Study: 100 Awardees (Survey Reformulation)
Final Deployment: 966 awardees, 169 responses (17.5%) and DMPs (68)
NSF Directorate Amount Awarded
166 166
10%
16%
12%
18%
16%
14%
13%
BIO CISE EHR ENG GEO MPS SBE
Awards Info
Awardees InfoAge Organization Type
25-24
35-44
45-54
55-64
65+
7%
41%
26%
19%
7%
150 151
Gove
rnm
ent,
1%Co
mm
ercia
l, 3%
Non-
profi
t, 3%
Academia, 93%
Awardees InfoPosition in Academia
Others: Dean (3), Professor Emeritus (1), Professor of Practice (1), Lecturer/Instructor (1), Post-Doctoral Fellow (1), Emeritus Senior Scientist, Director, Expert Consultant, Administrative Faculty Position, Chair.
143 138
Assistant Professor
23%
Associate Professor
28%
Full Profes-
sor40%
Researcher6.77%
Tenured62%
Retired 2%
On Tenure Track25%
Non- Tenure Track11%
Geographical Distribution
109 Created with Google Fusion Tables.
4.79
%0.
40%
3.01
%
22.7
5%
21.5
6%10
.24%
11.3
8%13
.77%
6.63
%
25.7
5%25
.75%
10.8
4%
23.3
5%23
.35%
22.8
9%
8.98
%10
.18%
33.1
3%
2.99
%2.
99%
13.2
5%
Strongly disagree Disagree Somewhat disagree Neither agree or disagree
Somewhat agree Agree Strongly agree
DMP is difficult to execute
DMP is important to formalize data sharing practices in scienceN=166m= 4.93 = 1.62
Writing a DMP for NSF proposal is a challenging taskN=167m= 3.89 = 1.45
N=167m= 3.79 = 1.51
Others: Computational Models, Surveys, DNA Sequences, Computer Codes, Crowdsourcing Data (Reviews)
Types of Data Documentation of Data
Will follow:
46% - Disciplinary practices
37% - Research project’s needs
17% - Institutional recommendations/ guidelines
158
3D Models 13.01% - 19Audio Files 12.33% - 18Curriculum Materials 21.23% - 31Data Models 27.40% - 40Field Notes 26.03% - 38Experimental Data 63.70% - 93Images 36.99% - 54Interview Transcripts 17.12% - 25Patient Records 0.68% - 1Samples 20.55% - 30Software 35.62% - 52Spreadsheets 40.41% - 59Video Files 21.23% - 31
Challenges Encountered None26%
Lack of guidance from my institution
29%
Lack of guidance from NSF36%
Appropriate infra-structure to archive/
preserve data41%
Level of granularity of data
25%
Data Description & Documentation
30%
Which stage(s) of research to share
the data 25%
Others:
Some projects do not generate data
Conflict between DMP requirement and IRB requirements regarding social and behavioral research data
Conflicts intellectual property and data protection
Long-term preservation issues
Conflicts individual/group vs. institutional strategies
169
Data Access & Availability
167
Others: “Publications”, “Available to NSF only”
Open 45%
Available with some restric-
tions51%
Restricted5%
By email request 45.52% - 61
Personal website 17.91% - 24
Research Group/Project Website 51.49% - 69
Institutional Repository 20.15% - 27
Disciplinary Repository 32.84% - 44
164
Barriers for Data Reuse
Reuse Issues - Privacy, Anonymity & Confidentiality
“IRB restrictions on ability to share even deidentified data. Concern that sharing even deidentified data will discourage participation in the study.”
“For myself, no. But for others to use my data, yes: for qualitative data, under IRB requirements for the protection of human subjects around confidentiality and anonymity, DMPs are nearly impossible to implement without perhaps some kind of temporal restriction on them (like, ‘This archive can only be opened in 20 - 30 - 40 years’ or something like that)”
“The project involves human subject; so protections have to be put in place that may limit reuse applications in the future.”
“HIPAA [Health Insurance Portability and Accountability Act] issues - obtaining self reporting data on human subjects.”
Reuse Issues - Context, Time Factor & Documentation“My past data was collected on a unique system built specifically for the research project. Need lots of context to reuse the data.”
“The only problems I see is that data can be taken out of context in a way that produces results that might not be correct.”
“Data is specific to testing scenarios. The insight gleaned from our experimental data is of more importance than the data itself.”
“My data is for specific purposes and it is hard to conceive of how someone would use it for something else/different. Even with a significant amount of metadata it would be difficult for someone to know all the circumstances under which the data was collected and why it was collected.”
“All scientific data is collected in particular context. Mechanisms that facilitate the description of that context are lacking. The creation of metadata that provides this information is a cumbersome, boring task and there are few resources available to ease the burden.”
“Systems are always changing...It would be best if we could upload data to NSF so that it will be publicly available in the same way NIST [National Institutes of Standards and Technology] publishes data.”
“Our raw data formats are extremely large, and need to be compressed into reduced, on-line archives for sharing. It is not possible for me as an individual PI to archive the raw data for others to examine.”
“My data is generally related to large software artifacts, so using it could involve quite a bit of work to get those artifacts running. This is something that I explicitly try to come up with solutions for in my DMPs.”
“Until NSF provides a free national repository for data archiving, we will not make progress in this area. If such an archive was available, it would be sensible to require researchers to place data there at the end of a grant and would allow other researchers to take advantage of it in a practical way.”
Reuse Issues - Format, Tools, Infrastructure Interoperability & Standards
DMPs – Preliminary Content Analysis• Coding Scheme
Used both deductive and inductive approaches 35 codes
NSF DMP Policy and University of Virginia's Guideline Emerged from DMP statements
• Data Analysis Procedure A total of 766 utterances were identified 642 unique utterances
DMPs’ Content
<Wordle Cloud Generated Based on Numbers of Each Code across the 68 DMPs>
Coding Scheme
Types of Data
Metadata Standards
Data Access & Sharing
Process
Data Archiving
Plan
Data Reuse Plan Others
• What to Generate• What Data
Types • How to Create• Where to Get
Existing Data
• Data Format• Metadata
Form• How to Create• Which
Metadata Standard
• Contextual Details Needed
• Discoverability of the Data
• When Available• How Available • What Available • Process for
Gaining Access • How Long
Retain the Right • Embargo Period • Ethical/Privacy
Issues • Compliance
with IRB Protocol • Whose
Intellectual Property
• Reusability of the Data
• Restrictions to Access
• Groups Interested In
• Foreseeable Uses/Users
• Strategy for Archiving Data• Which
Repository • Procedures for
Long-Term Storage • Data
Preservation Period • What Data
Preserved for Long-Term • Transformation
Required • Data
Documentation • Related
Information
• Data Lifecycle• Data Curation• Budget
Types of DataCodes Freq. Examples
What to Generate 58 Geochemical Data, Physical Samples, Mathematica (programing) Code, Course Materials
What Data Types 37 Gene Sequences, Experimental Data, Interview Transcript, Video Recordings
How to Create Data 25 Experimental Setup, Field Observation, Simulation, Survey, Interviews
Where to Get Existing Data 13 Moore Laboratory of Zoology, ArcView/GIS Inventories, Prior Study’s Database
Metadata StandardCodes Freq. Examples
Data Format 38 CSV file, TEMPO data file, XML format, SPSS file, plain text
Metadata Form 31 ArcGIS Metadata file, XML-base standard file, GIS database file
How to Create Metadata 14 Use existing metadata standards, or develop their own metadata standards
Which Metadata Standard 15 Dublin Core, DNA Sequence Metadata, EML (Ecological Metadata Language)
Contextual Details Needed 10 All aspect of the development project documented, experimental procedure record
Data Discoverability 7 Searches Built into Library, Searchable through Project Website
Data Access & Sharing Process
Codes Freq. Examples
When Available 28 Post-Publication, Post-Project, After Data Collection
How Available 37 Upon Request, Project Website, GMOD CHADO databases, Institutional Repository
What Available 33 Original research data (genome assemblies), survey data, educational materials
Process for Gaining Access 25 Email Request, Material Transfer Agreement, Direct Access from Web or Repository
How Long Retain the Right 18 Withhold until Publication, Years after Project Ends, Years after Data Production
Embargo Period 5 Years after data collection, Period for commercialization
Ethical/Privacy Issues 21 Privacy information is not available for public
Compliance with IRB Protocol 13 IRB application submission for human subject research
Whose Intellectual Property 17 Property of the PI and Co-PIs, Institutions, Open-Access
Data ArchivingCodes Freq. Examples
Strategy for Archiving Data 31 Hosted on the Web Servers at (university), ICPSR, disciplinary data repository
Which Repository 55 Organization website, institutional or discipline data repository
Procedures for Long-Term Storage
33 Submitted to databanks including NCBI GEO, Genbank, DataONE, Dryad
Data Preservation Period11 Minimum of five years post-grant funding, Long-
term preservation through disciplinary data repositories
What Data Preserved for Long-Term
7 All data and materials generated by this award, Genome Sequencing Data
Transformation Required 4 Keeping raw image data in its uncompressed form, transferred to IRI format
Data Documentation Submitted 11 Contextual details about experimental procedures, all aspects of the development project
Related Information Submitted 3 Metadata files, proposed study information, companion web page
Data Reuse PlanCodes Freq. Examples
Reusability of the Data 6 Descriptions about reusable methods (Used by a research community to follow-up)
Restrictions to Access 6 Access allowed for a certain group of researchers
Groups Interested In 8Wider research community studying the Great Lakes, academic geography organizations, and geography teacher associations
Foreseeable Uses/Users 10Available to engineers, clinicians, and medical researchers, sociologists and psychologists working in relevant sub-fields.
OthersCodes Freq. Examples
Data Lifecycle 1 Application of the Life Cycle Inventory databases
Data Curation 4 Curation (Consortiums and Partnerships)
Budget 9 Institution will absorb costs, no incremental costs , marginal costs
Data Available -
0
5
10
15
20
25
30
3 3
10
3
8
1
27
13
Types of Data Repositories for Long-Term Archiving
0
2
4
6
8
10
12
14
16
11
4
14
11
2
13 13
Disciplinary
Repository
External/Commercial Storage
Institutional
Repository
Internal/Institutional Storage
Journal Repository/ Supplement
Lab/Organization
Website
Not mentioned
/Specified
Some insights – DMPs’ Preliminary Analysis More informal/personal data sharing procedures rather than
formal/institutionalized data sharing and management plans
Most DMPs lacks content on “Metadata Standard” and “Data Reuse Plan”
Few have plans for long-term archiving. Very vague plans and ideas about long-term use of their data
Many DMPs addressed data archiving in institutional repositories that are not in existence yet, but expected to be created
A few DMPs mentioned interview transcripts will be available, but without addressing IRB issues
Future Directions
Survey a larger number of Awardees
More exhaustive coding analysis and in-depth exploration of the DMPs’ content
Analysis of DMPs to identify patterns, common challenges and best practices across and within different disciplinary communities