Managing data responsibly to enable research interity

Preview:

Citation preview

Managing data responsibly to enable research integrity

Heather Coates | Digital Scholarship & Data Management Librarianhttp://ulib.iupui.edu/digitalscholarship/datasupport/

Introduction to Research EthicsQuaid G504 (Fall 2016)

What is research integrity?

security

privacy

trust

honesty

accuracy

efficiency

objectivity

personal responsibility

ownership

stewardship

governance

Why does RDM matter?

RDM as a component of RCR

Roles & Responsibilities

Practical RDM

WHY DOES RDM MATTER?

The value of data increases with their use.

-Paul Uhlir

Source: John Gantz, IDC Corporation: The Expanding Digital Universe

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

2005 2006 2007 2008 2009 2010

The World of Data Around Us

Transient information

or unfilled demand for

storage

Information

Available Storage

Pe

tab

yte

s W

orld

wid

e

• Natural disaster • Facilities infrastructure failure • Storage failure • Server hardware/software failure• Application software failure• External dependencies (e.g. PKI failure)• Format obsolescence• Legal encumbrance • Human error• Malicious attack by human or automated agents• Loss of staffing competencies• Loss of institutional commitment • Loss of financial stability • Changes in user expectations and requirements

The World of Data Around Us: Data Loss

CC

im

age b

y S

hary

n M

orr

ow

on F

lickr

CC

im

age b

y m

om

bole

um

on F

lickr

Poor Data Management Affects Everyone

“MEDICARE PAYMENT ERRORS NEAR $20B” (CNN) December 2004

Miscoding and Billing Errors from Doctors and Hospitals totaled $20,000,000,000 in FY2003 (9.3% error rate) . The error rate measured claims that were paid despite being medically unnecessary, inadequately documented or improperly coded. In some instances, Medicare asked health care providers for medical records to back up their claims and got no response. The survey did not document instances of alleged fraud. This error rate actually was an improvement over the previous fiscal year (9.8% error rate).

“AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” (AP) February 2007

The Justice Department Inspector General found only two sets of data out of 26 concerning terrorism attacks were accurate. The Justice Department uses these statistics to argue for their budget. The Inspector General said the data “appear to be the result of decentralized and haphazard methods of collections … and do not appear to be intentional.”

“OOPS! TECH ERROR WIPES OUT Alaska Info” (AP) March 2007

A technician managed to delete the data and backup for the $38 billion Alaska oil revenue fund – money received by residents of the State. Correcting the errors cost the State an additional $220,700 (which of course was taken off the receipts to Alaska residents.)

"Good data management practice allows reliable verification of results and permits new and innovative research built on existing information. This is important if the full value of public investment in research is to be realized."

Managing and Sharing Data: Best Practices for ResearchersUK Data Archive

Benefits: Good Data Practices & Open Data

• Open data addresses social justice issues

• Open data enhances social welfare

• Open data benefits for effective governance and policy making

• Open data grows the economy

• Open data improves the integrity of the scholarly record

• Open data facilitates the education and training of new generations

• Open data enables validation or replication to support published results

• Open data accelerates the pace of discovery

• GDP increases the impact of your work by sharing your data, code & other products

• GDP improves the quality and consistency of research data you produce (save $$$)

• GDP improves the efficiency of your research (save time)

Personal Experience

“Please forgive my paranoia about protocols, standards, and data review. I'm in the latter stages of a long career with USGS (30 years, and counting), and have experienced much. Experience is the knowledge you get just after you needed it.

Several times, I've seen colleagues called to court in order to testify about conditions they have observed. Without a strong tradition of constant review and approval of basic data, they would've been in deep trouble under cross-examination. Instead, they were able to produce field notes, data approval records, and the like, to back up their testimony.

It's one thing to be questioned by a college student who is working on a project for school. It's another entirely to be grilled by an attorney under oath with the media present.”

- Nelson Williams, Scientist US Geological Survey

RDM AS A COMPONENT OF RCR

Concepts of Data Management

• Data ownership

• Data collection

• Data storage

• Data protection

• Data retention

• Data analysis

• Data sharing

• Data reporting

Steneck, 2004

DataONEData Life Cycle

The purpose of data management planning is to ensure that research data produced by a project are high quality, well organized, thoroughly documented, preserved, and accessible so that the validity of the data can be determined at any time.

ORI Guidelines for Responsible Data Management

The goal of data management is to produce self-describing data sets.

DataONE Primer on Data Management

Why is good data management so challenging?

ambiguity effect

availability heuristic

confirmation bias

experimenter’s or expectation bias

framing effect

hindsight bias

neglect of probability

optimism bias

planning fallacywell traveled road effect

ROLES & RESPONSIBILITIES

Funder progress towards openness

1985: National

Research Council

1999: Office of Mgmt & Budget, Circular A-110

revisions

2003: NIH Data

Sharing Policy

2008: NIH Public Access Policy

2011: NSF DMP Require-

ment

2012: NEH,

Office of Digital

Humanities DMP

Require-ment

2013: NSF Bio sketch change

2013: OSTP

Memo on Public

Access to Results of Federally-Funded

Research

2014: OSTP Memo on

Improving the Mgmt of & Access to Scientific

Collections

2014: OMB

Circular A-81

(Uniform Guidance)

takes effect

2015: Federal Funding agencies release plans

responding to 2013 OSTP memo

2016: Federal Funding Agency

Plans take effect

(DMP req)

Funder Policies: DMP & data sharing

• Association for Healthcare Research & Quality

• Centers for Disease Control & Prevention

• National Institutes of Health

• National Science Foundation

More agency policies at datasharing.sparcopen.org

Publisher Policies: Data availability

• DataDryad Publishers

• PLoS Journals

• Nature Publishing Group

• American Economic Review

• BioMedCentral

• JORD - Social Science Journals with a research data policy

• Data policies of Economic Journals

https://ulib.iupui.edu/digitalscholarship/datasupport/publisher_policies

Institutional Policies

• Vary greatly, with lots of gaps

• Distributed – address specific local or state requirements for specific types of data

• Often focus on institutional data rather than research data

• Do not provide practical guidance

• Do not distinguish between institutional and personal responsibilities

Roles/Responsibilities of project personnel

• On your own

– Fill in the team members responsible for key data activities

• In small groups (2-3 people), share and discuss

– What kind of training is provided for team members to complete these tasks accurately?

• Whole group discussion

– What barriers do you face in tracking roles and responsibilities?

– What barriers do you face in providing training?

Team Member Name Project Role Activity Description

Project design [+ documentation]

Determining the aims of the project, the methods used to achieve those aims, and identifying the products resulting from the project.

Translating the aims of the project into measurable research questions or hypotheses.

Instrument/measure/data collection tool design [+ documentation]

Creating tools that adapt the research questions or hypotheses into questions that can be addressed by discrete data points.

Validating tools through external review or pilot testing.Data collection [+ documentation]

Conducting surveys, interviews, experiments, and other project procedures according to the protocol in order to generate data.

Data processing [+ documentation]: entry, proofing/cleaning, preparation for analysis

Entering analog data into spreadsheet or database. Documenting procedures, date, and person responsible.

Checking data entry for accuracy and completeness. Documenting procedures, date, and person responsible.

Checking data for missing data, errors, and outliers. Documenting procedures, date, and person responsible.

Deciding what data to include/exclude. Documenting decision-making process and criteria used.Data analysis [+ documentation]:

Selecting analytical tools to be used. Documenting decision-making process and criteria used.

Conducting data characterization and screening tests, running analyses, generating results. Documenting process and files generated.

Deciding what data are relevant to the project aims and objectives. Documenting decision-making process and criteria used.

Data reporting: Creating summary tables, graphs, and other visuals to represent the data.

Writing up the project details and relevant results in the packages/format requested by the client, as specified by the deliverables agreed upon in the contract.

PRACTICAL RDM

Basic Principles: Good Research Data Practices

1. Have a plan & use it2. Follow the 3-2-1 rule of data storage3. Document4. Be consistent; when you aren’t, document the deviation5. Use common, standardized terminology6. Monitor the quality of the data as it is being created7. Report enough detail about your research so that others in your

field can reproduce it and others outside your field can evaluate it8. Be as open as possible9. Think about how your research might be useful to others

1: Functional data management plans support teams

• A tool for planning all the key activities related to data before you have a messy pile of bits on your hands

• A working document that reflects how a study is conducted

• Communication device for the team

• Documents the team members and their roles

• Customized to address the issues most relevant to your research

1: Planning…learning from Good Clinical Data Management Practices

• Begin with the end in mind OR Produce report-ready outputs

• Plan, test, revise, plan, test, revise…implement

• Include all stakeholders in the design of the protocol, data collection tools, data management plan, etc.

• Document, document, document– Specify documents required for reproducible research

– Facilitates clear communication and shared understanding throughout the project

– Specify roles and responsibilities from the beginning

2: Follow the 3-2-1 Rule

The accepted rule for backup best practices is the three-two-one rule. It can be summarized as: if you’re backing something up, you should have:

• At least three copies (in different places),

• in two different formats,

• with one of those copies off-site.

3: Document: How much?

More than you think you will need BUT less than everything

Information EntropyD

ATA

DE

TA

ILS

Time of data development

Specific details about problems with individual items or

specific dates are lost relatively rapidly

General details about datasets are lost through time

Accident or

technology

change may

make data

unusable

Retirement or career change makes access to

“mental storage” difficult or unlikely

Loss of data developer leads

to loss of remaining

information

TIME(From Michener et al 1997)

3: Document, document, document

Documentation should capture crucial details needed for post publication peer review or validation of results

• Study: research questions/aims, IRB protocol, informed consents/authorizations, etc.

• Data collection instruments or tools OR data sources

• Data collection process or workflow

• Can take many forms, but should be consistent with standards or norms of practice for your field (e.g., data dictionary, data model, codebook, readme.txt, lab notebook)

3A: Know what you have - Data Inventory

• On your own

– Fill in as much of the data inventory as you can

• In small groups (2-3 people), share and discuss

– Benefits of knowing exactly what data you have?

– How hard would it be to complete this fully and accurately?

• Whole group discussion

– How might this be helpful throughout various phases of the project?

– How might it be helpful to have an inventory for complete and active projects?

3A: Know what you have - Data Inventory Example

• Funding source

• Program or initiative

• Project title

• PI First Name

• PI Surname

• Other Researchers/Data Contacts

• Project Start Date dd-mm-yyyy

• Project End Date dd-mm-yyyy

• New datasets created?

• How many datasets created

• Data location(s)

• Dataset Type (qualitative, quantitative, mixed methods, model)

• Sharing data?– Deposit location?

– Licensing?

– Embargo?

http://www.data-archive.ac.uk/create-manage/strategies-for-centres/data-inventory

3B: Documentation Strategies

• Lab notebooks (print or electronic)

• Codebooks

• Data Dictionaries

• Procedures Manuals

• Protocols

• Readme.txt

3C: Structured documentation [metadata] is crucial for discovery, reuse, and interoperability

• Metadata describes the who, what, when, where, how, why of the data

• Metadata = documentation for machines (standardized, structured)

• Purpose is to enable evaluation, discovery, organization, management, re-use, authority/identification, and preservation

• Standards are commonly agreed upon terms and definitions in a structured format

• Good documentation builds trust in your data – provenance, data integrity, transparency, audit trail, etc.

4: Be consistent

• We’re human – recognize the challenge

• Prevention – design research instruments & processes to prevent mistakes

• Pilot everything to identify potential problem areas

• When you aren’t, document the deviation

• Train your project personnel to be consistent & monitor performance

• Do internal audits, quality checks, data screening periodically to detect inconsistencies

5: Use common, standardized terminology

• For things/concepts

– Diagnoses

– Species/cell lines

– Locations

– Variable names

– Samples & materials

• For formats, too

– Dates

– Codes

– Identifiers

6: Monitor the quality of the data

• Don’t wait until data collection is over

• Quality Assurance

• Quality Control

• Build it into the project timeline

• Make it someone’s job

• Document what you find and how it was corrected

7: Better reporting

• Report enough detail about your research so that others in your field can reproduce it and others outside your field can evaluate it

• This includes ALL aspects of the study: study design, data collection methods, sampling, population, data screening & processing, QA/QC procedures, analytical procedures, visualization procedures, etc.

I know you can’t fit this into a journal article but you can write up pieces of this as the study is conducted to support publications and reporting to the funder. Plus it makes writing those products much easier.

8: Be as open as possible (Open Science)

Open isn’t an all or nothing choice• Study registration• Open notebook science• Data sharing (raw data, processed data, data supporting published

results)• Open Data• Open Access publishing (deposit pre/post print in a repository,

choose OA journal, choose Gold OA option)

Want to learn more? Center for Open Science Why Open Research?

9: Think ahead

How might your research might be useful to others? To yourself in 5/15/50 years? Your students or trainees? Historians?

• Consider what you will forget in that time and document it

• Consider whether your data will be useful beyond the life of the project. If so, put it somewhere safe like an institutional or subject repository to share it and ensure long-term access.

Stakeholders in (Academic) RDM

• Research Administration

• Research Compliance

• University IT

• University Libraries

• University Archives

• Consortia (e.g., CIC)

• NIH CTSA Hubs/NCATS

• Research & Technology Corporation: http://iurtc.iu.edu/

Case studies: Discussion

1. http://retractionwatch.com/2015/11/05/got-the-blues-you-can-still-see-blue-after-all-paper-on-sadness-and-color-perception-retracted/

2. https://ori.hhs.gov/content/case-summary-anderson-david

Resources1. Uhlir, P. F. (2010). Information Gulags, Intellectual Straightjackets, and Memory Holes. Data Science Journal, 9, ES1-ES5.

2. DataONE Education Module: Data Management. DataONE. Retrieved December 2013. From

http://www.dataone.org/sites/all/documents/L01_DataManagement.pptx

3. Scientists are hoarding data and it’s ruining medical research: http://www.buzzfeed.com/bengoldacre/deworming-trials

4. Losing data from the National Centre for E-Social Science (NCESS) Portal:

http://datastories.jiscinvolve.org/wp/2015/08/10/losing-data-from-the-national-centre-for-e-social-science-ncess-

portal/

5. Biomedical data sharing and reuse: Attitudes and practices of clinical and scientific research staff:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0129506

6. Over half of psychology studies fail reproducibility test: http://www.nature.com/news/over-half-of-psychology-

studies-fail-reproducibility-test-1.18248

7. Value of Open Data Sharing: https://www.fosteropenscience.eu/sites/default/files/pdf/2536.pdf

8. Michener, W. K., Brunt, J. W., Helly, J. J., Kirchner, T. B., & Stafford, S. G. (1997). Nongeospatial metadata for the

ecological sciences. Ecological Applications, 7(1), 330-342.

9. Society for Clinical Data Management. (2013). Good Clinical Data Management Practices. Washington, D.C.

10. UK Data Archive. (2015). Prepare and manage data. From http://ukdataservice.ac.uk/manage-data.