Data Management for Research Aaron Collie, MSU Libraries Lisa Schmidt, University Archives

Preview:

Citation preview

Data Management for Research

Aaron Collie, MSU LibrariesLisa Schmidt, University Archives

Introductions Please tell us your name and

department A brief description of your

primary research area What do you consider to be your

research data?

Optional: Experience managing research data? Experience writing a data

management plan?

cc http://www.flickr.com/photos/quinnanya/

• Introductions • Background• Definitions• Upfront Decisions• Data Sharing Impacts

• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup

• Data Lifecycle Strategy

Agenda

Why are we here?

But why are we really here?

An Impetus: NSF recently released a mandate that all grant applications submitted after January 18th, 2011 must include a supplemental “Data Management Plan”

An Effect: This mandate from NSF has had a domino effect, and many funders that now require or state guidelines for data management of grant funded research

A Challenge: Data management (and oftentimes research methods in general) is an area that has not traditionally received a full treatment in most graduate and doctoral curricula

What is meant by “data management”?

Fundamental Practices File Organization Data Documentation Reliable Backups

Data lifecycle Digital Sustainability Scholarly

Communication Data Publishing Research Impact

Effective January 18, 2011 NSF will not evaluate any proposal missing a DMP May be up to two pages long PI may state that project will not generate data or

samples DMP is reviewed as part of intellectual merit or

broader impacts of application, or both Costs to implement DMP may be included in

proposal’s budget

NSF’s Data Management Guidelines Policies for re-use, re-distribution, and creation of

derivatives Plans for archiving data, samples, and other research

outcomes, maintaining access Types of data, samples, physical collections, software

generated Standards for data and metadata format and content Access and sharing policies, with stipulations for

privacy, confidentiality, security, intellectual property, or other rights or requirements

Other Federal Policies

NASA “promotes the full and open sharing of all data”

“requires that data…be submitted to and archived by designated national data centers.”

“expects the timely release and sharing of final research data"

"IMLS encourages sharing of research data."

“…should describe how the project team will manage and disseminate data generated by the project”

Upfront Decisions for Researchers What is the expected lifespan of the data? Besides the researcher(s) on the project, who else

should be given access to the data? Does the dataset include any sensitive information? Who owns or controls the research data? Should any restrictions be placed on the dataset? How are the data stored and preserved?

Upfront Decisions for Researchers How might the data be used, reused, and

repurposed? How is the data described and organized? Who are the expected and potential audiences for

the datasets? What publications or discoveries have resulted from

the datasets? How should the data be made accessible?

Data Sharing Impacts Reinforces open scientific

inquiry Encourages diversity of

analysis and opinion Promotes new research,

testing of new or alternative hypotheses and methods of analysis

Supports studies on data collection methods and measurement

Cc http://www.flickr.com/photos/pinchof_10/

Data Sharing Impacts (cont.)

Facilitates education of new researchers

Enables exploration of topics not envisioned by initial investigators

Permits creation of new datasets by combining data from multiple sources

• Introductions • Background• Definitions• Upfront Decisions• Data Sharing Impacts

• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup

• Data Lifecycle Strategy

Agenda

File Organization Practices: Overview

1. Create a file plan for your research project

2. Design a file naming convention that works for your project

3. Agree on a version control method to assist with file synchronization

4. Carefully choose file formats to maximize usefulness

“When I was a freshmen I named my assignments Paper Paperr Paperrr Paperrrr”-Undergrad

1. Create a file plan for your research project

File plan as a classification system Indexed – makes it easier to locate folders/files Primary subjects – main functions of research project

Secondary subjects – more specific activities of project, including research data

• Tertiary subjects – limit by date or equivalent– File Name (naming conventions)

1. Create a file plan for your research project (cont.)

Example documentation of Directory Hierarchy: /[Project]/[Grant Number]/[Event]/[Date]

Example documentation of File Naming Convention: [investigator]_[method]_[descriptor]_[YYYYMMDD]_[version].[ext]

2. Design a file naming convention that works for your project

Why file naming conventions? Enable better access/retrieval of files Create logical sequences for file sorting More easily identify what you’re searching for

Meaningful but short (255 character limit) Descriptive while still making sense Capital letters or underscores differentiate

between words Surname first followed by initials of first name More on handout

2. Design a file naming convention that works for your project (cont.)

2. Design a file naming convention that works for your project (cont.)

This Not ThissharpeW_krillMicrograph_backscatter3_20110117.tif KrillData2011.tif

This Not ThisborgesJ_collocation_20080414.xml Borges_Textbase.xml

3. Agree on a version control method to assist with file synchronization

Version number of record indicated file name with “v” followed by version number

Letter “d” indicates draft

Examples of simple version control:waltM_lakeLansing_fieldNotes_20091012_v002.docpetersK_OrgChart2009_d001.svg

4. Carefully choose file formats to maximize usefulness

• Non-proprietary• Open, documented standard• Common usage by research community• Standard representation (ASCII, Unicode)• Unencrypted• Uncompressed

Documentation Practices: Overview

1. At minimum create a README file that you can use to document your project

2. Utilize standards for describing data including Metadata Standards

3. If applicable, use in-line code commentary to explain code (cc) Will Scullin

1. At minimum create a README file that you can use to document your project

At minimum, store documentation in readme.txt file or equivalent, with data

Resource: http://libraries.mit.edu/guides/subjects/data-management/metadata.html

“Data about data” Standardized way of describing data Explains who, what, where, when of data creation

and methods of use Provides the essential tools for discovery, such as

a bibliographic citation

2. Utilize standards for describing data including Metadata Standards

2. Utilize standards for describing data including Metadata Standards

Basic project metadata:

• Title • Language • File Formats

• Creator • Dates • File Structure

• Identifier • Location • Variable List

• Subject • Methodology • Code Lists

• Funders • Data Processing • Versions

• Rights • Sources • Checksums

• Access Information

• List of File Names

Documentation Practices: Example Metadata Standards

Dublin Core Easy-to-create-and-maintain descriptive format to facilitate cross-domain resource discovery on the Web

Darwin Core Facilitates reference and sharing of biological diversity datasets

Data Documentation Initiative (DDI) Methodology for content, presentation, transport, and preservation of metadata about datasets in the social and behavioral sciences

Documentation Practices: Example Metadata Standards

Directory Interchange Format Descriptive format for exchanging information about earth science data

ISO 19115:2003 Describes geographic data such as maps and charts

PBCore Supports description and exchange of media assets, including both individual clips and full, edited, aired productions

Documentation Practices: Example Metadata Standards

Science Data Literacy Project Metadata for astronomy, biology, ecology and oceanography

VRACoreData standard for description of works of visual culture as well as images that document them

3. If applicable, use in-line code commentary to explain code

Example of R code commentary

# Cumulative normal densitypnorm(c(-1.96,0,1.96))

Backup Practices: Overview

1. Avoid single points of failure2. Understand the different types of storage3. Ensure data redundancy4. Aim for geographic distribution of data

1. Avoid single points of failure

A single point of failure occurs when it would only take one event to destroy all data on a device (e.g. dropped hard drive)

Good practices for avoiding single points of error: Use managed networked storage whenever possible Move data off of portable media Never rely on one copy of data Do not rely on CD or DVD copies to be readable Be wary of software lifespans (e.g. Angel)

2. Understand the different types of storage

• Flash Drives• Internal Hard Drives• External Hard Drives• Server and Web Storage• Managed Networked Storage• Cloud Storage

3. Ensure data redundancy

Backup Do’s: Make 3 copies

E.g. original + external/local + external/remote E.g. original + 2 formats on 2 drives in 2 locations

Geographically distribute and secure Local vs. remote, depending on needed recovery time

Personal computer, external hard drives, departmental, or university servers may be used

3. Ensure data redundancy (cont.)

Backup Don’ts: Do not rely on one copy Do not use CDs and DVDs Do not rely on ANGEL

(cc) George Ornbo

3. Ensure data redundancy (cont.)

Backup Maybe: Cloud storage

Amazon s3 Google MS Azure DuraCloud Rackspace

Note that many enterprise cloud storage services include a charge for in/out of data transfers

$$$

• Introductions • Background• Definitions• Upfront Decisions• Data Sharing Impacts

• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup

• Data Lifecycle Strategy

Agenda

Research is…De

fine

a qu

estio

n

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

Defin

e a

ques

tion

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data

Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

?

Defin

e a

ques

tion

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data

Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

The scientific method “is often misrepresented as a fixed sequence of steps,” rather than being seen for what it truly is, “a highly variable and creative process” (AAAS 2000:18).

Gauch, Hugh G. Scientific Method in Practice. New York: Cambridge University Press, 2010. Print. (Emphasis added)

Defin

e a

ques

tion

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

The Research Depth Chart

Scientific Method

Research Design

Research Method

Research Tasks Mor

e Sp

ecifi

c

M

ore

Gen

eric

Defin

e a

ques

tion

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

Source: DDI Structural Reform Group. “Overview of the DDI Version 3.0 Conceptual Model.“ DDI Alliance. 2004.http://opendatafoundation.org/ddi/srg/Papers/DDIModel_v_4.pdf

The Data Management Depth Chart

Research Data Lifecycle Model

The Data Management Depth Chart

Research Data Lifecycle Model

Research Data Management Tasks

???

???

The Data Management Depth Chart

Research Data Lifecycle Model

???

Data Management Plan

Research Data Management Tasks

Data are brainstormed

Study Concept

Data are brainstormed

DMP • Data type, purpose & value

MSU

• University Research Council guidelines• Research Facilitation and

Dissemination• Lifecycle Data Management Planning• Research Data Management Guidance

YOU • Start your Data Management Plan!

Data are collected and secured

Study Concept

Data Collection

Data are collected

DMP • Data format, size & short term storage

MSU

• ATS Andrew File System (AFS)• Institute for Cyber Enabled Research• MSU Libraries Data Services• MSU Libraries Campus Data Resources

YOU • File Plan, File Naming, Backup Plan

Data are normalized and processed

Study Concept

Data Collection

Data Processing

Data are processed

DMP • Data transformations & structures

MSU• LCT Computing Courses• High Performance Computing Center• Consortium of Research Consulting

Services

YOU • Documentation, Methodology

Data are distributed

Data Distribution

Study Concept

Data Collection

Data Processing

Data are distributed

DMP • Data sharing, security & rights

MSU

• Human Research Protection Program• University Research Council guidelines• MSU Libraries Copyright Permissions

Center• MSU Google Apps

YOU • Roles, Responsibilities, Resources

Data are discoverable

Data Distribution

Study Concept

Data Collection

Data Processing

Data Discovery

Data are discoverable

DMP • Data publishing & metadata

MSU• Development of Copyrighted Materials• MSU Libraries Data Citation Guide

YOU • README, Metadata Standard

Data are analyzed

Data Distributio

n

Data Discovery

Data Analysis

Study Concept

Data Collection

Data Processing

Data are analyzed

DMP • Standards & workflow documentation

MSU• Center for Statistical Training and

Consulting• Statistical Consulting Services

YOU • Code Commentary, Documentation

Data are stored and preserved

Data Distribution

Data Discovery

Data Analysis

Study Concept

Data Collection

Data Processing

Data Archiving

Data are preserved

DMP • Long term storage & management

MSU• VPRGS Repositories and Archives• Lifecycle Data Management Planning• Databib.org!

YOU • Embrace stewardship

Data can be used and reused

Data Distribution

Data Discovery

Data Analysis

Study Concept

Data Collection

Data Processing

Data Archiving

Repurposing

Data can be used and reused

DMP • Broader impact

MSU• Research Data Management CAFE• MSU Research Centers and Institutes• MSU Libraries Data Citation Guide

YOU • Publish your data!

Research Data Management Guidance

Face-to-face Advising Writing Data Management Plans Planning for Digital Projects Managing Digital Information

Group Training New Faculty Orientation Faculty Seminars Classroom Instruction lib.msu.edu/about/rdmg

In Conclusion… Upfront Decisions Researchers Need to Make General Good Practices for Managing Research Data NSF, NIH, IMLS and Other Funders’ Requirements Lifecycle of Research Data

ContactLisa M. SchmidtElectronic Records ArchivistUniversity Archives & Historical Collectionslschmidt@ais.msu.edu

Aaron CollieDigital Curation LibrarianMSU Librariescollie@msu.edu

Recommended