21
Preventing Data Loss Optimizing Storage & Organization Data Topics Workshop Series: Fall 2014

Preventing data loss

Embed Size (px)

Citation preview

Preventing Data LossOptimizing Storage & Organization

Data Topics Workshop Series: Fall 2014

Meet & Greet

• First Name• Program or Department • Current role in a research

project

Heather CoatesDigital Scholarship & Data Management LibrarianLiaison to the Fairbanks School of Public [email protected]

Timeline

• The Big Picture

• Practical Strategies

• Activities

• Presentation: 5 minutes• Activity 1: Storage & Backup Plan• Presentation: 5 minutes• Activity 2: Folder Structure & File

Names• Review | Q&A

Agenda

ScenarioFour years after your article is published, a researcher in your field contacts you with questions about the integrity of the data. • Can you find the files

supporting your published findings?

• Can you access and view the files?

• Can you justify your rationale for the procedures based on your documentation?

• Can someone pick up your research and build on it?

Goals

• Develop a consistent and coherent file organization and naming convention scheme for all project files.

• Select appropriate non-proprietary hardware and software formats for storing data.

• Create protected copies of files at crucial points in your study.• Use versioning software or documentation for tracking changes to

files over time (if necessary).

• Natural disaster • Facilities infrastructure failure • Storage failure • Server hardware/software failure• Application software failure• External dependencies (e.g. PKI failure)• Format obsolescence• Legal encumbrance • Human error• Malicious attack by human or automated agents• Loss of staffing competencies• Loss of institutional commitment • Loss of financial stability • Changes in user expectations and requirements

The World of Data Around Us: Data Loss

CC

imag

e by

Sha

ryn

Mor

row

on

Flic

kr

CC

imag

e by

mom

bole

umon

Flic

kr

Data Loss

.33

Vines et al, 2014 7

Backup Plan [Write it down, Do it, Check it]

• Rule of 3• Local copy (ex: desktop or laptop)• Semi-local copy (ex: IU cloud storage)• Remote copy (ex: IU cloud storage)

• Backup frequency• How much data can you risk losing?

• Backup procedure• Manual or automatic?• Full or incremental?• Verification/testing?

Security & Encryption

• Use IU systems • Strong authentication protocols

• Encryption• Useful for portable devices (e.g., laptops, external hard drives,

flash drives, smartphones, etc.)• Use for highly sensitive data• IU recommendations

• http://kb.iu.edu/data/ayzi.html• http://kb.iu.edu/data/bcnh.html

Activity 1: Storage & Backup Plan

• On your own – Jot down• Describe your current plan or practice• What challenges do you face?

• Whole group discussion• What can you change in your own practices to reduce the risk of

data loss?• What difficulties do you face using IU storage resources?

File Names

Courtesy of PhD Comics

Elements of a File Name

• Project/grant name and/or number • Date of creation/modification• Name of creator/investigator: last name first followed by (initials of)

first name • Research team/department associated with the data • Content or subject descriptor • Data collection method (instrument, site, etc.) • Version number• Project phase

Naming Strategies

• Date first• 20110103_diss_surveyB_raw• 20110118_diss_surveyB_raw• 20110119_diss_inter_trans• 20110204_diss_surveyB_quest-B

• Subject first• diss_surveyB_raw_20110103• diss_surveyB_raw_20110118• diss_inter_trans_20110119• diss_surveyB_quest_20110204

• Type first• surveyB_raw_diss_20110103• surveyB_raw_diss_20110118• inter_trans_diss_20110119• surveyB_quest_diss_20110204

• Numbered (Forced ordering)• 01_diss_survey_raw_20110103• 01_diss_survey_raw_20110118• 02_diss_inter_trans_20110119• 04_diss_survey_quest-B_20110204

Whitmire, 2014

A data horror story

http://retractionwatch.com/2014/10/17/this-situation-left-me-ashamed-and-infuriated-with-myself-scientist-retracts-two-papers/

A Portuguese group has retracted two papers in the Journal of Bacteriology after mislabeled computer files led to the wrong images being used.And, we’ve learned in a heartfelt email, the first author was devastated.“A problem with a malfunctioning computer and image storage and mislabeling led to the assembling by one of the co-authors of images that were previously published by our research group. I didn’t detect the problem when the manuscript was sent for publication. Although the conclusions were not compromised in any of the two papers, we retract the papers precisely because some images were wrongly used.”

A data horror story

“I the 2011 paper (http://jb.asm.org/content/196/22/3980), it was first submitted to other 2 journal (JBC and RNA Biology), whom requested a lot of modifications, and therefore, we accumulated a lot of processed data files. In between the process, the hard-drive of the computer that was used to store the data files (which is shared by 5 research groups) stopped working due data overloading. Nonetheless, we were able to retrieve the original data, or so we thought. At the time, I was responsible for composing the final figures of each paper that we produced, and asked the team members to give me the files. In Figure 8 of this paper, it seemed that there has been a labeling error in the source files, and I did not realize that some images where duplicated in the experiment that was being represented, neither that parts of the image had already been published. I should stress that that the images were produced in our lab and represent our data.”

Activity 2: Folder Structure & File Names

• On your own• Describe your current plan or habits

• Take Home - Think About This• What are your current strategies or habits for naming files?• How do you organize your folders?• Does it work for you?• What is one thing you can do to improve your current system?

Master Files & Data Locks

• Provides snapshots of key phases in the data life cycle• Raw• Cleaned• Phases of processing

• In combination with detailed documentation, these files make write-up easier and supports reproducibility and reuse

• Demonstrate provenance (i.e., an audit trail)

Version Control

• Manual – file names• Sequential numbered system• Dated

• Automatic – version control software• Mercurial• TortoiseSVN• GitHub

• Keep log files, supplement with documentation (e.g., readme.txt, comments, etc.)

Activity 3: Master Copies & Data Locks

• Take Home - Think About This• What key files or versions of your data, analyses, images, etc. do

you need to validate your published results?• Are they save somewhere safe?• Can they be accidentally edited?

• Develop a plan to create locked copies of key files just in case• Store them someplace else…not with your working files

Resources

1. DataONE Education Module: Data Management. DataONE. Retrieved December 2013. From http://www.dataone.org/sites/all/documents/L01_DataManagement.pptx

2. Vines et al, (2014), Current Biology, The availability of research data declines rapidly with article age. http://dx.doi.org/10.1016/j.cub.2013.11.014

3. Whitmire, A. (2014). Research Data Management – Organizing Your Data. From http://guides.library.oregonstate.edu/grad521lectures