31
Managing the Impacts of Managing the Impacts of Programmatic Scale and Programmatic Scale and Enhancing Incentives for Enhancing Incentives for Data Archiving Data Archiving A Presentation for “International Workshop on A Presentation for “International Workshop on Strategies for Preservation of and Open Strategies for Preservation of and Open Access to Scientific Data” Access to Scientific Data” June 22, 2004 June 22, 2004 Beijing, China Beijing, China Raymond McCord Raymond McCord Oak Ridge National Laboratory* Oak Ridge National Laboratory* Oak Ridge, Tennessee, USA Oak Ridge, Tennessee, USA *Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. *Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725 Department of Energy under contract DE-AC05-00OR22725

Managing the Impacts of Programmatic Scale and Enhancing Incentives for Data Archiving A Presentation for “International Workshop on Strategies for Preservation

Embed Size (px)

Citation preview

Managing the Impacts of Managing the Impacts of Programmatic Scale and Programmatic Scale and Enhancing Incentives for Enhancing Incentives for

Data Archiving Data Archiving A Presentation for “International Workshop on A Presentation for “International Workshop on

Strategies for Preservation of and Open Access to Strategies for Preservation of and Open Access to Scientific Data” Scientific Data”

June 22, 2004June 22, 2004

Beijing, ChinaBeijing, China

Raymond McCord Raymond McCord

Oak Ridge National Laboratory*Oak Ridge National Laboratory*

Oak Ridge, Tennessee, USAOak Ridge, Tennessee, USA*Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department *Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department

of Energy under contract DE-AC05-00OR22725of Energy under contract DE-AC05-00OR22725

CreditsCredits Concepts presented here are derived from 25+ Concepts presented here are derived from 25+

years of managing data for environmental years of managing data for environmental projects.projects. Variations of the concepts have been observed from Variations of the concepts have been observed from

these disciplines.these disciplines. plant community research plant community research impact assessment in marine systemsimpact assessment in marine systems acid rain surveysacid rain surveys environmental monitoring and cleanup projects at DOE environmental monitoring and cleanup projects at DOE

facilitiesfacilities land use assessmentland use assessment climate change research (atmospheric research)climate change research (atmospheric research)

These concepts are believed to extend to other These concepts are believed to extend to other scientific disciplines.scientific disciplines.

Presentation StrategyPresentation Strategy

Archiving and scienceArchiving and science Making connectionsMaking connections

Enhancing incentives for archivingEnhancing incentives for archiving Impacts of scaleImpacts of scale

Volume (files and bytes)Volume (files and bytes) DiversityDiversity TimingTiming LongevityLongevity

Source: American Scientist,Vol 886 p 525.

You can’t keep running in here and demanding data

every two years

Challenge:engage scientists

in the processof archiving theirdata and providethe mechanismfor archiving.

Challenge:engage scientists

in the processof archiving theirdata and providethe mechanismfor archiving.

Quotes from RaymondQuotes from Raymond ““ Storing data is easy. Finding and using Storing data is easy. Finding and using

data later is not.”data later is not.” ““Systematically and consistently organized Systematically and consistently organized

data does not occur without cost. Consider data does not occur without cost. Consider the results from previous science projects the results from previous science projects with no extra effort for data archiving.”with no extra effort for data archiving.”

““The natural tendency over time for data The natural tendency over time for data and information is chaos. Effort must be and information is chaos. Effort must be exerted to overcome this.”exerted to overcome this.”

““Successfully managed data by projects Successfully managed data by projects may not be ready to be archived.” may not be ready to be archived.”

Archive FunctionsArchive Functions Store dataStore data

Submitted by othersSubmitted by others Build a catalog and structureBuild a catalog and structure Maintain storage across technology Maintain storage across technology

generationsgenerations Review new data (QA, metadata)Review new data (QA, metadata)

““Advertise” contentsAdvertise” contents Find data for usersFind data for users

Query and browse logicQuery and browse logic Distribute dataDistribute data

Provide access to dataProvide access to data References to documentationReferences to documentation

Presumptions about Presumptions about ArchivingArchiving

Information sharing is important.Information sharing is important. Multidisciplinary data access will foster Multidisciplinary data access will foster

more robust scientific discoveries.more robust scientific discoveries. Archiving can be improved.Archiving can be improved. The “neurons” of archives are metadata.The “neurons” of archives are metadata. The limited number of permanent data The limited number of permanent data

archives will increase.archives will increase. An expectation from “the Internet”An expectation from “the Internet”

Why Archive??Why Archive??

“ I am doing Science.

Trust me.”

Cycles of ResearchCycles of Research“An Information View”“An Information View”

Planning

Automation and review

Informationreview

Problem Definition(Research Objectives)

Analysis and

modeling

Planning

MeasurementCollection

Selection andextraction

Archive of Data

Publications

OriginalObservations

SecondaryObservations

200 yrs 25 yrs

““ Why Don’t I Archive My Why Don’t I Archive My Data?” Data?”

No incentives - No incentives - What’s in it for me?What’s in it for me? No acknowledgment - No acknowledgment - Does a dataset = a Does a dataset = a

paper?paper? Give up publication rights - Give up publication rights - Will somebody Will somebody

scoop me?scoop me? Poor planning - Poor planning - It was not in “the Plan”.It was not in “the Plan”. No resources - No resources - Who’s going to pay for it?Who’s going to pay for it? No future – No future – Who will support this later?Who will support this later? Lack of training - Lack of training - What do I do first?What do I do first? Unsure about metadata content - Unsure about metadata content - How much How much

is enough?is enough?

““ Why Should I Archive My Why Should I Archive My Data?”Data?”

(management hints!)(management hints!) Career advancement Career advancement (give them credit)(give them credit)

Scientists need to get some recognition for archiving.Scientists need to get some recognition for archiving. Consider scientific journals that also provide Consider scientific journals that also provide

companion “data publications”.companion “data publications”. ““It may help me do science with broader view.”It may help me do science with broader view.”

Good scientific practice Good scientific practice (create peer (create peer pressure)pressure)

Professional development Professional development (give them (give them training)training) Provide daily interactions between scientific and Provide daily interactions between scientific and

information specialists.information specialists. Allow a reasonable time for initial discovery.Allow a reasonable time for initial discovery. Provide support for long-term “stewardship”. Provide support for long-term “stewardship”. (Who (Who

will answer the questions after the project is will answer the questions after the project is completed?)completed?)

““ Why Should I Archive Why Should I Archive My Data?”My Data?”

(more management hints!!)(more management hints!!) Institutional incentives Institutional incentives (Have plans (Have plans

AND expectations)AND expectations) Archiving should be required by the sponsor.Archiving should be required by the sponsor. Data archiving is “in the plan” and resources Data archiving is “in the plan” and resources

are available to support it.are available to support it. Interweave archiving with the planning and Interweave archiving with the planning and

publication processes.publication processes. Technological advances Technological advances (Give them (Give them

hardware and software)hardware and software) It is technically easier now and there are more It is technically easier now and there are more

options.options. Consistent “self-discipline” is still challenging.Consistent “self-discipline” is still challenging.

““ Why Should I Archive My Why Should I Archive My Data?”Data?”

(still more management hints!!!)(still more management hints!!!) ““ Change” will be managed. Change” will be managed. (Have (Have

standards AND flexibility!!??)standards AND flexibility!!??) Change is inherent in research.Change is inherent in research. Managing change without prior Managing change without prior

planning can become consumptive.planning can become consumptive. Changes may cause confusion and Changes may cause confusion and

diminish data usefulness.diminish data usefulness. A BIG issueA BIG issue – more details during – more details during

tomorrow’s panel discussion on tomorrow’s panel discussion on “Management and Technical Issues”“Management and Technical Issues”

Archiving Supports Better Archiving Supports Better ScienceScience

The metadata required for archiving The metadata required for archiving will improve data quality.will improve data quality.

Archiving extends data usefulness.Archiving extends data usefulness. Archived data increases your Archived data increases your

information base for doing research:information base for doing research: More data volume and diversityMore data volume and diversity

Proper archives permit the Proper archives permit the replication of results.replication of results.

A KEY concept

of Science

The Effects of Project The Effects of Project Scale on ArchivesScale on Archives

“ Metadata are archive neurons??”

Metadata Depends on Metadata Depends on Your “World View”Your “World View”

InvestigatorInvestigator Doesn’t need extensive formal metadata Doesn’t need extensive formal metadata

ProjectProject Metadata needed for project integration and Metadata needed for project integration and

modeling activities may be limitedmodeling activities may be limited Project data manager may help write metadataProject data manager may help write metadata

Data archiveData archive More detailed metadata (e.g., spatial More detailed metadata (e.g., spatial

coordinates)coordinates) More standardization (e.g., keywords) to More standardization (e.g., keywords) to

communicate clearly with future userscommunicate clearly with future users Who writes the metadata?Who writes the metadata?

Measurement

An Initial View of Data… An Initial View of Data…

Measurement

Single Experiment ViewSingle Experiment View

datesample

ID

parameter name

location

Measurement

Research Project ViewResearch Project View

QA flag

media

datesample

ID

parameter name

location

Measurement

Long-term or Long-term or Multidisciplinary ViewMultidisciplinary View

QA flag

media

generator

method

datesample

ID

parameter name

location

records

Units

Measurement

Integrated System & Integrated System & Archive ViewArchive View

QA flag

media

generator

method

datesample

ID

parameter name

location

records

Units

Sample def.typedatelocationgenerator

labfield

Method def.

words, wordsunitsmethod

Parameter def.

org.typenamecustodianaddress, etc.

coord.elev. typedepth

Recordsystem

datewords, words.

QA def.

Units def.

GIS

Another View of ScaleAnother View of Scale

Program

Project Scale and Recorded Project Scale and Recorded MetadataMetadata

PIMetadata Group Archive

Increasing User Scope

•Units

•Method

•QA flag

•Media

•Parameter name

•Measurement

•Date

•Sample ID

•Location

•Generator

•Records

Data Maturation and Data Maturation and ScaleScale

Individual InvestigatorsIndividual Investigators collect data, quality assure, document, analyze, collect data, quality assure, document, analyze,

publishpublish Groups or Science TeamsGroups or Science Teams

collate data, enhance, synthesize, model, publishcollate data, enhance, synthesize, model, publish Project Information SystemProject Information System

collate data, review completeness, maintain data collate data, review completeness, maintain data for projectfor project

Data Distribution and Archive CenterData Distribution and Archive Center long-term archive, distribute freely to userslong-term archive, distribute freely to users

Master Data DirectoryMaster Data Directory searchable index with pointers to datasearchable index with pointers to data

Preparing for ArchivingPreparing for ArchivingI will not wait.I will not wait.I will not wait.I will not …

Measurement

Generic Environmental Data Generic Environmental Data

ModelModel (Which Piece Is First…?)(Which Piece Is First…?)

QA flag

media

generator

method

datesample

ID

parameter name

location

records

Units

Sample def.typedatelocationgenerator

labfield

Method def.

words, wordsunitsmethod

Parameter def.

org.typenamecustodianaddress, etc.

coord.elev. typedepth

Recordsystem

datewords, words.

QA def.

Units def.

GIS

Measurement

Sequence of Sequence of Information BirthInformation Birth

QA flag

media

generator

method

datesample

ID

parameter name

location

records

Units

Sample def.typedatelocationgenerator

labfield

Method def.

words, wordsunitsmethod

Parameter def.

org.typenamecustodianaddress, etc.

coord.elev. typedepth

Recordsystem

datewords, words.

QA def.

Units def.

GIS

Research ~ Publishing ~ Research ~ Publishing ~ MetadataMetadata

Metadata design can be a Metadata design can be a “checklist” for research planning.“checklist” for research planning.

Metadata preparation can be Metadata preparation can be integrated with publication process.integrated with publication process.

Metadata are an investment in Metadata are an investment in current and future science.current and future science.

Summary PointsSummary Points Incentives to archive data are a “management Incentives to archive data are a “management

responsibility”.responsibility”. ““Management” should understand the “Big Management” should understand the “Big

Picture”Picture” The impacts of scale on archiving.The impacts of scale on archiving.

Archives need structure and standards.Archives need structure and standards. Solutions include more than additional Solutions include more than additional

technology. technology. New behavior is also VERY important.New behavior is also VERY important.

Metadata are the “neurons” of Archives.Metadata are the “neurons” of Archives. Early metadata are better than later.Early metadata are better than later. The planning and decisions about archiving The planning and decisions about archiving

needs to be intentional and not accidental. needs to be intentional and not accidental.

Future ThoughtsFuture Thoughts

Will we be able to know Will we be able to know “Where are “Where are we?”we?” as the capacity of information as the capacity of information technology continues to expand?technology continues to expand? How many 30 KB files are on a 100 GB How many 30 KB files are on a 100 GB

tape cartridge?tape cartridge? The future limits will not be technologyThe future limits will not be technology

But our minds…But our minds… We need to plan NOW about how to We need to plan NOW about how to

best leverage the future.best leverage the future.

Looking Forward to a Looking Forward to a Future With Archives!!Future With Archives!!