39
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego A Grand Challenge for the Information Age

UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

Embed Size (px)

Citation preview

Page 1: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Dr. Francine BermanDirector, San Diego Supercomputer Center

Professor and High Performance Computing Endowed Chair, UC San Diego

A Grand Challenge for the Information Age

Page 2: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

The Fundamental Driver of the Information Age is Digital Data

Shopping

Entertainment

Information

Business

Education

Health

Page 3: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Digital Data Critical for Research and Education

Data at multiple scales in the Biosciences

Data from multiple sources in the Geosciences

DisciplinaryDatabasesUsers

Data Accessand Use

DataIntegration

Organisms

Organs

Cells

Atoms

Bio-polymers

Organelles

Cell Biology

Anatomy

Physiology

Proteomics

Medicinal Chemistry

Genomics

Where should we drill for oil?

What is the Impact of Global Warming?

How are the continents shifting?

DataIntegration

Geologic Map

Geo-Chemical

Geo-Physical

Geo-Chronologic

Foliation Map

Complex “multiple-worlds”

mediation

What genes are associated with cancer?

What parts of the brain are responsible for Alzheimers?

Page 4: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Today’s Presentation

• Data Cyberinfrastructure Today – Designing and developing infrastructure to enable today’s data-oriented applications

• Challenges in Building and Delivering Capable Data Infrastructure

• Sustainable Digital Preservation – Grand Challenge for the Information age

Page 5: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Data Cyberinfrastructure Today – Designing and Developing Infrastructure for

Today’s Data-Oriented Applications

Page 6: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

COMPUTE (more FLOPS)

DA

TA

(mor

e B

YT

ES

)

Home, Lab, Campus, Desktop

Applications

Compute-intensive

HPCApplications

Data-intensiveand

Compute-intensive

HPCapplications

Grid Applications

Data Grid

Applications

NETWORK

(more BW)

Data-intensiveapplications

Today’s Data-oriented Applications Span the Spectrum

Designing Infrastructure for Data:

Data and High Performance Computing

Data and Grids

Data and Cyberinfrastructure Services

Page 7: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Data and High Performance Computing

• For many applications, development of “balanced systems” needed to support applications which are both data-intensive and compute-intensive. Codes for which

• Grid platforms not a strong option• Data must be local to computation • I/O rates exceed WAN capabilities• Continuous and frequent I/O is

latency intolerant

• Scalability is key• Need high-bandwidth and large-

capacity local parallel file systems, archival storage

Compute-intensive

HPCApplications

Data-intensiveapplications

COMPUTE (more FLOPS)

DA

TA

(mo

re B

YT

ES

)

Data-intensiveand

Compute-intensiveHPC

applications

Data-intensiveapplications

Compute-intensiveapplications

Page 8: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

: Earthquake Simulation at Petascale – better prediction accuracy creates greater data-intensive demands

Estimated figures for simulated 240

second period, 100 hour run-time

TeraShake domain (600x300x80 km^3)

PetaShake domain

(800x400x100 km^3)

Fault system interaction NO YES

Inner Scale 200m 25m

Resolution of terrain grid

1.8 billion mesh points

2.0 trillion mesh points

Magnitude of Earthquake 7.7 8.1

Time steps 20,000 (.012 sec/step)

160,000 (.0015 sec/step)

Surface data 1.1 TB 1.2 PB

Volume data 43 TB 4.9 PB

Information courtesy of the Southern California Earthquake Center

Page 9: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman9

Data and HPC: What you see is what you’ve measured

FLOPS alone are not enough.

Appropriate benchmarks needed to rank/bring visibility to more balanced machines critical for today’s applications.

Information courtesy of Jack Dongarra

Cray XD1 -- Custom Interconnect

Dalco Linux Cluster -- Quadrics Interconnect

Sun Fire Cluster -- Gigabit ethernet Interconnect

• Three systems using the same processor and number of processors.• AMD Opteron 64 processors

2.2 GHz

• Difference is in way the processors are interconnected

• HPC Challenge benchmarks measure different machine characteristics• Linpack and matrix

multiply are computationally intensive

• PTRANS (matrix transpose), RandomAccess , bandwidth/latency tests and other tests begin to reflect stress on memory system

Page 10: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Data and Grids• Data applications some of the first applications which

• required Grid environments• could naturally tolerate longer latencies

• Grid model supports key data application profiles

• Compute at site A with data from site B

• Store Data Collection at site A with copies at sitesB and C

• Operate instrument at site A, move data to site B for storage, post-processing, etc.

CERN data providing key driver for grid technologies

Page 11: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Data Services Key for TeraGrid Science Gateways

• Science Gateways provide common application interface for science communities on TeraGrid

• Data services key for Gateway communities

• Analysis

• Visualization

• Management

• Remote access, etc.

LEAD

GridChem

NVO

Information and images courtesy of Nancy Wilkins-Diehr

Page 12: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Unifying Data over the Grid – the TeraGrid GPFS WAN Effort

• User wish list• Unlimited data capacity.

(everyone’s aggregate storage almost looks like this)

• Transparent, high speed access anywhere on the Grid

• Automatic archiving and retrieval

• No Latency.

• TeraGrid GPFS-WAN effort focuses on providing “infinite“(SDSC) storage over the grid• Looks like local disk to grid sites

• Uses automatic migration with a large cache to keep files always “online” and accessible.

• Data automatically archived without user intervention

Information courtesy of Phil Andrews

Page 13: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Data Services – Beyond Storage to Use

What are the trends and what is the

noise in my data?How should I display my

data?

How should I organize my

data?

How can I make my data accessible to my collaborators?

How can I combine my data with my colleague’s

data?

My data is confidential; how do I make sure that it is seen/used only by the right people?

How do I make sure that my

data will be there when I want it?

What services do users want?

Page 14: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Integrated Infrastructure

Services: Integrated Environment Key to Usability

Data Storage

Data Management

Data Manipulation

Data Access

computers

Sensor-nets

instruments

File systems,Database systems,

Collection ManagementData Integration, etc.

simulation

analysis

visualization

modeling

Man

y D

ata

So

urc

es

• Database selection and schema design

• Portal creation and collection publication

• Data analysis

• Data mining

• Data hosting

• Preservation services

• Domain-specific tools• Biology Workbench

• Montage (astronomy mosaicking)

• Kepler (Workflow management)

• Data visualization

• Data anonymization, etc.

Page 15: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Data Hosting: SDSC DataCentral – A Comprehensive Facility for Research Data

• Broad program to support research and community data collections and databases

• DataCentral services include:• Public Data Collections and Database

Hosting

• Long-term storage and preservation (tape and disk)

• Remote data management and access (SRB, portals)

• Data Analysis, Visualization and Data Mining

• Professional, qualified 24/7 support• DataCentral resources include

• 1 PB On-line disk

• 25 PB StorageTek tape library capacity

• 540 TB Storage-area Network (SAN)

• DB2, Oracle, MySQL

• Storage Resource Broker

• Gpfs-WAN with 700 TB

PDB – 28 TB

Web-based portal access

Page 16: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

DataCentral Allocated Collections include

Seismology 3D Ground Motion Collection for the LA Basin

AtmosphericSciences50 year Downscaling of Global Analysis over California Region

Earth SciencesNEXRAD Data in Hydrometerology and Hydrology

Elementary Particle Physics

AMANDA data

Biology AfCS Molecule Pages

Biomedical Neuroscience BIRN

Networking Backbone Header Traces

Networking Backscatter Data

Biology Bee Behavior

Biology Biocyc (SRI)

Art C5 landscape Database

Geology Chronos

Biology CKAAPS

Biology DigEmbryo

Earth Science Education ERESE

Earth Sciences UCI ESMF

Earth Sciences EarthRef.org

Earth Sciences ERDA

Earth Sciences ERR

Biology Encyclopedia of Life

Life Sciences Protein Data Bank

Geosciences GEON

Geosciences GEON-LIDAR

Geochemistry Kd

Biology Gene Ontology

Geochemistry GERM

Networking HPWREN

Ecology HyperLter

Networking IMDC

Biology Interpro Mirror

Biology JCSG Data

Government Library of Congress Data

Geophysics Magnetics Information Consortium data

Education UC Merced Japanese Art Collections

Geochemistry NAVDAT

Earthquake Engineering NEESIT data

Education NSDL

Astronomy NVO

Government NARA

Anthropology GAPP

Neurobiology Salk data

Seismology SCEC TeraShake

Seismology SCEC CyberShake

Oceanography SIO Explorer

Networking Skitter

Astronomy Sloan Digital Sky Survey

Geology Sensitive Species Map Server

Geology SD and Tijuana Watershed data

Oceanography Seamount Catalogue

Oceanography Seamounts Online

Biodiversity WhyWhere

Ocean SciencesSoutheastern Coastal Ocean Observing and Prediction Data

Structural Engineering TeraBridge

Various TeraGrid data collections

Biology Transporter Classification Database

Biology TreeBase

Art Tsunami Data

Education ArtStor

Biology Yeast regulatory network

Biology Apoptosis Database

Cosmology LUSciD

Page 17: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Data Visualization is key

SCEC Earthquake simulations

Visualization of Cancer Tumors

Prokudin– Gorskii historical images

Information and images courtesy of Amit Chourasia, SCEC, Steve Cutchin, Moores Cancer Center, David Minor, U.S. Library of Congress

Page 18: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Building and Delivering Capable Data Cyberinfrastructure

Page 19: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Infrastructure Should be Non-memorable

• Good infrastructure should be• Predictable• Pervasive• Cost-effective• Easy-to-use• Reliable• Unsurprising

• What’s required to build and provide useful, usable, and capable data Cyberinfrastructure?

Page 20: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Building Capable Data Cyberinfrastructure: Incorporating the “ilities”

• Scalability• Interoperability• Reliability• Capability• Sustainability• Predictability• Accessibility• Responsibility• Accountability• …

Page 21: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Entity at risk

What can go wrong Frequency

FileCorrupted media, disk failure

1 year

Tape+ Simultaneous failure of 2 copies

5 years

System

+ Systemic errors in vendor SW, or malicious user, or operator error that deletes multiple copies

15 years

Archive+ Natural disaster, obsolescence of standards

50 - 100 years

Reliability

Reliability: What can go wrong

• How can we maximize data reliability?• Replication, UPS

systems, heterogeneity, etc.

• How can we measure data reliability?• Network availability=

99.999% uptime (“5 nines”),

• What is the equivalent number of “0’s” for data reliability?

Information courtesy of Reagan Moore

Page 22: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Responsibility and Accountability

• Who owns the data?

• Who takes care of the data?

• Who pays for the data?

• Who can access the data?

• What are reasonable expectations between users and repositories?

• What are reasonable expectations between federated partner repositories?

• What are appropriate models for evaluating repositories?

• What incentives promote good stewardship? What should happen if/when the system fails?

Page 23: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Good Data Infrastructure Incurs Real Costs

• Most valuable data must be replicated

• SDSC research collections have been doubling every 15 months.

• SDSC storage is 25 PB and counting. Data is from supercomputer simulations, digital library collections, etc.

10.0

100.0

1000.0

10000.0

100000.0

June-97 June-98 June-99 June-00 June-01 June-02 June-03 June-04 June-05 June-06 June-07 June-08 June-09

Date

Arc

hiva

l Sto

rage

(TB

)

Model A (8-yr,15.2-mo 2X) TB Stored Planned Capacity

Information courtesy of Richard Moore

Capacity Costs• Reliability increased by up-to-date

and robust hardware and software for

• Replication (disk, tape, geographically)

• Backups, updates, syncing

• Audit trails

• Verification through checksums, physical media, network transfers, copies, etc.

• Data professionals needed to facilitate

• Infrastructure maintenance

• Long-term planning

• Restoration, and recovery

• Access, analysis, preservation, and other services

• Reporting, documentation, etc.

Capability Costs

Information courtesy of Richard Moore

Page 24: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Economic Sustainability

• Making Infinite Funding Finite

• Difficult to support infrastructure for data preservation as an infinite, increasing mortgage

• Creative partnerships help create sustainable economic models

Relay Funding

Consortium support

User fees, recharges

Endowments

Hybrid solutions

Geisel Library at UCSD

Page 25: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Preserving Digital Information Over the Long Term

Page 26: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

How much Digital Data is there?

Kilo 103

Mega 106

Giga 109

Tera 1012

Peta 1015

Exa 1018

Zetta 1021

U.S. Library of Congress manages 295 TB of digital data, 230 TB of which is “born digital”

SDSC HPSS tape archive = 25+ PetaBytes

1 novel = 1 MegaByte

Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007

• 5 exabytes of digital information produced in 2003

• 161 exabytes of digital information produced in 2006• 25% of the 2006 digital

universe is born digital (digital pictures, keystrokes, phone calls, etc.)

• 75% is replicated (emails forwarded, backed up transaction records, movies in DVD format)

• 1 zettabyte aggregate digital information projected for 2010

iPod (up to 20K songs) = 80 GB

Page 27: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

How much Storage is there?

• 2007 is the “crossover year” where the amount of digital information is greater than the amount of available storage

• Given the projected rates of growth, we will never have enough space again for all digital information

Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007

Page 28: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Focus for Preservation: the “most valuable” data

• What is “valuable”?• Community reference data

collections (e.g. UniProt, PDB)

• Irreplaceable collections

• Official collections (e.g. census data, electronic federal records)

• Collections which are very expensive to replicate (e.g. CERN data)

• Longitudinal and historical data

• and others …

Cost

Time

Value

Page 29: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

“Regional” Scale

Local Scale

National, International

Scale

The Data Pyramid

A Framework for Digital Stewardship• Preservation efforts should

focus on collections deemed “most valuable”

• Key issues:• What do we preserve?• How do we guard

against data loss?• Who is responsible?• Who pays? Etc.

Digital Data Collections

Reference, nationally important, and irreplaceable

data collections

Key research and community data

collections

Personal data collections

IncreasingValue

IncreasingTrust

Repositories/Facilities

National / Internaional-scale data repositories, archives, and

libraries.

“Regional”-scale libraries and targeted data

centers.

Private repositories.

Increasingrisk/responsibility

Increasingstability

Increasinginfra-

structure

Page 30: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Digital Collections of Community Value

“Regional” Scale

Local Scale

National, International

Scale

The Data Pyramid

• Key techniques for preservation: replication, heterogeneous support

Page 31: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

: A Conceptual Model for Preservation Data Grids

The Chronopolis Model

• Geographically distributed preservation data grid that supports long-term management , stewardship of, and access to digital collections

• Implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure.

• Integrates targeted technology forecasting and migration to support of long-term life-cycle management and preservation

Digital Information

of Long-TermValue

Distributed Production Preservation Environment

TechnologyForecasting and

Migration

Administration, Policy,

Outreach

Page 32: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Chronopolis Focus Areas and Demonstration Project Partners

• Chronopolis R&D, Policy, and Infrastructure Focus areas:

• Assessment of the needs of potential user communities and development of appropriate service models

• Development of formal roles and responsibilities of providers, partners, users

• Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc.

• Development of appropriate cost and risk models for long-term preservation

• Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure

2 Prototypes:National Demonstration

Project

Library of Congress Pilot Project

PartnersSDSC/UCSD

U Maryland

UCSD Libraries

NCAR

NARA

Library of Congress

NSF

ICPSR

Internet Archive

NVO

UCSD Libraries

Demonstration Project information courtesy of Robert McDonald

Page 33: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

National Demonstration Project – Large-scale Replication and Distribution

• Focus on supporting multiple, geographically distributed copies of preservation collections:

• “Bright copy” – Chronopolis site supports ingestion, collection management, user access

• “Dim copy” – Chronopolis site supports remote replica of bright copy and supports user access

• “Dark copy” – Chronopolis site supports reference copy that may be used for disaster recovery but no user access

• Each site may play different roles for different collections

SDSC

U MdNCAR

Chronopolis Site

Chronopolis Federation architecture

Bright copy C1

Dim copy C1

Bright copy C2

Dark copy C1

Dim copy C2

Dark copy C2

Demonstration collections included:• National Virtual Observatory (NVO) [1 TB Digital Palomar Observatory Sky Survey]

• Copy of Interuniversity Consortium for Political and Social Research (ICPSR) data [1 TB Web-accessible Data]

• NCAR Observational Data [3 TB of Observational and Re-Analysis Data]

Page 34: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

SDSC/ UCSD Libraries Pilot Project with U.S. Library of Congress

Goal: To “… demonstrate the feasibility and performance of current approaches for a production digital Data Center to support the Library of Congress’ requirements.”

Library of Congress Pilot Project information courtesy of David Minor

Prokudin-Gorskii Photographs

(Library of Congress Prints and Photographs Division)

http://www.loc.gov/exhibits/empire/

(also collection of web crawls from the Internet Archive)

• Historically important 600 GB Library of Congress image collection

• Images over 100 years old with red, blue, green components (kept as separate digital files).

• SDSC stores 5 copies with dark archival copy at NCAR

• Infrastructure must support idiosyncratic file structure. Special logging and monitoring software developed so that both SDSC and Library of Congress could access information

Page 35: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Pilot Projects provided invaluable experience with key Issues

• Technical Issues• How to address Integrity,

verification, provenance, authentication, etc.

• Legal/Policy Issues• Who is responsible?• Who is liable?

• Social Issues• What formats/standards are

acceptable to the community?

• How do we formalize trust?

• Infrastructure Issues• What kinds of resources

(servers, storage, networks) are required?

• How should they operate?

• Evaluation Issues• What is reliable?• What is successful?

• Cost Issues• What is cost-effective?• How can support be

sustained over time?

Page 36: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

• Inadequate/unrealistic general solution: “Let X do it” where X is:

• The Government

• The Libraries

• The Archivists

• Google

• The private sector

• Data owners

• Data generators, etc.

It’s Hard to be Successful in the Information Age without reliable, persistent information

• Creative partnerships needed to provide preservation solutions with

• Trusted stewards

• Feasible costs for users

• Sustainable costs for infrastructure

• Very low risk for data loss, etc.

Page 37: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran BermanOctober 31, 2006

Office of CyberInfrastructure

Blue Ribbon Task Force to Focus on Economic Sustainability

State

Local

International

Non-profit

College

University

Commercial

Federal USER

Image courtesy of Chris Greer

• International Blue Ribbon Task Force (BRTF-SDPA) to begin in 2008 to study issues of economic sustainability of digital preservation and access

• Support from • National Science Foundation• Library of Congress• Mellon Foundation• Joint Information Systems

Committee• National Archives and Records

Administration• Council on Library and Information

Sources

Page 38: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

BRTF-SDPACharge to the Task Force:

1. To conduct a comprehensive analysis of previous and current efforts to develop and/or implement models for sustainable digital information preservation; (First year report)

2. To identify and evaluate best practice regarding sustainable digital preservation among existing collections, repositories, and analogous enterprises;

3. To make specific recommendations for actions that will catalyze the development of sustainable resource strategies for the reliable preservation of digital information; (Second Year report)

4. Provide a research agenda to organize and motivate future work.

How you can be involved:

• Contribute your ideas (oral and written “testimony”)

• Suggest readings (website will serve as a community bibliography)

• Write an article on the issues for a new community (Important component will be to educate decision makers and the public about digital preservation)

Website to be launched this Fall. Will link from www.sdsc.edu

Page 39: UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Fran Berman

Many Thanks

• Phil Andrews, Reagan Moore, Ian Foster, Jack Dongarra, Authors of the IDC Report, Ben Tolo, Reagan Moore, Richard Moore, David Moore, Robert McDonald, Southern California Earthquake Center, David Minor, Amit Chourasia, U.S. Library of Congress, Moores Cancer Center, National Archives and Records Administration, NSF, Chris Greer, Nancy Wilkins-Diehr, and many others …

www.sdsc.edu

[email protected]