23
Mike Smorul Saurabh Channan Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park

Mike Smorul Saurabh Channan

  • Upload
    yanni

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park. Mike Smorul Saurabh Channan. Overview. Digital Preservation Research ADAPT Project and Components Pilot Persistent Archive - PowerPoint PPT Presentation

Citation preview

Page 1: Mike Smorul Saurabh Channan

Mike Smorul

Saurabh Channan

Digital Preservation and Archiving at the

Institute for Advanced Computer StudiesUniversity of Maryland, College Park

Page 2: Mike Smorul Saurabh Channan

Overview

• Digital Preservation Research– ADAPT Project and Components– Pilot Persistent Archive

• Digital Library and Production Data Distribution– Global Land Cover Facility

• Conclusion & Questions

Page 3: Mike Smorul Saurabh Channan

A Digital Approach to Preservation Technology (ADAPT)

• Premise:– Preservation of digital entities into self-

describing objects• OAIS Information Packet model as a framework

– Separation of management into three layers, bitstream, semantic, and access/discovery

– Distributed and Secure Infrastructure• Automatic ingestion and replication• Policy-Driven Management of Preservation

Processes• Global Format Registry• Separate Peer-to-Peer Deep Archive

Page 4: Mike Smorul Saurabh Channan

ADAPT Architecture

Data Management

Metadata Management

Descriptive Metadata

Preservation Metadata

Administrative Metadata

Deep Archive Data GridConventional

Archive

PAWN

Met

adat

a

Data

CAN

Management of Preservation Processes

Page 5: Mike Smorul Saurabh Channan

ADAPT Components• Ingestion

– Producer-Archive Workflow Network (PAWN)

• Management of Preservation Processes– Lightweight Preservation Environment (LPE)

• Access and Discovery– Grid Retrieval and Search Platform (GRASP)– EAP Collection browser

Page 6: Mike Smorul Saurabh Channan

Overall Principles (PAWN)

• Distributed, secure ingestion• OAIS based Information Packet creation• Use of web/grid technologies – platform

independent• Minimal client-side requirements• Ease of integration with archive and data grid

systems.• Designed to satisfy data integrity requirements

of scientific collections and digital preservation

Page 7: Mike Smorul Saurabh Channan

Distributed Ingestion (PAWN)

``

`

Producer

``

`

Producer

``

`

Producer

``

`

Producer

Distributed Archive

Page 8: Mike Smorul Saurabh Channan

Ingestion Workflow (PAWN)

1. Negotiate Submission Agreement.

2. Workflow Initialization and Submission Information Packet (SIP) creation.

3. Transfer of SIPs to Data Grid site.

4. Validation of SIP transfer

5. Organization of data into collections and transfer into Data Grid.

Page 9: Mike Smorul Saurabh Channan

Component Overview (PAWN)

CRL check

Success/Failure notification of ingestion

Metadata registration/retrieval

Producer Management Interface Data Grid Management Interface

Producer data suppliersBitstream Validation Service

Data Grid

Page 10: Mike Smorul Saurabh Channan

Target Collections (PAWN)

• Digital Image Collection– Rich metadata in various formats

• Web site crawling– Online and interactive content

• GLCF Landsat data– Spatial and temporal metadata– Large quantity (over 15,000 objects)

Page 11: Mike Smorul Saurabh Channan

•The Lightweight Preservation Environment is an archival system based on a modular design using grid and web services.

•The current implementation relies mostly on Globus technologies.

•Primarily, we’ve focused on wrapping logic around those components.

Lightweight Preservation Environment (LPE)

Page 12: Mike Smorul Saurabh Channan

Developed Components (LPE)

•Data Manager (DM):Organizes data and queries between the user and the other components

•Policy Manager (PM):Ensures that a minimum number of copies exist for any given file

•Transformation Manager (TM):Executes specific transformations on a named file on a given storage node and returns the results

Page 13: Mike Smorul Saurabh Channan

Grid Retrieval and Search Platform (GRASP)

• Based on concepts developed in the Earth Science Data Interface (ESDI) developed at the UMIACS GLCF.

• Provides a graphical interface into data grid holdings.

• Access to entire GLCF holdings through the Storage Resource Broker(SRB)

Page 14: Mike Smorul Saurabh Channan

GRASP Architecture

I/O Abstraction Layer

Data Grid

Clients

Data download

Query Abstraction

Browse / Display

Spatial Information

Textual Information

Data Discovery

Page 15: Mike Smorul Saurabh Channan

GRASP Architecture

• GRASP uses a data grid as an abstract storage repository.

• Metadata in the grid is mined from the grid itself or from external sources and published into a browsable form.– Data grids may allow for platform independent

metadata, but may not be optimal for access

Page 16: Mike Smorul Saurabh Channan

GRASP Screenshot

Page 17: Mike Smorul Saurabh Channan

Global Land Cover Facility• Mission:

“The GLCF Mission is to encourage the use of remotely sensed imagery, derived products and applications within a broad range of science communities in a manner that improves comprehension of the nature and causes of land cover change and its impact on the Earth.”

• Goal:“The GLCF Goal is to provide free access to an integrated

collection of critical land cover and Earth science data through systems that are designed to maximize user outreach and that promote development of novel tools for ordering, visualizing and manipulating spatial data.”

Page 18: Mike Smorul Saurabh Channan

Data Collections

Majority of the holdings are of Landsat and MODIS data

Page 19: Mike Smorul Saurabh Channan

Data Distribution

• Data at the GLCF– Approximately 5.1 TB

compressed– Approximately 13 TB

uncompressed

• Anticipated Production Rate– Triple or Quadruple current

data holding within the next two year

Data Traffic

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

Month

Meg

abyt

es

Page 20: Mike Smorul Saurabh Channan

Data Discovery Applications

ESDI

Web Interface

User friendly

Search

Retrieve

Discover

Scalable

Over 9TB a month !

Page 21: Mike Smorul Saurabh Channan

GLCF Archive

SunFire V100

Sun

SunFire V100

SunSunFire V100

Sun

SunFire V100

Sun

SunFire V100

Sun

SunFire V100

Sun

SunFire V100

Sun

SunFire V100

Sun

ProFTPd servers

File ServersRAIDs

ftp://ftp.glcf…./../something.tif

SunFire V100

Sun

SunFire V100

Sun

SunFire V100

Sun

Scalable and Reliable

Page 22: Mike Smorul Saurabh Channan

Participation Possibilities

• PAWN ingestion component– Minimal geospatial metadata support planned, can be

expanded to support NGDA endpoint

• GRASP display component– Solid core components, end-user interfaces need

additional polishing

• GLCF data holdings– Additional hardware required if additional data and

access mechanisms (grid, etc) required

• Other possibilities include: grid infrastructure, GSI security, format registry, etc.

Page 23: Mike Smorul Saurabh Channan

Questions