59
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent Archives (Storage Resource Broker - SRB) Arcot Rajasekar Michael Wan Reagan W. Moore (sekar, mwan, moore)@sdsc.edu

Integration of Data Grids, Digital Libraries, and Persistent

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Integration of Data Grids, Digital Libraries, and Persistent Archives

(Storage Resource Broker - SRB)

Arcot RajasekarMichael Wan

Reagan W. Moore(sekar, mwan, moore)@sdsc.edu

Page 2: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

SDSC SRB Team • Reagan Moore• Michael Wan• Arcot Rajasekar • Wayne Schroeder• Arun Jagatheesan• Charlie Cowart• Lucas Gilbert • George Kremenek• Sheau-Yen Chen• Bing Zhu• Roman Olschanowsky (BIRN)• Vicky Rowley (BIRN)• Marcio Faerman (SCEC)• Antoine De Torcy (IN2P3)• Students & emeritus

– Erik Vandekieft– Reena Mathew– Xi (Cynthia) Sheng– Allen Ding– Grace Lin– Qiao Xin– Daniel Moore– Ethan Chen– Jon Weinburg

Page 3: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Topics

• Concepts behind data management

• Production data grid examples

• Integration of data grids with digital libraries and persistent archives

Page 4: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Grid

• Support data sharing between institutions– Discover relevant data without knowing the file name– Access data without knowing the storage location or

storage access protocol– Retrieve data using your preferred API

• Organize distributed data in a collection hierarchy• Manage latency in wide-area-networks• Manage PetaBytes of data and hundreds of

millions of files

Page 5: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Digital Library

• Provide curation services– Organization, description, and management of data

– Support schema extension

• Provide access services – Discovery, browsing, presentation, and manipulation of

data

• Federate semantics across collections– Digital library crosswalks

Page 6: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Persistent Archive

• Support archival processes– Appraisal, accession, arrangement, description,

preservation, and access

• Manage technology evolution while preserving integrity and authenticity of data

• Minimize risk of data loss– Preserve collections for hundreds of years– Data replication

Page 7: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Challenges

• Each community assigns different meanings to terms used to describe their requirements

• Data grid community– Persistent Archive is the infrastructure that manages

storage technology evolution while preserving a collection

• Archivist community – Persistent Archive is the collection that is being

preserved in some choice of infrastructure

• Together they define a preservation environment

Page 8: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Challenges

• Preservation community traditionally views technology evolution as the problem rather than the solution– Preservation requires the ability to manipulate old formats

• Digital library community attempts to assert exact meaning for semantics.– Metadata Encoding and Transmission Standard is one approach

towards the creation of a metadata framework with the ability to support extension schema

• Data grid community has not chosen standards for distributed data management– Computer science is just starting to understand how to characterize

and manage data, information, and knowledge

Page 9: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

To Make Progress

• Develop simplest possible description for describing data, information, and knowledge management

• Identify common infrastructure components

• Apply in production settings– Iterate, based on new expectations for data

management

Page 10: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Common Requirements forData Management

• Distributed data sources– Management across administrative domains

• Heterogeneity– Multiple types of storage repositories

• Scalability– Support for billions of digital entities, PetaBytes of data

• Preservation– Management of technology evolution

Page 11: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

SRB Collections at SDSCAs of 12/22/2000 As of 5/17/2002 As of 3/3/2004

Project Instance Data_size (in GB)

Count (files)

Data_size (in GB)

Count (files)

Data_size (in GB)

Count (files)

Users

Data GridDigsky 7,599.00 3,630,300 17,800.00 5,139,249 45,939.00 8,685,572 80NPACI 329.63 46,844 1,972.00 1,083,230 13,700.00 4,050,863 379Hayden 6,800.00 41,391 7,835.00 60,001 168SLAC -JCSG 514.00 77,168 3,432.00 446,613 43LDAS/SALK 239.00 1,766 2,002.00 14,427 66TeraGrid 22,563.00 452,868 2,585BIRN 892.00 2,472,299 160Digital LibraryDigEmbryo 124.30 2,479 433.00 31,629 720.00 45,365 23HyperLter 28.94 69 158.00 3,596 215.00 5,110 29Portal 33.00 5,485 1,610.00 46,278 374AfCS 27.00 4,007 236.00 42,987 21NSDL/SIO Exp 19.20 383 1,217.00 193,888 26Transana 5.80 92 92.00 2,387 26SCEC 12,311.00 1,730,432 47UCSDLib 127.00 202,445 29Persistent ArchiveNARA/Collection 7.00 2,455 72.00 82,192 58NSDL/CI 1,529.00 12,658,072 116TOTAL 8 TB 3.7 million 28 TB 6.4 million 114 TB 31 million 4230 ** Does not cover data brokered by SRB spaces administered outside SDSC. Does not cover databases; covers only files stored in file systems and archival storage systems Does not cover shadow-linked directories

Page 12: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Management Concepts(Elements)

• Collection– The organization of digital entities to simplify

management and access.

• Context– The information that describes the digital

entities in a collection.

• Content– The digital entities in a collection

Page 13: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Types of Context Metadata

• Descriptive– Provenance information, discovery attributes

• Administrative– Location, ownership, size, time stamps

• Structural– Data model, internal components

• Behavioral– Display and manipulation operations

• Authenticity– Audit trails, checksums, access controls

Page 14: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Metadata Standards

• METS - Metadata Encoding Transmission Standard– Defines standard structure and schema extension

• OAIS - Open Archival Information System– Preservation packages for submission, archiving,

distribution

• OAI - Open Archives Initiative– Metadata retrieval based on Dublin Core

provenance attributes

Page 15: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Management Concepts(Mechanisms)

• Curation– The process of creating the context

• Closure– Assertion that the collection has global

properties, including completeness and homogeneity under specified operations

• Consistency– Assertion that the context represents the content

Page 16: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Information Technologies• Data collecting

– Sensor systems, object ring buffers and portals

• Data organization– Collections, manage data context

• Data sharing– Data grids, manage heterogeneity

• Data publication– Digital libraries, support discovery

• Data preservation– Persistent archives, manage technology evolution

• Data analysis– Processing pipelines, manage knowledge extraction

Page 17: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Assertion• Data Grids provide the underlying abstractions required to

support– Digital libraries

• Curation processes• Distributed collections• Discovery and presentation services

– Persistent archives• Management of technology evolution• Preservation of authenticity

• The management of data requires the use of information (semantic labels).

• The management of information requires the use of knowledge (relationships).

Page 18: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Grid Terms• Data

– Bits - zeros and ones

• Digital Entity– The bits that form an image of reality (file, object, image, data, metadata,

string of bits, structured sets of string of bits)

• Information– Semantic labels applied to data

• Metadata– Semantic label and the associated data (attribute name and attribute value)

• Knowledge – Relationships between semantic labels applied to data– Relationships used to assert the application of a semantic label

Page 19: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Grid Components• Federated client-server architecture

– Servers can talk to each other independently of the client

• Infrastructure independent naming– Logical names for users, resources, files, applications

• Collective ownership of data– Collection-owned data, with infrastructure independent access control

lists

• Context management– Record state information in a metadata catalog from data grid services

such as replication

• Abstractions for dealing with heterogeneity

Page 20: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Grid Abstractions• Logical name space for files

– Global persistent identifier

• Storage repository virtualization– Standard operations supported on storage systems

• Information repository virtualization– Standard operations to manage collections in databases

• Access virtualization– Standard interface to support alternate APIs

• Latency management mechanisms– Aggregation, parallel I/O, replication, caching

• Security interoperability– GSSAPI, inter-realm authentication, collection-based authorization

Page 21: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Storage Repository Virtualization

Archive Database File System

User Application

Page 22: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Storage Repository Virtualization

Archive Database File System

Common set of operations for interacting with every type of storage repository

User ApplicationRemote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries

Page 23: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Mappings on ResourceName Space

• Define logical resource name– List of physical resources

• Replication– Write to logical resource completes when all physical

resources have a copy

• Load balancing– Write to a logical resource completes when copy exist on

next physical resource in the list

• Fault tolerance– Write to a logical resource completes when copies exist on

“k” of “n” physical resources

Page 24: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Containers

• Archivists store hardcopy in “cardboard boxes”

• A container is the digital equivalent, the aggregation of digital files into a single file, with an associated “packing list”

• Containers are used to minimize access latency, keep similar digital entities together

Page 25: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Stored at SDSC

• HPSS archive– Stores 1 Petabyte of data– Stores 17 million files

• Storage Resource Broker data grid– Stores 114 Terabytes of data– Stores 31 million files– Containers are used to aggregate files before

loading into HPSS

Page 26: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Unix Shell

Java, NTBrowsers

GridFTP OAIWSDL

SDSC Storage Resource Broker & Meta-data Catalog

HRM

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

C, C++, Libraries

AccessAPIs

Drivers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase,

SQLServer

Consistency Management / Authorization-Authentication

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

SRBServer

Linux I/O

DLL /Python

Page 27: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Production Data Grid• SDSC Storage Resource Broker

– Federated client-server system, managing• Over 100 TBs of data at SDSC

• Over 25 million files

– Manages data collections stored in• Archives (HPSS, UniTree, ADSM, DMF)

• Hierarchical Resource Managers

• Tapes, tape robots

• File systems (Unix, Linux, Mac OS X, Windows)

• FTP sites

• Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix)

• Virtual Object Ring Buffers

Page 28: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Virtualization

Archiveat SDSC

DatabaseAt U Md

File Systemat U Texas

User Application

Page 29: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Virtualization

Archiveat SDSC

DatabaseAt U Md

File Systemat U Texas

Common naming convention and set of attributes for describing digital entities

User Application

Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata

Inter-realm authentication Single sign-on system

Page 30: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Logical Name Space

• Persistent, location-independent identifiers for digital entities– Organized as collection hierarchy– Attributes mapped to logical name space

• Attributed managed in a database

• Types of administrative metadata– Physical location of file– Owner, size, creation time, update time– Access controls

Page 31: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

File Identifiers

• Logical file name– Infrastructure independent– Used to organize files into a collection hierarchy

• Globally unique identifier– GUID for asserting equivalence across collections

• Descriptive metadata– Support discovery

• Physical file name– Location of file

Page 32: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Information Repository Virtualization

Choice of database forMetadata Catalog

User ApplicationOperations used to manageadministrative, descriptive, user-defined metadata

Import from XML fileExport to XML fileBulk loadBulk unloadSchema extensionAccess controlsDynamic SQL generation

Common operations for managing a catalog in a database

Page 33: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Unix Shell

Java, NTBrowsers

GridFTP OAIWSDL

Access Virtualization

Application

C, C++, Libraries

Linux I/O

DLL /Python

Common operations performed on allstorage repositories

Map from API to remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries

Page 34: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Technology Evolution• All components of the “Persistent Archive” will evolve

– Hardware systems– Software systems– Protocols– Access methods– Encoding syntax for digital entities

• Create drivers for each new storage repository protocol– Migrate data to each new storage system

• Manage evolution of the encoding syntax through either transformative migration or emulation

Page 35: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Are Repeated Media Migrations Feasible?

• At SDSC, cartridge capacity has increased from 200 Mbytes to 200 Gbytes for same cartridge cost

• Only migrate to new technology when the cost per Gigabyte is a factor of two lower

• Then the media cost is fixed when sum over all migrations(1 + 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + …) = 2

• SDSC migrates to new media to reduce cost– All tape are stored in robots to minimize labor costs

Page 36: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Transformative Migration versus Emulation versus Digital Ontology

• Transformative Migration– Transform the encoding format to a new standard

– Can combine encoding format transformation with media migration

• Emulation– Create a transportable parser for the original encoding format

– Migrate emulator forward in time

– Example - Multivalent Browser (written in Java) for parsing pdf, laTex, …

• Digital ontology– Characterize the structures and relationships present within the digital entity

– Migrate the characterization forward in time

Page 37: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Persistent Archives

• When migrate from an old technology to a new technology, both versions are available.

• Virtualization mechanisms used for federation across space can be used to manage migration over time

• Persistent archives can be built on data grid infrastructure

Page 38: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Automation of Archival Processes

Archival Process Functionality

Appraisal Assessment of digital entities

Accession Import of digital entities

Description Assignment of preservation metadata

Arrangement Logical organization of digital entities

Preservation Long-term storage

Access Discovery and retrieval

Page 39: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Data Grid Core Capabilities

Storage repository abstraction

Storage interface to at least one repository

Standard data access mechanism

Standard data movement protocol support

Containers for data

Logical name space

Registration of files in logical name space

Retrieval by logical name

Logical name space structural independence from physical file

Persistent handle

Page 40: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Collection owned data

Collection hierarchy for organizing logical name space

Standard metadata attributes (controlled vocabulary)

Attribute creation and deletion

Scalable metadata insertion

Access control lists for logical name space

Attributes for mapping from logical file name to physical file

Encoding format specification attributes

Data referenced by catalog query

Containers for metadata

Information Repository Abstraction

Page 41: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Distributed Resilient Architecture

Specification of system availability

Standard error messages

Status checking

Authentication mechanism

Specification of reliability against permanent data loss

Specification of mechanism to validate integrity of data

Specification of mechanism to assure integrity of data

Page 42: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Virtual Data Grid

Knowledge repositories for managing collection properties

Characterization of the application of transformative migrations on encoding format

Characterization of the application of archival processes

Page 43: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

SRBserver

SRB agent

SRBserver

Federated SRB server model

MCAT

Read Application

SRB agent

1

2

34

6

5

Logical NameOr

Attribute Condition

1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control

Peer-to-peer

Brokering

Server(s) SpawningData

Access

Parallel Data Access

R1R2

5/6

Page 44: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Latency Management -Bulk Operations

• Bulk register– Create a logical name for a file– Load context (metadata)

• Bulk load– Create a copy of the file on a data grid storage repository

• Bulk unload– Provide containers to hold small files and pointers to each

file location

• Bulk delete• Requests for bulk operations for access control, …

Page 45: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

SRB Latency Management

ReplicationServer-initiated I/O

StreamingParallel I/O

CachingClient-initiated I/O

Remote Proxies,Staging

Data AggregationContainers

SourceDestination

Prefetch

NetworkDestinationNetwork

Page 46: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Southern California Earthquake Center

• Build community digital library• Manage simulation and observational data

– Anelastic wave propagation output– 10 TBs, 1.5 million files

• Provide web-based interface– Support standard services on digital library

• Manage data distributed across multiple sites– USC, SDSC, UCSB, SDSU, SIO

• Provide standard metadata– Community based descriptive metadata– Administrative metadata– Application specific metadata

Page 47: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

SCEC Digital Library Technologies

• Portals – Knowledge interface to the library, presenting a coherent view of the services

• Knowledge Management Systems– Organize relationships between SCEC concepts and semantic labels

• Process management systems – Data processing pipelines to create derived data products

• Web services – Uniform capabilities provided across SCEC collections

• Data grid – Management of collections of distributed data

• Computational grid – Access to distributed compute resources

• Persistent archive – Management of technology evolution

Page 48: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Metadata Organization (Domain View versus Run View)

Domain List Formatting

Output

Run

Provenance

Velocity Model Fault Model

Physical Numerical

Spatial Temporal

Domain ...

Simulation Model Program Computer System

Page 49: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Page 50: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Page 51: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Zone SRB Federation

• Mechanisms to impose consistency and access constraints when sharing:– Resources

• Controls on which zones may use a resource

– User names (user-name / domain / SRB-zone)• Users may be registered into another domain, but

retain their home zone, similar to Shibboleth

– Data files• Controls on who specifies replication of data

– Context metadata• Controls on who manages updates to metadata

Page 52: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Unix Shell

Java, NTBrowsers

OAI,WSDL,OGSA

HTTP

Archives - Tape,HPSS, ADSM,

UniTree, DMF, CASTOR,ADS

DatabasesDB2, Oracle, Sybase,SQLserver,Postgres,

mySQL, Informix

File SystemsUnix, NT,Mac OSX

Application

ORB

Storage Repository VirtualizationCatalog Abstraction

DatabasesDB2, Oracle, Sybase,

Postgres, mySQL,Informix

C, C++, Java Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency & Metadata Management / Authorization-Authentication Audit

Linux I/O

DLL /Python,

Perl

Federation Management

Data Grid Federation - zoneSRB

Page 53: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Peer-to-Peer Federation1. Occasional Interchange - for specified users

2. Replicated Catalogs - entire state information replication

3. Resource Interaction - data replication

4. Replicated Data Zones - no user interactions between zones

5. Master-Slave Zones - slaves replicate data from master zone

6. Snow-Flake Zones - hierarchy of data replication zones

7. User / Data Replica Zones - user access from remote to home zone

8. Nomadic Zones “SRB in a Box” - synchronize local zone to parent zone

9. Free-floating “myZone” - synchronize without a parent zone

10.Archival “BackUp Zone” - synchronize to an archive

SRB Version 3.0.1 released December 19, 2003

Page 54: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Zone SRBZone

Organization

Zone interaction control

Consistency Management

User Connection Point to access files

Data Access Control Setting

Metadata synchroni-zation

Resource sharing

User-ID sharing between zones

Zones Zones Collections Files Files Metadata Resources User names

Free Floating Zones

Peer-to-Peer Local AdminUser-specified

data publicationFrom home

zoneUser set access

controlsUser controlled synchronization

None None

Occasional Interchange

Peer-to-Peer Local Admin User specifiedFrom home

zoneUser set access

controlsUser controlled synchronization

None Partial

Replicated Data Zones

Peer-to-Peer Local AdminUser-specified

replicationFrom home

zoneUser set local

access controlsUser controlled synchronization

PartialPartial, user

establishes own accounts

Resource Interaction

Peer-to-Peer Local AdminUser-specified

replicationFrom home

zoneUser set access

controlsNone

Partial shared resource for replication

Partial

User and Data Replica Zones

Peer-to-Peer Local AdminUser-specified

replicationFrom home

zoneSystem set

access controls

System controlled complete

synchronizationPartial Complete

Replicated Catalog

Peer-to-Peer Local AdminSystem managed

name conflict resolution

From any zoneSystem

replicated access controls

System controlled complete

synchronization

All zones share resources

Complete

Snow Flake Zones

Hierarchical Local Admin

System managed replication in hierarchy of

zones

From home zone

System set access controls

System controlled partial

synchronizationNone One

Master-Slave Zones

Hierarchical Super AdminSystem-managed

replication to slave

From home zone

System set access controls

System controlled partial

synchronizationNone One

Archival zones Hierarchical Super AdminSystem-managed

versioning to parent zone

From home zone

System set access controls

System controlled complete

synchronizationNone Complete

Nomadic Zones Hierarchical Local AdminUser-managed replication to parent zone

From home zone

User set access controls

User controlled synchronization

Partial One

Principle peer-to-peer federation approaches(1536 possible combinations)

Page 55: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Replicated Catalog

Archival

Partial User-ID Sharing

Partial Resource Sharing

No Metadata Synch Hierarchical Zone OrganizationOne Shared User-ID

System Managed ReplicationConnection From Any ZoneComplete Resource Sharing

System Set Access ControlsSystem Controlled Complete SynchComplete User-ID Sharing

System Managed ReplicationSystem Set Access ControlsSystem Controlled Partial SynchNo Resource Sharing

Super Administrator Zone Control

System Controlled Complete SynchComplete User-ID Sharing

Peer-to-Peer Zones

Replication Zones

Hierarchical Zones

Occasional Interchange

Free Floating

Resource Interaction

User and Data ReplicaNomadic

Snow Flake

Master Slave

Replicated Data

Page 56: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Deep Archive

• Impose sharing constraints:– Only system administrator access– Selected replication of files – Write once, with versions created on changes to data

• Impose consistency constraints– Coordinate update of preservation metadata with file replication

• Manage replication of both data and metadata• Use federation to guarantee preservation against

– Local hardware and software failures– Local operation errors– Local disasters

Page 57: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Research• Information (semantic label) is an assertion that some criteria were

met for the application of the label– Need to describe and manage the assertions (rules and relationships) used

to apply semantic labels

• Information (semantic label) expresses a context-related meaning that should be associated with a digital entity– Meaning is determined by the context

• Characterization of information requires the ability to describe – The context that defines the assertions for assigning the label– The context that explains the meaning of the label

• Organization of information requires the use of relationships (knowledge)

Page 58: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

Knowledge Based Data Grid Roadmap

AttributesSemantics

Knowledge

Information

Data

Ingest Services

Management AccessServices

(Model-based Access)

(Data Handling System)

MC

AT

/HD

F

Gri

ds

XM

L D

TD

SD

LIP

XT

M D

TD

Rul

es -

KQ

L

InformationRepository

Attribute- based Query

Feature-basedQuery

Knowledge orTopic-Based Query / Browse

KnowledgeRepository for Rules

RelationshipsBetweenConcepts

FieldsContainersFolders

Storage(Replicas,Persistent IDs)

Page 59: Integration of Data Grids, Digital Libraries, and Persistent

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

For More Information

Reagan W. MooreSan Diego Supercomputer Center

[email protected]

http://www.npaci.edu/DICE

http://www.npaci.edu/DICE/SRB

http://www.npaci.edu/dice/srb/mySRB/mySRB.html