20
A Taxonomy for A Taxonomy for Distributed Data Distributed Data Sharing, Management Sharing, Management and Processing and Processing Chris Sosa Chris Sosa VCGR 2007 VCGR 2007

Data Grid Taxonomies

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Data Grid Taxonomies

A Taxonomy for A Taxonomy for Distributed Data Distributed Data

Sharing, Management Sharing, Management and Processingand Processing

Chris SosaChris SosaVCGR 2007VCGR 2007

Page 2: Data Grid Taxonomies

OverviewOverview►Discussion of Data GridsDiscussion of Data Grids

What they areWhat they are Why they are usefulWhy they are useful Why they are difficultWhy they are difficult

►TaxonomiesTaxonomies►Classification of Data Grids using Classification of Data Grids using

TaxonomiesTaxonomies►Attempted Classification of Genesis IIAttempted Classification of Genesis II

Page 3: Data Grid Taxonomies

What are Data Grids?What are Data Grids?►Aggregation of Aggregation of geographically-geographically-

distributeddistributed, , heterogeneousheterogeneous computing, computing, storage, and network resources to storage, and network resources to form form unifiedunified, , securesecure and and pervasivepervasive accessaccess..

►Large data setsLarge data sets that can be shared that can be shared worldwideworldwide

►Data is a 1Data is a 1stst-class resource-class resource

Page 4: Data Grid Taxonomies

Why are Data Grids Why are Data Grids Important?Important?

►Proliferation of Data: Seeing GB -> PBProliferation of Data: Seeing GB -> PB►Geographical Distribution: They’re Geographical Distribution: They’re

everywhereeverywhere!!!!!!►Sharing with Site AutonomitySharing with Site Autonomity►A Single Source for a variety of dataA Single Source for a variety of data►Want to be able to search and discover Want to be able to search and discover

suitable resourcessuitable resources

Page 5: Data Grid Taxonomies

Issues Related to Data GridsIssues Related to Data Grids►Site AutonomySite Autonomy►HeterogeneityHeterogeneity►Limited ResourcesLimited Resources►Single SourceSingle Source►Access RestrictionsAccess Restrictions►Unified Namespace Unified Namespace

Page 6: Data Grid Taxonomies

A Taxonomy for Data GridsA Taxonomy for Data Grids►What is a Taxonomy?What is a Taxonomy?

Technique for classifying Technique for classifying something something into groups.into groups. Technique used here is making a Graph (looks Technique used here is making a Graph (looks

like a Tree but things can be multi-classified into like a Tree but things can be multi-classified into different leaves).different leaves).

► Taxonomy broken into four sub-taxonomiesTaxonomy broken into four sub-taxonomies OrganizationOrganization Data TransportData Transport Data ReplicationData Replication SchedulingScheduling

Page 7: Data Grid Taxonomies

Organization Sub-TaxonomyOrganization Sub-Taxonomy

Page 8: Data Grid Taxonomies

Data Transport Sub-Data Transport Sub-TaxonomyTaxonomy

Page 9: Data Grid Taxonomies

Data Replication and Storage Data Replication and Storage Sub-TaxonomySub-Taxonomy

Page 10: Data Grid Taxonomies

Replication Architecture Sub-Replication Architecture Sub-sub-Taxonomysub-Taxonomy

Page 11: Data Grid Taxonomies

Replication Strategy Sub-sub-Replication Strategy Sub-sub-Taxonomy (cnt’d)Taxonomy (cnt’d)

Page 12: Data Grid Taxonomies

Resource Allocation and Resource Allocation and Scheduling Sub-TaxonomyScheduling Sub-Taxonomy

Page 13: Data Grid Taxonomies

Classification Time Classification Time ►For complete classification see section For complete classification see section

5 in the paper.5 in the paper.►Next few slides highlight interesting Next few slides highlight interesting

aspects of classifying technologiesaspects of classifying technologies►End with discussion of Genesis II End with discussion of Genesis II

classificationclassification

Page 14: Data Grid Taxonomies

Classification Time Classification Time OrganizationOrganization

►HEP – hierarchical and shared facilities for HEP – hierarchical and shared facilities for computing and storage (collaborative) computing and storage (collaborative)

►Astronomy – organizing VO’s to find single Astronomy – organizing VO’s to find single source data. Federated model. source data. Federated model.

►Bio-Informatics – Federated model (over Bio-Informatics – Federated model (over DB’s) and providing common data DB’s) and providing common data formats.formats.

►Earth Sciences (NEESgrid) – bottom-up Earth Sciences (NEESgrid) – bottom-up model.model.

Page 15: Data Grid Taxonomies

Classification Time Classification Time Data TransportData Transport

► GASS (Globus Toolkit)GASS (Globus Toolkit) Data access mechanism. Data access mechanism. Goal to provide uniform access. Goal to provide uniform access. Remote I/O mechanism for Grid apps. Remote I/O mechanism for Grid apps. Fetches entire file onto “cache”. Can use prestaging etc. Fetches entire file onto “cache”. Can use prestaging etc.

► IBP (Internet Backplane)IBP (Internet Backplane) Optimize data transfer with “store-and-forward” protocol. Optimize data transfer with “store-and-forward” protocol. Fixed size byte arrays in global addressing spaceFixed size byte arrays in global addressing space Security is capabilities-based. Security is capabilities-based.

► GridFTP (misconception: doesn’t require Globus Toolkit)GridFTP (misconception: doesn’t require Globus Toolkit) Extension of default FTP protocol to provide addt’l Grid func.Extension of default FTP protocol to provide addt’l Grid func. Allows GSI and Kerberos based authentication. Allows GSI and Kerberos based authentication. Multiple TCP streams over the same channel and allows and handles Multiple TCP streams over the same channel and allows and handles

data striping.data striping. Restart capabilityRestart capability

► KangarooKangaroo IBP but hidden from user (cannot be explicitly told how to route)IBP but hidden from user (cannot be explicitly told how to route) R/W’s are in the background (non-blocking) unless told otherwiseR/W’s are in the background (non-blocking) unless told otherwise Uses hopsUses hops

Page 16: Data Grid Taxonomies

Classification Time Classification Time Data Transport (cnt’d)Data Transport (cnt’d)

► Legion I/OLegion I/O Object-oriented middlewareObject-oriented middleware Single system mage (distributed file system)Single system mage (distributed file system) Transparent access by native and legacy appsTransparent access by native and legacy apps Uses X.509 Proxies to handle security for file transfers (data Uses X.509 Proxies to handle security for file transfers (data

not encrypted while in transit)not encrypted while in transit)► SRB I/O (Storage Resource Broker)SRB I/O (Storage Resource Broker)

Uniform and transparent interface to hetero storage systemsUniform and transparent interface to hetero storage systems Parallel-I/O and 3Parallel-I/O and 3rdrd party transfers party transfers Fine-grained security.Fine-grained security. Remote proceduresRemote procedures

► StorkStork Schedule for data placement jobsSchedule for data placement jobs Can translate between mutually incompat. Transfer protocolsCan translate between mutually incompat. Transfer protocols Can create DAG’s (directed acyclic graphs) to plan higher level Can create DAG’s (directed acyclic graphs) to plan higher level

transfers (data pipelines)transfers (data pipelines)

Page 17: Data Grid Taxonomies

Classification Time Classification Time Data Replication and StorageData Replication and Storage

► GFarm (Grid DataFarm) – for data-intensive programs GFarm (Grid DataFarm) – for data-intensive programs GFarm’s (parallel) file system unifies the file addressing space over all nodesGFarm’s (parallel) file system unifies the file addressing space over all nodes Replica management is dynamic and coupled with scheduling Replica management is dynamic and coupled with scheduling Data in a file can be broken into fragments on multiple disksData in a file can be broken into fragments on multiple disks Files are write-onceFiles are write-once

► Giggle Giggle (GIGa-scale Global Location Engine) – architecture framework for (GIGa-scale Global Location Engine) – architecture framework for a replica location service (RLS). a replica location service (RLS). Data represented by a logical file name (LFN)Data represented by a logical file name (LFN) Physical location identified by a physical file name (PFN) - URLPhysical location identified by a physical file name (PFN) - URL Local Replica Catalogs (LRC) get matching between LFN’s and PFN’sLocal Replica Catalogs (LRC) get matching between LFN’s and PFN’s Replication Location Index (RLI) creates an index of replica catalogs (pointer Replication Location Index (RLI) creates an index of replica catalogs (pointer

from LFN’s to LRC’s). Periodically updated via polling.from LFN’s to LRC’s). Periodically updated via polling. Aimed at write-once, read many. Only provides indexing.Aimed at write-once, read many. Only provides indexing.

► GDMP to provide secure and high-speed file transfer services. GDMP to provide secure and high-speed file transfer services. Based on pub-sub model. Based on pub-sub model. GSI as security model (auth + authz).GSI as security model (auth + authz). HEP uses it. HEP uses it. Client replicates from central storageClient replicates from central storage

► SRB to Enable creation of shared collectionsSRB to Enable creation of shared collections Unified view of data files which is analogous to the UNIX fs structure.Unified view of data files which is analogous to the UNIX fs structure. Static replication with replication managed a the container / dataset levelStatic replication with replication managed a the container / dataset level SRB focuses on preservation of the dataSRB focuses on preservation of the data Decentralized model with Hybrid scheme – Tree for naming, ring for replicationDecentralized model with Hybrid scheme – Tree for naming, ring for replication Replication is organized with a DBReplication is organized with a DB

Page 18: Data Grid Taxonomies

Classification Time Classification Time Allocation and SchedulingAllocation and Scheduling

Page 19: Data Grid Taxonomies

Genesis II ClassificationGenesis II Classification► OrganizationOrganization - Federated, - Federated,

Interdomain, Collaborative, Interdomain, Collaborative, Stable, ManagedStable, Managed

► DataData TransportTransport - File I/O - File I/O (RNS), Cryptographic Keys (WS-(RNS), Cryptographic Keys (WS-Security), SSL, Fine-grained Security), SSL, Fine-grained (through delegation), Restart, (through delegation), Restart, Block + Stream (ByteIO)Block + Stream (ByteIO)

► Replica ArchitectureReplica Architecture and and Strategy – TBDStrategy – TBD

► SchedulingScheduling – Process- – Process-Oriented, Individual, Makespan, Oriented, Individual, Makespan, Spatial.Spatial.

Page 20: Data Grid Taxonomies

Questions?Questions?