Data Grids

Embed Size (px)

DESCRIPTION

Data Grids. Darshan R. Kapadia Gregor von Laszewski. Grids. We’ve seen computational grids – collections of computing clusters and protocols/software in order to submit jobs, distribute work, schedule jobs, monitor status, etc. - PowerPoint PPT Presentation

Text of Data Grids

Slide 1

Data GridsDarshan R. KapadiaGregor von Laszewski

http://grid.rit.edu11GridsWeve seen computational grids collections of computing clusters and protocols/software in order to submit jobs, distribute work, schedule jobs, monitor status, etc.

But how do we manage collections of data on a grid not just the computations / programs themselves?http://grid.rit.edu22Data GRIDhttp://grid.rit.edu3

Lothar A T Bauerdick (2003). Grid Tools and the LHC Data Challenges. LHC Symposium. May 3, 2003.3Why data grids?The immense computational demands of many scientific applications are often coupled with massive amounts of data.These data sets must be shared by a virtual organization (or multiple VOs) for a variety of computationsDistributing jobs to diverse geographic computing resources also requires distributing data collections for processing and storing output.http://grid.rit.edu44Data Grid ChallengesStorage capacity for massive quantities of data Distribute data sets to disperse geographic locations to complete jobs in a gridMaximize computation to communication ratioAggregation of results, data coherencyWho has the copy of the data setNeed to do all of this securely and robustlyhttp://grid.rit.edu55Functions of Data GRIDData AccessHow do we access and manage data?Storage Resource BrokersUNIX File Systems, Distributed File Systems, HTTP servers, etcHow do we transfer data?Metadata AccessData about data!Replica ManagementCreate/delete copies of dataReplica catalogsReplica SelectionLocating the best data replica to use for an applicationDetermine subset of data required for a jobEarth System GRIDThe Earth System Grid (ESG) integrates supercomputers with large-scale data and analysis servers located at numerous national labs and research centers to create a powerful environment for next generation climate research. Participating OrganizationArgonne National LaboratoryLawrence Berkeley National LaboratoryLawrence Livermore National LaboratoryLos Alamos National LaboratoryNational Center for Atmospheric ResearchOak Ridge National LaboratoryUniversity of Southern California/Information Sciences Institute

http://www.earthsystemgrid.org/High Energy Physics Application http://grid.rit.edu8

B. Allcock J. Bester, B. C. F. K. M. N. Q. T. J. . A. L. . I. . C. . S. . V. . D. . S. (2002). Data Management and Transfer in High Performance Computational Grid Environments. Parallel Computing Journal, 28(5), 749-771.8Data GRID Architecturehttp://grid.rit.edu9

Chervenak, A., Deelman, E., Kesselman, C., Allcock, B., Foster, I., & Nefedova, V., et al. (2003). High-performance remote access to climate simulation data: a challenge problem for data grid technologies. Parallel Comput., 29(10), 1335-1356.9Data Grid DesignMechanism NeutralityPolicy NeutralityCompatibility with Grid InfrastructureUniformity of Information Infrastructurehttp://grid.rit.edu1010Core Data GRID servicesStorage System and Data AccessData Abstraction: Storage SystemData Access

Metadata Services

http://grid.rit.edu1111High Level Data Grid ComponentsReplica ManagementReplica Selection and Data Filteringhttp://grid.rit.edu1212GASSGlobus Access to Secondary Storage [5]NOT a distributed file systemUnix (C-style) fopen/fcloseDefault behavior is to transfer entire file from remote site into a local cache when file is openedGASS also provides finer-tuned control. Pre-stage/Post-stage file accessesCache managementNo cache coherency (changes made to remote file do not get propagated to caches)

http://grid.rit.edu1313Contd..Commandsglobus_gass_fopenglobus_gass_fclose

File names are URLs

http://grid.rit.edu1414GridFTPGridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth, wide-area networks.Based on FTP (RFC-959)Extended for higher-performance, flexibility, and robustnessParallel data sources, parallel transfersPartial file transfersTransfer restart capabilities

http://grid.rit.edu1515GridFTPCan Use GSI for security.TeraGrid has three clients which utilize GridFTP UberFTP(recommended)Globus-url-copy(preferred for scripting)tgcp (deprecated)

16Amazon Simple Storage Service (Amazon S3) Amazon S3 is storage for the Internet.

Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web.

AWS S3 FunctionalitiesWrite, read, and delete objects containing from 1 byte to 5 gigabytes of data each. The number of objects you can store is unlimited. Each object is stored in a bucket and retrieved via a unique, developer-assigned key.Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users. Uses standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit.http://grid.rit.edu1818Replica Managementhttp://grid.rit.edu19

A Taxonomy of Data Grids for Distributed Data Sharing, Management,and Processing KUMAR VENUGOPAL, RAJKUMAR BUYYA, AND KOTAGIRI RAMAMOHANARAO19ConclusionData Grid involves maintenance of large amount of data, So it is unique in terms of its architecture.Data Grid are very important for the future as large amount of data will be required for future applications.Referenceshttp://www.earthsystemgrid.org/Chervenak, A., Deelman, E., Kesselman, C., Allcock, B., Foster, I., & Nefedova, V., et al. (2003). High-performance remote access to climate simulation data: a challenge problem for data grid technologies. Parallel Comput., 29(10), 1335-1356.Allcock, W., Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2001). The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications, 23, 187-200.Bester, J., Foster, I., Kesselman, C., Tedesco, J., & Tuecke, S. (1999). GASS: A Data Movement and Access Service for Wide Area Computing Systems. Paper presented at the Proceedings of IOPADS'99.B. Allcock J. Bester, B. C. F. K. M. N. Q. T. J. . A. L. . I. . C. . S. . V. . D. . S. (2002). Data Management and Transfer in High Performance Computational Grid Environments. Parallel Computing Journal, 28(5), 749-771.http://grid.rit.edu2121