Data Grids for HPC: Geographical Information System GridsMarlon PierceGeoffrey FoxIndiana UniversityDecember 7 2004Internet Seminar
Overview from Previous Lectures
Parallel ComputingParallel processing is built on breaking problems up into parts and simulating each part on a separate computer nodeThere are several ways of expressing this breakup into parts with Software: Message Passing as in MPI orOpenMP model for annotating traditional languagesExplicitly parallel languages like High Performance FortranAnd several computer architectures designed to support this breakupDistributed Memory with or without custom interconnectShared Memory with or without good cacheVectors with usually good memory bandwidth
What are Web Services?Web Services are distributed computer programs that can be in any language (Fortran .. Java .. Perl .. Python) The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and PythonHere is a typical e-commerce use?
What Is the Connection?Both MPI and Web Services rely upon messaging to interact.But the difference is in speed of message transmissionMPI useful for microsecond communication speeds.Clusters, traditional parallel computingWeb Services communicate with Internet speedsMillisecond communication times at best.This implies that we have (at least) a two-level programming model.Level 1: MPI within science applications on clusters and HPC.Level 2: Programming between science applications.
Two-level Programming IThe Web Service (Grid) paradigm implicitly assumes a two-level Programming ModelWe make a Service (same as a distributed object or computer program running on a remote computer) using conventional technologiesC++ Java or Fortran Monte Carlo module perhaps running with MPI on a parallel machineData streaming from a sensor or SatelliteSpecialized (JDBC) database accessSuch services accept and produce data from other services, files and databasesThe Grid is used to coordinate such services assuming we have solved problem of programming the service
Two-level Programming IIThe Grid is discussing the composition of distributed services with the runtime interfaces to Grid as opposed to UNIX pipes/data streamsFamiliar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programsSuch interpretative environments are the single processor analog of Grid ProgrammingSome projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately
3 Layer Programming ModelApplication(level 1 Programming)Application Semantics (Metadata, Ontology)Level 2 ProgrammingBasic Web Service InfrastructureWeb Service 1Workflow (level 3) Programming BPELWS 2WS 3WS 4MPI Fortran C++ etc.Semantic WebSemantic Web adds a another layer between workflow andServices representing traditional applications
Data and Science ApplicationsTwo- (or three-) level programming applies to all applications.Typically we need to bind together HPC and non-HPC parts.How do you provide data to your application?How do you share data between applications?How do you communicate results to analysis and visualization programs?This is particularly important as the size and quality of observational data is growing rapidly.Q: How do you easily bind together science apps and remote data sources?A: Web Services (and Grids) provide the unifying architecture.
Grid LibrariesProgramming the Grid has many similarities with conventional languagesIn HPSearch you use similar Scripting languagesGrids are particularly good at supporting user interfaces as the browser is a particular servicePortal technology important gift of Grids for HPCMost promising (and not exploited often) is building Grid Libraries which are collections of services which can be re-used in several applicationsMastercard service is a typical business Grid libraryVisualization, Sensor processing, GIS are naturally distributed components of a HPC application that can be developed as Grid libraries
Data Grids for HPC
Data Deluged ScienceIn the past, we worried about data in the form of parallel I/O or MPI-IO, but we didnt consider it as an enabler of new algorithms and new ways of computingData assimilation was not central to HPCCASC set up because didnt want test data!Now particle physics will get 100 petabytes from CERNNuclear physics (Jefferson Lab) in same situationUse around 30,000 CPUs simultaneously 24X7Weather forecasting, climate, solid earth (EarthScope, Eath Systems Grid, GEON)We discussed our project SERVOGrid in October 2004 lecture.Bioinformatics curated databases (Biocomplexity only 1000s of data points at present) Virtual Observatory and SkyServer in AstronomyEnvironmental Sensor nets
Data Deluge @ HomeIn 2003, all of Marion County, IN (including Indianapolis) was surveyed using Light Detection and Ranging (LiDAR) sensing.GRW, Inc flew a Cessna 337 airplane over the entire county to produce digitized maps.1 point per square meter.495 square miles total.Can be used to create high resolution contour maps.But what do you do with all of the data?LiDAR data represents 3 orders of magnitude increase in data resolution over what is used today in conventional flood prediction (B. Engles, Purdue).
Flood modeling codes thus must become HPC codes to handle the size of newly available data.
Example Data Grid: The Earth System GridU.S. DOE SciDAC funded R&D effortBuild an Earth System Grid that enables management, discovery, distributed access, processing, & analysis of distributed terascale climate research dataA Collaboratory Pilot ProjectBuild upon ESG-I, Globus Toolkit, DataGrid technologies, and deployPotential broad application to other areashttp://www.earthsystemgrid.org
ESG Data SetsCommunity Climate Systems Model dataThis is data that is compatible with the National Center for Atmospheric Research (NCAR) global climate model, CCSMCouples atmospheric, land surface, ocean, and sea ice models.This is a US government model for climate modeling and prediction.http://www.ccsm.ucar.edu/Parallel Climate Model dataData compatible with extensions to CCSM.Uses same atmospheric model but different ocean and sea ice models.
ESG ChallengesBy the end of 2003, DOE-sponsored climate change research had produced 100 TB of scientific data.Stored across several DOE sites and NCAR.Consequence of HPC, will only escalate as models can simulate global weather patterns at increasingly fine resolution.Basic problems in data managementWhat is in the data files (metadata)?How were data created and by whom (provenance)?How data be stored and moved between sites efficiently?How can data be delivered to scientific community?ESG web portal
ESG Data SetsCommunity Climate Systems Model dataThis is data that is compatible with the National Center for Atmospheric Research (NCAR) global climate model, CCSMCouples atmospheric, land surface, ocean, and sea ice models.This is the US governments workhorse code for climate modeling and prediction.http://www.ccsm.ucar.edu/Parallel Climate Model dataData compatible with extensions to CCSM.Uses same atmospheric model but different ocean and sea ice models.
Example Data Grid: GEONProject Goal: Prototype interpretive environments of the future in Earth Sciences.Use advanced information technologies to facilitate collaborative, inter-disciplinary science efforts. Scientists will be able to discover data, tools, and models via portals, using advanced, semantics-based search engines and query tools, in a uniform authentication environment that provides controlled access to a wide range of resources. A prototype Semantic GridA services-based environment facilitates creation of scientific workflows that are executed in the distributed environment. Advanced GIS mapping, 3D, and 4D visualization tools allow scientists to interact with the data. www.geongrid.org
GEON Grid Application: SYNSEIS SYNSEIS is a grid application that provides an opportunity for seismologists and other earth science partners to compute and study 3D seismic records to understand complex subsurface structures. SYNSEIS is built using a service-based architecture. While it provides users an easy-to-use GUI to access data, models and compute resources, it also provides connectors (APIs) for developers should they choose to utilize any of its components in other applications.
SYNSEIS ArchitectureGASS GRAM GridFTP GSI SYNSEIS (FLASH GUI)IRISDMCSynSeisEngineGEON PortalCornell Map ServerCorbaWeb serviceWaveform and seismic event catalogs: www.iris.edu
GEON SYNSEIS ConclusionsUsing the Grid technology, GEON team was able to bring an extremely complex and cumbersome seismic data analysis procedure to a level that can be used by anyone efficiently and effectively, hence SYNSEIS is a first step towards faster discovery.Democratization of community resources allows not only GEON researchers but also external community members to access state-of-the-art software and tools.Although the tool is developed for GEON applications, it holds a tremendous potential for projects like EarthScope. SYNSEIS can be used by EarthScope researchers to conduct timely analysis of collected dataSYNSEIS also has a high potential to be used in educational environments allowing students to experiment with data and make their own earthquakes.SYNSEIS has allowed us to practice building distributed data and computational resources.
SERVOGrid Example: GeoFESTSERVOGrid was discussed in more detail in the October lecture of this series.But worth another mention in this context.GeoFEST is Geophysical Finite Element Simulation ToolGeo