43
Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure Development, CCT Computer Science e-Science Institute, Edinburgh http://www.cct.lsu.edu/~sjha http://saga.cct.lsu.edu

Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Embed Size (px)

Citation preview

Page 1: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Text

Distributed Applications:Examining the Past

Understanding the Present Preparing for the Future(Grid)

Shantenu Jha

Director, Cyber-Infrastructure Development, CCT

Computer Science

e-Science Institute, Edinburgh

http://www.cct.lsu.edu/~sjha

http://saga.cct.lsu.edu

Page 2: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Text

Outline

Critical Perspective on Large-Scale Distributed Applications and Production Cyber-Infrastructure (CI)

Understanding Distributed Applications (DA) Differ from HPC or || App, Challenges of DA DA Development Objectives (IDEAS)

Understanding SAGA Using SAGA to develop Distributed Applications

Frameworks Abstractions for Dynamic Execution Data-Intensive Applications

Discuss how IDEAS are met Derive (Initial) User Requirements/Requests for FutureGrid

Page 3: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Critical Perspectives Distributed CI: Is the whole > than the sum of the parts?

Several BIG Projects have success stories on TG But REAL Science happens at ALL SCALES Tools for the individual users to innovate and develop?

Infrastructure capabilities and policy determine Applications development, deployment and execution:

Proportion of App. that utilize multiple distributed sites sequentially, concurrently or asynchronously is low (~5%)

Not referring to tightly-coupled across multiple-sites TG (exclusively) supported legacy, static execution models

Move data to computing Compute where the data is? Distributed Data/Jobs vs Bringing it all into the Cloud

What novel applications & science has Distributed CI fostered?

Page 4: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Text

• Fundamentally a hard problem:• Dynamical Resource, Heterogeneous resources• Variable Control (or lack thereof)

• Add to it: Complex underlying infrastructure provisioning• Programming Systems for Distributed Applications:

• Incomplete? Customization? Extensibility?• Computational Models of Distributed Computing• Design Points: More than (peak) performance • Primary role of Usage Modes• Range of DA, no clear taxonomy

Understanding Distributed Applications Development Challenges

Page 5: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Understanding Distributed ApplicationsDevelopment Challenges

Distributed Applications Require: Coordination over Multiple & Distributed sites:

Scale-up and Scale-out Logically or physically Distributed

1st Gen of Peta/Exa/Zetta/Yotta -- Applications requiring multiple-runs, ensembles, workflows..

Core characteristics and challenges of logically and physically distributed applications are SAME

Inter-play of Requirements, Infrastructure, Usage Mode

Ability to develop simple, novel or effective distributed Applications lags behind other aspects of CI

General purpose Distributed Application Development Lacking in NSF/OCIs portfolio….

Page 6: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Understanding Distributed Applications Development Objectives

Interoperability: Ability to work across multiple distributed resources

Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently

Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure

Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data

Simplicity: Accommodate above distributed concerns at different levels easily…

Challenge: How to develop DA effectively and efficiently with the above as first-class objectives?

Page 7: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Text

SAGA: Basic Philosophy

There exists a lack of Programmatic approaches that: Provide general-purpose common grid functionality for

applications and thus hide underlying complexity, varying semantics..

The building blocks upon which to construct “consistent” higher-levels of functionality and abstractions

Hides “bad” heterogeneity, means to address “good” heterogeneity Meets the need for a Broad Spectrum of Application:

Simple scripts, Gateways, Smart Applications and Production Grade Tooling, Workflow…

Simple, integrated, stable, uniform and high-level interface Simple and Stable: 80:20 restricted scope and Standard Integrated: Similar semantics & style across Uniform: Same interface for different distributed systems

SAGA: Provides Application* developers with basic unit required to compose high-level functionality across (distinct) distributed systems

(*) One Person’s Application is another Person’s Tool

Page 8: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Text

SAGA: The Standard Landscape

Page 9: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA: In a thousand words..

Page 10: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Text

SAGA: Job SubmissionRole of Adaptors (middleware binding)

Page 11: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA Job API: Example

Page 12: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA: Other Packages

Page 13: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA and Distributed Applications

Page 14: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA-based Frameworks: Types Frameworks: Logical structure for Capturing Application

Requirements, Characteristics & Patterns Runtime and/or Application Framework

Application Frameworks designed to either: Pattern: Commonly recurring modes of computation

Programming, Deployment, Execution, Data-access.. MapReduce, Master-Worker, H-J Submission

Abstraction: Mechanism to support patterns and application characteristics

Runtime Frameworks: Load-Balancing – Compute and Data Distribution

SAGA-based Framework: Infrastructure-independent

Page 15: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Abstractions for Dynamic Execution (1) Container Task

Adaptive:

Type A: Fix number of replicas; vary cores assigned

to each replica.

Type B: Fix the size of replica, vary number of replicas

(Cool Walking)

-- Same temperature range (adaptive sampling)

-- Greater temperature range (enhanced

dynamics)

Page 16: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Abstractions for Dynamic Execution (2)SAGA Pilot-Job (BigJob)

Page 17: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Coordinate Deployment & Scheduling of Multiple Pilot-Jobs

Page 18: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Distributed Adaptive Replica Exchange (DARE)Scale-Out, Dynamic Resource Allocation and Aggregation

Page 19: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Multi-Physics Runtime FrameworksExtensibility

Coupled Multi-Physics require two distinct, but concurrent simulations

Can co-scheduling be avoided?

Adaptive execution model: Yes

Load-balancing required. Pilot-Job facilitates LB! Across sites? (open Q)

First demonstrated multi-platform Pilot-Job:

MPI-based TG – Condor GI

Page 20: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Dynamic Execution Reduced Time to Solution

Page 21: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Ensemble Kalman FiltersHeterogeneous Sub-Tasks

Ensemble Kalman filters (EnKF), are recursive filters to handle large, noisy data; use the EnKF for history matching and reservoir characterization

EnKF is a particularly interesting case of irregular, hard-to-predict run time characteristics:

Page 22: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Results: Scale-Out Performance

Using more machines decreases the TTC and variation between experiments

Using BQP decreases the TTC & variation between experiments further

Lowest time to completion achieved when using BQP and all available resources Khamra & Jha, GMAC, ICAC’09

Page 23: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

But Why does BQP Help? The Case for System Senors

Page 24: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Autonomic Integration of HPC Grids-Clouds EnKF: Extensibility and Interoperabilty

(work with M. Parashar et al. Accepted for e-Science 2009)

• Application Objectives:• Acceleration• Resilience• Conservation

• Pull vs Push Task map

Page 25: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Application-level InteroperabilityCloud-Cloud; Cloud-Grid

Application-level (ALI) vs. System-level Interoperability (SLI) Infrastructure Independence is Pre-requisite for ALI

The case for both Grids AND Clouds: Hybrid & Heterogeneous workload: data-compute affinity differ Availability zone, Data-transfer cost.. Complex data-flow dependency: need runtime determination

Just because you can use Grids AND Clouds, should you ?

Important Research Question: When should you? Runtime Decision: Mechanism to determine when/if ? Should be influenced by Application Objectives Programming Model should be Infrastructure independent

Same application, priced differently, for same performance Same application, priced same, for different performance

Page 26: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA-based Frameworks: Examples SAGA-based Pilot-Job Framework (FAUST)

Extend to support Load-balancing for multi-components SAGA MapReduce Framework:

Control the distribution of Tasks (workers) Master-Worker: File-Based &/or Stream-Based Data-locality optimization using SAGA’s replica API

SAGA NxM Framework: Compute Matrix Elements, each is a Task

All-to-All Sequence comparison Control the distribution of Tasks and Data Data-locality optimization via external (runtime) module

Page 27: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Distributed Data Intensive ApplicationsResearch Challenges

Goal: Develop DDI scientific applications to utilize a broad range of distributed systems, without vendor lock-in, or disruption, yet with the flexibility and performance that scientific applications demand.

Frameworks as possible solutions Frameworks address some primary challenges in developing Distributed DI

Applications Coordination of distributed data & computing Runtime (Dynamic) scheduling, placement Fault-tolerance

Many Challenges in developing such Frameworks: What are the components? How are they coupled? Functionality

expressed/exposed? Coordination? Layering, Ordering, Encapsulations of Components

“Just because you use can’t use MPI (on distributed systems), doesn’t mean you can’t use other approaches”

Page 28: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Frameworks: Logical ordering

SAGA

Page 29: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Frameworks: Logical ordering

Page 30: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA-MapReduce(Miceli, Jha et al CCGrid’09; Merzky, Jha et al GPC’09)

Interoperability: Use multiple infrastructure concurrently

Control the NW placement Simple staging of data

SAGA-Sphere-Sector: Open Cloud Consortium

Stream processing model Ongoing work Apply to all elements

(files) in a data-set (stream)

Ts: Time-to-solution, including data-staging for SAGA-MapReduce (simple file-based mechanism)

Page 31: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Controlling Relative Compute-Data Placement

Page 32: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA All-Pairs: Runtime Data Placement

Classical: Place task on 4 LONI machines (512px Dell Clusters)

Simple data staging “Intelligent”: Map a task to a

resource based upon Cost Cost = Data Dependency +

transfer times (latency) “Ignoring Intelligent mapping is

no longer an option” Quote (undergraduate) Miceli

Classical Intelligent

0

100

200

300

400

500

600

Processing Time

"Intelligence" Time

Page 33: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Understanding Distributed Applications Development Objectives Redux

Interoperability: Ability to work across multiple distributed resources

SAGA: Middleware Agnostic Distributed Scale-Out: The ability to utilize multiple

distributed resources concurrently Support Multiple Pilot-Jobs: Ranger, Abe, QB

Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure

Pilot-Job also Coupled CFD-MD, Integrated BQP Adaptivity: Response to fluctuations in dynamic resource

and availability of dynamic data Simplicity: Accommodate above distributed concerns at

different levels easily…

Page 34: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Does SAGA Provide A Fresh Perspective?

Page 35: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Early User: An Environment that Supports

Echo what Andrew Grimshaw said!! e.g., test-bed for Standards interoperation

Trivial Remarks: Not obsessed with system utilization like TG Policies that support IDEAS as first-class concerns

Support Dynamic, First-Principles Explicitly Distributed App. Dynamic, Adaptive Applications:

Dynamic Resource Utilization: e.g BQP (Jha et al, GMAC, ICAC Barcelona 2009)

Grid Observatory (EGEE) – all kinds of Traces Dynamic Adaptive Data:

Network Aware Application (Jha et al, IEEE eScience ’07) Data Scheduler: Big Data, Frequent Data

Page 36: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Early User: An Environment that Supports

Autonomic Computational Science Applications Support the tuning of and by Applications

Platform for developing (SAGA) AF and RT Frameworks Design, Stand-up and Experiment with Frameworks

eg load-balancer for dynamic resource allocation SAGA-MapReduce, NxM

eg Control Relative Placement of Data/Compute

Supporting Distributed Abstractions – Development, Deployment and Execution-level

A controlled but realistic environment RAIN – Dynamic Provisioning (provide clean API) (Reproducible) Experimental Manager, VAMPIR

[Connection with Grid Observatory]

Page 37: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Text

SAGA-based Tools and ProjectsOne person’s Tool is another person’s Application

DESHL DEISA-based Shell and Workflow library

JSAGA from IN2P3 (Lyon) http://grid.in2p3.fr/jsaga/index.html

GANGA-DIANE gLite

XtreemOS (Based upon SAGA for the Distribution) NAREGI/KEK SD Specification

With gLite adaptors

Advantage of Standards

Page 38: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

AcknowledgementsSAGA Team and DPA Team and the UK-EPSRC (UK EPSRC: DPA, OMII-UK , OMII-UK PAL)

People:SAGA D&D: Hartmut Kaiser, Ole Weidner, Andre Merzky, Joohyun Kim, Lukasz

Lacinski, João Abecasis, Chris Miceli, Bety Rodriguez-MillaSAGA Users: Andre Luckow, Yaakoub el-Khamra, Kate Stamou, Cybertools

(Abhinav Thota, Jeff, N. Kim), Owain KenwayGoogle SoC: Michael Miceli, Saurabh Sehgal, Miklos ErdelyiCollaborators and Contributors: Steve Fisher & Group, Sylvain Renaud

(JSAGA), Go Iwai & Yoshiyuki Watase (KEK)DPA: Dan Katz, Murray Cole, Manish Parashar, Omer Rana, Jon Weissman

Page 39: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Abstractions for Distributed Applications and Systems: A Computational Science Perspective Authors: S Jha, D Katz, M Parashar, O Rana, J

Weissman

Upcoming Book by Wiley (Summer 2010)

Page 40: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA: Building the abstractions to Bridge the Infrastructure-Applications Gap

Focus on Application Development and

Characteristics, not infrastructure details

Page 41: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Interoperability

Page 42: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

Application Development Phase

Generation & Exec. Planning Phase

Execution Phase

DAG based Workflow ApplicationsExtensibility Approach

Page 43: Text Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid) Shantenu Jha Director, Cyber-Infrastructure

SAGA-based DAG ExecutionPreserving Performance