A Grid Parallel Application Framework

A Grid Parallel Application Framework

Jeremy Villalobos

PhD studentDepartment of Computer Science

University of North Carolina Charlotte

Overview

Parallel Applications on the Grid

Latency Hiding by Redundant Processing (LHRP)

PGAFramework

Related work

Conclusion


Advantages Access to more resources Lower costs Future profits from Grid Economy ?

Challenges IO problem Need for easy-to-use Interface Heterogeneous hardware

Latency Hiding by Redundant Processing

Latency Hiding problem LHRP Algorithm

CPU type CPU task assigned to each CPU type Versioning system

Mathematical model to describe LHRP Results

LHRP Latency Hiding Latency Hiding by Redundantly

Processing

LHRP Algorithm

Internal: Only communicates with LAN CPUs. Border: Communicates with LAN CPUs and

one Buffer CPU Buffer: Communicates with LAN Border CPU

and receives data from WAN Border CPU

Computation and Communication Stages

Internal: Computes borders Transfers borders

(Non-blocking) Computes core

matrix Waits for transfer

ACK


Border: Computes borders Transfers borders (Non-

blocking) Sends far border Computes core matrix Waits for transfer ACK Checks on far border

transfer ACK (if it is the last iteration Wait)


Buffer: Computes borders Transfers borders (Non-

blocking) Receives far border Computes core matrix Waits for transfer ACK Checks on far border

transfer ACK (if it is the last iteration Wait)

Grid to Local Latency Ratio = 3

1 2 3 4 5 6 7 8 9 10 11 12 13

Time Steps Node 2 Node 3 Node 4 Node 5

row

coo

rdin

ates

1 1 1 1 1 0 0 1 1 1 1

2 2

3 3

4 3

5 36 4

7 5

8 6

9 6

10 6

11 7

12 8

column coordinates

Buffer Node Versioning Algorithm


1 2 3 4 5 6 7 8 9 10 11 12 13Node 2 Node 3 Node 4 Node 5

row

coo

rdin

ates

1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 34 35 36 47 58 69 6

10 611 712 813 914 9

column coordinates




row

coo

rdin

ates

1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 35 36 47 58 69 6

10 611 712 813 914 9

column coordinates




row

coo

rdin

ates

1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 3 3 2 2 2 2 2 2 3 35 36 47 58 69 6

10 611 712 813 914 9

column coordinates




row

coo

rdin

ates

1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 3 3 2 2 2 2 2 2 3 35 3 3 3 3 3 3 3 3 3 36 47 58 69 6

10 611 712 813 914 9

column coordinates




row

coo

rdin

ates

1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 3 3 2 2 2 2 2 2 3 35 3 3 3 3 3 3 3 3 3 36 4 4 4 4 3 3 4 4 4 47 58 69 6

10 611 712 813 914 9

column coordinates




row

coo

rdin

ates

1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 3 3 2 2 2 2 2 2 3 35 3 3 3 3 3 3 3 3 3 36 4 4 4 4 3 3 4 4 4 47 5 5 5 4 3 3 4 5 5 58 6 6 5 4 4 4 4 5 6 69 6 6 5 5 5 5 5 5 6 6

10 6 6 6 6 6 6 6 6 6 611 7 7 7 7 6 6 7 7 7 712 8 8 8 7 6 6 7 8 8 813 9 9 8 7 7 7 7 8 9 914 9 9 8 8 8 8 8 8 9 9

column coordinates


LHRP Algorithm Review

Node types: Internal Border Buffer

Far Border transfer Buffer Node Versioning system

Estimated Algorithm Performance

G: Grid Latency I: Internal Latency B: Amount of data tuples used by the Buffer Node W: Total amount of work for all CPUs C: Amount of CPUs doing non-redundant work


0

50

100

150

200

250

300

350

400

LH vs LHRP

LHRP

LH

Compute time in nanosec per subcell

Est

imat

ed t

otal

tim

e to

com

pute

one

cy

cle

0

20

40

60

80

100

120

140

160

180

LH vs LHRP

LHRP

LH

Grid Latency

Ca

lcu

late

d T

ota

l Tim

e


Process LH LHRP

0 35724 418561 7408 8228

2 7408 10932

3 7408 10928

4 7412 8528

Experimental Result: Memory Footprint

21% increase memory use over conventional form of Latency Hiding.

Causes: Extra Matrix in Buffer

Node to store old column versions

Extra far border buffers.

Experimental Results: Performance

0

100

200

300

400

500

600

700

800

LH vs LHRP Grid Latency

LH Average

LH Min

LH Max

LHRP Average

LHRP Min

LHRP Max

Grid Latency ( ms )

To

tal C

om

pu

te T

ime

(se

c)

0 20 40 60 80 100 120 140 160 180 200

0

500

1000

1500

2000

2500

LHRP vs LH Compute TimeLH Average

LH MIN

LH MAX

LHRP Average

LHRP MIN

LHRP MAX

Compute Time in ns per Subcell

Tota

l Com

pute

Tim

e

Experimental Results: Performance

PGAFramework Objective Design Requirements Implementation technology choices API Design API Workpool Example Other API features

Synchronization option Recursive option

PGAFramework

Objective: To create an efficient parallel application framework for the grid that allows a user programmer easy interaction with the Grid resources.

Design Requirements Platform independence Self Deployment Easy-to-Use Interface Provide the following services without requiring

extra effort on the part of the user programmer: Load Balancing Scheduling Fault tolerance Latency/Bandwidth tolerance

DesignGPAFramework

User's Application

API (Interface)

LoadBalancing Scheduling

Fault Tolerance

LatencyBandwidthTolerance

Globus

Job Scheduler (Condor)

GPAFramework

User's Applications

Hardware Resources

Deployment

GridWay ?

Globus Globus

Condor PBS

Globus

SGE

Desktop PCs Node Cluster computer node Super computer

SchedulingService

Job Submit Node

ResourceDiscovery

Implementation Java

Platform Independence JXTA (JXSE)

Peer-to-peer API Provides tools to work-around NAT's and

firewalls Provides library and module runtime loading

API

Motivation for API Design

Video Codecs Codecs follow an

interfaces What happens inside

the codec does not matter

The input and output for the codec needs to be specified

Display a GuiLoad File...

Output video to screen

mpeg ogg h.264

Video Player

Mpeg endoded

stream

Raw video Data

PGAFramework API There may be

multiple “template” API's

Each API has Interfaces that the user implements

The user “Inserts” his module into the framework

API

Get data from frameworkCompute on dataReturn processed dataRequest sync (optional)

Give data to framework

Get data from framework Store or pipe data

Schedule processes on ResourceLoad user Data

Create network Determine topology and net behavior Send user process to compute nodes

Get Data from user class Send to master node

Repeat process in loop until done

API Sample Code

public interface GridAppTemplate {public Data Compute(Data input);public Data DiffuseData(long segment);public void GatherData(long segment, Data dat);public long getDataCount();

}

API Sample Codepublic class myModule implements GridAppTemplate{

double x1, x2, y1, y2;double total;long random_samples;final long data_count = 100;public myModule(double x1_arg, double x2_arg,

double y1, double y2, long rad_smp){x1 = x1_arg;x2 = x2_arg;total = 0;random_samples = rad_smp;

}

@Override

public Data Compute(Data data) {//convert generic object to my object

MyData dat = (MyData) data;MyOutputData output = new MyOutputData();output.inside=0;//compute

double dist = Math.sqrt(dat.random_x * dat.random_x + dat.random_y * dat.random_y);

if( dist < 1.0){output.inside = 1L;

}return output;

}

@Override

public Data DiffuseData( long segment) {MyData d = new MyData();d.random_x = Math.random()*(x2-x1) + x1;d.random_y = Math.random()*(y2-y1) + y1;return d;

}

@Override

public void GatherData( long segment, Data dat) {MyOutputData data = (MyOutputData) dat;total += data.inside;

}

@Override

public long getDataCount() {return random_samples;

}

public double getPi(){double pi = (total / random_samples ) * 4;return pi;

}

}

API

API Sample Code

public class UserApplication {public static void main(String[] args) throws

UserModNotSetException {// Instantiate the custom Module that foolows//the GridAppTemplate Interface//the user can use the constructor, or some other way to get the//parameters set, such as file paths and options to tweak an algorithmmyModule mod = new myModule(0.0, 1.0, 0.0, 1.0, 10000);//submit the module to the network

NetworkDeployer deployer = new NetworkDeployer(mod);//start the network

deployer.startNetwork();

double pi = mod.getPi();System.out.println("PI is: " + pi );

}}

Synchronization option

RemoteHandler provides an Interface to synchronize data

Data is synced non-blocking User creates blocking procedures if needed

public class myRemoteHandler implementsRemoteEventHandler{

mySyncData chunk;public myRemoteHandler( Data chunk){

this.chunk = (mySyncData) chunk;}@Overridepublic void SyncDone(SyncData piece) {

chunk.setPiece( piece );}

}

Recursive Feature

Allows multiple level of parallelization (granularity)

DecodeVideo

CutRaw Video

IntoPictures

Blur pictures

Blur portion of picturePipeline

Work pool

Synchronous

Related Work MPI Implementation for the Grid

MPICH-G2 GridMPI MPICH-V2 (MPICH-V1)

Peer-to-peer parallel frameworks P2PMPI (for cluster computing) P3 (for cluster computing)

Self deploying frameworks Jojo

Conclusions


Latency Hiding by Redundant Processing (LHRP)

PGAFramework

Related work

Documents

A Grid Parallel Application Framework