Upload
elle
View
20
Download
0
Embed Size (px)
DESCRIPTION
Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte. A Grid Parallel Application Framework. Overview. Parallel Applications on the Grid Latency Hiding by Redundant Processing (LHRP) PGAFramework Related work Conclusion. - PowerPoint PPT Presentation
Citation preview
A Grid Parallel Application Framework
Jeremy Villalobos
PhD studentDepartment of Computer Science
University of North Carolina Charlotte
Overview
Parallel Applications on the Grid
Latency Hiding by Redundant Processing (LHRP)
PGAFramework
Related work
Conclusion
Parallel Applications on the Grid
Advantages Access to more resources Lower costs Future profits from Grid Economy ?
Challenges IO problem Need for easy-to-use Interface Heterogeneous hardware
Latency Hiding by Redundant Processing
Latency Hiding problem LHRP Algorithm
CPU type CPU task assigned to each CPU type Versioning system
Mathematical model to describe LHRP Results
LHRP Latency Hiding Latency Hiding by Redundantly
Processing
LHRP Algorithm
Internal: Only communicates with LAN CPUs. Border: Communicates with LAN CPUs and
one Buffer CPU Buffer: Communicates with LAN Border CPU
and receives data from WAN Border CPU
Computation and Communication Stages
Internal: Computes borders Transfers borders
(Non-blocking) Computes core
matrix Waits for transfer
ACK
Computation and Communication Stages
Border: Computes borders Transfers borders (Non-
blocking) Sends far border Computes core matrix Waits for transfer ACK Checks on far border
transfer ACK (if it is the last iteration Wait)
Computation and Communication Stages
Buffer: Computes borders Transfers borders (Non-
blocking) Receives far border Computes core matrix Waits for transfer ACK Checks on far border
transfer ACK (if it is the last iteration Wait)
Grid to Local Latency Ratio = 3
1 2 3 4 5 6 7 8 9 10 11 12 13
Time Steps Node 2 Node 3 Node 4 Node 5
row
coo
rdin
ates
1 1 1 1 1 0 0 1 1 1 1
2 2
3 3
4 3
5 36 4
7 5
8 6
9 6
10 6
11 7
12 8
column coordinates
Buffer Node Versioning Algorithm
Grid to Local Latency Ratio = 3
1 2 3 4 5 6 7 8 9 10 11 12 13Node 2 Node 3 Node 4 Node 5
row
coo
rdin
ates
1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 34 35 36 47 58 69 6
10 611 712 813 914 9
column coordinates
Buffer Node Versioning Algorithm
Grid to Local Latency Ratio = 3
1 2 3 4 5 6 7 8 9 10 11 12 13Node 2 Node 3 Node 4 Node 5
row
coo
rdin
ates
1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 35 36 47 58 69 6
10 611 712 813 914 9
column coordinates
Buffer Node Versioning Algorithm
Grid to Local Latency Ratio = 3
1 2 3 4 5 6 7 8 9 10 11 12 13Node 2 Node 3 Node 4 Node 5
row
coo
rdin
ates
1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 3 3 2 2 2 2 2 2 3 35 36 47 58 69 6
10 611 712 813 914 9
column coordinates
Buffer Node Versioning Algorithm
Grid to Local Latency Ratio = 3
1 2 3 4 5 6 7 8 9 10 11 12 13Node 2 Node 3 Node 4 Node 5
row
coo
rdin
ates
1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 3 3 2 2 2 2 2 2 3 35 3 3 3 3 3 3 3 3 3 36 47 58 69 6
10 611 712 813 914 9
column coordinates
Buffer Node Versioning Algorithm
Grid to Local Latency Ratio = 3
1 2 3 4 5 6 7 8 9 10 11 12 13Node 2 Node 3 Node 4 Node 5
row
coo
rdin
ates
1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 3 3 2 2 2 2 2 2 3 35 3 3 3 3 3 3 3 3 3 36 4 4 4 4 3 3 4 4 4 47 58 69 6
10 611 712 813 914 9
column coordinates
Buffer Node Versioning Algorithm
Grid to Local Latency Ratio = 3
1 2 3 4 5 6 7 8 9 10 11 12 13Node 2 Node 3 Node 4 Node 5
row
coo
rdin
ates
1 1 1 1 1 0 0 1 1 1 12 2 2 2 1 0 0 1 2 2 23 3 3 2 1 1 1 1 2 3 34 3 3 2 2 2 2 2 2 3 35 3 3 3 3 3 3 3 3 3 36 4 4 4 4 3 3 4 4 4 47 5 5 5 4 3 3 4 5 5 58 6 6 5 4 4 4 4 5 6 69 6 6 5 5 5 5 5 5 6 6
10 6 6 6 6 6 6 6 6 6 611 7 7 7 7 6 6 7 7 7 712 8 8 8 7 6 6 7 8 8 813 9 9 8 7 7 7 7 8 9 914 9 9 8 8 8 8 8 8 9 9
column coordinates
Buffer Node Versioning Algorithm
LHRP Algorithm Review
Node types: Internal Border Buffer
Far Border transfer Buffer Node Versioning system
Estimated Algorithm Performance
G: Grid Latency I: Internal Latency B: Amount of data tuples used by the Buffer Node W: Total amount of work for all CPUs C: Amount of CPUs doing non-redundant work
Estimated Algorithm Performance
0
50
100
150
200
250
300
350
400
LH vs LHRP
LHRP
LH
Compute time in nanosec per subcell
Est
imat
ed t
otal
tim
e to
com
pute
one
cy
cle
0
20
40
60
80
100
120
140
160
180
LH vs LHRP
LHRP
LH
Grid Latency
Ca
lcu
late
d T
ota
l Tim
e
Estimated Algorithm Performance
Process LH LHRP
0 35724 418561 7408 8228
2 7408 10932
3 7408 10928
4 7412 8528
Experimental Result: Memory Footprint
21% increase memory use over conventional form of Latency Hiding.
Causes: Extra Matrix in Buffer
Node to store old column versions
Extra far border buffers.
Experimental Results: Performance
0
100
200
300
400
500
600
700
800
LH vs LHRP Grid Latency
LH Average
LH Min
LH Max
LHRP Average
LHRP Min
LHRP Max
Grid Latency ( ms )
To
tal C
om
pu
te T
ime
(se
c)
0 20 40 60 80 100 120 140 160 180 200
0
500
1000
1500
2000
2500
LHRP vs LH Compute TimeLH Average
LH MIN
LH MAX
LHRP Average
LHRP MIN
LHRP MAX
Compute Time in ns per Subcell
Tota
l Com
pute
Tim
e
Experimental Results: Performance
PGAFramework Objective Design Requirements Implementation technology choices API Design API Workpool Example Other API features
Synchronization option Recursive option
PGAFramework
Objective: To create an efficient parallel application framework for the grid that allows a user programmer easy interaction with the Grid resources.
Design Requirements Platform independence Self Deployment Easy-to-Use Interface Provide the following services without requiring
extra effort on the part of the user programmer: Load Balancing Scheduling Fault tolerance Latency/Bandwidth tolerance
DesignGPAFramework
User's Application
API (Interface)
LoadBalancing Scheduling
Fault Tolerance
LatencyBandwidthTolerance
Globus
Job Scheduler (Condor)
GPAFramework
User's Applications
Hardware Resources
Deployment
GridWay ?
Globus Globus
Condor PBS
Globus
SGE
Desktop PCs Node Cluster computer node Super computer
SchedulingService
Job Submit Node
ResourceDiscovery
Implementation Java
Platform Independence JXTA (JXSE)
Peer-to-peer API Provides tools to work-around NAT's and
firewalls Provides library and module runtime loading
API
Motivation for API Design
Video Codecs Codecs follow an
interfaces What happens inside
the codec does not matter
The input and output for the codec needs to be specified
Display a GuiLoad File...
Output video to screen
mpeg ogg h.264
Video Player
Mpeg endoded
stream
Raw video Data
PGAFramework API There may be
multiple “template” API's
Each API has Interfaces that the user implements
The user “Inserts” his module into the framework
API
Get data from frameworkCompute on dataReturn processed dataRequest sync (optional)
Give data to framework
Get data from framework Store or pipe data
Schedule processes on ResourceLoad user Data
Create network Determine topology and net behavior Send user process to compute nodes
Get Data from user class Send to master node
Repeat process in loop until done
API Sample Code
public interface GridAppTemplate {public Data Compute(Data input);public Data DiffuseData(long segment);public void GatherData(long segment, Data dat);public long getDataCount();
}
API Sample Codepublic class myModule implements GridAppTemplate{
double x1, x2, y1, y2;double total;long random_samples;final long data_count = 100;public myModule(double x1_arg, double x2_arg,
double y1, double y2, long rad_smp){x1 = x1_arg;x2 = x2_arg;total = 0;random_samples = rad_smp;
}
@Override
public Data Compute(Data data) {//convert generic object to my object
MyData dat = (MyData) data;MyOutputData output = new MyOutputData();output.inside=0;//compute
double dist = Math.sqrt(dat.random_x * dat.random_x + dat.random_y * dat.random_y);
if( dist < 1.0){output.inside = 1L;
}return output;
}
@Override
public Data DiffuseData( long segment) {MyData d = new MyData();d.random_x = Math.random()*(x2-x1) + x1;d.random_y = Math.random()*(y2-y1) + y1;return d;
}
@Override
public void GatherData( long segment, Data dat) {MyOutputData data = (MyOutputData) dat;total += data.inside;
}
@Override
public long getDataCount() {return random_samples;
}
public double getPi(){double pi = (total / random_samples ) * 4;return pi;
}
}
API
API Sample Code
public class UserApplication {public static void main(String[] args) throws
UserModNotSetException {// Instantiate the custom Module that foolows//the GridAppTemplate Interface//the user can use the constructor, or some other way to get the//parameters set, such as file paths and options to tweak an algorithmmyModule mod = new myModule(0.0, 1.0, 0.0, 1.0, 10000);//submit the module to the network
NetworkDeployer deployer = new NetworkDeployer(mod);//start the network
deployer.startNetwork();
double pi = mod.getPi();System.out.println("PI is: " + pi );
}}
Synchronization option
RemoteHandler provides an Interface to synchronize data
Data is synced non-blocking User creates blocking procedures if needed
public class myRemoteHandler implementsRemoteEventHandler{
mySyncData chunk;public myRemoteHandler( Data chunk){
this.chunk = (mySyncData) chunk;}@Overridepublic void SyncDone(SyncData piece) {
chunk.setPiece( piece );}
}
Recursive Feature
Allows multiple level of parallelization (granularity)
DecodeVideo
CutRaw Video
IntoPictures
Blur pictures
Blur portion of picturePipeline
Work pool
Synchronous
Related Work MPI Implementation for the Grid
MPICH-G2 GridMPI MPICH-V2 (MPICH-V1)
Peer-to-peer parallel frameworks P2PMPI (for cluster computing) P3 (for cluster computing)
Self deploying frameworks Jojo
Conclusions
Parallel Applications on the Grid
Latency Hiding by Redundant Processing (LHRP)
PGAFramework
Related work