Click here to load reader

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Embed Size (px)

Citation preview

No Slide TitleComputer Science Overview
Laxmikant (Sanjay) Kale
Computer Science Projects: Posters
Computer Science Projects: Talks
Jiao:
Migratable Objects and Charm++
Rocket Center Collaborations
It was clear that Charm++ won’t be adopted by the whole application community
It was equally clear to us that it was a unique technology that will improve programmer productivity substantially
Led to the development of AMPI
Adaptive MPI
Processor Virtualization
Software engineering
Separate VPs for different modules
Message driven execution
Automatic checkpointing
Automatic dynamic load balancing
Programmer: [Over] decomposition into virtual processors
Runtime: Assigns VPs to processors
Enables adaptive runtime strategies
Highly Agile Dynamic load balancing
Needed, for example, for handling Advent of plasticity around a crack
Here a simple example
Plasticity in a bar
Optimizing all-to-all via Mesh
Phase 1:
Phase 2:
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
messages instead of P-1
Optimized All-to-all “Surprise”
76 bytes all-to-all on Lemieux
Completion time vs. computation overhead
Led to the development of Asynchronous Collectives now supported in AMPI
CPU is free during most of the time taken by a collective operation
Chart1
16
16
0.2921
0
32
32
0.4692
0.3882
64
64
1.0273
0.5242
96
96
2.6996
0.8973
128
128
2.7295
1.3953
192
192
6.4733
1.9875
256
256
5.3
2.8628
512
512
9.327
3.84
1024
1024
22.34
6.7081
1280
1280
15.5859
1536
1536
2048
2048
MPI
Mesh
Hypercube
Latency Tolerance: Multi-Cluster Jobs
Job co-scheduled to run across two clusters to provide access to large numbers of processors
But cross cluster latencies are large!
Virtualization within Charm++ masks high inter-cluster latency by allowing overlap of communication with computation
Cluster A
Cluster B
Hypothetical Timeline of a Multi-Cluster Computation
A
B
C
cross-cluster boundary
Processors A and B are on one cluster, Processor C on a second cluster
Communication between clusters via high-latency WAN
Processor Virtualization allows latency to be masked
*
Multi-cluster Experiments
Experimental environment
Artificial latency environment: VMI “delay device” adds a pre-defined latency between arbitrary pairs of nodes
TeraGrid environment: Experiments run between NCSA and ANL machines (~1.725 ms one-way latency)
Experiments
Five-point stencil (2D Jacobi) for matrix sizes 2048x2048 and 8192x8192
LeanMD molecular dynamics code running a 30,652 atom system
*
Five-Point Stencil Results (P=64)
*
Fault Tolerance
Migrate objects to disk!
Now available in distribution version of AMPI and Charm++
New work
In-memory checkpointing
*
In-memory Double Checkpoint
Double checkpointing
Local physical processor
Remote “buddy” processor
Use local disks!
Chart2
6.4
6.4
6.4
12.8
12.8
12.8
25.6
25.6
25.6
51.2
51.2
51.2
102.4
102.4
102.4
204.8
204.8
204.8
409.6
409.6
409.6
819.2
819.2
819.2
1638.4
1638.4
1638.4
3276.8
3276.8
3276.8
6553.6
6553.6
6553.6
Scalable Fault Tolerance
Motivation:
When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!
How?
Latency tolerance mitigates costs
Restart can be speeded up by spreading out objects from failed processor
Long term project
General purpose implementation in progress
Only failed processor’s objects recover from checkpoints, while others “continue”
Chart2
0
1
2
4
5
6
7
Parallel Objects,
Adaptive Runtime System
Libraries and Tools
The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE
Molecular Dynamics
Crack Propagation
Space-time meshes
Computational Cosmology
Rocket Simulation
Protein Folding
Dendritic Growth
*
Next…
University of Illinois at Urbana-Champaign
0
5
10
15
20
University of Illinois at Urbana-Champaign
(
)