Upload
phungthuan
View
217
Download
2
Embed Size (px)
Citation preview
Have Your Cake In Parallel And Eat It Sequen6ally Too! Seman&cally Sequen&al, Parallel Execu&on of Mul&processor Programs
Gagan Gupta
Mul&processors are ubiquitous, but programming them con&nues to be challenging Our Goal: Simplify mul&processor programming without compromising performance
Benefits: • Simplified programming; Simplified system design; BeCer reliability • Performance at par or beCer (5% to 288%) than conven&onal methods
Summary
Conven6onal Wisdom Our Approach
Order in programs obstructs parallelism Order can help to expose parallelism!
Use non-‐determinis&c programs, or make dataflow in programs explicit
Use ordered programs; maintain precise program-‐order execu&on seman6cs
Programmer should expose parallelism Use run-‐6me dataflow and specula6ve techniques to expose parallelism
Sequencer
• Unrolls dynamic instances of tasks
• Computes data set dynamically (user assisted)
1 for (i=0; i<n; i++) { 2 call F (wr, rd) 3 }
Fn Write Set Read Set
F1: {B, C} {A} F2: {D} {A} F3: {?} {?} F4: {B} {D} F5: {B} {D} F6: {G} {H}
Example code and dynamic task instances
Dataflow Engine
Precise-‐restart Engine
• Tracks tasks and their order in a Reorder List • Checkpoints mod set in History Buffer • Re&res task in (total) program order
Order-‐aware Load
-‐balan
cing Task
Sche
duler
F2
t1 t2 t3 t4 t5 t6
Execute Declares dataset Execute Re-execute out-of-order => misspeculated out-of-order speculatively => rolled back non-speculatively using History Buffer
Example speculative dataflow execution on 3 processors
F1
F2
F6
F2
F3
F4 F5
F3
Time
P1
P2
P3 F3 F3
Mul&p
rocessor
Pro
gram
F1 F2 F3 F4 F5 F6
t1 t5
F1 F2
F3
F6 F5
F4 Time
?? F3
ParaKram speedup (harmonic mean) is 20% higher than non-
deterministic Pthreads (excludes Cholesky)
0
1
2
3
4
5
6
7
8
9
8x Core i7-965
16x Opteron 8350
32x Opteron 8356
Spee
dup
Harmonic Mean of Achieved Speedups
Non-deterministic
ParaKram
ParaKram speedup is 288% higher than non-
deterministic OpenMP, 75% over Cilk
Applications: Barneshut Blackscholes Pbzip2 Dedup Histogram RevereseIndex Swaptions Mergesort RE WordCount ConjugateGradient
0
1
2
3
4
5
6
7
Genome Mergesort Labyrinth
Spee
dup
Speculative Execution (8x Intel Xeon)
ParaKram
TL2
Cilk
ParaKram scales with system size;
Non-deterministic method does not scale
ParaKram speedup is up to 77% higher than non-
deterministic Cilk and TL2 STM
0
2
4
6
8
10
12
14
1 2 4 8 12 16 20 24
Spee
dup
# Processors
Cholesky Decomposition (24x Intel Xeon)
ParaKram Speculative
ParaKram Non-speculative
OpenMP
Cilk
0
10
20
30
40
50
60
70
1 2 4 8 12 16 20 24
Spee
dup
# Processors
Tolerating Exceptions
Conventional Checkpoint-and-Recovery
ParaKram
Exploi6ng Parallelism Conven6onal
Develop Parallel Algorithm
Schedule Execu&on of Tasks
Ensure Independence between Parallel Tasks
Program text ≠> Order
F1
F2
F3
F4 F5
• Execu&on has to respect programmer-‐exposed parallelism
• Cannot schedule F5 in t2
F5
F1
F2
F3
F4 F5
Programmer’s Role
Reason About Inter-‐task Data Access Conflicts
Our Approach
Develop Parallel Algorithm
Program text => Order
Reason About Task-‐local Data Accesses
Conven6onal Our Approach
• If dependences are known, distant parallelism can be exposed
• Can schedule F5 in t2
Task Dependence Graph of Cholesky Decomposi&on
• Run-‐&me parallel execu&on manager (C++ library)
• Performs out-‐of-‐order superscalar processor-‐like execu&on on mul&processors
ParaKram
F1
F1
F1, F6, F2
F1
F2
F1 F2 F3 F4 F5 F6
Epoch Reorder List Entries Completed Retired
t1
F2 F3 F4 F5 F6 t2
F2 F3 F4 F5 F6 t3
F3 F4 F5 F6 t4
Precisely restarting misspeculated task (F3 from above)
t1
t2
t3
t4
t1
t2
t3
t4
Uncovers parallelism past blocked tasks in the program • Constructs dynamic data dependence graph using write and read sets
• Executes tasks out-‐of-‐order • If task dependences/order are unknown, speculates tasks are independent
• Detects and rec&fies misspecula&on
Multiprocessor program
Task Scheduler
Dataflow Engine
Precise-restart Engine
Sequencer
Multiprocessor System
ParaKram