Upload
vincent-pinkett
View
214
Download
0
Embed Size (px)
Citation preview
Toward Efficient and Robust Software Speculative
Parallelization on Multiprocessors
Marcelo Cintra and Diego R. Llanos
University of Edinburghhttp://www.inf.ed.ac.uk/home/mc
Universidad de Valladolidhttp://www.infor.uva.es/~diego
Symp. on Principles and Practice of Parallel Programming - June 2003
2
Speculative parallelization on SMPfor(i=0; i<100; i++) { ... = A[L[i]];
A[K[i]] = ...}
Assume no dependences and execute iterations in parallel
Iteration J+2... = A[5];
Iteration J+1... = A[2];
Iteration J... = A[4];
A[6] = ...A[2] = ...A[5] = ...
Access to shared data should be tracked at runtime
RAW
If a violation is detected, offending threads are squashed
Symp. on Principles and Practice of Parallel Programming - June 2003
3
Hardware vs. Software schemes
Hardware schemes+High performance– Changes to processor, caches, and coherence
controller Software schemes
+No hardware changes– Poorer performance:
Software management overhead Suboptimal scheduling Contention due to the need of
synchronization
Symp. on Principles and Practice of Parallel Programming - June 2003
4
Wish List
To reduce software overhead use of efficient speculative data structures and optimized operations
To have an efficient scheduling minimizing memory overhead while maximizing tolerance to load imbalance and violations
To reduce contention avoid synchronization as much as possible
To avoid performance degradation squash contention mechanism
Symp. on Principles and Practice of Parallel Programming - June 2003
5
Outline
Motivation Our software-only scheme Evaluation Related Work Conclusions
Symp. on Principles and Practice of Parallel Programming - June 2003
6
Speculative Access Structures Use of versions of the shared data structure
Sharedstructur
e
Thread A(iteration J)
Thread B(iteration
J+1)nananananana
nananaelnana
A speculative access structure holds the state (na, m, el, elm) of each version of elements
m
A[0]
A[1]
A[2]
A[n]
. . .
A[0]
A[1]
A[2]
A[n]
. . .
A[0]
A[1]
A[2]
A[n]
. . .
Symp. on Principles and Practice of Parallel Programming - June 2003
7
Speculative Access Structure I: Simple Array
Array of access states directly mapped to shadow copy of the user data array
NA EL MNA NA NA EL NA NASpec. accessstructure
Version copy
NA: not accessedEL: exposed loadedM: modifiedELM: exposed loaded and modified
Symp. on Principles and Practice of Parallel Programming - June 2003
8
Speculative Access Structure I: Simple Array
Cheap to look up on speculative memory operations
... = A[2]
NA EL MNA NA NA EL NA NAVersion copy
EL
NA EL MNA NA NA EL NA NAAccessarray
User array
Scan
Expensive to search on commits
Symp. on Principles and Practice of Parallel Programming - June 2003
9
Speculative Access Structure II: Indirection Array
Array of indices that indicate the elements of the shadow data array that were touched
NA EL MNA NA NA EL NA NAAccessarray
Userarray
1 6 4Indirectionarray
Symp. on Principles and Practice of Parallel Programming - June 2003
10
Speculative Access Structure II: Indirection Array
1 6 4Indirectionarray
... = A[2]
NA EL MNA NA NA EL NA NAAccessarray
EL2
Cheap to look up on speculative memory operations
Cheap to search on commits
Scan
NA EL MNA NA NA EL NA NAAccessarray
1 6 4Indirectionarray
Symp. on Principles and Practice of Parallel Programming - June 2003
11
Scheduling Threads
Static: assign a chunk of N/P iterations to each processor+Only P active threads little memory overhead– Poor tolerance to load imbalance and dependence
violations
Dynamic: dynamically assign each of N iterations– N active threads bigger memory structures+Better tolerance to load imbalance and
dependence violations Our solution: software version of an
aggressive sliding window mechanism †
† Cintra, Martinez and Torrellas; ISCA 2000
Symp. on Principles and Practice of Parallel Programming - June 2003
12
Schedule a window of W iterations at a time
Sliding window
Window (W)
Thread 1
Thread 2
Time
Iterations (N):
1 2 3 4 5 76 8
Dynamic assignment of iterations inside the window
1 2 1
2 3
32
When the non-spec thread finishes, the window is advanced
1 2 3 4 5 6 7 8 4
Tradeoff between load balancing and size of version structures
Symp. on Principles and Practice of Parallel Programming - June 2003
13
Memory operations
... = A[K[i]]
Load operationL1: Update state of the element to ELL2: Scan backwards access array for versionL3: Obtain most up-to-date version
A[K[i]] = ...
Store operationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations
Correctness guaranteed if globally performed in program order
But program order may not be respected…
Compiler reordering
Use of relaxed memory consistency models
Symp. on Principles and Practice of Parallel Programming - June 2003
14
Race Conditions
Certain interleaving of operations may lead to incorrect execution
Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version
Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations
Thread executingiteration J
Thread executing
iteration J+K
Time
S2
Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations L2
Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version
S1
Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations L3
Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version Incorrect
value
S3
Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations
Violation not detected
L1
Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version
Symp. on Principles and Practice of Parallel Programming - June 2003
15
Conservative Solution
To embrace operations in a critical section
Load Operation# lock A L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version# unlock A
Store Operation# lock A S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations# unlock A Drawback: contention
Symp. on Principles and Practice of Parallel Programming - June 2003
16
Our Solution: Memory Fences
Load Operation L1: Update state of the element to EL# memory fence L2: Scan backwards access array for version L3: Obtain most up-to-date version
Store Operation S1: Perform the store of the new version# memory fence S2: Update state of the element to M or ELM# memory fence S3: Scan forwards access array for violations
All pending operations should be performed before passing the memory fence
This is the minimun set of memory fences needed Critical sections are still necessary to protect structures on thread starts, commits and squashes.
Symp. on Principles and Practice of Parallel Programming - June 2003
17
Outline
Motivation Our software-only scheme Evaluation Related Work Conclusions
Symp. on Principles and Practice of Parallel Programming - June 2003
18
Evaluation Environment
Execution of experiments on a real machine Sun SunFire 6800 SMP with 24
UltraSPARC-III processors OpenMP 2.0
Study of applications with non-analyzable loops TREE, WUPWISE, MDG no dependences LUCAS, AP3M dependences
Symp. on Principles and Practice of Parallel Programming - June 2003
19
Speedups of Loops: TREE
Very close to “ideal”DOALL speedup
Symp. on Principles and Practice of Parallel Programming - June 2003
20
Speedups of Loops: WUPWISE
Not so close to “ideal” DOALLspeedup: huge spec data size
Symp. on Principles and Practice of Parallel Programming - June 2003
21
Importance of Indirection Array
Symp. on Principles and Practice of Parallel Programming - June 2003
22
Cost of Violation Checks
Systems evaluated: Baseline: our scheme, with violation
checks upon stores sys2: same as Baseline, but violation
checks upon commits
Symp. on Principles and Practice of Parallel Programming - June 2003
23
Cost of Violation Checks
May outperform checks atcommit on sparse accesses
Checks upon loads and storesare not too expensive
Symp. on Principles and Practice of Parallel Programming - June 2003
24
Effects of Scheduling Schemes
Systems evaluated Baseline: Sliding window moved when non-
speculative thread finishes sys3: Sliding window moved when all thread
finish(solution adopted by Dang et al. [IPDPS 2002])
sys4: Dynamic scheduling, no partial commits(solution adopted by Rundberg et al. [WSSMM 2000])
Symp. on Principles and Practice of Parallel Programming - June 2003
25
Effects of Scheduling Schemes
P = 4 processors
Fully dynamic scheduleis not always feasible
Best performance forW=2*P to 4*P
Symp. on Principles and Practice of Parallel Programming - June 2003
26
Wish List Revisited
To reduce software overhead Access and Indirection Arrays Early violation detection (on stores instead of
during commit) To have an efficient scheduling
Agressive Sliding Window mechanism To reduce contention
Use of memory fences instead of critical sections To avoid performance degradation
Squash monitor with feedback
Symp. on Principles and Practice of Parallel Programming - June 2003
27
Outline
Motivation Our software-only scheme Evaluation Related Work Conclusions
Symp. on Principles and Practice of Parallel Programming - June 2003
28
Software-only speculative parallelization schemes
SW-R-LRPD at Texas A&M University (IPDPS 2002) Less aggressive window (moved when all threads finish) Violation checks when threads commit
Chalmers University (WSSMM 2000) Dynamic scheme Violation checks upon stores
IBM Research (SC 1998) Series of tests for various specific behaviors
TLDS at Carnegie Mellon University (tech. rep. 2001) Speculation in software DSM engine
Symp. on Principles and Practice of Parallel Programming - June 2003
29
Outline
Motivation Our software-only scheme Evaluation Related Work Conclusions
Symp. on Principles and Practice of Parallel Programming - June 2003
30
Conclusions
Systematic consideration of the design space and cost/performance issues
New efficient and robust software-only speculative parallelization scheme:– Fine-tuned data structures – Aggressive sliding window– Reduced synchronization requirements– Overhead monitors and feedback
Very good performance:– 7 to 25% faster than previous schemes– 71% of hand-made, manual parallelization speedup
Toward Efficient and Robust Software Speculative
Parallelization on Multiprocessors
Marcelo Cintra and Diego R. Llanos
University of Edinburghhttp://www.inf.ed.ac.uk/home/mc
Universidad de Valladolidhttp://www.infor.uva.es/~diego
Symp. on Principles and Practice of Parallel Programming - June 2003
32
Data Structures ImplementationUserarray
02 nNA
M
NA NA ELNANA
NANA
NA
NA
NA
NANA
NA
NANANA
NA
Accessstructures
Version copies
M
01
n
Symp. on Principles and Practice of Parallel Programming - June 2003
33
Squashing Threads
Violations are detected by looking up speculative access structure On every store
+Check only the element being accessed+Earlier violation detections±Frequent checks Need some form of synchronization
At commit Check all elements+Faster speculative memory operations
Symp. on Principles and Practice of Parallel Programming - June 2003
34
Squash contention mechanism Goal: to avoid performance
degradation in the presence of dependences
Implemented with commit and squash monitors
After a given threshold, following invocations of the same loop will be executed sequentially
Symp. on Principles and Practice of Parallel Programming - June 2003
35
Importance of Squash Monitors
Symp. on Principles and Practice of Parallel Programming - June 2003
36
Application Characteristics
Application
TREE
MDG
Loops
accel_10
interf_1000
WUPWISE muldeo_200’muldoe_200’
% of Seq.Time
94
86
41
Spec dataSize (KB)
< 1
< 1
12,000
AP3M Shgravll_700
LUCAS mers_mod_square(line 444)
78
20
3,000
4,000
Symp. on Principles and Practice of Parallel Programming - June 2003
37
Speedups of Loops: MDGVery close to “ideal”DOALL speedup
Symp. on Principles and Practice of Parallel Programming - June 2003
38
Overall Speedups: TREE
Symp. on Principles and Practice of Parallel Programming - June 2003
39
Overall Speedups: WUPWISE
Symp. on Principles and Practice of Parallel Programming - June 2003
40
Overall Speedups: MDG
Symp. on Principles and Practice of Parallel Programming - June 2003
41
Constrained Memory Overheads Mixed results: either Baseline
Or Sys4 perform best
Symp. on Principles and Practice of Parallel Programming - June 2003
42
Related Work
Hardware-based speculative parallelization schemes:– I-ACOMA at University of Illinois– HYDRA at Stanford– Multiplex at Purdue– Multiscalar at Wisconsin– Clustered Speculative Multithreading at UPC– TLDS at Carnegie Mellon
Inspector-Executor scheme:– Leung and Zahorjan (PPoPP 1993)– Saltz, Mirchandney, and Crowley (IEEE ToC 1991)
Symp. on Principles and Practice of Parallel Programming - June 2003
43
Related Work
Optimistic Concurrency Control schemes:– E.g., Herlihy (ACM TDBS 1990); Kung and
Robinson (ACM TDBS 1981)– Only need to enforce that accesses to objects
in critical sections do not overlap no total order
– Applied to explicitly parallel applications