Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Toward Efficient and Robust Software Speculative

Parallelization on Multiprocessors

Marcelo Cintra and Diego R. Llanos

University of Edinburghhttp://www.inf.ed.ac.uk/home/mc

Universidad de Valladolidhttp://www.infor.uva.es/~diego

Symp. on Principles and Practice of Parallel Programming - June 2003

2

Speculative parallelization on SMPfor(i=0; i<100; i++) { ... = A[L[i]];

A[K[i]] = ...}

Assume no dependences and execute iterations in parallel

Iteration J+2... = A[5];

Iteration J+1... = A[2];

Iteration J... = A[4];

A[6] = ...A[2] = ...A[5] = ...

Access to shared data should be tracked at runtime

RAW

If a violation is detected, offending threads are squashed


3

Hardware vs. Software schemes

Hardware schemes+High performance– Changes to processor, caches, and coherence

controller Software schemes

+No hardware changes– Poorer performance:

Software management overhead Suboptimal scheduling Contention due to the need of

synchronization


4

Wish List

To reduce software overhead use of efficient speculative data structures and optimized operations

To have an efficient scheduling minimizing memory overhead while maximizing tolerance to load imbalance and violations

To reduce contention avoid synchronization as much as possible

To avoid performance degradation squash contention mechanism


5

Outline

Motivation Our software-only scheme Evaluation Related Work Conclusions


6

Speculative Access Structures Use of versions of the shared data structure

Sharedstructur

e

Thread A(iteration J)

Thread B(iteration

J+1)nananananana

nananaelnana

A speculative access structure holds the state (na, m, el, elm) of each version of elements

m

A[0]

A[1]

A[2]

A[n]

. . .

A[0]

A[1]

A[2]

A[n]

. . .

A[0]

A[1]

A[2]

A[n]

. . .


7

Speculative Access Structure I: Simple Array

Array of access states directly mapped to shadow copy of the user data array

NA EL MNA NA NA EL NA NASpec. accessstructure

Version copy

NA: not accessedEL: exposed loadedM: modifiedELM: exposed loaded and modified


8

Speculative Access Structure I: Simple Array

Cheap to look up on speculative memory operations

... = A[2]

NA EL MNA NA NA EL NA NAVersion copy

EL

NA EL MNA NA NA EL NA NAAccessarray

User array

Scan

Expensive to search on commits


9

Speculative Access Structure II: Indirection Array

Array of indices that indicate the elements of the shadow data array that were touched


Userarray

1 6 4Indirectionarray


10

Speculative Access Structure II: Indirection Array


... = A[2]


EL2

Cheap to look up on speculative memory operations

Cheap to search on commits

Scan




11

Scheduling Threads

Static: assign a chunk of N/P iterations to each processor+Only P active threads little memory overhead– Poor tolerance to load imbalance and dependence

violations

Dynamic: dynamically assign each of N iterations– N active threads bigger memory structures+Better tolerance to load imbalance and

dependence violations Our solution: software version of an

aggressive sliding window mechanism †

† Cintra, Martinez and Torrellas; ISCA 2000


12

Schedule a window of W iterations at a time

Sliding window

Window (W)

Thread 1

Thread 2

Time

Iterations (N):

1 2 3 4 5 76 8

Dynamic assignment of iterations inside the window

1 2 1

2 3

32

When the non-spec thread finishes, the window is advanced

1 2 3 4 5 6 7 8 4

Tradeoff between load balancing and size of version structures


13

Memory operations

... = A[K[i]]

Load operationL1: Update state of the element to ELL2: Scan backwards access array for versionL3: Obtain most up-to-date version

A[K[i]] = ...

Store operationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations

Correctness guaranteed if globally performed in program order

But program order may not be respected…

Compiler reordering

Use of relaxed memory consistency models


14

Race Conditions

Certain interleaving of operations may lead to incorrect execution

Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version

Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations

Thread executingiteration J

Thread executing

iteration J+K

Time

S2

Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations L2


S1

Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations L3

Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version Incorrect

value

S3

Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations

Violation not detected

L1



15

Conservative Solution

To embrace operations in a critical section

Load Operation# lock A L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version# unlock A

Store Operation# lock A S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations# unlock A Drawback: contention


16

Our Solution: Memory Fences

Load Operation L1: Update state of the element to EL# memory fence L2: Scan backwards access array for version L3: Obtain most up-to-date version

Store Operation S1: Perform the store of the new version# memory fence S2: Update state of the element to M or ELM# memory fence S3: Scan forwards access array for violations

All pending operations should be performed before passing the memory fence

This is the minimun set of memory fences needed Critical sections are still necessary to protect structures on thread starts, commits and squashes.


17

Outline



18

Evaluation Environment

Execution of experiments on a real machine Sun SunFire 6800 SMP with 24

UltraSPARC-III processors OpenMP 2.0

Study of applications with non-analyzable loops TREE, WUPWISE, MDG no dependences LUCAS, AP3M dependences


19

Speedups of Loops: TREE

Very close to “ideal”DOALL speedup


20

Speedups of Loops: WUPWISE

Not so close to “ideal” DOALLspeedup: huge spec data size


21

Importance of Indirection Array


22

Cost of Violation Checks

Systems evaluated: Baseline: our scheme, with violation

checks upon stores sys2: same as Baseline, but violation

checks upon commits


23

Cost of Violation Checks

May outperform checks atcommit on sparse accesses

Checks upon loads and storesare not too expensive


24

Effects of Scheduling Schemes

Systems evaluated Baseline: Sliding window moved when non-

speculative thread finishes sys3: Sliding window moved when all thread

finish(solution adopted by Dang et al. [IPDPS 2002])

sys4: Dynamic scheduling, no partial commits(solution adopted by Rundberg et al. [WSSMM 2000])


25

Effects of Scheduling Schemes

P = 4 processors

Fully dynamic scheduleis not always feasible

Best performance forW=2*P to 4*P


26

Wish List Revisited

To reduce software overhead Access and Indirection Arrays Early violation detection (on stores instead of

during commit) To have an efficient scheduling

Agressive Sliding Window mechanism To reduce contention

Use of memory fences instead of critical sections To avoid performance degradation

Squash monitor with feedback


27

Outline



28

Software-only speculative parallelization schemes

SW-R-LRPD at Texas A&M University (IPDPS 2002) Less aggressive window (moved when all threads finish) Violation checks when threads commit

Chalmers University (WSSMM 2000) Dynamic scheme Violation checks upon stores

IBM Research (SC 1998) Series of tests for various specific behaviors

TLDS at Carnegie Mellon University (tech. rep. 2001) Speculation in software DSM engine


29

Outline



30

Conclusions

Systematic consideration of the design space and cost/performance issues

New efficient and robust software-only speculative parallelization scheme:– Fine-tuned data structures – Aggressive sliding window– Reduced synchronization requirements– Overhead monitors and feedback

Very good performance:– 7 to 25% faster than previous schemes– 71% of hand-made, manual parallelization speedup

Toward Efficient and Robust Software Speculative

Parallelization on Multiprocessors

Marcelo Cintra and Diego R. Llanos

University of Edinburghhttp://www.inf.ed.ac.uk/home/mc

Universidad de Valladolidhttp://www.infor.uva.es/~diego


32

Data Structures ImplementationUserarray

02 nNA

M

NA NA ELNANA

NANA

NA

NA

NA

NANA

NA

NANANA

NA

Accessstructures

Version copies

M

01

n


33

Squashing Threads

Violations are detected by looking up speculative access structure On every store

+Check only the element being accessed+Earlier violation detections±Frequent checks Need some form of synchronization

At commit Check all elements+Faster speculative memory operations


34

Squash contention mechanism Goal: to avoid performance

degradation in the presence of dependences

Implemented with commit and squash monitors

After a given threshold, following invocations of the same loop will be executed sequentially


35

Importance of Squash Monitors


36

Application Characteristics

Application

TREE

MDG

Loops

accel_10

interf_1000

WUPWISE muldeo_200’muldoe_200’

% of Seq.Time

94

86

41

Spec dataSize (KB)

< 1

< 1

12,000

AP3M Shgravll_700

LUCAS mers_mod_square(line 444)

78

20

3,000

4,000


37

Speedups of Loops: MDGVery close to “ideal”DOALL speedup


38

Overall Speedups: TREE


39

Overall Speedups: WUPWISE


40

Overall Speedups: MDG


41

Constrained Memory Overheads Mixed results: either Baseline

Or Sys4 perform best


42

Related Work

Hardware-based speculative parallelization schemes:– I-ACOMA at University of Illinois– HYDRA at Stanford– Multiplex at Purdue– Multiscalar at Wisconsin– Clustered Speculative Multithreading at UPC– TLDS at Carnegie Mellon

Inspector-Executor scheme:– Leung and Zahorjan (PPoPP 1993)– Saltz, Mirchandney, and Crowley (IEEE ToC 1991)


43

Related Work

Optimistic Concurrency Control schemes:– E.g., Herlihy (ACM TDBS 1990); Kung and

Robinson (ACM TDBS 1981)– Only need to enforce that accesses to objects

in critical sections do not overlap no total order

– Applied to explicitly parallel applications

Documents

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh