43
Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh http://www.inf.ed.ac.uk/ home/mc Universidad de Valladolid http://www.infor.uva.es/ ~diego

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Embed Size (px)

Citation preview

Page 1: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Toward Efficient and Robust Software Speculative

Parallelization on Multiprocessors

Marcelo Cintra and Diego R. Llanos

University of Edinburghhttp://www.inf.ed.ac.uk/home/mc

Universidad de Valladolidhttp://www.infor.uva.es/~diego

Page 2: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

2

Speculative parallelization on SMPfor(i=0; i<100; i++) { ... = A[L[i]];

A[K[i]] = ...}

Assume no dependences and execute iterations in parallel

Iteration J+2... = A[5];

Iteration J+1... = A[2];

Iteration J... = A[4];

A[6] = ...A[2] = ...A[5] = ...

Access to shared data should be tracked at runtime

RAW

If a violation is detected, offending threads are squashed

Page 3: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

3

Hardware vs. Software schemes

Hardware schemes+High performance– Changes to processor, caches, and coherence

controller Software schemes

+No hardware changes– Poorer performance:

Software management overhead Suboptimal scheduling Contention due to the need of

synchronization

Page 4: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

4

Wish List

To reduce software overhead use of efficient speculative data structures and optimized operations

To have an efficient scheduling minimizing memory overhead while maximizing tolerance to load imbalance and violations

To reduce contention avoid synchronization as much as possible

To avoid performance degradation squash contention mechanism

Page 5: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

5

Outline

Motivation Our software-only scheme Evaluation Related Work Conclusions

Page 6: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

6

Speculative Access Structures Use of versions of the shared data structure

Sharedstructur

e

Thread A(iteration J)

Thread B(iteration

J+1)nananananana

nananaelnana

A speculative access structure holds the state (na, m, el, elm) of each version of elements

m

A[0]

A[1]

A[2]

A[n]

. . .

A[0]

A[1]

A[2]

A[n]

. . .

A[0]

A[1]

A[2]

A[n]

. . .

Page 7: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

7

Speculative Access Structure I: Simple Array

Array of access states directly mapped to shadow copy of the user data array

NA EL MNA NA NA EL NA NASpec. accessstructure

Version copy

NA: not accessedEL: exposed loadedM: modifiedELM: exposed loaded and modified

Page 8: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

8

Speculative Access Structure I: Simple Array

Cheap to look up on speculative memory operations

... = A[2]

NA EL MNA NA NA EL NA NAVersion copy

EL

NA EL MNA NA NA EL NA NAAccessarray

User array

Scan

Expensive to search on commits

Page 9: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

9

Speculative Access Structure II: Indirection Array

Array of indices that indicate the elements of the shadow data array that were touched

NA EL MNA NA NA EL NA NAAccessarray

Userarray

1 6 4Indirectionarray

Page 10: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

10

Speculative Access Structure II: Indirection Array

1 6 4Indirectionarray

... = A[2]

NA EL MNA NA NA EL NA NAAccessarray

EL2

Cheap to look up on speculative memory operations

Cheap to search on commits

Scan

NA EL MNA NA NA EL NA NAAccessarray

1 6 4Indirectionarray

Page 11: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

11

Scheduling Threads

Static: assign a chunk of N/P iterations to each processor+Only P active threads little memory overhead– Poor tolerance to load imbalance and dependence

violations

Dynamic: dynamically assign each of N iterations– N active threads bigger memory structures+Better tolerance to load imbalance and

dependence violations Our solution: software version of an

aggressive sliding window mechanism †

† Cintra, Martinez and Torrellas; ISCA 2000

Page 12: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

12

Schedule a window of W iterations at a time

Sliding window

Window (W)

Thread 1

Thread 2

Time

Iterations (N):

1 2 3 4 5 76 8

Dynamic assignment of iterations inside the window

1 2 1

2 3

32

When the non-spec thread finishes, the window is advanced

1 2 3 4 5 6 7 8 4

Tradeoff between load balancing and size of version structures

Page 13: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

13

Memory operations

... = A[K[i]]

Load operationL1: Update state of the element to ELL2: Scan backwards access array for versionL3: Obtain most up-to-date version

A[K[i]] = ...

Store operationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations

Correctness guaranteed if globally performed in program order

But program order may not be respected…

Compiler reordering

Use of relaxed memory consistency models

Page 14: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

14

Race Conditions

Certain interleaving of operations may lead to incorrect execution

Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version

Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations

Thread executingiteration J

Thread executing

iteration J+K

Time

S2

Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations L2

Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version

S1

Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations L3

Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version Incorrect

value

S3

Iteration J: Store OperationS1: Perform the store of the new versionS2: Update state of the element to M or ELMS3: Scan forwards access array for violations

Violation not detected

L1

Iteration J+K: Load OperationL1: Update state of the element to ELL2: Scan backwards acc. array for version L3: Obtain most up-to-date version

Page 15: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

15

Conservative Solution

To embrace operations in a critical section

Load Operation# lock A L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version# unlock A

Store Operation# lock A S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations# unlock A Drawback: contention

Page 16: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

16

Our Solution: Memory Fences

Load Operation L1: Update state of the element to EL# memory fence L2: Scan backwards access array for version L3: Obtain most up-to-date version

Store Operation S1: Perform the store of the new version# memory fence S2: Update state of the element to M or ELM# memory fence S3: Scan forwards access array for violations

All pending operations should be performed before passing the memory fence

This is the minimun set of memory fences needed Critical sections are still necessary to protect structures on thread starts, commits and squashes.

Page 17: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

17

Outline

Motivation Our software-only scheme Evaluation Related Work Conclusions

Page 18: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

18

Evaluation Environment

Execution of experiments on a real machine Sun SunFire 6800 SMP with 24

UltraSPARC-III processors OpenMP 2.0

Study of applications with non-analyzable loops TREE, WUPWISE, MDG no dependences LUCAS, AP3M dependences

Page 19: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

19

Speedups of Loops: TREE

Very close to “ideal”DOALL speedup

Page 20: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

20

Speedups of Loops: WUPWISE

Not so close to “ideal” DOALLspeedup: huge spec data size

Page 21: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

21

Importance of Indirection Array

Page 22: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

22

Cost of Violation Checks

Systems evaluated: Baseline: our scheme, with violation

checks upon stores sys2: same as Baseline, but violation

checks upon commits

Page 23: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

23

Cost of Violation Checks

May outperform checks atcommit on sparse accesses

Checks upon loads and storesare not too expensive

Page 24: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

24

Effects of Scheduling Schemes

Systems evaluated Baseline: Sliding window moved when non-

speculative thread finishes sys3: Sliding window moved when all thread

finish(solution adopted by Dang et al. [IPDPS 2002])

sys4: Dynamic scheduling, no partial commits(solution adopted by Rundberg et al. [WSSMM 2000])

Page 25: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

25

Effects of Scheduling Schemes

P = 4 processors

Fully dynamic scheduleis not always feasible

Best performance forW=2*P to 4*P

Page 26: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

26

Wish List Revisited

To reduce software overhead Access and Indirection Arrays Early violation detection (on stores instead of

during commit) To have an efficient scheduling

Agressive Sliding Window mechanism To reduce contention

Use of memory fences instead of critical sections To avoid performance degradation

Squash monitor with feedback

Page 27: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

27

Outline

Motivation Our software-only scheme Evaluation Related Work Conclusions

Page 28: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

28

Software-only speculative parallelization schemes

SW-R-LRPD at Texas A&M University (IPDPS 2002) Less aggressive window (moved when all threads finish) Violation checks when threads commit

Chalmers University (WSSMM 2000) Dynamic scheme Violation checks upon stores

IBM Research (SC 1998) Series of tests for various specific behaviors

TLDS at Carnegie Mellon University (tech. rep. 2001) Speculation in software DSM engine

Page 29: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

29

Outline

Motivation Our software-only scheme Evaluation Related Work Conclusions

Page 30: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

30

Conclusions

Systematic consideration of the design space and cost/performance issues

New efficient and robust software-only speculative parallelization scheme:– Fine-tuned data structures – Aggressive sliding window– Reduced synchronization requirements– Overhead monitors and feedback

Very good performance:– 7 to 25% faster than previous schemes– 71% of hand-made, manual parallelization speedup

Page 31: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Toward Efficient and Robust Software Speculative

Parallelization on Multiprocessors

Marcelo Cintra and Diego R. Llanos

University of Edinburghhttp://www.inf.ed.ac.uk/home/mc

Universidad de Valladolidhttp://www.infor.uva.es/~diego

Page 32: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

32

Data Structures ImplementationUserarray

02 nNA

M

NA NA ELNANA

NANA

NA

NA

NA

NANA

NA

NANANA

NA

Accessstructures

Version copies

M

01

n

Page 33: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

33

Squashing Threads

Violations are detected by looking up speculative access structure On every store

+Check only the element being accessed+Earlier violation detections±Frequent checks Need some form of synchronization

At commit Check all elements+Faster speculative memory operations

Page 34: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

34

Squash contention mechanism Goal: to avoid performance

degradation in the presence of dependences

Implemented with commit and squash monitors

After a given threshold, following invocations of the same loop will be executed sequentially

Page 35: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

35

Importance of Squash Monitors

Page 36: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

36

Application Characteristics

Application

TREE

MDG

Loops

accel_10

interf_1000

WUPWISE muldeo_200’muldoe_200’

% of Seq.Time

94

86

41

Spec dataSize (KB)

< 1

< 1

12,000

AP3M Shgravll_700

LUCAS mers_mod_square(line 444)

78

20

3,000

4,000

Page 37: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

37

Speedups of Loops: MDGVery close to “ideal”DOALL speedup

Page 38: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

38

Overall Speedups: TREE

Page 39: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

39

Overall Speedups: WUPWISE

Page 40: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

40

Overall Speedups: MDG

Page 41: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

41

Constrained Memory Overheads Mixed results: either Baseline

Or Sys4 perform best

Page 42: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

42

Related Work

Hardware-based speculative parallelization schemes:– I-ACOMA at University of Illinois– HYDRA at Stanford– Multiplex at Purdue– Multiscalar at Wisconsin– Clustered Speculative Multithreading at UPC– TLDS at Carnegie Mellon

Inspector-Executor scheme:– Leung and Zahorjan (PPoPP 1993)– Saltz, Mirchandney, and Crowley (IEEE ToC 1991)

Page 43: Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Symp. on Principles and Practice of Parallel Programming - June 2003

43

Related Work

Optimistic Concurrency Control schemes:– E.g., Herlihy (ACM TDBS 1990); Kung and

Robinson (ACM TDBS 1981)– Only need to enforce that accesses to objects

in critical sections do not overlap no total order

– Applied to explicitly parallel applications