27
IPDPS 2002 1 Parasol Laboratory Texas A&M University The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence Rauchwerger Department of Computer Science Texas A&M University

Parasol LaboratoryTexas A&M University IPDPS 20021 The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

IPDPS 2002 1

Parasol Laboratory Texas A&M University

The R-LRPD Test:Speculative Parallelization of

Partially Parallel Loops

Francis Dang, Hao Yu, and Lawrence Rauchwerger

Department of Computer Science

Texas A&M University

IPDPS 2002 2

Parasol Laboratory Texas A&M University

Motivation

To maximize performance, extract the maximum available parallelism from loops.

Static compiler methods may be insufficient.– Access patterns may be too complex.– Required information is only available at runtime.

Run-time methods needed to extract loop parallelism– Inspector/Executor– Speculative Parallelization

IPDPS 2002 3

Parasol Laboratory Texas A&M University

Speculative Parallelization: LRPD Test

Main Idea– Execute a loop as a DOALL.– Record memory references during execution.– Check for data dependences.– If there was a dependence, re-execute the loop sequentially.

Disadvantages– One data dependence can invalidate speculative

parallelization.– Slowdown is proportional to speculative parallel execution

time.– Partial parallelism is not exploited.

IPDPS 2002 4

Parasol Laboratory Texas A&M University

Partially Parallel Loop Example

do i = 1, 8

z = A[K[i]]

A[L[i]] = z + C[i]

end do

K[1:8] = [1,2,3,1,4,2,1,1]

L[1:8] = [4,5,5,4,3,5,3,3]

iter 1 2 3 4 5 6 7 8

A()

1 R R R R

2 R R

3 R W W W

4 W W R

5 W W W

IPDPS 2002 5

Parasol Laboratory Texas A&M University

The Recursive LRPD

Main Idea– Transform a partially parallel loop into a sequence of fully

parallel, block-scheduled loops.– Iterations before the first data dependence are correct and

committed.– Re-apply the LRPD test on the remaining iterations.

Worst case– Sequential time plus testing overhead

IPDPS 2002 6

Parasol Laboratory Texas A&M University

Algorithm

success

Initialize

Commit

Analyze

Execute as DOALL

Checkpoint

if failure

Reinitialize

Restore

Restart

IPDPS 2002 7

Parasol Laboratory Texas A&M University

Implementation

Implemented in run-time pass in Polaris and additional hand-inserted code.– Privatization with copy-in/copy-out for arrays under test.– Replicated buffers for reductions.– Backup arrays for checkpointing.

IPDPS 2002 8

Parasol Laboratory Texas A&M University

Recursive LRPD Example

do i = 1, 8

z = A[K[i]]

A[L[i]] = z + C[i]

end do

K[1:8] = [1,2,3,1,4,2,1,1]

L[1:8] = [4,5,5,4,2,5,3,3]

7-85-63-41-2iter

WWW5

RWW4

WR3

WR2

RRR1

A()

P4P3P2P1proc

First Stage

7-85-6iter

W5

R4

W3

W2

R1

A()

P4P3P2P1proc

Second Stage

IPDPS 2002 9

Parasol Laboratory Texas A&M University

Heuristics

Work Redistribution Sliding Window Approach Data Dependence Graph Extraction

IPDPS 2002 10

Parasol Laboratory Texas A&M University

Work Redistribution

Redistribute remaining iterations across processors. Execution time for each stage will decrease. Disadvantages:

– May uncover new dependences across processors.– May incur remote cache misses from data redistribution.

p1 p2 p3 p4

1st stage

After 1st stage

2nd stage

After 2nd stage

With Redistribution

p1 p2 p3 p4

1st stage

After 1st stage

2nd stage

After 2nd stage

Without Redistribution

IPDPS 2002 11

Parasol Laboratory Texas A&M University

Work Redistribution Example

do i = 1, 8

z = A[K[i]]

A[L[i]] = z + C[i]

end do

K[1:8] = [1,2,3,1,4,2,1,1]

L[1:8] = [4,5,5,4,2,5,3,3]

7-85-63-41-2iter

WWW5

RWW4

WR3

WR2

RRR1

A()

P4P3P2P1proc

First Stage Second Stage

8765iter

W5

R4

WW3

RW2

RR1

A()

P4P3P2P1proc

Third Stage

876iter

W5

4

WW3

R2

RR1

A()

P4P3P2P1proc

IPDPS 2002 12

Parasol Laboratory Texas A&M University

Redistribution Model

Redistribution may not always be beneficial. Stop redistribution if:

– The cost of data redistribution outweighs the benefit from work redistribution.

Synthetic loop to model this adaptive method.

IPDPS 2002 13

Parasol Laboratory Texas A&M University

Redistribution Model

Time Breakdown of Model - 8 Processors

0

2

4

6

8

10

12

Never

Alw

ays

Adaptiv

e

Never

Alw

ays

Adaptiv

e

Never

Alw

ays

Adaptiv

e

Never

Alw

ays

Adaptiv

e

Tim

e (

seco

nd

s)

Redistribution Overhead

Synchronization

Speculative Loop Time

Stage 1 Stage 2 Stage 3 Stage 4

Time Progression of Model - 8 Processors

0

5

10

15

20

25

30

0 1 2 3 4 5

LRPD Stage

Tim

e (

seco

nd

s)

Alw ays Adaptive Never

IPDPS 2002 14

Parasol Laboratory Texas A&M University

Sliding Window R-LRPD

R-LRPD can generate a sequential schedule for long dependence distributions.

Strip-mine the speculative execution.

Apply the R-LRPD on a contiguous block of iterations.

Only dependences within the window cause failures.

Adds more global synchronizations and test overhead.

After 1st stage

After 2nd stage

p1

1st stage

p2

p2 p1

2nd stage

IPDPS 2002 15

Parasol Laboratory Texas A&M University

DDG Extraction

R-LRPD can generate sequential schedules for complex dependence distributions.

Use the SW R-LRPD scheme to extract the data dependence graph (DDG).

Generate an optimized schedule from the DDG.

Obtains the DDG for loops from which a proper inspector cannot be extracted.

p1

1st stage

p2

p2 p1

2nd stage

After 1st stage 1 3

After 2nd stage 2 5

3 4

IPDPS 2002 16

Parasol Laboratory Texas A&M University

Performance Issues

Performance issues:– Blocked scheduling – potential cause for load imbalance.– Checkpointing can be expensive.

Feedback guided blocked scheduling– Use the timing information from the previous instantiation

(Bull, EuroPar 98)– Estimate the processor chunk sizes for minimal load

imbalance.

On-Demand Checkpointing– Checkpoint only data modified during execution.

IPDPS 2002 17

Parasol Laboratory Texas A&M University

Experiments

Setup:– 16 processor HP V-Class– 4 GB memory– HP-UX 11.0

DCDCMP_do15

DCDCMP_do70

BJT

SPICE 2G6

Quadrilateral LoopFMA3D

NLFILT_do300

EXTEND_do400

FPTRAK_do300

TRACK

Codes and Loops:

IPDPS 2002 18

Parasol Laboratory Texas A&M University

Experimental Results – Input Profiles

TRACK Input Profile

0%

10%

20%

30%

40%

50%

60%

70%

15-250 16-400 16-450 5-400 50-100

Input

Pe

rce

nt

Tim

e (

%)

FPTRAK_do300 EXTEND_do400

NLFILT_do300 Other

Spice Input Profile

0%

10%

20%

30%

40%

50%

60%

128-bit adder Extended Reference

Input

Pe

rcen

t Tim

e (%

)DCDCMP_DO70 DCDCMP_DO15

BJT Other

IPDPS 2002 19

Parasol Laboratory Texas A&M University

Experimental Results - TRACK

NLFILT_do300 Speedup

0

2

4

6

8

10

0 4 8 12 16 20

Processors

Sp

ee

du

p

15-250 16-400 16-450

5-400 50-100

restarts ofnumber ionsinstantiat ofnumber

ionsinstantiat ofnumber ratio mParallelis

NLFILT_do300 Parallelism Ratio

0.00

0.20

0.40

0.60

0.80

1.00

0 4 8 12 16 20

Processors

Pa

ralle

lism

Ra

tio15-250 16-400 16-450

5-400 50-100

IPDPS 2002 20

Parasol Laboratory Texas A&M University

Experimental Results - TRACK

EXTEND_do400 Speedup

0

1

2

3

4

5

6

0 4 8 12 16 20

Processors

Sp

eedu

p

15-250 16-400 16-450

5-400 50-100

EXTEND_do400 Parallelism Ratio

0.0

0.2

0.4

0.6

0.8

1.0

0 4 8 12 16 20

ProcessorsP

ara

llelis

m R

ati

o15-250 16-400 16-450

5-400 50-100

IPDPS 2002 21

Parasol Laboratory Texas A&M University

Experimental Results - TRACK

FPTRAK_do300 Speedup

0

1

2

3

4

5

6

0 4 8 12 16 20

Processors

Sp

eedu

p

15-250 16-400 16-450

5-400 50-100

FPTRAK_do300 Parallelism Ratio

0.0

0.2

0.4

0.6

0.8

1.0

0 4 8 12 16 20

Processors

Pa

ralle

lism

Ra

tio

15-250 16-400 16-450

5-400 50-100

IPDPS 2002 22

Parasol Laboratory Texas A&M University

Experimental Results - TRACK

TRACK Program Speedup

0

2

4

6

0 4 8 12 16 20

Processors

Spe

edu

p

15-250 16-400 16-450

5-400 50-100

NLFILT_do300 Optimization ContributionsInput: 16-400

0

2

4

6

8

0 4 8 12 16 20

Processors

Sp

ee

du

p

No optimizations FB

RD RD and FB

RD, FB, and ODC

IPDPS 2002 23

Parasol Laboratory Texas A&M University

Experimental Results – Sliding Window

NLFILT_do300 Speedup ComparisonInput: 15-250

0

2

4

6

8

0 4 8 12 16 20

Processors

Sp

eed

up

R-LRPD: All Opts. SW Blocksize = 256

SW Blocksize = 512

NLFILT_do300 Parallelism Ratio ComparisonInput: 15-250

0.5

0.6

0.7

0.8

0.9

1.0

0 4 8 12 16 20

Processors

Para

llelism

Rati

o

R-LRPD: All Opts. SW Blocksize = 256

SW Blocksize = 512

IPDPS 2002 24

Parasol Laboratory Texas A&M University

Experimental Results – Sliding Window

NLFILT_do300 Speedup ComparisonInput: 16-400

0

2

4

6

8

10

0 4 8 12 16 20

Processors

Sp

eed

up

R-LRPD: All Opts. SW Blocksize = 256

SW Blocksize = 512

NLFILT_do300 Parallelism Ratio ComparisonInput: 16-400

0.5

0.6

0.7

0.8

0.9

1.0

0 4 8 12 16 20

Processors

Para

llelism

Rati

o

R-LRPD: All Opts. SW Blocksize = 256

SW Blocksize = 512

IPDPS 2002 25

Parasol Laboratory Texas A&M University

Experimental Results – FMA3D

FMA3D SpeedupQuadrilateral Loop

0

2

4

6

8

10

12

0 4 8 12 16 20

Processors

Sp

ee

du

p

IPDPS 2002 26

Parasol Laboratory Texas A&M University

Experimental Results – SPICE 2G6

SPICE 2G6 SpeedupInput: Extended Reference

0

2

4

6

0 2 4 6 8 10

Processors

Sp

eed

up

DCDCMP_DO15 DCDCMP_DO70

BJT ALL

SPICE 2G6 SpeedupInput: 128-bit adder

0

2

4

6

0 2 4 6 8 10

Processors

Sp

eed

up

DCDCMP_DO15 DCDCMP_DO70

BJT ALL

IPDPS 2002 27

Parasol Laboratory Texas A&M University

Conclusion

Contribution:– Can speculatively parallelize any loop.– Concern is now optimizing the parallelization and not when

to parallelize.

Future work:– Use dependence distribution information for adaptive

redistribution and scheduling.