4€¦ · - 53 - 5 4 3 2 1 0:2 :1 B:4 B:3 B:2 B:1 :6 :5 :4 :3 B:5 5 4 3 2 1 0 B:1 :1 :3 :1 B:3 :3 B:2 :2 I e dl e m Ti - 54 - - 56 - T ! ( ti) 1. ! ti 2. ! t

1

Topic 14: Parallelism

COS 320

Compiling Techniques

Princeton University Spring 2015

Prof. David August

2

Final Exam!

•  Friday May 22 at 1:30PM in FRIEND 006 •  Closed book •  One Front/Back 8.5x11

- 3 -

Moore’s Law

Source: Intel/Wikipedia - 4 -

Single-Threaded Performance Not Improving

- 5 -

What about Parallel Programming? –or- What is Good About the Sequential Model?

❖  Sequential is easier »  People think about programs sequentially »  Simpler to write a sequential program

❖  Deterministic execution »  Reproducing errors for debugging »  Testing for correctness

❖  No concurrency bugs »  Deadlock, livelock, atomicity violations »  Locks are not composable

❖  Performance extraction »  Sequential programs are portable Ÿ  Are parallel programs? Ask GPU developers J

»  Performance debugging of sequential programs straight-forward

- 6 -

Compilers are the Answer? - Proebsting’s Law

❖  “Compiler Advances Double Computing Power Every 18 Years”

❖  Run your favorite set of benchmarks with your favorite state-of-the-art optimizing compiler. Run the benchmarks both with and without optimizations enabled. The ratio of of those numbers represents the entirety of the contribution of compiler optimizations to speeding up those benchmarks. Let's assume that this ratio is about 4X for typical real-world applications, and let's further assume that compiler optimization work has been going on for about 36 years. Therefore, compiler optimization advances double computing power every 18 years. QED.

110100100010000100000100000010000000100000000

197119

7519

7919

8319

8719

9119

9519

9920

03

Years

Speedup

Conclusion – Compilers not about performance!

A Step Back in Time: Old Skool Parallelization

Are We Doomed?

- 8 -

Parallelizing Loops In Scientific Applications

for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X

Scientific Codes (FORTRAN-like) Independent Multithreading (IMT) Example: DOALL parallelization

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

- 9 -

What Information is Needed to Parallelize? ❖  Dependences within iterations are fine ❖  Identify the presence of cross-iteration data-

dependences »  Traditional analysis is inadequate for parallelization.

For instance, it does not distinguish between different executions of the same statement in a loop.

❖  Array dependence analysis enables optimization for parallelism in programs involving arrays. »  Determine pairs of iterations where there is a data

dependence »  Want to know all dependences, not just yes/no

for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X

for(i=1; i<=N; i++) // C a[i] = a[i-1] + 1; // X

- 10 -

Affine/Linear Functions ❖  f( i1, i2, …, in) is affine, if it can be expressed as a sum of a

constant, plus constant multiples of the variables. i.e. n f = c0 + ∑ cixi

i = 1

❖  Array subscript expressions are usually affine functions involving loop induction variables.

❖  Examples: »  a[ i ] affine »  a[ i+j -1 ] affine »  a[ i*j ] non-linear, not affine »  a[ 2*i+1, i*j ] linear/non-linear, not affine »  a[ b[i] + 1 ] non linear (indexed subscript), not affine

- 11 -

Array Dependence Analysis

for (i = 1; i < 10; i++) { X[i] = X[i-1] } To find all the data dependences, we check if

1.  X[i-1] and X[i] refer to the same location; 2.  different instances of X[i] refer to the same location. »  For 1, we solve for i and i’ in

1 ≤ i ≤ 10, 1 ≤ i’ ≤ 10 and i – 1 = i’ »  For 2, we solve for i and i’ in

1 ≤ i ≤ 10, 1 ≤ i’ ≤ 10, i = i’ and i ≠ i’ (between different dynamic accesses) There is a dependence since there exist integer solutions to 1. e.g. (i=2, i’=1),

(i=3,i’=2). 9 solutions exist. There is no dependences among different instances of X[i] because 2 has no

solutions! - 12 -

Array Dependence Analysis - Summary

❖  Array data dependence basically requires finding integer solutions to a system (often refers to as dependence system) consisting of equalities and inequalities.

❖  Equalities are derived from array accesses. ❖  Inequalities from the loop bounds. ❖  It is an integer linear programming problem. ❖  ILP is an NP-Complete problem. ❖  Several Heuristics have been developed.

»  Omega – U. Maryland

- 13 -

Loop Parallelization Using Affine Analysis Is Proven Technology

❖  DOALL Loop »  No loop carried dependences for a particular nest »  Loop interchange to move parallel loops to outer scopes

❖  Other forms of parallelization possible »  DOAcross, DOpipe

❖  Optimizing for the memory hierarchy »  Tiling, skewing, etc.

❖  Real compilers available – KAP, Portland Group, gcc ❖  For better information, see

»  http://gcc.gnu.org/wiki/Graphite?action=AttachFile&do=get&target=graphite_lambda_tutorial.pdf

Back to the Present – Parallelizing C and C++ Programs

- 15 -

Loop Level Parallelization

i = 0-39

i = 10-19

i = 30-39

i = 0-9

i = 20-29

Thread 1 Thread 0

Loop Chunk

Bad news: limited number of parallel loops in general purpose applications

– 1.3x speedup for SpecINT2000 on 4 cores

- 16 -

DOALL Loop Coverage

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvinn

056.e

ar17

1.swi

m17

2.mgri

d17

7.mes

a17

9.art

183.e

quak

e18

8.amm

p00

8.esp

resso

023.e

qntot

t02

6.com

press

072.s

c09

9.go

124.m

88ks

im12

9.com

press

130.l

i13

2.ijpe

g16

4.gzip

175.v

pr18

1.mcf

197.p

arser

255.v

ortex

256.b

zip2

300.t

wolf

cjpeg

djpeg epic

g721

deco

deg7

21en

code

gsmd

ecod

egs

menc

ode

mpeg

2dec

mpeg

2enc

pegw

itdec

pegw

itenc

rawca

udio

rawda

udio

unep

icgre

p lex wcya

ccav

erage

SPEC FP SPEC INT Mediabench Utilities

Frac

tion o

f seq

uent

ial ex

ecut

ion

- 17 -

What’s the Problem?

for (i=0; i<100; i++) { . . . = *p; *q = . . . }

1. Memory dependence analysis

Memory dependence profiling and speculative parallelization

- 18 -

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.

alvi

nn05

6.ea

r17

1.sw

im17

2.m

grid

177.

mes

a17

9.ar

t18

3.eq

uake

188.

amm

p00

8.es

pres

so02

3.eq

ntot

t02

6.co

mpr

ess

072.

sc09

9.go

124.

m88

ksim

129.

com

pres

s13

0.li

132.

ijpeg

164.

gzip

175.

vpr

181.

mcf

197.

pars

er25

6.bz

ip2

300.

twol

fcj

peg

djpe

gep

icg7

21de

code

g721

enco

degs

mde

code

gsm

enco

dem

peg2

dec

mpe

g2en

cpe

gwitd

ecpe

gwite

ncra

wca

udio

raw

daud

ioun

epic

grep lex

yacc

aver

age


Frac

tion

of

sequ

enti

al e

xecu

tion

Provable DOALL

DOALL Coverage – Provable and Profiled

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.

alvi

nn05

6.ea

r17

1.sw

im17

2.m

grid

177.

mes

a17

9.ar

t18

3.eq

uake

188.

amm

p00

8.es

pres

so02

3.eq

ntot

t02

6.co

mpr

ess

072.

sc09

9.go

124.

m88

ksim

129.

com

pres

s13

0.li

132.

ijpeg

164.

gzip

175.

vpr

181.

mcf

197.

pars

er25

6.bz

ip2

300.

twol

fcj

peg

djpe

gep

icg7

21de

code

g721

enco

degs

mde

code

gsm

enco

dem

peg2

dec

mpe

g2en

cpe

gwitd

ecpe

gwite

ncra

wca

udio

raw

daud

ioun

epic

grep lex

yacc

aver

age


Frac

tion

of

sequ

enti

al e

xecu

tion

Profiled DOALLProvable DOALL

Still not good enough!

- 19 -

What’s the Next Problem? 2. Data dependences

while (ptr != NULL) { . . . ptr = ptr->next; sum = sum + foo; }

Compiler transformations

- 20 -

sum2 += x sum1 += x

We Know How to Break Some of These Dependences – Recall ILP Optimizations

sum+=x

sum = sum1 + sum2

Thread 1 Thread 0

Apply accumulator variable expansion!

- 21 -

Data Dependences Inhibit Parallelization

❖  Accumulator, induction, and min/max expansion only capture a small set of dependences

❖  2 options »  1) Break more dependences – New transformations »  2) Parallelize in the presence of dependences – more

than DOALL parallelization ❖  We will talk about both, but for now ignore this

issue

- 22 -

char *memory; void * alloc(int size);

void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; }

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

What’s the Next Problem?

3. C/C++ too restrictive

- 23 -

char *memory; void * alloc(int size);


Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

Loops cannot be parallelized even if computation is independent

- 24 -

Commutative Extension ❖  Interchangeable call sites

»  Programmer doesn’t care about the order that a particular function is called

»  Multiple different orders are all defined as correct »  Impossible to express in C

❖  Prime example is memory allocation routine »  Programmer does not care which address is returned

on each call, just that the proper space is provided ❖  Enables compiler to break dependences that flow

from 1 invocation to next forcing sequential behavior

- 25 -

char *memory; void * alloc(int size); @Commutative


Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

- 26 -

char *memory; void * alloc(int size); @Commutative


Implementation dependences should not cause serialization.

Core 1 Core 2

Tim

e

Core 3

Low Level Reality

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

- 27 -

What is the Next Problem? ❖  4. C does not allow any prescribed non-

determinism »  Thus sequential semantics must be assumed even

though they not necessary »  Restricts parallelism (useless dependences)

❖  Non-deterministic branch à programmer does not care about individual outcomes »  They attach a probability to control how statistically

often the branch should take »  Allow compiler to tradeoff ‘quality’ (e.g.,

compression rates) for performance Ÿ  When to create a new dictionary in a compression scheme

- 28 -

280 chars

3000 chars

70 chars

520 chars

Sequential Program

100 chars

100 chars

80 chars 100

chars

20

Parallel Program

dict = create_dict(); while((char = read(1))) { profitable =

compress(char, dict) if (!profitable) { dict = restart(dict); } } finish_dict(dict);

#define CUTOFF 100 dict = create_dict(); count = 0; while((char = read(1))) { profitable =

compress(char, dict) if (!profitable) dict=restart(dict); if (count == CUTOFF){ dict=restart(dict); count=0; } count++; } finish_dict(dict);

!profit

!profit

!profit

!profit

cutoff cutoff cutoff cutoff

!profit

- 29 -

1500 chars

2500 chars

800 chars

1200 chars

1700 chars

2-Core Parallel Program

64-Core Parallel Program

1000 chars

1300 chars


compress(char, dict) @YBRANCH(probability=.01) if (!profitable) { dict = restart(dict); } } finish_dict(dict);

Compilers are best situated to make the tradeoff between output quality

and performance

100 chars

100 chars

80 chars 100

chars

20 cutoff cutoff cutoff cutoff

!prof

cutoff cutoff cutoff cutoff

!prof

!prof !prof

Res

et e

very

25

00 c

hara

cter

s R

eset

eve

ry

100

char

acte

rs

- 30 -

Capturing Output/Performance Tradeoff: Y-Branches in 164.gzip dict = create_dict(); while((char = read(1))) { profitable =

compress(char, dict) if (!profitable) { dict = restart(dict); } } finish_dict(dict);

#define CUTOFF 100000 dict = create_dict(); count = 0; while((char = read(1))) { profitable =

compress(char, dict) if (!profitable) dict=restart(dict); if (count == CUTOFF){ dict=restart(dict); count=0; } count++; } finish_dict(dict);


compress(char, dict) @YBRANCH(probability=.00001) if (!profitable) { dict = restart(dict); } } finish_dict(dict);

- 31 -

unsigned char *block; int last_written; compressStream(in, out) { while (True) { loadAndRLEsource(in); if (!last) break; doReversibleTransform(); sendMTFValues(out); } }

doReversibleTransform() { ... sortIt(); ... }

sortIt() { ... printf(...); ... }

256.bzip2

Parallelization techniques must look inside function calls to expose operations that cause synchronization.

- 32 -

char *memory; void * xalloc(int size) { void * ptr = memory; memory = memory + size; return ptr; }

High-Level View: Parsing a sentence is independent of any other sentence.

Low-Level Reality: Implementation dependences inside functions called by parse lead to large sequential regions.

batch_process() { while(True) { sentence = read(); if (!sentence) break; parse(sentence); print(sentence); } }

197.parser

- 33 -

164.gzip 26 x x x 175.vpr 1 x x x

176.gcc 18 x x x x 181.mcf 0 x

186.crafty 9 x x x x x

197.parser 3 x x

253.perlbmk 0 x x x

254.gap 3 x x x

255.vortex 0 x x x

256.bzip2 0 x x

300.twolf 1 x x x Modified only 60 LOC out of ~500,000 LOC - 34 -

Lack of an Aggressive Compilation Framework What prevents the automatic extraction of parallelism?

Sequential Programming Model

Performance Potential

- 35 -

What About Non-Scientific Codes???

for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X

while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X

Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++)

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86]

Independent Multithreading (IMT) Example: DOALL parallelization

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2 Core 1 Core 2

- 36 -

Alternative Parallelization Approaches

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1 Core 2


0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1 Core 2 Pipelined Multithreading (PMT) Example: DSWP [PACT 2004]

Cyclic Multithreading (CMT)

- 37 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2 0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1 Core 2

CMT IMT

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1 Core 2

PMT

- 38 -


0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

1 iter/cycle

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1 Core 2

PMT

1 iter/cycle lat(comm) = 1:

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1 Core 2

CMT

1 iter/cycle

- 39 -


0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1 Core 2

PMT

1 iter/cycle lat(comm) = 1: 1 iter/cycle 1 iter/cycle 1 iter/cycle lat(comm) = 2: 0.5 iter/cycle 1 iter/cycle

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1 Core 2

CMT

- 40 -


0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1 Core 2

PMT

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1 Core 2

CMT

Cross-thread Dependences è Wide Applicability

Thread-local Recurrences è Fast Execution

- 41 -

Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP

Find English

Sentences

Parse Sentences

(95%)

Emit Results

Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage)

197.parser

Decoupled Software Pipelining

- 43 -

Decoupled Software Pipelining (DSWP)

A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next;

Inter-thread communication latency is a one-time cost

intra-iteration loop-carried

register control

communication queue

[MICRO 2005]

Dependence Graph

DAGSCC Thread 1 Thread 2

D

B

C

A

A D

B

C

A

D B

C

- 44 -

Implementing DSWP L1:

Aux:

DFG


register

memory control

- 45 -

Optimization: Node Splitting To Eliminate Cross Thread Control

L1

L2


register

memory control

- 46 -

Optimization: Node Splitting To Reduce Communication L1

L2


register

memory control

- 47 -

Constraint: Strongly Connected Components

Solution: DAGSCC

Consider:


register

memory control

Eliminates pipelined/decoupled property - 48 -

2 Extensions to the Basic Transformation

❖  Speculation »  Break statistically unlikely dependences »  Form better-balanced pipelines

❖  Parallel Stages »  Execute multiple copies of certain “large” stages »  Stages that contain inner loops perfect candidates

- 49 -

Why Speculation?

A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next;

Dependence Graph D

B

C

A

DAGSCC A D

B

C


register control

communication queue

- 50 -

Why Speculation?

A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next;

Dependence Graph D

B

C

A

DAGSCC A D

B

C

A B C D

Predictable Dependences


register control

communication queue

- 51 -

Why Speculation?


Dependence Graph D

B

C

A

DAGSCC D

B

C

Predictable Dependences

A intra-iteration loop-carried

register control

communication queue

- 52 -

Execution Paradigm

Misspeculation detected

DAGSCC D

B

C

A

Misspeculation Recovery Rerun Iteration 4


register control

communication queue

- 53 -

Understanding PMT Performance

0

1

2

3

4

5

A:1

A:2 B:1

B:2

B:3

B:4

A:3

A:4

A:5

A:6 B:5

Core 1 Core 2 0

1

2

3

4

5

A:1 B:1

C:1

C:3

A:2 B:2

A:3 B:3

Core 1 Core 2

Idle

Ti

me

1 cycle/iter Slowest thread:

Iteration Rate: 1 iter/cycle

2 cycle/iter

0.5 iter/cycle

)max( itT ∝

1.  Rate ti is at least as large as the longest dependence recurrence.

2.  NP-hard to find longest recurrence.

3.  Large loops make problem difficult in practice.

- 54 -

Selecting Dependences To Speculate


Dependence Graph D

B

C

A

DAGSCC D

B

C

A

Thread 1

Thread 2

Thread 3

Thread 4 intra-iteration loop-carried

register control

communication queue

- 55 -

Detecting Misspeculation

DAGSCC D

B

C

A

A1: while(consume(4)) D : node = node->next produce({0,1},node); Th

read

1

A3: while(consume(6)) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Th

read

3

A2: while(consume(5)) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Th

read

2

A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);

Thre

ad 4

- 56 -


DAGSCC D

B

C

A

A1: while(TRUE) D : node = node->next produce({0,1},node); Th

read

1

A3: while(TRUE) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Th

read

3

A2: while(TRUE) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Th

read

2

A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);

Thre

ad 4

- 57 -


DAGSCC D

B

C

A

A1: while(TRUE) D : node = node->next produce({0,1},node); Th

read

1

A3: while(TRUE) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Th

read

3

A2: while(TRUE) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Th

read

2

A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC();

Thre

ad 4

- 58 -

Breaking False Memory Dependences

Memory Version 3

Oldest Version Committed by

Recovery Thread Dependence

Graph D

B

C

A


register control

communication queue

false memory

- 59 -

Adding Parallel Stages to DSWP

LD = 1 cycle X = 2 cycles


Throughput DSWP: 1/2 itera<on/cycle DOACROSS: 1/2 itera<on/cycle PS-‐DSWP: 1 itera<on/cycle

Comm. Latency = 2 cycles

0

1

2

3

4

5

LD:1

LD:2

X:1

X:3

LD:3

LD:4

LD:5

LD:6

X:5

6

7

LD:7

LD:8

Core 1 Core 2 Core 3

X:2

X:4

- 60 -

p = list; sum = 0; A: while (p != NULL) { B: id = p-‐>id; E: q = p-‐>inner_list; C: if (!visited[id]) { D: visited[id] = true; F: while (foo(q))‏ G: q = q-‐>next; H: if (q != NULL) ‏ I: sum += p-‐>value; } J: p = p-‐>next; }

10

10

10

10

5

5

50

50

5

3 ReducWon

Thread Partitioning

- 61 -

Thread Partitioning: DAGSCC

- 62 -

Thread Partitioning

Merging Invariants •  No cycles •  No loop-‐carried dependence inside a doall node

20

10

15

5

100

5

3

- 63 -

Treated as sequenWal

20

10

15

113

Thread Partitioning

- 64 -

45

113

Thread Partitioning

❖  Modified MTCG[Ottoni, MICRO’05] to generate code from partition

- 65 -

Discussion Point 1 – Speculation ❖  How do you decide what dependences to speculate?

»  Look solely at profile data? »  How do you ensure enough profile coverage? »  What about code structure? »  What if you are wrong? Undo speculation decisions at run-time?

❖  How do you manage speculation in a pipeline? »  Traditional definition of a transaction is broken »  Transaction execution spread out across multiple cores

- 66 -

Discussion Point 2 – Pipeline Structure ❖  When is a pipeline a good/bad choice for parallelization?

❖  Is pipelining good or bad for cache performance? »  Is DOALL better/worse for cache?

❖  Can a pipeline be adjusted when the number of available cores increases/decreases?

CFGs, PCs, and Cross-Iteration Deps

67

1.  r1 = r1 + 1

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r1 < 1000

1.  r1 = 10

No register live outs

Loop-Level Parallelization: DOALL

68

1.  r1 = r1 + 1

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r1 < 1000

1.  r1 = 10


1.  r1 = r1 + 2

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r1 < 999

1.  r1 = 9

1.  r1 = r1 + 2

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r1 < 1000

1.  r1 = 10

Another Example

69

1.  r1 = r1 + 1

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r2 == 10

1.  r1 = 10


Another Example

70

1.  r1 = r1 + 1

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r2 == 10

1.  r1 = 10


1.  r1 = r1 + 2

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r2 == 10

1.  r1 = 9

1.  r1 = r1 + 2

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r2 == 10

1.  r1 = 10

Speculation

71


1.  r1 = r1 + 2

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r2 == 10

1.  r1 = 9

1.  r1 = r1 + 2

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Branch r2 == 10

1.  r1 = 10

Speculation, Commit, and Recovery

72


1.  r1 = r1 + 2

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  Send{1} r2

5.  Jump

1.  r1 = 9

1.  r1 = r1 + 2

2.  r2 = MEM[r1]

3.  r2 = r2 + 1

4.  MEM[r1] = r2

5.  Jump

1.  r1 = 10 1.  r2 = Receive{1}

2.  Branch r2 != 10

3.  MEM[r1] = r2

4.  r2 = Receive{2}

5.  Branch r2 != 10

6.  MEM[r1] = r2

7.  Jump

1.  Kill and Continue

Difficult Dependences

73

1.  r1 = MEM[r1] 2.  Branch r1 == 0

3.  r2 = MEM[r1 + 4] 4.  r3 = Work (r2)

5.  Print ( r3 )

6.  Jump

1.  r1 = Head


DOACROSS

74

1.  r1 = MEM[r1] 2.  Branch r1 == 0

3.  r2 = MEM[r1 + 4] 4.  r3 = Work (r2)

5.  Print ( r3 )

6.  Jump

1.  r1 = Head


PS-DSWP

75

1.  r1 = MEM[r1] 2.  Branch r1 == 0

3.  r2 = MEM[r1 + 4] 4.  r3 = Work (r2)

5.  Print ( r3 )

6.  Jump

1.  r1 = Head


SPE

CC

INT

Per

f.(l

ogsc

ale)

1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012Year

CPU92CPU95CPU2000CPU2006

Golden era of computer architecture

1992 2012 1994 1996 1998 2000 2002 2004 2006 2008 2010

~ 3 years behind

CPU92 CPU95 CPU2000 CPU2006

Year

SP

EC

CIN

T P

erfo

rman

ce (l

og. S

cale

)

Era of DIY: •  Multicore •  Reconfigurable •  GPUs •  Clusters

10 Cores!

10-Core Intel Xeon “Unparalleled Performance”

P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)

Automatic Speculation

Automatic Pipelining

Parallel Resources Automatic

Allocation/Scheduling

Commit

MULTICORE ARCHITECTURE (CIRCA 2010)

Automatic Pipelining

Parallel Resources

Automatic Speculation

Automatic Allocation/Scheduling

Commit

Realizable parallelism

Parallel Library Calls

Time

Time

Thre

ads

Thre

ads

Credit: Jack Dongarra

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

0

1

2

3

4

5

LD:1

LD:2

W:1

W:3

LD:3


W:2

W:4

LD:4

LD:5

C:1

C:2

C:3

Core 4

Spec-‐PS-‐DSWP P6 SUPERSCALAR ARCHITECTURE

Example

A: while (node) { B: node = node->next; C: res = work(node); D: write(res); }

B1

C1

A1


A2

B2

D1

C2

D2

Tim

e

Program Dependence Graph

A B

D

C

Control Dependence Data Dependence

PDG

Example


B1

C1

A1


A2

B2

D1

C2

D2

Tim

e

Spec-‐DOALL

SpecDOALL


A B

D

C


Example



Tim

e

Spec-‐DOALL

A2

B2

C2

D2

A1

B1

C1

D1

A3

B3

C3

D3

SpecDOALL


A B

D

C


Example

B: node = node->next; C: res = work(node); D: write(res); }


Tim

e


A B

D

C


Spec-‐DOALL

A2 A1 A3

B2

C2

D2

B1

C1

D1

B3

C3

D3

A: while (node) { while (true) {

B2

C2

D2

B3

C3

D3

B4

C4

D4

197.parser

Slowdown

SpecDOALLPerf


Tim

e

C1

D1

B1

B7

C3

D3

B3

C4

D4

B4

C5

D5

B5

C6

B6

Spec-‐DOACROSS


Tim

e

Spec-‐DSWP

C2

D2

B2

C1

D1

B1

B3

B4

B2

C2

C3 D2

B5

B6

B7

D3

C5

C6

C4

D5

D4

Throughput: 1 iter/cycle Throughput: 1 iter/cycle

DOACROSSDSWP

Comparison: Spec-‐DOACROSS and Spec-‐DSWP

Comm.Latency = 2: Comm.Latency = 2: Comm.Latency = 1: 1 iter/cycle Comm.Latency = 1: 1 iter/cycle


Tim

e

C1

D1

B1

C2

D2

B2

C3

D3

B3


B2

B3

B1

B5

B6

B4

C2

C3

C1

C5

C6

C4

B7

Pipeline Fill time

0.5 iter/cycle 1 iter/cycle

D2

D3

D1

D5

D4

Tim

e

C4

D4

B4

C5

D5

B5

C6

B6

LatencyProblem

B7

0

5

10

15

20

25

30

35

40

45

50

Performan

ce Spe

edup

(X)

(Number of Total Cores, Number of Nodes)

TLS Spec-‐PS-‐DSWP

Spec-‐DOACROSS vs. Spec-‐DSWP [MICRO 2010]

Geomean of 11 benchmarks on the same cluster Performance relaWve to Best SequenWal 128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]

Restoration of Trend “Compiler Advances Double Computing Power Every 18 Years!”

– Proebsting’s Law

Compiler Technology

Architecture/Devices

Era of DIY: •  Multicore •  Reconfigurable •  GPUs •  Clusters

Compiler technology inspired class of architectures?

CFGs and PCs

93

Dependence Graphs and Parallelism

94

95 96

Documents

4€¦ · - 53 - 5 4 3 2 1 0:2 :1 B:4 B:3 B:2 B:1 :6 :5 :4 :3 B:5 5 4 3 2 1 0 B:1 :1 :3 :1 B:3 :3 B:2 :2 I e dl e m Ti - 54 - - 56 - T ! ( ti) 1. ! ti 2. ! t