84
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR LAB The Roofline Model: A pedagogical tool for program analysis and optimization ParLab Summer Retreat Samuel Williams, David Patterson [email protected]

The Roofline Model: A pedagogical tool for program analysis and optimization

Embed Size (px)

DESCRIPTION

The Roofline Model: A pedagogical tool for program analysis and optimization. ParLab Summer Retreat Samuel Williams, David Patterson [email protected]. Motivation. Performance and scalability of multicore architectures can be extremely non-intuitive to novice programmers - PowerPoint PPT Presentation

Citation preview

Page 1: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

1

BERKELEY PAR LAB

The Roofline Model:A pedagogical tool for program analysis and

optimization

ParLab Summer Retreat

Samuel Williams, David Patterson

[email protected]

Page 2: The Roofline Model: A pedagogical tool for program analysis and optimization

2

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Motivation

Performance and scalability of multicore architectures can be extremely non-intuitive to novice programmers

Success of the multicore paradigm should be premised on augmenting the abilities of the world’s programmers

Page 3: The Roofline Model: A pedagogical tool for program analysis and optimization

3

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Goals & Audience

Focused on:rates and efficiencies (Gflop/s, % of peak),

Goals for Roofline: Provide everyone with a graphical aid that provides:

realistic expectations of performance and productivity

Show inherent hardware limitations for a given kernel Show potential benefit and priority of optimizations

Who’s not the audience for the Roofline: Not for those interested in fine tuning (+5%) Not for those challenged by parallel kernel correctness

Page 4: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

4

BERKELEY PAR LAB

Principal Components of Performance

Page 5: The Roofline Model: A pedagogical tool for program analysis and optimization

5

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Components

There are three principal components to performance: Computation Communication Locality

Each architecture has a different balance between these Each kernel has a different balance between these

Performance is a question of how well an kernel’s characteristics map to an architecture’s characteristics

Page 6: The Roofline Model: A pedagogical tool for program analysis and optimization

6

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Computation

For us, floating point performance (Gflop/s) is the metric of interest (typically double precision)

Peak in-core performance can only be attained if: fully exploit ILP, DLP, FMA, etc… non-FP instructions don’t sap instruction bandwidth threads don’t diverge (GPUs) transcendental/non pipelined instructions are used sparingly branch mispredictions are rare

To exploit a form of in-core parallelism, it must be: Inherent in the algorithm Expressed in the high level implementation Explicit in the generated code

Page 7: The Roofline Model: A pedagogical tool for program analysis and optimization

7

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Communication

For us, DRAM bandwidth (GB/s) is the metric of interest

Peak bandwidth can only be attained if certain optimizations are employed: Few unit stride streams NUMA allocation and usage SW Prefetching Memory Coalescing (GPU)

Page 8: The Roofline Model: A pedagogical tool for program analysis and optimization

8

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Locality

Computation is free, Communication is expensive. Maximize locality to minimize communication There is a lower limit to communication: compulsory traffic

Hardware changes can help minimize communication Larger cache capacities minimize capacity misses Higher cache associativities minimize conflict misses Non-allocating caches minimize compulsory traffic

Software optimization can also help minimize communication Padding avoids conflict misses Blocking avoids capacity misses Non-allocating stores minimize compulsory traffic

Page 9: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

9

BERKELEY PAR LAB

Roofline Model

Page 10: The Roofline Model: A pedagogical tool for program analysis and optimization

10

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Integrating Components…

Goal: integrate in-core performance, memory bandwidth, and locality into a single readily understandable performance figure

Also, must graphically show the penalty associated with not including certain software optimizations

Roofline model will be unique to each architecture Coordinates of a kernel are ~unique to each architecture

Page 11: The Roofline Model: A pedagogical tool for program analysis and optimization

11

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

What relates GB/s to GFlop/s ?

Through dimensional analysis, its clear that Flops:Bytes is the parameter that allows us to convert bandwidth (GB/s) to performance (GFlop/s)

This is a well known quantity: Arithmetic Intensity (discussed later)

When we measure total bytes, we incorporate all cache behavior (the 3C’s) and Locality

Page 12: The Roofline Model: A pedagogical tool for program analysis and optimization

12

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Basic Roofline

Performance is upper bounded by both the peak flop rate, and the product of streaming bandwidth and the flop:byte ratio

Gflop/s = min Peak Gflop/s

Stream BW * actual flop:byte ratio

Page 13: The Roofline Model: A pedagogical tool for program analysis and optimization

13

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

notes

Bandwidth #’s collected via micro benchmarks

Computation #’s derived from optimization manuals (pencil and paper)

Assume complete overlap of either communication or computation

Page 14: The Roofline Model: A pedagogical tool for program analysis and optimization

14

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

AMD Opteron 2356(Barcelona)

Peak roofline performance based on manual for

single precision peak and a hand tuned stream

read for bandwidth

1

peak SP

peak

stre

am b

andw

idth

Lo

g s

cale

!L

og

sca

le !

Log scale !Log scale !

Page 15: The Roofline Model: A pedagogical tool for program analysis and optimization

15

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

AMD Opteron 2356(Barcelona)

Peak roofline performance based on manual for

single precision peak and a hand tuned stream

read for bandwidth

1

peak SP

peak

stre

am b

andw

idth

Page 16: The Roofline Model: A pedagogical tool for program analysis and optimization

16

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

AMD Opteron 2356(Barcelona)

Opterons have separate multipliers and adders

‘functional unit parallelism’ This is a ceiling beneth the

roofline

1

peak SP

mul / add imbalance

peak

stre

am b

andw

idth

Page 17: The Roofline Model: A pedagogical tool for program analysis and optimization

17

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

AMD Opteron 2356(Barcelona)

In single precision, SIMD is 4x32b.

If only the _ss versions are used, performance is 1/4

1

peak SP

mul / add imbalance

peak

stre

am b

andw

idth

w/out SIMD

Page 18: The Roofline Model: A pedagogical tool for program analysis and optimization

18

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

AMD Opteron 2356(Barcelona)

If 4 independent instructions are kept in the pipeline, performance will fall

1

peak SP

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out SIMD

Page 19: The Roofline Model: A pedagogical tool for program analysis and optimization

19

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

AMD Opteron 2356(Barcelona)

If SW prefetching is not used, performance will degrade

These act as ceilings below the bandwidth roofline

1

peak SP

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out SIMD

Page 20: The Roofline Model: A pedagogical tool for program analysis and optimization

20

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

AMD Opteron 2356(Barcelona)

Without NUMA optimizations, the memory controllers on the second socket can’t be used.

1

peak SP

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

Page 21: The Roofline Model: A pedagogical tool for program analysis and optimization

21

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

AMD Opteron 2356(Barcelona)

Bandwidth is much lower without unit stride streams

1

peak SP

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

w/out

unit

strid

e str

eam

s

Page 22: The Roofline Model: A pedagogical tool for program analysis and optimization

22

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

AMD Opteron 2356(Barcelona)

Its difficult for any architecture to reach the raw DRAM bandwidth

1

peak SP

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

w/out

unit

strid

e str

eam

s

raw D

RAM b

andw

idth

Page 23: The Roofline Model: A pedagogical tool for program analysis and optimization

23

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(adding ceilings)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Partitions the regions of expected performance into three optimization regions: Compute only Memory only Compute+Memory

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

w/out

unit

strid

e str

eam

s

Page 24: The Roofline Model: A pedagogical tool for program analysis and optimization

24

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Uniqueness

There is no single ordering or roofline model The order of ceilings is generally (bottom up):

What is inherent in algorithm What a compiler is likely to provide What a programmer could provide What can never be exploited for this kernel

For example, FMA or mul/add balance is inherent in many linear algebra

routines and should be placed at the bottom. However, many stencils are dominated by adds, and thus the

multipliers and FMA go underutilized.

Page 25: The Roofline Model: A pedagogical tool for program analysis and optimization

25

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Arithmetic Intensity in HPC

Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that’s constant, but

on others it scales with with problem size (increasing temporal locality)

Actual arithmetic intensity is capped by cache/local store capacity

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods

Page 26: The Roofline Model: A pedagogical tool for program analysis and optimization

26

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Accurately Determining the true Flop:DRAM Byte ratio

Remember the 3C’s of caches

Calculating the Flop:DRAM byte ratio is: Compulsory misses: straightforward Capacity misses: pencil and paper (maybe performance counters) Conflict misses: must use performance counters

Flop:actual DRAM Byte ratio < Flop:compulsory DRAM Byte ratio

One might place a range on the arithmetic intensity ratio Thus performance is limited to an area between the ceilings and

between the upper (compulsory) and lower bounds on arithmetic intensity

Page 27: The Roofline Model: A pedagogical tool for program analysis and optimization

27

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(powerpoint doodle)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Final Roofline

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

Page 28: The Roofline Model: A pedagogical tool for program analysis and optimization

28

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(powerpoint doodle)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Some arbitrary kernel has a flop:compulsory byte ratio of 4

Overlaid on the roofline Defines upper bound on

range of expected performance

Also shows which optimizations are likely

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

compulsory m

isses

Page 29: The Roofline Model: A pedagogical tool for program analysis and optimization

29

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(powerpoint doodle)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Capacity misses reduce the actual flop:byte ratio

Also reduces attainable performance

AI is unique to each combination of kernel and architecture

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

compulsory m

isses

capacity

Page 30: The Roofline Model: A pedagogical tool for program analysis and optimization

30

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(powerpoint doodle)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Conflict misses may destroy performance

AI is unique to each combination of kernel and architecture

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

compulsory m

isses

capacity

conflict

Page 31: The Roofline Model: A pedagogical tool for program analysis and optimization

31

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(powerpoint doodle)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Conflict misses may destroy performance

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

compulsory m

isses

capacity

conflict

Doubling c

ache

capac

ity

doesn’t

double th

e

arith

met

ic in

tensi

ty!

Doubling c

ache

capac

ity

doesn’t

double th

e

arith

met

ic in

tensi

ty!

Page 32: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

32

BERKELEY PAR LAB

Three Categories of Software Optimization

Page 33: The Roofline Model: A pedagogical tool for program analysis and optimization

33

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Maximizing Attained in-core Performance

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Software optimizations such as explicit SIMDization can punch through the horizontal ceilings (what can be expected from a compiler)

Other examples include loop unrolling, reordering, and long running loops

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

compulsory m

isses

capacity

conflict

Page 34: The Roofline Model: A pedagogical tool for program analysis and optimization

34

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Maximizing AttainedMemory Bandwidth

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Compilers won’t give great out-of-the box bandwidth

Punch through bandwidth ceilings: Maximize MLP long unit stride accesses NUMA aware allocation and

parallelization SW prefetching

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

compulsory m

isses

capacity

conflict

Page 35: The Roofline Model: A pedagogical tool for program analysis and optimization

35

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Minimizing Memory Traffic

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Use performance counters to measure flop:byte ratio (AI)

Out-of-the-box code may have an AI ratio much less than the compulsory ratio Be cognizant of cache capacities,

associativities, and threads sharing it Pad structures to avoid conflict

misses Use cache blocking to avoid capacity

misses These optimizations can be

imperative

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

capacity

conflict

compulsory m

isses com

pulsory misses

Page 36: The Roofline Model: A pedagogical tool for program analysis and optimization

36

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

compulsory m

isses

Effective Roofline (before)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

Before optimization, traffic, and limited bandwidth optimization limits performance to a very narrow window

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

capacity

conflict

Page 37: The Roofline Model: A pedagogical tool for program analysis and optimization

37

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Effective Roofline (after)

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/41/2 1 2 4 8 16

After optimization, ideally, performance is significantly better

1

peak SPAMD Opteron 2356

(Barcelona)

mul / add imbalance

w/out ILP

peak

stre

am b

andw

idth

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

w/out SIMD

capacity

conflict

compulsory m

isses

Page 38: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

38

BERKELEY PAR LAB

Applicable to Other Architectural Paradigms ?

Page 39: The Roofline Model: A pedagogical tool for program analysis and optimization

39

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Four Architectures

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllersH

yper

Tra

nsp

ort

Hyp

erT

ran

spor

tOpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllers

OpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbar

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbarHyp

erT

ran

spor

tH

yper

Tra

nsp

ort

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

inte

rcon

nect

inte

rcon

nect

86.4 GB/s

768MB 900MHz GDDR3 Device DRAM768MB 900MHz GDDR3 Device DRAM

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

192KB L2 (Textures only)192KB L2 (Textures only)

24 ROPs24 ROPs

6 x 64b memory controllers6 x 64b memory controllers

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC B

IFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

Sun Victoria FallsAMD Barcelona

NVIDIA G80IBM Cell Blade

Page 40: The Roofline Model: A pedagogical tool for program analysis and optimization

40

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Four Architectures

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllersH

yper

Tra

nsp

ort

Hyp

erT

ran

spor

tOpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllers

OpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbar

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbarHyp

erT

ran

spor

tH

yper

Tra

nsp

ort

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

inte

rcon

nect

inte

rcon

nect

86.4 GB/s

768MB 900MHz GDDR3 Device DRAM768MB 900MHz GDDR3 Device DRAM

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

192KB L2 (Textures only)192KB L2 (Textures only)

24 ROPs24 ROPs

6 x 64b memory controllers6 x 64b memory controllers

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC B

IFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

Sun Victoria FallsAMD Barcelona

NVIDIA G80IBM Cell Blade

One

Thr

ead

/ Cor

e

One

Thr

ead

/ Cor

e

Mul

tithr

eade

d C

ores

Mul

tithr

eade

d C

ores

Page 41: The Roofline Model: A pedagogical tool for program analysis and optimization

41

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Four Architectures

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllersH

yper

Tra

nsp

ort

Hyp

erT

ran

spor

tOpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllers

OpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbar

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbarHyp

erT

ran

spor

tH

yper

Tra

nsp

ort

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

inte

rcon

nect

inte

rcon

nect

86.4 GB/s

768MB 900MHz GDDR3 Device DRAM768MB 900MHz GDDR3 Device DRAM

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

192KB L2 (Textures only)192KB L2 (Textures only)

24 ROPs24 ROPs

6 x 64b memory controllers6 x 64b memory controllers

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC B

IFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

Sun Victoria FallsAMD Barcelona

NVIDIA G80IBM Cell Blade

Cache-basedCache-based

Local Store-based

Local Store-based

Page 42: The Roofline Model: A pedagogical tool for program analysis and optimization

42

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

32b Rooflines for the Four(in-core parallelism)

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16 1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

Sun Victoria FallsAMD Barcelona

NVIDIA G80

Single Precision Roofline models for the SMPs used in this work.

Based on micro-benchmarks, experience, and manuals

Ceilings =

in-core parallelism Can the compiler find all this

parallelism ? NOTE:

log-log scale Assumes perfect SPMD

41/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

8

16

32

64

128

256

1/41/2 1 2 4 8 16

IBM Cell Blade

4

8

16

32

64

128

256

4

8

16

32

64

128

256

4

8

16

32

64

128

256

512

512

512

512

peak SP

mul / add imbalance

w/out SIMD

w/out ILP

peak SP

peak SP

w/out FMA

w/out SIMD

w/out ILP

peak SP

w/out FMA

w/out

mem

ory c

oales

cing

w/out

NUM

A

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out

DM

A con

curre

ncy

Page 43: The Roofline Model: A pedagogical tool for program analysis and optimization

43

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

32b Rooflines for the Four(diverged threads)

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16 1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

Sun Victoria FallsAMD Barcelona

NVIDIA G80

G80 dynamically finds DLP (shared instruction fetch)

SIMT If threads of a warp diverge

from SIMD execution, performance is limited by instruction issue bandwidth

Ceilings on G80 =

number of unique PCs when threads diverge

41/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

8

16

32

64

128

256

1/41/2 1 2 4 8 16

IBM Cell Blade

4

8

16

32

64

128

256

4

8

16

32

64

128

256

4

8

16

32

64

128

256

512

512

512

512

peak SP

mul / add imbalance

w/out SIMD

w/out ILP

peak SP

peak SP

w/out FMA

w/out SIMD

w/out ILP

w/out

NUM

A

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out

DM

A con

curre

ncy

peak SP

4 PCs

w/out

mem

ory c

oales

cing8 PCs

16 PCs

32 PCs

Page 44: The Roofline Model: A pedagogical tool for program analysis and optimization

44

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

32b Rooflines for the Four(FP fraction of dynamic instructions)

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16 1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

Sun Victoria FallsAMD Barcelona

NVIDIA G80

Some kernels have large numbers of non FP instructions

Saps instruction issue bandwidth

Ceilings = FP fraction of dynamic instruction mix

NOTE: Assumes perfect in-core

parallelism

41/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

8

16

32

64

128

256

1/41/2 1 2 4 8 16

IBM Cell Blade

4

8

16

32

64

128

256

4

8

16

32

64

128

256

4

8

16

32

64

128

256

512

512

512

512

w/out

SW

pre

fetc

h

w/out

SW

pre

fetc

h

peak SP

w/out

NUM

A

FP = 25%

FP = 12%

FP = 6%

peak SP

w/out

NUM

A

w/out

DM

A con

curre

ncy

FP = 25%

FP = 12%

FP = 6%

peak SP

w/out

NUM

A FP = 25%

FP = 12%

peak SP

w/out

mem

ory c

oales

cing

FP = 50%

FP = 25%

FP = 12%

FP = 6%

Page 45: The Roofline Model: A pedagogical tool for program analysis and optimization

45

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

32b Rooflines for the Four(ridge point)

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16 1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

Sun Victoria FallsAMD Barcelona

NVIDIA G80

Some architectures have drastically different

ridge points VF may be compute bound

on many kernels Clovertown has 1/3 the BW

of Barcelona = ridge point to the right

41/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

8

16

32

64

128

256

1/41/2 1 2 4 8 16

IBM Cell Blade

4

8

16

32

64

128

256

4

8

16

32

64

128

256

4

8

16

32

64

128

256

512

512

512

512

w/out

SW

pre

fetc

h

w/out

SW

pre

fetc

h

peak SP

w/out

NUM

A

FP = 25%

FP = 12%

FP = 6%

peak SP

w/out

NUM

A

w/out

DM

A con

curre

ncy

FP = 25%

FP = 12%

FP = 6%

peak SP

w/out

NUM

A FP = 25%

FP = 12%

peak SP

w/out

mem

ory c

oales

cing

FP = 50%

FP = 25%

FP = 12%

FP = 6%

Page 46: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

46

BERKELEY PAR LAB

Using Roofline when Auto-tuningHPC Kernels

Page 47: The Roofline Model: A pedagogical tool for program analysis and optimization

47

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMPs Used

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllers

Hyp

erT

ran

spor

tH

yper

Tra

nsp

ortOpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllers

OpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbar

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbarHyp

erT

ran

spor

tH

yper

Tra

nsp

ort

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC B

IFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs667MHz FBDIMMs

Chipset (4x64b controllers)Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

CoreCore

FSB

CoreCore CoreCore CoreCore

10.66 GB/s

CoreCore

FSB

CoreCore CoreCore CoreCore

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

Page 48: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

48

BERKELEY PAR LAB

Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

Page 49: The Roofline Model: A pedagogical tool for program analysis and optimization

49

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Sparse MatrixVector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate y=Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low arithmetic intensity (often <0.166 flops/byte)

= likely memory bound

A x y

Page 50: The Roofline Model: A pedagogical tool for program analysis and optimization

50

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV

Double precision roofline models

FMA is inherent in SpMV (place at bottom)

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aun

align

ed D

MA

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

Barcelonapeak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 51: The Roofline Model: A pedagogical tool for program analysis and optimization

51

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aun

align

ed D

MA

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

Barcelonapeak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 52: The Roofline Model: A pedagogical tool for program analysis and optimization

52

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV(out-of-the-box parallel)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166 For simplicity: dense

matrix in sparse format

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aun

align

ed D

MA

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

Barcelonapeak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 53: The Roofline Model: A pedagogical tool for program analysis and optimization

53

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV(NUMA & SW prefetch)

compulsory flop:byte ~ 0.166

utilize all memory channels

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aun

align

ed D

MA

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

Barcelonapeak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 54: The Roofline Model: A pedagogical tool for program analysis and optimization

54

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio, and FP% of instructions

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aun

align

ed D

MA

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

Barcelonapeak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

Page 55: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

55

BERKELEY PAR LAB

Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.

Best Paper, Application Track

Page 56: The Roofline Model: A pedagogical tool for program analysis and optimization

56

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD

Plasma turbulence simulation via Lattice Boltzmann Method Two distributions:

momentum distribution (27 scalar components) magnetic distribution (15 vector components)

Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)

Must read 73 doubles, and update 79 doubles per point in space Requires about 1300 floating point operations per point in space Just over 1.0 flops/byte (ideal)

momentum distribution

14

4

13

16

5

8

9

21

12

+Y

2

25

1

3

24

23

22

26

0

18

6

17

19

7

10

11

20

15

+Z

+X

magnetic distribution

14

13

16

21

12

25

24

23

22

26

18

17

19

20

15

+Y

+Z

+X

macroscopic variables

+Y

+Z

+X

Page 57: The Roofline Model: A pedagogical tool for program analysis and optimization

57

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD

Huge datasets NUMA allocation/access Little ILP No DLP Far more adds than

multiplies (imbalance) Essentially random

access to memory Flop:byte ratio ~0.7 High conflict misses

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 58: The Roofline Model: A pedagogical tool for program analysis and optimization

58

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD(out-of-the-box code)

Huge datasets NUMA allocation/access Little ILP No DLP Far more adds than multiplies

(imbalance) Essentially random access to

memory Flop:byte ratio ~0.7 High conflict misses Peak VF performance with 64

threads (our of 128) - high conflict misses

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 59: The Roofline Model: A pedagogical tool for program analysis and optimization

59

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD(Padding, Vectorization, Unrolling, Reordering)

Vectorize the code to eliminate TLB capacity misses

Ensures unit stride access (bottom bandwidth ceiling)

Tune for optimal VL Clovertown pinned to

lower BW ceiling

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 60: The Roofline Model: A pedagogical tool for program analysis and optimization

60

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~1.0 on x86/Cell

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

Page 61: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

61

BERKELEY PAR LAB

The Heat Equation Stencil

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “Stencil Computation Optimization and Autotuning on State-of-the-Art Multicore Architecture”, submitted to Supercomputing (SC), 2008.

Page 62: The Roofline Model: A pedagogical tool for program analysis and optimization

62

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

The Heat Equation Stencil

PDE grid

+Y

+Z

+X

stencil for heat equation PDE

y+1

y-1

x-1

z-1

z+1

x+1x,y,z

Explicit Heat equation on a regular grid Jacobi One double per point in space 7-point nearest neighbor stencil Must:

read every point from DRAM perform 8 flops (linear combination) write every point back to DRAM

Just over 0.5 flops/byte (ideal) Cache locality is important

Page 63: The Roofline Model: A pedagogical tool for program analysis and optimization

63

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies

(imbalance) Ideal flop:byte ratio 1/3

High locality requirements Capacity and conflict misses will

severely impair flop:byte ratio

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 64: The Roofline Model: A pedagogical tool for program analysis and optimization

64

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies

(imbalance) Ideal flop:byte ratio 1/3

High locality requirements Capacity and conflict misses will

severely impair flop:byte ratio

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 65: The Roofline Model: A pedagogical tool for program analysis and optimization

65

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …)

Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3

Clovertown has huge caches but is pinned to lower BW ceiling

Cache management is essential when capacity/thread is low

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 66: The Roofline Model: A pedagogical tool for program analysis and optimization

66

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~0.5 on x86/Cell

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

Page 67: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

67

BERKELEY PAR LAB

Refining the Roofline

Page 68: The Roofline Model: A pedagogical tool for program analysis and optimization

68

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Other performance metrics

There is no reason either floating point (Gflop/s) must be the performance metric

Could also use: Graphics (Pixels, Vertices, Textures) Crypto Integer Bitwise etc…

Page 69: The Roofline Model: A pedagogical tool for program analysis and optimization

69

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

For our kernels, DRAM bandwidth is the key communication component.

For other kernels, other bandwidths might be more appropriate L2 bandwidth (e.g. DGEMM) PCIe bandwidth (offload to GPU) Network bandwidth

The example below shows zero overhead double buffered transfers to/from a GPU over PCIe x16 How bad is a SP stencil ? What about SGEMM ?

No overlap / high overhead tends

to smooth performance Performance is half at ridge point

Other bandwidths

1

flop:PCIe byte ratio

att

ain

ab

le G

flop

/s (

32

b)

2 4 8 16 32 64 128

NVIDIA G80

4

8

16

32

64

128

256

512 peak SP

FP = 50%

FP = 25%

FP = 12%

FP = 6%

full d

uplex

half d

uplex

Page 70: The Roofline Model: A pedagogical tool for program analysis and optimization

70

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Mix and match

In general, you can mix and match as the kernel/architecture requires:

e.g. all posibilities is the cross product of performance metrics with bandwidths

{Gflop/s, GIPS, crypto, …} {L2, DRAM, PCIe, Network}

Page 71: The Roofline Model: A pedagogical tool for program analysis and optimization

71

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Conclusions

The Roofline Model provides an intuitive graph for kernel analysis and optimization

Easily extendable to other architectural paradigms

Easily extendable to other communication or computation metrics

Page 72: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

72

BERKELEY PAR LAB

Questions ?

Page 73: The Roofline Model: A pedagogical tool for program analysis and optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

73

BERKELEY PAR LAB

BACKUP SLIDES

Page 74: The Roofline Model: A pedagogical tool for program analysis and optimization

74

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Trends

ILP is decreasing (shorter pipelines, multithreading)

SIMD is becoming wider

Bandwidth isn’t keeping up with #cores

Application flop:byte is decreasing

Cache/Local Store management is becoming critical1.6%

1/8

flop:DRAM byte ratio

% o

f pea

k G

flop/

s

3.1%

6.4%

12.5%

25.0%

50.0%

100%

1/41/2 1 2 4 8 16

1

peak SP

mul / add imbalance

w/out SIMD

peak

stre

am b

andw

idth

w/out

mem

ory o

ptim

izatio

ns

w/out ILP

Page 75: The Roofline Model: A pedagogical tool for program analysis and optimization

75

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Trends

ILP is decreasing (shorter pipelines, multithreading)

SIMD is becoming wider

Bandwidth isn’t keeping up with #cores

Application flop:byte is decreasing

Cache/Local Store management is becoming critical1.6%

1/8

flop:DRAM byte ratio

% o

f pea

k G

flop/

s

3.1%

6.4%

12.5%

25.0%

50.0%

100%

1/41/2 1 2 4 8 16

1

peak SP

mul / add imbalance

w/out SIMD

peak

stre

am b

andw

idth

w/out

mem

ory o

ptim

izatio

ns

w/out ILP

Page 76: The Roofline Model: A pedagogical tool for program analysis and optimization

76

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Trends

ILP is decreasing (shorter pipelines, multithreading)

SIMD is becoming wider

Bandwidth isn’t keeping up with #cores

Application flop:byte is decreasing

Cache/Local Store management is becoming critical1.6%

1/8

flop:DRAM byte ratio

% o

f pea

k G

flop/

s

3.1%

6.4%

12.5%

25.0%

50.0%

100%

1/41/2 1 2 4 8 16

1

peak SP

mul / add imbalance

w/out SIMD

peak

stre

am b

andw

idth

w/out

mem

ory o

ptim

izatio

ns

w/out ILP

Page 77: The Roofline Model: A pedagogical tool for program analysis and optimization

77

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Trends

ILP is decreasing (shorter pipelines, multithreading)

SIMD is becoming wider

Bandwidth isn’t keeping up with #cores

Application flop:byte is decreasing

Cache/Local Store management is becoming critical1.6%

1/8

flop:DRAM byte ratio

% o

f pea

k G

flop/

s

3.1%

6.4%

12.5%

25.0%

50.0%

100%

1/41/2 1 2 4 8 16

1

peak SP

mul / add imbalance

w/out SIMD

peak

stre

am b

andw

idth

w/out

mem

ory o

ptim

izatio

ns

w/out ILP

Page 78: The Roofline Model: A pedagogical tool for program analysis and optimization

78

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Trends

ILP is decreasing (shorter pipelines, multithreading)

SIMD is becoming wider

Bandwidth isn’t keeping up with #cores

Application flop:byte is decreasing

Cache/Local Store management is becoming critical1.6%

1/8

flop:DRAM byte ratio

% o

f pea

k G

flop/

s

3.1%

6.4%

12.5%

25.0%

50.0%

100%

1/41/2 1 2 4 8 16

1

peak SP

mul / add imbalance

w/out SIMD

peak

stre

am b

andw

idth

w/out

mem

ory o

ptim

izatio

ns

w/out ILP

Page 79: The Roofline Model: A pedagogical tool for program analysis and optimization

79

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Alternate View (Barcelona)

2

1/2

attainable GB/s

atta

inab

le G

flop/

s (3

2b)

4

8

16

32

64

128

1 2 4 8 16 32 64

Same formalism can be plotted on different axis

Difficult to plot AI

1

peak SP

mul / add imbalance

w/out ILP

peak stream bandw

idth

w/out S

W prefetching

w/out N

UM

A optim

izations

w/out SIMD

w/out unit stride stream

s

Page 80: The Roofline Model: A pedagogical tool for program analysis and optimization

80

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

No Overlap

What if computation or communication isn’t totally overlapped At ridgepoint 50% of the time is spent in each, so performance is cut

in half In effect, the curves are smoothed Common for bulk synchonous MPI communication, atypical for

DRAM access on modern architectures

Page 81: The Roofline Model: A pedagogical tool for program analysis and optimization

81

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Opteron(non-overlapped)

AMD Opteron (rev.F) Not typical of multi-thread/core architectures

Not typical of architectures with ooo or HW prefetchers

More common in network accesses

0.5

1.0

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

2.0

4.0

8.0

16.0

32.0

64.0

1/41/2 1 2 4 8 16

peak DP

mul / add imbalance

w/outILP or SIMDpe

ak st

ream

ban

dwidt

h

w/out

SW

pre

fetch

ing

w/out

NUM

A opt

imiza

tions

Page 82: The Roofline Model: A pedagogical tool for program analysis and optimization

82

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Performance(+Cell/SPE version)

Wrote a double precision Cell/SPE version

DMA, local store blocked, NUMA aware, etc…

Only 2x1 and larger BCOO Only the SpMV-proper routine

changed

About 12x faster (median) than using the PPEs alone.

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)

AMD Opteron (rev.F)Intel Xeon (Clovertown)

+More DIMMs(opteron), +FW fix, array padding(N2), etc…

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 83: The Roofline Model: A pedagogical tool for program analysis and optimization

83

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Performance(Local Store Implementation)

First attempt at cell implementation.

VL, unrolling, reordering fixed No NUMA Exploits DMA and double

buffering to load vectors Straight to SIMD intrinsics. Despite the relative

performance, Cell’s DP implementation severely impairs performance

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 1 2 4

64^3 128^3

GFlop/s

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 16 1 2 4 8 16

64^3 128^3

GFlop/s

Intel Xeon (Clovertown) AMD Opteron (rev.F)

Sun Niagara2 (Huron) IBM Cell Blade (SPEs)*

*collision() only

+SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

Page 84: The Roofline Model: A pedagogical tool for program analysis and optimization

84

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Where do GPUs fit in?

GPUs discover data level parallelism from thread level parallelism at runtime

VLIW

SIMD,Vector

superscalar,SMT,etc…

G80

Expressed atcompile time

Discovered atrun time

Inst

ruct

ion

Leve

lP

aral

lelis

mD

ata

Leve

lP

aral

lelis

m