37
Introduction to the Roofline Model Samuel Williams Computational Research Division Lawrence Berkeley National Lab [email protected]

Introduction to the Roofline Model

  • Upload
    others

  • View
    34

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to the Roofline Model

Introduction to theRoofline Model

Samuel WilliamsComputational Research DivisionLawrence Berkeley National Lab

[email protected]

Page 2: Introduction to the Roofline Model

You just spent 6 months porting your application to GPUs

Are you done?

Page 3: Introduction to the Roofline Model

What is “Good” Performance?

3

§ Imagine profiling the mix of loop nests in an application when running on the GPUo GFLOP/s alone may not be particularly

insightful

Loop nest (kernel)

o speedup relative to a Xeon may seem random

GFL

OP/

s

Page 4: Introduction to the Roofline Model

What is “Good” Performance?

4

2. making good use of the GPU’s compute and/or bandwidth capabilities

1. Operating in the throughput-limited regimenot sensitive to Amdahl effects, D2H/H2D transfers, launch overheads, etc…

Ø Ultimately, we need a quantitative model rather than qualitative statements like “good”

§ Two fundamental aspects to “Good” performance…

Page 5: Introduction to the Roofline Model

Roofline Model

§ Roofline Model is a throughput-oriented performance model

§ Tracks rates not times§ Independent of ISA and architecture§ applies to CPUs, GPUs, Google

TPUs1, FPGAs, etc…§ Helps quantify Good Performance

51Jouppi et al, “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA, 2017.

https://crd.lbl.gov/departments/computer-science/PAR/research/roofline

Page 6: Introduction to the Roofline Model

Reduced Model

§ Superscalar architectures can be complex§ Don’t model / simulate full architecture§ Created simplified processor architecture

6https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)

Page 7: Introduction to the Roofline Model

Reduced Model

§ Superscalar architectures can be complex§ Don’t model / simulate full architecture§ Created simplified processor architecture§ Make assumptions on performance and

usage…o Cores can attain peak GFLOP/s on local datao Cores execute load-balanced SPMD codeo NoC bisection bandwidth is sufficiento There is sufficient cache bandwidth and capacity

such that they do not affect performanceØ Basis for DRAM Roofline Model

7

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

Page 8: Introduction to the Roofline Model

Data Movement or Compute?

§ Which takes longer?o Data Movemento Compute?

8

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

#FP ops / Peak GFLOP/sTime = max

#Bytes / Peak GB/s

Page 9: Introduction to the Roofline Model

Data Movement or Compute?

9

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

1 / Peak GFLOP/sTime#FP ops #Bytes / #FP ops / Peak GB/s

= max

§ Which takes longer?o Data Movemento Compute?

§ Is performance limited by compute or data movement?

Page 10: Introduction to the Roofline Model

Data Movement or Compute?

10

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

Peak GFLOP/s#FP opsTime (#FP ops / #Bytes) * Peak GB/s

= min

§ Which takes longer?o Data Movemento Compute?

§ Is performance limited by compute or data movement?

Page 11: Introduction to the Roofline Model

Data Movement or Compute?

11

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

Peak GFLOP/sGFLOP/s = min

AI * Peak GB/sAI (Arithmetic Intensity) = FLOPs / Bytes (as presented to DRAM )

Data Movement or Compute?

§ Which takes longer?o Data Movemento Compute?

§ Is performance limited by compute or data movement?

Page 12: Introduction to the Roofline Model

Arithmetic Intensity

§ Measure of data locality (data reuse)§ Ratio of Total Flops performed to Total Bytes moved§ For the DRAM Roofline…

o Total Bytes to/from DRAM o Includes all cache and prefetcher effectso Can be very different from total loads/stores (bytes requested)o Equal to ratio of sustained GFLOP/s to sustained GB/s (time cancels)

12

Page 13: Introduction to the Roofline Model

(DRAM) Roofline Model

§ Plot Roofline bound using Arithmetic Intensity as the x-axis

§ Log-log scale makes it easy to doodle, extrapolate performance along Moore’s Law, etc…

13

Peak GFLOP/s

Atta

inab

le F

LOP/

s

HBM GB/s

Arithmetic Intensity (FLOP:Byte)

Peak GFLOP/sGFLOP/s = min

AI * Peak GB/sAI (Arithmetic Intensity) = FLOPs / Bytes (moved to/from DRAM )

Page 14: Introduction to the Roofline Model

(DRAM) Roofline Model

14

Attain

able

FLO

P/s

Arithmetic Intensity (FLOP:Byte)

Transition @ AI ==Peak GFLOP/s / Peak GB/s ==

‘Machine Balance’

HBM

GB/s

Peak GFLOP/s

§ Plot Roofline bound using Arithmetic

Intensity as the x-axis

§ Log-log scale makes it easy to

doodle, extrapolate performance

along Moore’s Law, etc…

Peak GFLOP/sGFLOP/s = min

AI * Peak GB/sAI (Arithmetic Intensity) = FLOPs / Bytes (moved to/from DRAM )

Page 15: Introduction to the Roofline Model

(DRAM) Roofline Model

15

Atta

inab

le F

LOP

/s

Arithmetic Intensity (FLOP:Byte)

unattainable performance(greater than peak GFLOP/s)

unattain

able

perform

ance

(insu

fficien

t ban

dwidth)

poor

performanceHBM

GB/s

Peak GFLOP/s

§ Roofline tessellates this 2D view of performance into 5 regions…

Peak GFLOP/sGFLOP/s = min

AI * Peak GB/sAI (Arithmetic Intensity) = FLOPs / Bytes (moved to/from DRAM )

Page 16: Introduction to the Roofline Model

Roofline Example #1

§ Typical machine balance is 5-10 FLOPs per byte…o 40-80 FLOPs per double to exploit compute capabilityo Artifact of technology and moneyo Unlikely to improve

16

#pragma omp parallel forfor(i=0;i<N;i++){

Z[i] = X[i] + alpha*Y[i];}

§ Consider STREAM Triad…

o 2 FLOPs per iterationo Transfer 24 bytes per iteration (read X[i], Y[i], write Z[i])o AI = 0.083 FLOPs per byte == Memory bound

Atta

inab

le F

LOP/

s

HBM GB/s

Arithmetic Intensity (FLOP:Byte)

Peak GFLOP/s

5.0

TRIAD

GFLOP/s ≤ AI * HBM GB/s

0.083

Page 17: Introduction to the Roofline Model

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…

17

#pragma omp parallel forfor(k=1;k<dim+1;k++){for(j=1;j<dim+1;j++){for(i=1;i<dim+1;i++){new[k][j][i] = -6.0*old[k ][j ][i ]

+ old[k ][j ][i-1]+ old[k ][j ][i+1]+ old[k ][j-1][i ]+ old[k ][j+1][i ]+ old[k-1][j ][i ]+ old[k+1][j ][i ];

}}}

HBMHBM GB/s

Perfect Caches

Compute GFLOP/s

Page 18: Introduction to the Roofline Model

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…o 7 FLOPso 8 memory references (7 reads, 1 store) per point

18

#pragma omp parallel forfor(k=1;k<dim+1;k++){for(j=1;j<dim+1;j++){for(i=1;i<dim+1;i++){new[k][j][i] = -6.0*old[k ][j ][i ]

+ old[k ][j ][i-1]+ old[k ][j ][i+1]+ old[k ][j-1][i ]+ old[k ][j+1][i ]+ old[k-1][j ][i ]+ old[k+1][j ][i ];

}}}

HBMHBM GB/s

Perfect Caches

Compute GFLOP/s

o AI = 7 / (8*8) = 0.11 FLOPs per byte(measured at the L1)

Page 19: Introduction to the Roofline Model

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…o 7 FLOPso 8 memory references (7 reads, 1 store) per pointo Ideally, cache will filter all but 1 read and 1 write per point

19

#pragma omp parallel forfor(k=1;k<dim+1;k++){for(j=1;j<dim+1;j++){for(i=1;i<dim+1;i++){new[k][j][i] = -6.0*old[k ][j ][i ]

+ old[k ][j ][i-1]+ old[k ][j ][i+1]+ old[k ][j-1][i ]+ old[k ][j+1][i ]+ old[k-1][j ][i ]+ old[k+1][j ][i ];

}}}

HBMHBM GB/s

Perfect Caches

Compute GFLOP/s

Page 20: Introduction to the Roofline Model

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…o 7 FLOPso 8 memory references (7 reads, 1 store) per pointo Ideally, cache will filter all but 1 read and 1 write per pointØ 7 / (8+8) = 0.44 FLOPs per byte (DRAM)

20

#pragma omp parallel forfor(k=1;k<dim+1;k++){for(j=1;j<dim+1;j++){for(i=1;i<dim+1;i++){new[k][j][i] = -6.0*old[k ][j ][i ]

+ old[k ][j ][i-1]+ old[k ][j ][i+1]+ old[k ][j-1][i ]+ old[k ][j+1][i ]+ old[k-1][j ][i ]+ old[k+1][j ][i ];

}}}

HBMHBM GB/s

Perfect Caches

Compute GFLOP/s

Page 21: Introduction to the Roofline Model

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…o 7 FLOPso 8 memory references (7 reads, 1 store) per pointo Ideally, cache will filter all but 1 read and 1 write per pointØ 7 / (8+8) = 0.44 FLOPs per byte (DRAM)

21

Atta

inab

le F

LOP/

s

HBM GB/s

TRIAD

Arithmetic Intensity (FLOP:Byte)0.083

7-pointStencil

GFLOP/s ≤ AI * HBM GB/s

0.44

Peak GFLOP/s

#pragma omp parallel forfor(k=1;k<dim+1;k++){for(j=1;j<dim+1;j++){for(i=1;i<dim+1;i++){new[k][j][i] = -6.0*old[k ][j ][i ]

+ old[k ][j ][i-1]+ old[k ][j ][i+1]+ old[k ][j-1][i ]+ old[k ][j+1][i ]+ old[k-1][j ][i ]+ old[k+1][j ][i ];

}}}

== memory bound, but 5x the FLOP rate as TRIAD

Page 22: Introduction to the Roofline Model

What is “Good” Performance?

§ Think back to our mix of loop nests…

22

FLO

P/s

Loop nest (kernel)

Page 23: Introduction to the Roofline Model

What is “Good” Performance?

§ We can sort kernels by arithmetic intensity…

23

Atta

inab

le F

LOP/

s

Arithmetic Intensity (FLOP:Byte)

Page 24: Introduction to the Roofline Model

What is “Good” Performance?

§ We can sort kernels by arithmetic intensity…

§ … and compare performance relative to machine capabilities

24

Peak GFLOP/s

Atta

inab

le F

LOP/

s

HBM GB/s

Arithmetic Intensity (FLOP:Byte)

Page 25: Introduction to the Roofline Model

50%

of S

TREAM

50% of Peak

What is “Good” Performance?

§ Kernels near the roofline are making good use of computational resources

25

Peak GFLOP/s

Atta

inab

le F

LOP

/s

HBM G

B/s

Arithmetic Intensity (FLOP:Byte)

Page 26: Introduction to the Roofline Model

50% of Peak

50% of

STREAM

What is “Good” Performance?

§ Kernels near the roofline are making good use of computational resources

26

Peak GFLOP/s

Atta

inab

le F

LOP/

s

HBM GB/s

Arithmetic Intensity (FLOP:Byte)

Ø kernels can have low performance (GFLOP/s), but make good use (%STREAM) of a machine

Page 27: Introduction to the Roofline Model

50% of Peak

50% of

STREAM

What is “Good” Performance?

§ Kernels near the roofline are making good use of computational resources

27

Peak GFLOP/s

Atta

inab

le F

LOP/

s

HBM GB/s

Arithmetic Intensity (FLOP:Byte)

Ø kernels can have low performance (GFLOP/s), but make good use (%STREAM) of a machine

Ø kernels can have high performance (GFLOP/s), but still make poor use of a machine (%peak)

Page 28: Introduction to the Roofline Model

Roofline is made of two components

§ Machine Modelo Lines defined by peak GB/s and GF/s

(Benchmarking)o Unique to each architectureo Common to all apps on that architecture

28

Peak GFLOP/s

Atta

inab

le F

LOP/

s

HBM GB/s

50% of

STREAM

Arithmetic Intensity (FLOP:Byte)

50% of Peak

Page 29: Introduction to the Roofline Model

Roofline is made of two components

§ Machine Modelo Lines defined by peak GB/s and GF/s

(Benchmarking)

o Unique to each architecture

o Common to all apps on that architecture

§ Application Characteristicso Dots defined by application GFLOP’s and

GB’s (Application Instrumentation)

o Unique to each application

o Unique to each architecture

29

Atta

inab

le F

LOP

/s

Arithmetic Intensity (FLOP:Byte)

Page 30: Introduction to the Roofline Model

General Performance Optimization Strategy

§ Get to the Roofline

30

Peak GFLOP/s

Atta

inab

le F

LOP/

s

Arithmetic Intensity (FLOP:Byte)

50% of Peak

HBM GB/s

50% of

STREAM

Page 31: Introduction to the Roofline Model

General Performance Optimization Strategy

§ Get to the Roofline

31

Peak GFLOP/s

Atta

inab

le F

LOP/

s

Arithmetic Intensity (FLOP:Byte)

50% of Peak

HBM GB/s

50% of

STREAM

§ Increase Arithmetic Intensity when bandwidth-limitedo Reducing data movement increases AI

Page 32: Introduction to the Roofline Model

How can performance ever be below the Roofline?

Page 33: Introduction to the Roofline Model

Performance Below the Roofline?

§ Insufficient cache bandwidth and data locality

33

§ “Lack of Parallelism”o Thread Divergence (idle threads)o Insufficient Occupancy (idle warp sched)o Insufficient #Thread Blocks (idle SMs)

§ Integer-heavy Codeso Non-FP instructions impair FP

performanceo No FP instructions… AI=0

§ Instruction Mixo Lack of FMAo Mixed Precision effectso Lack of Tensor Core operations

Additional FP CeilingsCharlene Yang, Thorsten Kurth,Samuel Williams, "HierarchicalRoofline analysis for GPUs:Accelerating performanceoptimization for the NERSC-9Perlmutter system", Concurrencyand Computation: Practice andExperience (CCPE), August 2019.

Instruction Roofline Model

Nan Ding, Samuel Williams, "AnInstruction Roofline Model for GPUs",Performance Modeling, Benchmarking,and Simulation (PMBS), BEST PAPERAWARD, November 2019.

10-2 10-1 100 101 102

Instruction Intensity (Warp Instructions per Transaction)

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

HBM 25.9 GTXN/s

L2 93.6 GTXN/s

L1 437.5 GTXN/sThoeretical Peak: 489.6 warp GIPSRoofline Scaling

TrajectoriesKhaled Ibrahim, Samuel Williams, LeonidOliker, "Performance Analysis of GPUProgramming Models using the RooflineScaling Trajectories", InternationalSymposium on Benchmarking,Measuring and Optimizing (Bench),BEST PAPER AWARD, November 2019.

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0

Arithmetic Intensity (Flops/Byte)

GFl

op/s

VFMA (1229)

ADD (c32) (77)

ADD (c1) (9.2)DRAM (c3

2) (128)

DRAM (c1) (1

4.3)

●●●

roofline_summary_sp_lbl

● Class AClass BClass C

c1

c2

c4c8

c16c32c64

Atta

inab

le G

FLO

P/s

HBM GB/s

Arithmetic Intensity (FLOP:Byte)

Peak GFLOP/s

Hierarchical Roofline Model

Charlene Yang, Thorsten Kurth, SamuelWilliams, "Hierarchical Roofline analysisfor GPUs: Accelerating performanceoptimization for the NERSC-9 Perlmuttersystem", Concurrency and Computation:Practice and Experience (CCPE), August2019.

Page 34: Introduction to the Roofline Model

Summary

Page 35: Introduction to the Roofline Model

Why We Use Roofline…

1. Determine when we’re done optimizing codeo Assess performance relative to machine capabilitieso Track progress towards optimalityo Motivate need for algorithmic changes

2. Identify performance bottlenecks & motivate software optimizations

3. Understand performance differences between Architectures, Programming Models, implementations, etc…o Why do some Architectures/Implementations move more data than others?o Why do some compilers outperform others?

4. Predict performance on future machines / architectureso Set realistic performance expectationso Drive for Architecture-Computer Science-Applied Math Co-Design

35

Page 36: Introduction to the Roofline Model

Take away

§ Roofline helps understand application performance relative to machine capabilities§ just the beginning of the optimization process§ Other bottleneck- or architecture-specific tools can be used to refine the process

§ Roofline helps frame the conversation between…§ Application Developers§ Computer Scientists§ Applied Mathematicians§ Processor Vendors

…providing a common mental model and optimization language

36

Page 37: Introduction to the Roofline Model

Questions