21
Do theoretical FLOPs matter for real application’s performance ? By [email protected] Abstract: The most intelligent answer to this question is “it depends on the application”. To proof that, we will show a few examples from both theoretical and practical point of view. In order to validate experimentally it, a modified AMD processor named “Fangio” (AMD Opteron 6275 Processor) will be used which has limited floating point capability to 2 FLOPs/clk/BD unit, delivering less (-8% in avergage) but close to the performance of AMD Opteron 6276 Processor with 4 times more floating point capability , ie. 8 FLOPs/clk/BD unit. The intention of this work is threefold: i) to demonstrate that the FLOPs/clk/core of microprocessor architectures isn’t necessarily a good performance metric indicator despite it is heavily used by the industry (eg. HPL). ii) to expose that code vectorization technology of compilers is fundamental in order to extract as much as possible real application performance but it has a long way to go in extracting it. iii) It would not be fair to blame exclusively on compiler technology: algorithms are not well designed and written for the compilers to exploit vector instructions (ie. SSE, AVX and FMA). Saudi Arabia HPC, KAUST, Thuwal, 2012

Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Embed Size (px)

Citation preview

Page 1: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Do theoretical FLOPs matter for real application’s performance ?

By [email protected] Abstract: The most intelligent answer to this question is “it depends on the application”. To proof that, we will show a few examples from both theoretical and practical point of view. In order to validate experimentally it, a modified AMD processor named “Fangio” (AMD Opteron 6275 Processor) will be used which has limited floating point capability to 2 FLOPs/clk/BD unit, delivering less (-8% in avergage) but close to the performance of AMD Opteron 6276 Processor with 4 times more floating point capability , ie. 8 FLOPs/clk/BD unit. The intention of this work is threefold: i) to demonstrate that the FLOPs/clk/core of microprocessor architectures isn’t necessarily a good performance metric indicator despite it is heavily used by the industry (eg. HPL). ii) to expose that code vectorization technology of compilers is fundamental in order to extract as much as possible real application performance but it has a long way to go in extracting it. iii) It would not be fair to blame exclusively on compiler technology: algorithms are not well designed and written for the compilers to exploit vector instructions (ie. SSE, AVX and FMA).

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 2: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Agenda

• Concepts

– Kinds of FLOPs/clk

– Single Instruction Single Data , Single Instruction Multiple Data

• AMD Interlagos processor FPU

– FPU, see Understanding Interlagos arch through HPL (HPC Advisory Council workshop, ISC 2012)

– Roofline model

• AMD Fangio processor

– FPU capping, roofline model

• Results/Conclusions within roofline model for Interlagos and Fangio.

– Benchmarks: HPL, stream, CFD apps, SPEC fp benchmarks.

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 3: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

• A brief list of floating point instructions and examples supported by AMD Interlagos

• Scalar, Packed

• SP: Single precision, DP: Double precision

Concepts: kinds of FLOPs/clk

Ins. Type Examples FLOPs/clk Reg. size

X87 FADD , FMUL 1 32,64bits

SSE (SP) Scalar: ADDSS , MULSS Packed: ADDPS,MULPS 8 128bits

SSE2 (DP) Scalar: ADDSD, MULSD Packed: ADDPD,MULPD 4 128bits

AVX (SP) Scalar: VADDSS, MULSS Packed: VADDPS, VMULPS 4, 8 128,256b

AVX (DP) Scalar: ADDSD, MULSD Packed: VADDPD, VMULPD 2, 4 128,256b

FMA4 (SP) Scalar: VFMADDSS Packed: VFMADDPS 8, 16 128,256b

FMA4 (DP) Scalar: VFMADDSD Packed: VFMADDPD 4, 8 128,256b

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 4: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

SISD: Single Instruction Single Data SIMD: Single Instruction Multiple Data

streams of inputs of data and results, stored in vectors or packed format, SSE, AVX, FMA

single data input and result at each clock, stored in scalar format 1 clock

SIMD allows processing of more data but it needs to be formated / packed to fit vectors. THAT IS THE CHALLENGE

Saudi Arabia HPC, KAUST, Thuwal, 2012

Current CPU cores can crunch 8 DP numbers at a time. GPUs streaming cores can crunch 2-4DP numbers. There are several thousand streaming cores per GPU. Bubbles

(no work)

Page 5: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Few slides from presentation:

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 6: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 7: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 8: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 9: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 10: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 11: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Roof line for AMD Interlagos System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.

Real GFLOP/s in double precision (Linpack benchmark)

2procs x 8core-pairs x 2.3GHz x 8 DP FLOP/clk/core-pair x 0.85eff =

250 DP GF/s

Very high numerical intensity ie. (FLOP/s) / (Byte/s)

-use of AMD Core Math Library (FMA4 instructions)

-cache friendly

-reusability of data

-DGEMM is L3 BLAS with Arithmetic Intensity order N (problem size)

Real GB/s: 72 GB/s (stream benchmark)

Low numerical intensity

-use of non temporal stores (use of write combined buffer instead of evicting

data into L2 -> L3 -> RAM to speed up the write into RAM.)

-not cache friendly

-no reusage of data

-most of the time cores waiting for data (low FLOP/clk despite using SSE2,FMA4)

- stream is L1 BLAS with Arithmetic Intensity order 1 (size independent)

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 12: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

AMD Fangio, FPU capping

• Fangio is Interlagos processor model 6276 but has capped FPU from 8 DP FLOP/clk to 2 DP FLOP/clk by slowing down FPU retirement of instructions.

• Allows same instruction architecture set as Interlagos.

• System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.

Performance impact depends on workload:

Real GFLOP/s in double precision (Linpack benchmark)

2procs x 8core-pairs x 2.3GHz x 2 DP FLOP/clk/core-pair x 0.85eff =

75 DP GF/s

Real GB/s: 72 GB/s (stream benchmark)

unmodified memory throughput performance !

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 13: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

HPL runs to confirm FPU capping

2P 16cores @ 2.3GHz (6276 Interlagos) ==============================================================================

T/V N NB P Q Time Gflops

--------------------------------------------------------------------------------

WR01R2L4 86400 100 4 8 1774.55 2.423e+02

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022068 ...... PASSED

==============================================================================

2P 16cores @ 2.3GHz (6275 Interlagos) Fangio ==============================================================================

T/V N NB P Q Time Gflops

--------------------------------------------------------------------------------

WR01R2L4 86400 100 4 8 5494.39 7.826e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022316 ...... PASSED

==============================================================================

3x longer time

Saudi Arabia HPC, KAUST, Thuwal, 2012

78.26GF/s (16CU*2GF/clk/CU*2.3GHz) 78.26 GF/s 73.6 GF/s

106% HPL eff !!

Page 14: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Stream runs on 6275 to confirm no drop in memory throughput -------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 73089.4045 0.0443 0.0438 0.0449

Scale: 68952.3038 0.0469 0.0464 0.0472

Add: 66289.3072 0.0729 0.0724 0.0734

Triad: 66301.0957 0.0730 0.0724 0.0734

-------------------------------------------------------------

Scale, Add and Triad do FLOPS in double precision.

Triad is plotted in roofline model since it is the one with highest FLOPs associated with operations: add and multiply using FMA4.

#pragma omp parallel for

for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 15: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Summary of measurements per node plotted in roofline model

Workload Interlagos 6276 Interlagos 6275 “Fangio”

GB/s DP GF/s AI= F/B GB/s DP GF/s AI= F/B

HPL =6*4=24 =7.8*32=250 10.6 =1.8*4=6.8 =2.3*32=75 11.7

STR. TRIAD =17*4=68 =0.5*32=16 0.08 =17*4=68 =0.5*32=16 0.08

OPENFOAM =15*4=60 =0.8*32=25 0.41 =14*4=56 =0.7*32=22 0.39

• 1 computer has 2 processors with a total of 4 numanodes and 32 cores in 16 compute units • 1 numanode has a total of 4 compute units. • Memory bandwidth in GB/s is measured per numanode. (in red) • Double precision floating point is measured per core.(in red)

Saudi Arabia HPC, KAUST, Thuwal, 2012

(2*7.8)/(2.3GHz*8) = 85% eff (2*2.3)/(2.3GHz*2) = 100% eff !!

Not the effective freq due to boost.

Page 16: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Roofline for Interlagos and Fangio

8

16

32

64

128

256

GF/s (log 2 scale)

AMD Interlagos (250GF/s)

AMD Fangio (75GF/s)

1 2 4 8 16 32 0.125 0.25 0.5

(GF/s)/(GB/s) Arithmetic intensity (log 2 scale)

HPL (Interlagos)

HPL (Fangio)

TRIAD

75% perf drop

0% perf drop

Sparse algebra such as CFD apps ~ 6-8% perf drop OpenFOAM, FLUENT, STARCCM,..

SPECfp 3-20%, 8% average perf drop

Data dependencies, scalar code, no benefits from vectorization

L3 BLAS (eg. Dgemm), benefits from vectorization: FMA, AVX, SSE

Both processors have same memory bandwidth, Ie. same BW slope.

Measured and plotted

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 17: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Performance impact on SPEC fp 2006 rate peak

• SPEC website link: www.spec.org • Runs done with peak flags

configuration in order to utilize optimally compiler technology.

• In this case Open64 compiler has been used.

• Runs were done with only 1 copy per Bulldozer unit, to allow each process/copy to fully utilize available computing resources without constrains originated from shared resources in the Bulldozer compute unit (eg. L2, FPU , instruction scheduler).

Resource utilization

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 18: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Performance impact on SPEC fp 2006 rate peak (cont)

Benchmark

name Application area

Brief

description

% perf.

drop

Bwaves Fluid Dynamics 3D transonic transient laminar viscous flow. 0.09%

Gamess Quantum

Chemistry.

self-consistent field calculations are

performed using the Restricted Hartree

Fock method

-10.51%

Milc Quantum

Chromodynamics

A gauge field generating program for lattice

gauge theory programs with dynamical quarks. 0.10%

Zeusmp Fluid Dynamics

NCSA code, CFD simulation of astrophysical

phenomena. -7.47%

Gromacs Biochemistry /

Molecular

Dynamics

Newtonian equations of motion for

hundreds to millions of particle. -32.17%

cactusADM General Relativity Solves the Einstein evolution equations using a

staggered-leapfrog numerical method -2.01%

Leslie3d Fluid Dynamics CFD, Large Eddy Simulation -0.44%

Namd Biology / Molecular

Dynamics

Large biomolecular systems. The test case

has 92,224 atoms of apolipoprotein A-I. -24.23%

Saudi Arabia HPC, KAUST, Thuwal, 2012

GPU candidate

GPU candidate

Page 19: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Benchmark Application area Brief Description % perf. drop

dealII Finite Element

Analysis

adaptive finite elements and error estimation.

Helmholtz-type equation. -9.09%

Soplex

Linear

Programming,

Optimization

simplex algorithm and sparse linear algebra.

Test cases include railroad planning and

military airlift models. 1.86%

Povray Image Ray-tracing Image rendering. The testcase is a

1280x1024 anti-aliased image of a landscape. -12.15%

Calculix Structural

Mechanics

Finite element code for linear and

nonlinear 3D structural applications. -26.82%

GemsFDTD Computational

Electromagnetics

Solves the Maxwell equations in 3D using the

finite-difference time-domain (FDTD) method. -0.67%

Tonto Quantum

Chemistry

molecular Hartree-Fock wavefunction

calculation to better match experimental

X-ray diffraction data. -14.43%

Lbm Fluid Dynamics "Lattice-Boltzmann Method" to simulate

incompressible fluids in 3D 0.58%

Wrf Weather Weather modeling from scales of meters

to thousands of kilometers. -0.95%

Sphinx3 Speech recognition speech recognition system from Carnegie

Mellon University -3.00%

AVERAGE REAL PERFORMANCE DROP WHEN

REDUCED 75% THE THEORETICAL FLOPs -8.94%

Saudi Arabia HPC, KAUST, Thuwal, 2012

GPU candidate

Page 20: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Performance impact on CFD apps • Most of CFD apps with Eulerian formulation use sparse linear algebra to

represent the linearized Navier-Stokes equations on non structured grids.

• The higher the discretization schemes, the higher the arithmetic intensity

• Data dependencies in both spatial and time prevent vectorization

• Large datasets have low cache reutilization.

• Cores are waiting most of the time to get new data into caches.

• Once data is on the caches, the floating point instructions are mostly scalar instead of packed.

• Compilers have hard time in finding opportunities to vectorize loops.

• Loop unrolling and partial vectorization of independent data help very little due to cores waiting to get that data.

• Overall, low performance from FLOPs/s point of view.

• Therefore, capping FPU in terms of FLOPs/clk does not impact on application’s performance.

• Theoretical FLOP/s isn’t therefore a good indicator of how applications such as CFD ones (and many more) will perform.

most

Saudi Arabia HPC, KAUST, Thuwal, 2012

Page 21: Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

What should we do, moving forward ? • Multidisciplinary teams to work on

– Algorithm research and development to make it more hardware aware.

– Software research and development to implement efficiently the algorithms. (eg. Comm avoidance, dynamic task scheduling, work stealing, locality, power aware, resilient ,…)

– Interaction between scientist and computer (HW+SW) scientist to develop new formulations of equations that will deliver algorithms better suited for new computer architectures.

– Research and development on compiler and progr. language technology to detect algorithm properties and exploit hardware features.

• Supercomputing datacenter institutions to work on

– Enabling science by proper exploitation of computational resources.

– Multidisciplinary teams educating scientist on how to use the resources.

– Supercomputing investments should be funded and measured in terms of number and quality of scientific projects, not in terms of CPU utilization. (eg. CPU utilization isn’t CPU efficiency, like theoretical FLOPs isn’t real application’s performance). Saudi Arabia HPC, KAUST, Thuwal, 2012