Upload
joshua-mora
View
1.041
Download
4
Tags:
Embed Size (px)
Citation preview
Do theoretical FLOPs matter for real application’s performance ?
By [email protected] Abstract: The most intelligent answer to this question is “it depends on the application”. To proof that, we will show a few examples from both theoretical and practical point of view. In order to validate experimentally it, a modified AMD processor named “Fangio” (AMD Opteron 6275 Processor) will be used which has limited floating point capability to 2 FLOPs/clk/BD unit, delivering less (-8% in avergage) but close to the performance of AMD Opteron 6276 Processor with 4 times more floating point capability , ie. 8 FLOPs/clk/BD unit. The intention of this work is threefold: i) to demonstrate that the FLOPs/clk/core of microprocessor architectures isn’t necessarily a good performance metric indicator despite it is heavily used by the industry (eg. HPL). ii) to expose that code vectorization technology of compilers is fundamental in order to extract as much as possible real application performance but it has a long way to go in extracting it. iii) It would not be fair to blame exclusively on compiler technology: algorithms are not well designed and written for the compilers to exploit vector instructions (ie. SSE, AVX and FMA).
Saudi Arabia HPC, KAUST, Thuwal, 2012
Agenda
• Concepts
– Kinds of FLOPs/clk
– Single Instruction Single Data , Single Instruction Multiple Data
• AMD Interlagos processor FPU
– FPU, see Understanding Interlagos arch through HPL (HPC Advisory Council workshop, ISC 2012)
– Roofline model
• AMD Fangio processor
– FPU capping, roofline model
• Results/Conclusions within roofline model for Interlagos and Fangio.
– Benchmarks: HPL, stream, CFD apps, SPEC fp benchmarks.
Saudi Arabia HPC, KAUST, Thuwal, 2012
• A brief list of floating point instructions and examples supported by AMD Interlagos
• Scalar, Packed
• SP: Single precision, DP: Double precision
Concepts: kinds of FLOPs/clk
Ins. Type Examples FLOPs/clk Reg. size
X87 FADD , FMUL 1 32,64bits
SSE (SP) Scalar: ADDSS , MULSS Packed: ADDPS,MULPS 8 128bits
SSE2 (DP) Scalar: ADDSD, MULSD Packed: ADDPD,MULPD 4 128bits
AVX (SP) Scalar: VADDSS, MULSS Packed: VADDPS, VMULPS 4, 8 128,256b
AVX (DP) Scalar: ADDSD, MULSD Packed: VADDPD, VMULPD 2, 4 128,256b
FMA4 (SP) Scalar: VFMADDSS Packed: VFMADDPS 8, 16 128,256b
FMA4 (DP) Scalar: VFMADDSD Packed: VFMADDPD 4, 8 128,256b
Saudi Arabia HPC, KAUST, Thuwal, 2012
SISD: Single Instruction Single Data SIMD: Single Instruction Multiple Data
streams of inputs of data and results, stored in vectors or packed format, SSE, AVX, FMA
single data input and result at each clock, stored in scalar format 1 clock
SIMD allows processing of more data but it needs to be formated / packed to fit vectors. THAT IS THE CHALLENGE
Saudi Arabia HPC, KAUST, Thuwal, 2012
Current CPU cores can crunch 8 DP numbers at a time. GPUs streaming cores can crunch 2-4DP numbers. There are several thousand streaming cores per GPU. Bubbles
(no work)
Few slides from presentation:
Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Roof line for AMD Interlagos System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.
Real GFLOP/s in double precision (Linpack benchmark)
2procs x 8core-pairs x 2.3GHz x 8 DP FLOP/clk/core-pair x 0.85eff =
250 DP GF/s
Very high numerical intensity ie. (FLOP/s) / (Byte/s)
-use of AMD Core Math Library (FMA4 instructions)
-cache friendly
-reusability of data
-DGEMM is L3 BLAS with Arithmetic Intensity order N (problem size)
Real GB/s: 72 GB/s (stream benchmark)
Low numerical intensity
-use of non temporal stores (use of write combined buffer instead of evicting
data into L2 -> L3 -> RAM to speed up the write into RAM.)
-not cache friendly
-no reusage of data
-most of the time cores waiting for data (low FLOP/clk despite using SSE2,FMA4)
- stream is L1 BLAS with Arithmetic Intensity order 1 (size independent)
Saudi Arabia HPC, KAUST, Thuwal, 2012
AMD Fangio, FPU capping
• Fangio is Interlagos processor model 6276 but has capped FPU from 8 DP FLOP/clk to 2 DP FLOP/clk by slowing down FPU retirement of instructions.
• Allows same instruction architecture set as Interlagos.
• System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.
Performance impact depends on workload:
Real GFLOP/s in double precision (Linpack benchmark)
2procs x 8core-pairs x 2.3GHz x 2 DP FLOP/clk/core-pair x 0.85eff =
75 DP GF/s
Real GB/s: 72 GB/s (stream benchmark)
unmodified memory throughput performance !
Saudi Arabia HPC, KAUST, Thuwal, 2012
HPL runs to confirm FPU capping
2P 16cores @ 2.3GHz (6276 Interlagos) ==============================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01R2L4 86400 100 4 8 1774.55 2.423e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022068 ...... PASSED
==============================================================================
2P 16cores @ 2.3GHz (6275 Interlagos) Fangio ==============================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01R2L4 86400 100 4 8 5494.39 7.826e+01
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022316 ...... PASSED
==============================================================================
3x longer time
Saudi Arabia HPC, KAUST, Thuwal, 2012
78.26GF/s (16CU*2GF/clk/CU*2.3GHz) 78.26 GF/s 73.6 GF/s
106% HPL eff !!
Stream runs on 6275 to confirm no drop in memory throughput -------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 73089.4045 0.0443 0.0438 0.0449
Scale: 68952.3038 0.0469 0.0464 0.0472
Add: 66289.3072 0.0729 0.0724 0.0734
Triad: 66301.0957 0.0730 0.0724 0.0734
-------------------------------------------------------------
Scale, Add and Triad do FLOPS in double precision.
Triad is plotted in roofline model since it is the one with highest FLOPs associated with operations: add and multiply using FMA4.
#pragma omp parallel for
for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; Saudi Arabia HPC, KAUST, Thuwal, 2012
Summary of measurements per node plotted in roofline model
Workload Interlagos 6276 Interlagos 6275 “Fangio”
GB/s DP GF/s AI= F/B GB/s DP GF/s AI= F/B
HPL =6*4=24 =7.8*32=250 10.6 =1.8*4=6.8 =2.3*32=75 11.7
STR. TRIAD =17*4=68 =0.5*32=16 0.08 =17*4=68 =0.5*32=16 0.08
OPENFOAM =15*4=60 =0.8*32=25 0.41 =14*4=56 =0.7*32=22 0.39
• 1 computer has 2 processors with a total of 4 numanodes and 32 cores in 16 compute units • 1 numanode has a total of 4 compute units. • Memory bandwidth in GB/s is measured per numanode. (in red) • Double precision floating point is measured per core.(in red)
Saudi Arabia HPC, KAUST, Thuwal, 2012
(2*7.8)/(2.3GHz*8) = 85% eff (2*2.3)/(2.3GHz*2) = 100% eff !!
Not the effective freq due to boost.
Roofline for Interlagos and Fangio
8
16
32
64
128
256
GF/s (log 2 scale)
AMD Interlagos (250GF/s)
AMD Fangio (75GF/s)
1 2 4 8 16 32 0.125 0.25 0.5
(GF/s)/(GB/s) Arithmetic intensity (log 2 scale)
HPL (Interlagos)
HPL (Fangio)
TRIAD
75% perf drop
0% perf drop
Sparse algebra such as CFD apps ~ 6-8% perf drop OpenFOAM, FLUENT, STARCCM,..
SPECfp 3-20%, 8% average perf drop
Data dependencies, scalar code, no benefits from vectorization
L3 BLAS (eg. Dgemm), benefits from vectorization: FMA, AVX, SSE
Both processors have same memory bandwidth, Ie. same BW slope.
Measured and plotted
Saudi Arabia HPC, KAUST, Thuwal, 2012
Performance impact on SPEC fp 2006 rate peak
• SPEC website link: www.spec.org • Runs done with peak flags
configuration in order to utilize optimally compiler technology.
• In this case Open64 compiler has been used.
• Runs were done with only 1 copy per Bulldozer unit, to allow each process/copy to fully utilize available computing resources without constrains originated from shared resources in the Bulldozer compute unit (eg. L2, FPU , instruction scheduler).
Resource utilization
Saudi Arabia HPC, KAUST, Thuwal, 2012
Performance impact on SPEC fp 2006 rate peak (cont)
Benchmark
name Application area
Brief
description
% perf.
drop
Bwaves Fluid Dynamics 3D transonic transient laminar viscous flow. 0.09%
Gamess Quantum
Chemistry.
self-consistent field calculations are
performed using the Restricted Hartree
Fock method
-10.51%
Milc Quantum
Chromodynamics
A gauge field generating program for lattice
gauge theory programs with dynamical quarks. 0.10%
Zeusmp Fluid Dynamics
NCSA code, CFD simulation of astrophysical
phenomena. -7.47%
Gromacs Biochemistry /
Molecular
Dynamics
Newtonian equations of motion for
hundreds to millions of particle. -32.17%
cactusADM General Relativity Solves the Einstein evolution equations using a
staggered-leapfrog numerical method -2.01%
Leslie3d Fluid Dynamics CFD, Large Eddy Simulation -0.44%
Namd Biology / Molecular
Dynamics
Large biomolecular systems. The test case
has 92,224 atoms of apolipoprotein A-I. -24.23%
Saudi Arabia HPC, KAUST, Thuwal, 2012
GPU candidate
GPU candidate
Benchmark Application area Brief Description % perf. drop
dealII Finite Element
Analysis
adaptive finite elements and error estimation.
Helmholtz-type equation. -9.09%
Soplex
Linear
Programming,
Optimization
simplex algorithm and sparse linear algebra.
Test cases include railroad planning and
military airlift models. 1.86%
Povray Image Ray-tracing Image rendering. The testcase is a
1280x1024 anti-aliased image of a landscape. -12.15%
Calculix Structural
Mechanics
Finite element code for linear and
nonlinear 3D structural applications. -26.82%
GemsFDTD Computational
Electromagnetics
Solves the Maxwell equations in 3D using the
finite-difference time-domain (FDTD) method. -0.67%
Tonto Quantum
Chemistry
molecular Hartree-Fock wavefunction
calculation to better match experimental
X-ray diffraction data. -14.43%
Lbm Fluid Dynamics "Lattice-Boltzmann Method" to simulate
incompressible fluids in 3D 0.58%
Wrf Weather Weather modeling from scales of meters
to thousands of kilometers. -0.95%
Sphinx3 Speech recognition speech recognition system from Carnegie
Mellon University -3.00%
AVERAGE REAL PERFORMANCE DROP WHEN
REDUCED 75% THE THEORETICAL FLOPs -8.94%
Saudi Arabia HPC, KAUST, Thuwal, 2012
GPU candidate
Performance impact on CFD apps • Most of CFD apps with Eulerian formulation use sparse linear algebra to
represent the linearized Navier-Stokes equations on non structured grids.
• The higher the discretization schemes, the higher the arithmetic intensity
• Data dependencies in both spatial and time prevent vectorization
• Large datasets have low cache reutilization.
• Cores are waiting most of the time to get new data into caches.
• Once data is on the caches, the floating point instructions are mostly scalar instead of packed.
• Compilers have hard time in finding opportunities to vectorize loops.
• Loop unrolling and partial vectorization of independent data help very little due to cores waiting to get that data.
• Overall, low performance from FLOPs/s point of view.
• Therefore, capping FPU in terms of FLOPs/clk does not impact on application’s performance.
• Theoretical FLOP/s isn’t therefore a good indicator of how applications such as CFD ones (and many more) will perform.
most
Saudi Arabia HPC, KAUST, Thuwal, 2012
What should we do, moving forward ? • Multidisciplinary teams to work on
– Algorithm research and development to make it more hardware aware.
– Software research and development to implement efficiently the algorithms. (eg. Comm avoidance, dynamic task scheduling, work stealing, locality, power aware, resilient ,…)
– Interaction between scientist and computer (HW+SW) scientist to develop new formulations of equations that will deliver algorithms better suited for new computer architectures.
– Research and development on compiler and progr. language technology to detect algorithm properties and exploit hardware features.
• Supercomputing datacenter institutions to work on
– Enabling science by proper exploitation of computational resources.
– Multidisciplinary teams educating scientist on how to use the resources.
– Supercomputing investments should be funded and measured in terms of number and quality of scientific projects, not in terms of CPU utilization. (eg. CPU utilization isn’t CPU efficiency, like theoretical FLOPs isn’t real application’s performance). Saudi Arabia HPC, KAUST, Thuwal, 2012