40
Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering LPC 2016 Open Source Technology Center

Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

  • Upload
    others

  • View
    22

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Improving Linux Performance with GCC latest technologies

Victor Rodriguez SW Engineer OTC– Advance System Engineering LPC 2016

Open Source Technology Center

Page 2: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Agenda •  Unused resources •  AVX technology •  Function Multi Versioning •  Auto FDO and Profiling

2

Page 3: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Unused resources….

https://i.ytimg.com/vi/X0ymEx64RLw/maxresdefault.jpg

Page 4: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Lets start with something simple

#define MAX 1000000 int a[256], b[256], c[256]; int main () { int i,j; for (j=0;j<MAX;j++){ for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } return 0; }

Page 5: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Vectorization … what is it ?

Page 6: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Vectorization …this is it ?

Page 7: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Vectorization … How to enable? $ gcc -fopt-info-vec sanity.c –O2 –ftree-vectorize

$ gcc -fopt-info-vec sanity.c –O3

HW/SW characteristics

https://github.com/VictorRodriguez/autofdo_tutorial/blob/master/sort.c

Page 8: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Why don’t we extend the register more ?

8

Page 9: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

AVX2 $ gcc -O3 sanity.c -fopt-info-vec -mavx2 -o sanity

511 -> 256 255 -> 128 127 - > 0

AVX 512 AVX2/AVX SSE ZMM0 YMM0 XMM0 ZMM1 YMM1 XMM1 ZMM2 YMM2 XMM2 ZMM3 YMM3 XMM3 ZMM4 YMM4 XMM4 ZMM5 YMM5 XMM5 ZMM6 YMM6 XMM6

Page 10: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

AVX technology advantage

603

38

26

0 100 200 300 400 500 600 700

None

Vectorization

AVX2

Time ( ms )

Execution time of array addition( Lower is better ) HW/SW

characteristics

Page 11: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

11

Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing

Westmere SandyBridge IvyBridge Haswell Broadwell Intel® Xeon Phi™

Future Xeon processors

AVX AVX2 AVX-512

Page 12: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

12

How do I apply that to Linux ?

Multiple Libraries

GLIBC

One Libraries

FMV

Page 13: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

GLIBC patch : https://github.com/clearlinux-pkgs/glibc/blob/master/0001-Add-avx2-fake-capability-like-tls.patch

13

Page 14: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

14

+ /* Add 'avx2' capability on x86_64 */

Page 15: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

15

Make it work in ldconfig

Page 16: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

In spec files:

16

https://github.com/clearlinux-pkgs/openblas/blob/master/openblas.spec

Page 17: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

17

Page 18: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

18

Page 19: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Real case : example 1

R language

Open BLAS

%configure --disable-static --without-x --with-system-zlib --with-system-bzlib --with-system-pcre --with-system-xz --enable-BLAS-shlib --enable-R-shlib --with-blas="-lopenblas" --with-cairo --enable-lto

Page 20: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Real case : example 1

Page 21: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Problem ?

21

AVX2

SSE

AVX

AMD

How many binaries should I

deploy ?

Page 22: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Why don’t we put everything in the same box

22

avx2

sse

avx

Page 23: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

FMV How does it works ?

23

http://lwn.net/Articles/691932/

In the old times and only in C++ ( before GCC 6)

The target() directives will compile the functions for instruction-set extensions (e.g. sse4.2) or for specific architectures (e.g. arch=atom).X

Page 24: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

24

•  Here, for each function, the developer needed to create specific functions and code for each target.

•  That would have required extra overhead in the code; increasing the

number of LOC in a program for FMV makes it more clunky to manage and maintain.

Clunky

Page 25: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

25

•  Fortunately, GCC 6 solves this problem: it supports FMV in both C and C++ code with a single attribute to define the minimum set of architectures to support.

•  This makes it easier to develop Linux applications that can take advantage of enhanced instructions, without the overhead of replicating functions for each target.

Image taken of : http://www.picgifs.com/graphics/t/tux/graphics-tux-510908.gif

Thanks to !!! Evgeny Stupachenko

Page 26: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Function Multiversioning (FMV)

avx2

sse

avx

#defineMAX1000000inta[256],b[256],c[256];__attribute__((target_clones("avx2","arch=atom","default")))voidfoo(){inti,x;

for(x=0;x<MAX;x++){for(i=0;i<256;i++){ a[i]=b[i]+c[i];}}

}intmain(){foo();return0;}

Page 27: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

FMV We will build that with: $ gcc -O3 sanity.c -o sanity

And you can watch the magic with objdump like this: gcc -O3 sanity.c -g -c

objdump -S sanity.o

Page 28: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

FMV

Page 29: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

29

CPUID selection •  In GCC 4.8, FMV had a dispatch priority rather than a CPUID

selection. •  Function versions with more advanced features got higher priority. •  For example, a version targeted for AVX2 would have a higher

dispatch priority than a version targeted for SSE2.

•  In GCC 6, the resolver checks the CPUID and then calls the corresponding function. It does this once per binary execution.

•  So when there are multiple calls to the FMV function, only the first call will execute the CPUID comparison; the subsequent calls will find the required version by a pointer.

•  This technique is already used for almost all glibc functions. For example, glibc has memcpy() optimized for each architecture, so when it is called, glibc will call the proper optimized memcpy().

Page 30: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

30

Results

Page 31: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Profiling… Helping the compiler find the most efficient path...

31

Page 32: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Profiling.. Learning on the road

Invasive Profiling:

PGO

Advantages:

Very precise

Disadvantages:

High overhead

Non-Invasive Profiling: AutoFDO

Advantages:

Small overhead

Disadvantages:

not precise : statistical data

Page 33: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

MariaDB + PGO

Page 34: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

MariaDB + PGO

34

Page 35: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

MariaDB + PGO + Open Stack

Page 36: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

36 https://gcc.gnu.org/wiki/AutoFDO/Tutorial

Page 37: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

AWK and AutoFDO ?

0 100 200 300 400 500 ms

AWK benchmark ( lower is better )

With AutoFDO Without AutoFDO

37

Thanks to !!! Andi Kleen

https://gcc.gnu.org/wiki/AutoFDO/Tutorial

Page 38: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Bugs .. Not everything was pretty

38

LTO AutoFDO

FMV Macros

Page 39: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Unused resources….

https://i.ytimg.com/vi/X0ymEx64RLw/maxresdefault.jpg

Page 40: Improving Linux Performance with GCC latest technologies€¦ · Improving Linux Performance with GCC latest technologies Victor Rodriguez SW Engineer OTC– Advance System Engineering

Thank you

40