55
© Copyright IBM Corporation 2016. Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM. Enabling POWER 8 advanced features on Linux Sébastien Chabrolles Julien Limodin Fabrice Moyen PowerSystem Linux Center IBM Montpellier

Enabling POWER 8 advanced features on Linux

Embed Size (px)

Citation preview

Page 1: Enabling POWER 8 advanced features on Linux

© Copyright IBM Corporation 2016. Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.

Enabling POWER 8 advanced features on Linux

Sébastien Chabrolles Julien Limodin Fabrice Moyen PowerSystem Linux Center IBM Montpellier

Page 2: Enabling POWER 8 advanced features on Linux

1 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

POWER8 Hardware Accelerator

NX

On Chip Accelerators (NX):

Symetric Crypto Compression engine Random Number Generator

One NX complex per chip

A given NX can access all memory in the SMP

A given NX can be accessed by any core

Can be accessed via powerVM hypervizor call

In Core Accelerators :

Symetric Crytpo

Private per core

Leverage Vector Unit (VMX)

Direct access for guest/VM (including KVM)

IBM - POWER8

12 cores per socket (from 3 to 4 GHz)

8 HW threads / core (SMT technology)

Large cache (96 MB : 8 MB / core)

High Memory Bandwidth (~200 GB/s)

Page 3: Enabling POWER 8 advanced features on Linux

2 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

1. Transparent Memory Compression

2. -

3. Power8 Split-Core

Enable POWER 8 advanced features on Linux

Page 4: Enabling POWER 8 advanced features on Linux

3 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Transparent Memory Compression

Transparent Memory Compression is a feature provided by the operating system (Kernel) dynamically compresses process memory without process knowledge.

PowerVM with AIX proposes this functionality via AME (Active Memory Expansion)

Unfortunately, AME does not exist for Linux.

Linux has an alternative solution is named ZSWAP !!!

Zswap is a feature that hooks into the read and write sides of the swap code and acts as a compressed cache for pages go to and from the swap device

Like AME, Zswap can use the Power NX compression accelerator (842) to improve compression performance.

But unlike AME, zswap has some restriction :

Paging device are needed with enough space to store uncompressed data.

but still the real one.

Application processes must allow to be swapped-out.

Page 5: Enabling POWER 8 advanced features on Linux

4 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

P8 NX (on-chip) block diagram

Second generation Nest Accelerator complex*

Encryption Engine

Random number generator

Two 842 compression / decompression

engines

Proprietary IBM Research algorithm

SRAM based dictionary compression

Used by AME

Good compression ratio at high bandwidth

106% of LZO on 190+ benchmarks

158% of compression ratio of software

DEFLATE with FHT on Canterbury corpus

Only available via PowerVM or BareMetal

Linux.

-chip accelerators for cryptography and active

IBM J. Res. & Dev., vol. 57, no. 4, Nov./Dec. 2013.

On-chip SMP Interconnect Interface

che

DMA Controller

842

Channel

0

RNG

Channel

1

chs

AES

SHA

IOB

chs

AES

SHA

IOB

che

842

Channel

2

Channel

3

32B 32B 16B 16B

32B

32B 32B 16B 16B

32B

32B 32B

16B 16B

ingress arrays egress arrays

2to1 clock region

On-chip SMP interconnect

Page 6: Enabling POWER 8 advanced features on Linux

5 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Zswap !

For that, we will use a well known Java Benchmark (SPECjbb), run it several time while increasing the JVM Heap-Size.

1 core POWER8 10GB Mem Ubuntu 16.04

10 GB Phys. Mem

JVM Heap-Size

9GB 10 GB 18 GB

SPEC

jbb

1- Baseline Test with Zswap deactivated 2- Test with zswap and software compression (default) 3- Test with zswap and Power HW compression (842)

Page 7: Enabling POWER 8 advanced features on Linux

6 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Memory Over-Allocation test with SPECjbb2005 (BaseLine)

0

20

40

60

80

100

120

9 10 11 12 13 14 15 16 17 18

%b

op

s vs

no

min

al

JVM Heap Size

SPECjbb2005 performance and Memory Over-Allocation 1 P8 core SMT8 10GB Mem

zswap off

Memory Over-commitment

10% of nominal performance due to Memory thrashing)

Page 8: Enabling POWER 8 advanced features on Linux

7 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

SWAP / Paging Activity

System Memory

Swap device

1- Swap Out / Page Out When the memory is full, a process (LRUD) scans memory and move the

device. Asynchrous Backgroud task => No impact on

2- Swap In / Page In When page-fault occurs and pages are located in the paging device, those pages must be moved back to the Memory. As physical disks are much more slower

=> THIS HURTS PERFORMANCE !!!

Swap out

Swap in

Page 9: Enabling POWER 8 advanced features on Linux

8 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

0

20

40

60

80

100

120

9 10 11 12 13 14 15 16 17 18

Swap

I/O

(M

B/s

)

JVM Heap Size

Swap I/O activity - SPECjbb2005 Memory Over-Allocation 1P8 core SMT8 - 10GB Mem

zswap off

Memory Over-Allocation test with SPECjbb2005 (Swap I/O)

Memory Over-commitment

Single SAS disk used as Swap device Reaches his limit at ~100 MB/s (50% read)

Page 10: Enabling POWER 8 advanced features on Linux

9 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

In the memory thrashing case, the non-deterministic latency and performance degradation that I/O introduces could be fatal to your

I/O storm could even prevent you to connect to your system or start any

We need a way to smooth out this I/O storm and performance cliff as memory demand meets memory capacity.

Zswap!

Page 11: Enabling POWER 8 advanced features on Linux

10 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

ZSWAP requirement

1. Zswap is directly available in the Linux Kernel since v3.11

RedHat 7, CentOS 7, Fedora 19

Suse 12

Ubuntu 14.04

Enable zswap at boot level by adding the option zswap.enabled=1 in your boot loader.

2. Power NX (on-chip) acceleration (842) is only available for PowerVM and BareMetal Linux.

Not Available today for PowerKVM guest

cat /proc/device-tree/ibm,platform-facilities/ibm,compression-v1/status should return okay

Note : Ubuntu need a kernel 4.2 or above to get access to Power NX hw (starting with ubuntu 15.10)

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1488495

Enable zswap HW compression with zswap.compressor=842 in your boot loader.

Page 12: Enabling POWER 8 advanced features on Linux

11 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Enabling POWER HW compression engine (842) with zswap

RedHat : 1- Enable Zswap with 842 compressor at boot time.

vi /etc/sysconfig/grub

add zswap.enabled=1 zswap.compressor=842 to GRUB_CMDLINE_LINUX

2- Regenerate your grub.cfg file.

grub2-mkconfig > /boot/grub2/grub.cfg

3- Add 842 kernel modules to your ramdisk

echo 842 > /etc/modules-load.d/842.conf

dracut -f

4- reboot and verify with dmesg | grep zswap

[ 1.064790] zswap: loaded using pool 842/zbud

Page 13: Enabling POWER 8 advanced features on Linux

12 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Enabling POWER HW compression engine (842) with zswap

Ubuntu :

1- Enable Zswap with 842 compressor at boot time. vi /etc/sysconfig/grub

add zswap.enabled=1 zswap.compressor=842 to GRUB_CMDLINE_LINUX

2- Regenerate your grub.cfg file. grub2-mkconfig > /boot/grub2/grub.cfg

3- Add 842 kernel modules to your ramdisk echo 842 > /etc/modules-load.d/842.conf

vi /usr/share/initramfs-tools/hooks/842

Add the following lines:

#!/bin/sh -e

PREREQS=""

case $1 in

prereqs) echo "${PREREQS}"; exit 0;;

esac

. /usr/share/initramfs-tools/hook-functions

force_load 842

update-initramfs -u

4- dmesg | grep zswap

[ 1.064790] zswap: loaded using pool 842/zbud

Page 14: Enabling POWER 8 advanced features on Linux

13 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Zswap parameters and monitoring

Zswap parameters are located in /sys/module/zswap/parameters You can change :

- compressor : [ lzo or 842 ] default lzo Compressor algorithm to use - enabled : [ Y or N ] Enable zswap - max_pool_percent : [1 to 100] default 20 Compress pool size limit (in % of RAM) - Zpool : [ zbud or zsmalloc ] default zbud Compression pool algorithm. Zbud : - store 2 pages in one slot (compression ratio 2:1) - evict the oldest pages to disk when full Zsmalloc : - can store more pages per slot than zbud (compression ratio ~ 3:1) - but unlike zbud, redirect new allocation to paging device when full. (does not recycle old pages).

You can monitor zswap activity by looking at counters located in /sys/kernel/debug/zswap

Page 15: Enabling POWER 8 advanced features on Linux

14 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

zswap

Swap device

1- Compress/Uncompress

(zbud by default). Scan/Compress use extra CPU cycles, but when page-fault occurs, it is really faster to get pages from the compressed pool in memory than disk.

3- Swap In / Page In When page-fault occurs and pages are located in the paging device, those pages must be moved back to the Memory.

THIS HURTS PERFORMANCE !!!

Uncompressed Memory Zpool (zbud)

ZSWAP

ZSWAP

2- Swap Out / Page Out When the compress zpool is full, zbud moves odest compressed pages to the swap device

Page 16: Enabling POWER 8 advanced features on Linux

15 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

ZSWAP Memory Over-Allocation test with SPECjbb2005

0

20

40

60

80

100

120

9 10 11 12 13 14 15 16 17 18

%b

op

s vs

no

min

al

JVM Heap Size

Testing zswap (zbud) with SPECjbb2005 1 P8 core SMT8 10GB Mem - max_pool_percent=40

zswap off

zswap 842 (HW)

Memory Over-commitment

Zpool Over-commitment

75% of nominal performance at 140% memory

Page 17: Enabling POWER 8 advanced features on Linux

16 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

ZSWAP HW vs Soft. compression

0

20

40

60

80

100

120

9 10 11 12 13 14 15 16 17 18

%b

op

s vs

no

min

al

JVM Heap Size

Testing zswap (zbud) with SPECjbb2005 1 P8 core SMT8 10GB Mem - max_pool_percent=40

zswap off

zswap lzo

zswap 842 (HW)

Memory Over-commitment

Zpool Over-commitment

X1.5

Page 18: Enabling POWER 8 advanced features on Linux

17 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

ZSWAP Memory Over-Allocation test with SPECjbb2005

0

20

40

60

80

100

120

9 10 11 12 13 14 15 16 17 18

%b

op

s vs

no

min

al

JVM Heap Size

Testing zswap (zbud) with SPECjbb2005 1 P8 core SMT8 10GB Mem - max_pool_percent=40

zswap 842 (HW)

Memory Over-commitment

Zpool Over-commitment

1 2 3

Page 19: Enabling POWER 8 advanced features on Linux

18 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Case 1 : Zswap with Memory not Over-Committed

Swap device

Memory Used (uncompressed) Free memory

Enough Memory available application No/Little swap I/O occuring Zswap is idle (no CPU overhead)

=> You can almost use all the memory before zswap starts working

100% Memory Used (uncompressed)

100% CPU user Best performance for application

Page 20: Enabling POWER 8 advanced features on Linux

19 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Case 2 : Zswap with Memory Over-Committed

Swap device

Memory Used (uncompressed)

Application needs more memory than available Zswap starts working, compressing pages in/out zpool. Zpool is increasing No/Little swap I/O occuring

Below nominal performance due to memory scanning, unmapping. Compression/decompression are offloaded to NX 842

Zpool (zbud)

ZSWAP

25% CPU system due to page scanning 75% of nominal performance on

CPU bound application (worst case)

Page 21: Enabling POWER 8 advanced features on Linux

20 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Zswap with 842(HW) vs LZO(Soft)

Zswap HW compression 842

10GB RAM, 14GB Java Heap Size

25% of System CPU (overhead) due to memory page scanning.

Compression offloaded to NX 842

75% of nominal performance

Zswap Soft. Compression LZO

10GB RAM, 14GB Java Heap Size

50% of system CPU (overhead) due to memory page scanning and compression

50% of nominal performance

50% better CPU usage with POWER HW compression

Page 22: Enabling POWER 8 advanced features on Linux

21 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

0

20

40

60

80

100

120

9 10 11 12 13 14 15 16 17 18

Swap

I/O

(kB

/s)

JVM Heap Size

Testing zswap (zbud) with SPECjbb2005 1P8 core SMT8 - 10GB Mem - max_pool_percent=40

zswap off

zswap on

ZSWAP Memory Over-Allocation (Swap IO activity)

Memory Over-commitment

Zpool Over-commitment

No or few paging when running

1 2 3

Page 23: Enabling POWER 8 advanced features on Linux

22 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Case 3 : Zswap with Memory Over-Committed and Zpool Full

Swap device

Memory Used (uncompressed)

Application needs more memory than available Zswap is working, compressing pages in/out zpool Zpool reaches max_pool_percent limit (compress pool is full). Need to free some space in Zpool

=> Swapping in/out !!! Performance degradation

Zpool (zbud) FULL

ZSWAP

max_pool_percent=40

75% CPU wait I/O; only 10 % CPU user 10% of nominal performance due to waiting for pages on swap device (swap in)

SWAP IN/OUT

Page 24: Enabling POWER 8 advanced features on Linux

23 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Zswap Conclusion

Zswap is not AME, but it can really helps to reduce impact of paging activity and secure your production system with no cost and no penalty:

Power8 NX842 compression engine are available for PowerVM and BareMetal Linux

No Impact, when memory demand is below RAM capacity installed.

Can maintain your system at 75% performance in CPU 100% case (the worse scenario) and

Zswap zbud x1.4 Memory expansion ratio (with max_pool_percent=40)

You need More ??? then you can try zswap with ZSMALLOC allocator .

Page 25: Enabling POWER 8 advanced features on Linux

24 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Zswap with Zsmalloc compress pool (vs zbud)

Swap device

1- Compress/Uncompress

Scan/Compress use extra CPU cycles, but when page-fault occurs, it is really faster to get pages from the compressed pool in memory than disk.

2- Swap In / Out But compare to zbud, zsmalloc page replacement algorithm. When the zpool is full, Paging out will occurs directly from the main memory to the paging device.

Uncompressed Memory Zpool

(zsmalloc)

ZSWAP

ZSWAP

Zsmalloc can store more pages per slot than zbud. (3:1 measured) Resulting to a higher memory

Page 26: Enabling POWER 8 advanced features on Linux

25 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

0

20

40

60

80

100

120

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

%b

op

s vs

no

min

a

JVM Heap Size

Testing zswap (zbud vs zsmalloc) with SPECjbb2005 1 P8 core SMT8 10GB Mem - max_pool_percent=40

zswap off

zswap zsmalloc 842 (HW)

zswap 842 (HW)

75% Nominal perf. @ x1.8 Memory size

50% Nominal perf. @ x2 Memory size

Memory Over-commitment

Zpool (zbud) limit

Zpool (zsmalloc) limit

ZSWAP (zsmalloc) Memory Over-Allocation test with SPECjbb2005

x2

Page 27: Enabling POWER 8 advanced features on Linux

26 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Monitor Zswap (zsmalloc) activity on 10GB VM with Grafana

10GB 15GB

20GB 25GB

30GB 35GB 40GB

Page 28: Enabling POWER 8 advanced features on Linux

27 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

1. Transparent Memory Compression

2. -

3. Power8 Split-Core

Enable POWER 8 advanced features on Linux

Page 29: Enabling POWER 8 advanced features on Linux

28 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Symetric vs Asymetric encryption

Symmetric encryption (AES):

SLOW/Complex operation

Private key never distributed

Use to send AES secret key

FAST/Simple operation

Secret Key must be distributed

Optimized by Power8

Not Optimized by Power8

Page 30: Enabling POWER 8 advanced features on Linux

29 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Anatomy of a SSL/HTTPS request

SSL Handshake Executed only once

Asymetric encryption Secret Key exchange

Data exchange Symetric encryption

Client browser Server

Majority of the exchange will use symetric encryption

Page 31: Enabling POWER 8 advanced features on Linux

30 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

POWER8 Hardware Accelerator

NX

On Chip Accelerators (NX):

Symetric Crypto: AES, SHA True random number generator

Need to use thru hypervizor call for guest/VM

Better single thread performance, larger bandwith

Symetric Crypto currently not available for PowerKVM guest

In Core Accelerators :

Symetric Crypto : AES, SHA Cyclic Redundancy Check

Private per core

Leverage Vector Unit (VMX)

Direct access for guest/VM

IBM - POWER8

12 cores per socket (from 3 to 4 GHz)

8 HW threads / core (SMT technology)

Large cache (96 MB : 8 MB / core)

High Memory Bandwidth (~200 GB/s)

Page 32: Enabling POWER 8 advanced features on Linux

31 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

AES Symmetric Cryptography / SHA Hash Engine

AES Key lengths: 128b,192b,256b

Combination AES-SHA / SHA-AES supported

Move the data once to encrypt/decrypt and/then authenticate

I/O buffer (IOB) provides function

8.9Gbps throughput per engine for AES 128 CBC Encrypt at 2.4GHz, 256B message

7Gbps engine throughput for SHA-512 at 2.4GHz, 256B message

Supports byte aligned source and target data buffers, scatter/gather

AES modes supported

Electronic Codebook (ECB)

Cipher Block Chaining (CBC)

Counter (CTR)

Counter with CBC-MAC (CCM)

Galois Counter Mode (GCM)

XCBC-MAC-96 (XMAC)

Hash mode supported

SHA1

SHA2 SHA-256

SHA2 SHA-512

Keyed-hash MAC (HMAC)

MD5

Page 33: Enabling POWER 8 advanced features on Linux

32 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

POWER8 Hardware Encryption

Source: Performance Characteristics of the POWER8 Processor, Alex Mericas, IBM Corporation

Algorithm POWER7+ POWER8

On-Chip On-Chip In-Core

AES-GCM X X X

AES-CTR X X X

AES-CBC X X X

AES-ECB X X X

SHA-256 X X X

SHA-512 X X X

RNG X X

CRC X

Algorithm POWER7+

(SW)

POWER8 (HW)

Single Thread Multi Thread

SHA-512 35 10.7 (x3) 2.6 (x13)

AES-128-ENC 17 4 (x4) 0.8 (x21)

AES-256-ENC 21 5.5 (x3.8) 1.1 (x19)

Cycles per Byte (1 core and in-core crypto)

-Chip Hardware Accelerators introduced with POWER7+

POWER8 has same accelerators Offload encryption for OS-based large

messages (encrypted file systems, etc)

On virtualized system, access to On-Chip (NX) Hardware Accelerators needs to be made through hypervizor call.

In-Core acceleration is directly accessible

to virtualized guest (no hypervisor call needed).

includes user-mode instructions to accelerate common algorithms

Page 34: Enabling POWER 8 advanced features on Linux

33 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Linux on Power hypervizor compatibility matrix

Accelerator Features Baremetal PowerVM guest

PowerKVM guest

On-chip Compression (842)

AES

RNG

In-core AES

SHA

CRC

Page 35: Enabling POWER 8 advanced features on Linux

34 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

P8 Hardware Encryption Acceleration

Combination of on-chip accelerators for CPU offload with larger blocks of encryption work, and in-core instructions for small data sizes.

Exploitation available transparently under OS services and APIs

On-chip Crypto In-core Crypto Random Number Generation

/dev/random /dev/urandom

Hardware

Kernel

User Space

Cryptographic Library in C

IPsec TCP/IP Encrypted File System

GSkit Standard

Library

Strong Keys

Encrypted Data In Flight

Encrypted Data In At Rest

OpenSSL Key Generation

Hypervisor H_COP calls

Applications Custom Application Use/Libs

= can be exploited here

Physical TPM

Standard Crypto APIs

OpenSSL 1.0.2 libcrypto

34

Page 36: Enabling POWER 8 advanced features on Linux

35 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

How to enable the in-core crypto accelerator:

In Java, starting with IBM Java 7.1, AES is accelerated by using POWER8 in-core AES instructions by specifying -Dcom.ibm.crypto.provider.doAESInHardware=true on the JVM command line.

OpenSSL > 1.0.2 is using VMX in-core P8 instruction and optimization for AES/SHA

All the application based on this version of openSSL will benefit from P8 encryption acceleration.

Ubuntu : OpenSSL 1.0.2 in ubuntu 15.10 and 16.04

RedHat : Still in OpenSSL 1.0.1 => Crypto Not Accelerated

Fedora 23 : OpenSSL 1.0.2

Suse12, OpenSuse 13 : Still in OpenSSL 1.0.1 => Crypto Not Accelerated

What can you do if you do not have the OpenSSL 1.0.2 ?

Code recompilation with « Advanced Toolchain (v9) »

« Advanced toolchain » is a gcc based compiler (provided by IBM for free) that provide POWER optimized library. (like libcrypto).

You can then enable HW crypto acceleration to your application even if your Linux distribution provide the latest libcrypto (OpenSSL 1.0.2)

Page 37: Enabling POWER 8 advanced features on Linux

36 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

IBM Advance Toolchain for PowerLinux

URLs:

IBM Advance Toolchain for PowerLinux Documentation

Improving performance with IBM Advance Toolchain for PowerLinux

Description:

The IBM Advance Toolchain for PowerLinux is a set of open source development tools and runtime libraries which allows users to take leading edge advantage of IBM's latest POWER hardware features on Linux.

Over time, these libraries and latest compiler technologies are integrated into the shipping distributions.

However, the IBM Advance Toolchain for PowerLinux contains the latest tested and supported GNU Compiler Collection (GCC) compiler versions, tailored for Power systems, and packaged together with an expanding set of processor-tuned libraries, allowing you to take advantage of the latest technology without waiting..

GCC Compiler

Page 38: Enabling POWER 8 advanced features on Linux

37 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Example of Apache and wget compiled with Advance Toolchain (1/3)

Idea was to recompile Apache and wget with Advance Toolchain to use the Power8 HW in-core cryptography in order to improve the performance.

Recompile on PowerLinux:

Get source code of Apache and wget from community

Install Advance Toolchain AT9

Recompile out-of-the-box with the following flags, no source code changes at all required.

export CFLAGS="-O3 -m64 -mcpu=power8 -mtune=power8"

export PATH=/opt/at9.0/bin/:$PATH

Configure, make and make install

Simple test: download a 10G file with wget from the Apache web server in HTTPSinste

10GB

Apache (httpd)

WGET

loopback SSL

Page 39: Enabling POWER 8 advanced features on Linux

38 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Example of Advance Toolchain with Apache and wget (2/3)

Standard Apache and wget provided by the repo

Transfer done in 3m10s

Compiled Apache and wget with Advance Toolchain

Transfer done in 23s

Page 40: Enabling POWER 8 advanced features on Linux

39 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Standard Advanced

toolchain

Example of Advance Toolchain with Apache and wget (3/3)

Profiling shows that AT version is using P8 accelerated version of ghash and aes

Page 41: Enabling POWER 8 advanced features on Linux

40 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Example 2 : J2EE Application benchmark (DayTrader application)

60% better CPU Utilisation with Power in-core encryption

With P8 HW Crypto Without P8 HW Crypto

Page 42: Enabling POWER 8 advanced features on Linux

41 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

1. Transparent Memory Compression

2. -

3. Power8 Split-Core

Enable POWER 8 advanced features on Linux

Page 43: Enabling POWER 8 advanced features on Linux

43 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Enabling SMT on PowerKVM guests (1/2)

run run PowerKVM with 2 P8 cores Guest1 2 vcpus

Guest2 4 vcpus

Default : 2 vcores, 1 thread

Manually Defined: 1 vcore, 4 threads

<vcpu>4<vcpu/> <cpu> <topology sockets=1 cores=1 threads=4/> </cpu>

guest2.xml

WAIT

No free core available. Vcore cannot be dispatched Waiting for next dispatch (time sharing)

SMT level different than 1 will slow down Guests dispatching.

How do we schedule guest VCPUs onto physical CPU cores? Introduce notion of "virtual core" (vcore)

VCPUs are allocated to vcores before being dispatched by PowerKVM host to real Core. By default 1 vcpu = 1 vcore Can be modified to xVCPU = 1 core to enable SMT.

Page 44: Enabling POWER 8 advanced features on Linux

44 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Enabling SMT on PowerKVM guests (2/2)

In order to configure a KVM Guest, the number of VCPUs on a guest must be set to the product of cores and threads per core assigned to the guest, and the number of threads per core must be explictly set.

vcpu = sockets x cores x threads

For example, when using libvirt, you can configure a guest with the following settings in order to get a guest with SMT=8 and 2 cores (16 total vcpus)

<vcpu>16</vcpu> <cpu> <topology sockets='1' cores='2' threads </cpu>

With that configuration, a guest OS will be able to enable SMT=8 (default) and use the 16 threads across the assigned two cores. This also allows the guest to dynamically control the SMT level directly from the OS (ppc64_cpu --smt=x)

Page 45: Enabling POWER 8 advanced features on Linux

45 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Enabling SMT topology with Kimchi on PowerKVM 3.1

Page 46: Enabling POWER 8 advanced features on Linux

46 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Default guest SMT mode is 1 VCPU/vcore Inefficient use of resources in whole-core mode (1 thread/core) Often chosen by users who are not familiar with POWER Often chosen by management agents (e.g. OpenStack)

Setting topology is too complex in big cloud environment

Up to now, default core-split mode was whole-core Good for single-thread performance Allows users to run SMT1, SMT2, SMT4 and SMT8 guests Hits over commitment early, especially with SMT1 guests

with 20 cores P8 => 20 maximum vcpu dispatched in // by default.

PowerKVM 3.1 addresses these points with 2 features : 1. (sub)core sharing (piggybacking) 2. Dynamic multi-threading (split-core)

2 vcpus

PowerKVM with 2 P8 cores

run run

Guest 1

Guest 1 Guest 2

run run

PowerKVM with 2 P8 cores

Page 47: Enabling POWER 8 advanced features on Linux

47 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

PowerKVM Micro-Threading (Split-Core)

No split-core : 1 full core available with up to 8 parallel threads Only 1 guest running at a time

(PowerVM only mode available)

split-core by 2 : 2 sub-cores available each with up to 4 parallel threads. Up to 2 guests running at a time

split-core by 4 : 4 sub-cores available each with up to 2 parallel threads. Up to 4 guests running at a time

IBM Power8 chip

1 Core

1 2

2 1

4 3

1

Page 48: Enabling POWER 8 advanced features on Linux

48 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

PowerKVM Micro-Threading (Split-Core)

VM1 VM2 VM3 VM4

Context switching (hypervisor overhead)

time

Full

core

thr1 thr2 thr3 thr4

thr5 thr6 thr7 thr8

Full core

POWER8

Power8 is a 8 threads processor. All threads share MMU(1) context, therefore must be in same partition. Guests in single thread (SMT 1) mode cannot use the full core capacity.

Micro-Threading benefits: Better CPU resources usage More virtual machines per core Reduces over-commitment overhead (context switch)

Micro-Threading limitations:

Guest SMT is limited to 2 or 4, depending on the Split Core level (Half core, Quarter Core) All threads are running in SMT8 mode. (lower single thread perf.)

PowerKVM introduces the possibility to split a Power8 core in 2 or 4 subcores: Micro-Threading (static in PowerKVM 2.1, dynamic in PowerKVM 3.1) Each subcore has its own MMU(1) and can be dispatched independently to a different Guest (VM).

(1) MMU (Memory Management Unit) is a Hardware Memory Decoder that maps virtual addresses to physical addresses

VM2

subcore1 VM1

VM3

VM4

time

subcore1 subcore2

subcore3 subcore4

thr1 thr2 thr3 thr4

thr5 thr6 thr7 thr8

POWER8

subcore2

subcore3

subcore4

Page 49: Enabling POWER 8 advanced features on Linux

49 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

PowerKVM 3.1 Dynamic Micro-Threading (SubCores)

With PowerKVM 3.1, The hypervisor may dynamically choose to split by-two or by-four each core in order to optimize vcpus needs with hardware available resources.

run run PowerKVM3 with 1 P8 core

Guest1 2 vcpus <topology sockets=1 cores=1 threads=2/>

Guest2 4 vcpus <topology sockets=1 cores=1 threads=4/>

Manually Defined : 1 vcore, 2 threads

Manually Defined: 1 vcore, 4 threads

run

run PowerKVM3 with 1 P8 core

Guest1 2 vcpus <topology sockets=1 cores=1 threads=2/>

Guest2 2 vcpus <topology sockets=1 cores=1 threads=2/>

Manually Defined : 1 vcore, 2 threads

Manually Defined: 1 vcore, 2 threads

Guest3 2 vcpus <topology sockets=1 cores=1 threads=2/>

Manually Defined : 1 vcore, 2 threads

Splitting by 2 is optimum Splitting by 4 is optimum

To manually and statically set the level of subcoring, use at PowerKVM host level: ppc64_cpu --subcores-per-core # Get number of subcores per core

ppc64_cpu --subcores-per-core=X # Set subcores per core to X (1,2 or 4)

ppc64_cpu --threads-per-core # Get threads per core

(It needs all VMs to be offline)

Page 50: Enabling POWER 8 advanced features on Linux

50 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

PowerKVM 3.1 Micro-Threading (Subcore) DEMO

Page 51: Enabling POWER 8 advanced features on Linux

51 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

PowerKVM 3.1 Dynamic Micro-Threading (SubCores) DEMO

The demonstration is done with:

4 Guests (Virtual machines), all pinned onto one single core of a 20-cores S822L Power8 server.

PowerKVM 3.1 virtualization. Each guest is defined with a manual topology of 1 vcore and 2 threads.

run

PowerKVM3 with 1 P8 core

split1 2 vcpus <topology sockets=1 cores=1 threads=2/>

split2 2 vcpus <topology sockets=1 cores=1 threads=2/>

Manually Defined : 1 vcore, 2 threads

Manually Defined: 1 vcore, 2 threads

split3 2 vcpus <topology sockets=1 cores=1 threads=2/>

Manually Defined : 1 vcore, 2 threads

split3 2 vcpus <topology sockets=1 cores=1 threads=2/>

Manually Defined : 1 vcore, 2 threads

Page 52: Enabling POWER 8 advanced features on Linux

52 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Time Slice

Co

re T

hre

ads

1 2 3 4 5 6 7 8

Time Slice

Co

re T

hre

ads

1 2 3 4 5 6 7 8

PowerKVM 3.1 Dynamic Micro-Threading (SubCores) DEMO (guest topology is 1 vcore, 2 threads)

Time Slice

Co

re T

hre

ads

1 2 3 4 5 6 7 8

split1 split2 Split3 split4 split1 split2 Split3 split4 split1 split2 Split3 split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

split1

split2

Split3

split4

No Micro-Threading allowed

Micro-Threading with 2 sub-cores max

Micro-Threading with 4 sub-cores max

Page 53: Enabling POWER 8 advanced features on Linux

53 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

400 VMs on a (small) S822LC 20-cores ?

Thanks to split-core (and piggybacking), even 400 VMs but nevertheless powerfull IBM S822LC is OK (even if definitely extreme).

Guest= 2 vcpus

Default : 2 vcores, 1 threads

No need to split(thanks to piggyback with 20 VMs)

Split-core helps optimizing cores

utilization

Number of VMs

Almost like PowerKVM 2.1 (piggyback not available with pKVM 2.1) PowerKVM 3.1 split-core benefits

Pgb

ench

po

stgr

eSQ

L w

ork

load

(tp

s)

Page 54: Enabling POWER 8 advanced features on Linux

54 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Session Evaluations

YOUR OPINION MATTERS!

Submit four or more session evaluations by 5:30pm Wednesday

to be eligible for drawings!

*Winners will be notified Thursday morning. Prizes must be picked up at registration desk, during operating hours, by the conclusion of the event.

1 2 3 4

Page 55: Enabling POWER 8 advanced features on Linux

55 IBM Systems Technical Events | ibm.com/training/events

© Copyright IBM Corporation 2016. Technical University/Symposia materials may

not be reproduced in whole or in part without the prior written permission of IBM.

Continue growing your IBM skills

ibm.com/training

provides a comprehensive portfolio of skills and career accelerators that are designed to meet all your training needs.

If training that is right for you with our Global Training Providers, we can help. Contact IBM Training at [email protected]

Global Skills Initiative