LinuxCon Japan 2015 OpenPOWER Technology Innovationservers, IBM is throwing open the gates and will be licensing Power8 to third-party chip and component makers. The Register: the

© 2015 IBM Corporation

LinuxCon Japan 2015

OpenPOWER Technology Innovation

Paul MackerrasLinux on Power Kernel ArchitectPowerKVM Architect

June 5, 2015

[email protected]

© 2015 IBM Corporation2

Agenda

1. OpenPOWER Foundation Overview

2. POWER8 with NVIDIA GPU Integration

3. Coherent Accelerator Processor Interface (CAPI)


OpenPOWER Foundation Overview

4 © 2015 OpenPOWER Foundation

• Moore’s law no longer satisfies performance gain

• Growing workload demands

• Numerous IT consumption models

• Mature Open software ecosystem

OpenPOWER, a catalyst for Open Innovation

• Rich software ecosystem

• Spectrum of power servers

• Multiple hardware options

• Derivative POWER chips

OpenPOWER is an open development community, using the POWER Architecture to serve the evolving needs of customers.

Performance of POWER architecture

amplified capability

Open Development

open software, open hardware

Collaboration of thought leaders

simultaneous innovation, multiple disciplines

Feeds back … resulting in client choice

Feeds back … resulting in client choice

© 2015 OpenPOWER Foundation4

New Chips & New Chips & ComponentsComponents

Components Components & Systems& Systems

New Systems New Systems & Platforms& Platforms

Bringing It Bringing It All TogetherAll Together

First Open server specification and motherboard combining OpenPOWER,

OpenCompute and OpenStack (mock-up) First GPU-accelerated OpenPOWER

developer platformPrototype of Firestone, a new high-

performance server on the path to exascale

First commercially available OpenPOWER server

RedPower, the first China OpenPOWER 2-socket system

coming to market in 2015Inspur 2-socket

POWER8 Server ChuangHe China-branded

OpenPOWER systems with POWER8

Data Engine for NoSQL with 40TB CAPI-attached flash 24:1 Server consolidation for 3x lower cost per user

Open Source Redis

Clustering 192 Vcores + CAPI

40TB in 2U

First China “local” POWER derivative chip, CP1

Convey’s CAPI developer kit based on the company’s Xilinix-based co-processors

CAPI shared virtual memory between an Altera Stratix V FPGA accelerator

and a POWER8 CPU

First commercially available OpenPOWER third-party server

New CAPI-based solution: the ConnectX-4 adapter card by Mellanox

Nallatech’s OpenPOWER CAPI Developer Kit


New POWER8 products are already appearing

http://www.enterprisetech.com/2014/04/28/inside-google-tyan-power8-server-boards/

The Tyan reference (ATX) board, SP010, measures 12” by 9.6”➢ one single-chip module (SCM)➢ four DDR3 memory slots➢ two Gigabit Ethernet network interfaces➢ keyboard and video➢ intended for developers

The Google reference board➢ two single-chip module (SCM)➢ four modified SATA ports➢ Google use only

Available from October 2014:TYAN GN70-BP010 Customer reference systemhttp://www.tyan.com/campaign/openpower/

http://www.enterprisetech.com/2014/04/28/inside-google-tyan-power8-server-boards/

http://www.tyan.com/campaign/openpower/


Glimpses of things to come...OpenPower Systems Coming In Mid-2015

See http://www.enterprisetech.com/2014/12/09/openpower-systems-coming-mid-2015/ for more details

A Wistron-built system:➢ Two socket POWER8➢ Eight Wistron built DDR3 memory slots➢ 2 SFF SATA disks➢ 5 PCI-E gen 3 slots – half-height, half length➢ 4 CAPI enabled PCI slots➢ 1 TB memory max

http://www.enterprisetech.com/2014/12/09/openpower-systems-coming-mid-2015/


POWER8 Processor

Caches • 512 KB SRAM L2 / core• 96 MB eDRAM shared L3• Up to 128 MB eDRAM L4

(off-chip)

Cores • 12 cores (SMT8)• 8 dispatch, 10 issue,

16 exec pipe• 2X internal data

flows/queues• Enhanced prefetching• 64K data cache,

32K instruction cache

Accelerators• Crypto & memory expansion• Transactional Memory • VMM assist • Data Move / VM Mobility Energy Management

• On-chip Power Management Micro-controller• Integrated Per-core VRM• Critical Path Monitors

Technology•22nm SOI, eDRAM, 15 ML 650mm2

Memory• Up to 230 GB/s

sustained bandwidth

Bus Interfaces• Durable open memory

attach interface• Integrated PCIe Gen3• SMP Interconnect• CAPI (Coherent

Accelerator Processor Interface)

ComputerWorld: To make the chip faster, IBM has turned to a more advanced manufacturing process, increased the clock speed and added more cache memory, but perhaps the biggest change heralded by the Power8 cannot be found in the specifications. After years of restricting Power processors to its servers, IBM is throwing open the gates and will be licensing Power8 to third-party chip and component makers. The Register: the Power8 is so clearly engineered for midrange and enterprise systems for running applications on a giant shared memory space, backed by lots of cores and threads. Power8 does not belong in a smartphone unless you want one the size of a shoebox that weighs 20 pounds. But it most certainly does belong in a badass server, and Power8 is by far one of the most elegant chips that Big Blue has ever created, based on the initial specs. PCWorld: With Power8, IBM has more than doubled the sustained memory bandwidth from the Power7 and Power7+, to 230 GB/s, as well as I/O speed, to 48 GB/s. Put another way, Watson’s ability to look up and respond to information has more than doubled as well.

Microprocessor report: Called Power8, the new chip delivers impressive numbers, doubling the performance of its already powerful predecessor, Power7+. Oracle currently leads in server-processor performance, but IBM’s new chip will crush those records. The Power8 specs are mind boggling.

Source: Hotchips presentation


OpenPOWER Ecosystem

XCATXCAT

System Operating Environment Software StackA modern development environment is emerging

based on tools and services

CloudSoftware

OperatingSystem / KVM

Standard OperatingEnvironment

(System Mgmt)

So

ftw

are

Power Open Source Software Stack Components

ExistingOpen Source

Software Communities

Firmware

Hardware

New OSS Community

OpenPOWERTechnology

OpenPOWERFirmware

CAPP

PC

Ie

POWER8

CAPI over PCIe

“Standard POWER Products” – 2014

Har

dw

a re

“Custom POWER SoC” – Future

Customizable

Framework to Integrate System IP on Chip

Industry IP License Model

Multiple Options to Design with POWER Technology Within OpenPOWER


OpenPOWER firmware available under Apache V2 license on github since July 2014

Community: https://github.com/open-power

One stop place to build all firmware – Includes cross compiler tool set, Hostboot OPAL, and OCC– Project will be used for all planed updates

Discussion Forum– Ask a question

https://github.com/open-power


“Little endian” is making it easier to port applications

● POWER8 processors support execution in both big endian (BE) and little endian mode (LE)

● Most compiled open source software is designed (defacto) to run in little endian mode.

● Linux on Power has chosen to exploit little endian (LE) processor mode based on OpenPOWER partner feedback.

– Eases the migration of applications from Linux on x86.– Enables simple data migration from Linux on x86.– Simplifies data sharing (interoperability) with Linux on x86.– Improves Power I/O offerings with modern I/O adapters and

devices, e.g. GPUs.

● LE distributions for Linux on Power does NOT mean x86 applications magically run: applications must still be compiled for Power.

● LE enablement is facilitating discussions with new partners and software providers for Linux on Power

● AIX and IBM I will remain BE

BE and LE are simply different ways of ordering of how data is stored. Important in 1980.Not as important in 2015.Market has moved.


Where to find more information? http://openpowerfoundation.org/

http://openpowerfoundation.org/


POWER8 with NVIDIA GPU Integration


IBM and NVIDIA deliver new acceleration capabilities for analytics, big data, and Java

✔ Bare-metal system, runs Ubuntu directly on hardware (no hypervisor)

✔ First step toward GPU-based solutions for analytics and HPC

Runs pattern extraction analytic workloads faster

Provides new acceleration capability for analytics, big data, Java, and other technical computing workloads

Delivers faster results and lower energy costs by accelerating processor intensive applications

Power System S824L• Up to 24 POWER8 cores• Up to 1 TB of memory• 384 GB/s max memory bandwidth• Up to 2 NVIDIA K40 GPU accelerators• Ubuntu Linux running bare metal


Harnessing the power of GPUs

With their massively parallel architecture, GPUs far exceed the computational power of CPUs when

operating on large sets of floating point numbers.


GPU Programming with CUDA – A simple example

void cuda_add(int *a, int *b, int *c) { int *dev_a, *dev_b, *dev_c; int len = N*sizeof(int);

cudaMalloc((void**)&dev_a, len); cudaMalloc((void**)&dev_b, len); cudaMalloc((void**)&dev_c, len);

cudaMemcpy(dev_a, a, len, cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, len, cudaMemcpyHostToDevice);

add<<<N,1>>>(dev_a, dev_b, dev_c);

cudaMemcpy(c, dev_c, len, cudaMemcpyDeviceToHost);

cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);}

__global__ void add(int *a, int *b, int *c) { int tid = blockIdx.x; c[tid] = a[tid] + b[tid];}

Language extension to invoke our kernel with 'N'

blocks executing in parallelone thread per block

Identify your block number

Add the single element whose index matches your block

number

+++

===

a b c

+++

===

On the CPU ...

...and GPU

Vector addition


Bringing GPU programming into Java

● CUDA4J library provides a Java API for managing and accessing GPU devices, libraries, kernels, and memory.

– Available with IBM Java on POWER

– There are times when you want this low level GPU control from Java

– API reflects the concepts familiar in CUDA programming

– Make use of Java exceptions, automatic resource management, etc.

– Handles copying data to/from the GPU, flow of control from Java to GPU and back, etc

CudaDevice – a CUDA capable GPU deviceCudaBuffer – a region of memory on the GPUCudaModule – user library of kernels to load into GPUCudaKernel – for launching a device functionCudaFunction – a kernel's entry pointCudaEvent – for timing and synchronizationCudaException – for when something goes wrong

new Java APIs


Fundamental types in CUDA4J

CudaBufferCudaBuffer

PTX.func add { … }.func foo { … }.func bar { … }

CudaFunction

CudaGlobal

CudaLinker

CudaSurface

CudaTextureCudaDevice

CudaDeviceCudaModule CudaKernel

CudaKernel

CudaBufferCudaBuffer

CudaGrid

add{}

Device events

CudaStream

CudaModule

Relationship for generating an instance

Relationship as an argument

Used to combine multiple cubin/fatbin/PTXsinto single module

Corresponds to a HW feature in GPU

CudaFunctionfoo{}

executionengine

devicememory

CudaEvent

Java


GPU-enabling standard Java SE APIs

● Example: java.util.Arrays.sort(int[] a)

● Java employs heuristics that determine if the work should be off-loaded to the GPU

– Overhead of moving data to GPU, invoking kernel, and returning results means small sorts (<~20k elements) are faster on the CPU.

– Host may have multiple GPUs. Are any available for the task?

– Is there space for conducting the sort on the device?


GPU-enabled array sort method performance

IBM Power 8 with Nvidia K40m GPU


JIT optimized GPU acceleration

bytecodes

intermediaterepresentation

optimizer

CPU GPU

code generator code generator

PTX ISACPU native

● As the JIT compiles a stream expression it can identify candidates for GPU off-loading– Arrays copied to and from the device implicitly– Java operations mapped to GPU kernel operations– Preserves the standard Java syntax and semantics

● Early steps

– Recognize a limited set of operations within the lambda expressions

– Redundant/pessimistic data transfer between host and device

– Limited heuristics about when to invoke the GPU and when to generate CPU instructions

● Java 8 streams allow developers to express computation as aggregate parallel operations on data

IntStream.range(0, N).parallel().forEach(i > c[i] = a[i] + b[i]);


JIT / GPU optimization of Lambda expression

JIT recognized Java code for matrix multiplication using Java 8 parallel stream

Speed-up factor when run on a GPU enabled host

IBM Power 8 with Nvidia K40m GPU


NVLINK: Enhanced bandwidth interconnect

● High-speed interconnect between CPUs and GPU

● 5x to 12x higher bandwidth than PCIe Gen 3

● Planned for POWER8 machines in 2016


POWER8 Coherent Accelerator Processor Interface


Coherent Accelerator Processor Interface (CAPI) overview

CustomHardware

Application

POWER8

CAPP

Coherence Bus

PSL

FPGA or ASIC

Customizable HardwareApplication Accelerator • Specific system SW, middleware, or user application• Written to durable interface provided by PSL

POWER8

PCIe Gen 3Transport for encapsulated messages

Processor Service Layer (PSL)• Present robust, durable interfaces to applications• Offload complexity / content from CAPP

Virtual Addressing• Accelerator can work with same memory addresses that the

processors use• Pointers de-referenced same as the host application• Removes OS & device driver overhead

Hardware Managed Cache Coherence• Enables the accelerator to participate in “Locks” as a normal thread

Lowers Latency over IO communication model

Coherent Accelerator Processor Interface (CAPI)


Coherent Accelerator Processor Interface (CAPI) overview

CAPP PCIe

POWER8 Processor

Typical I/O Model Flow

Flow with a Coherent Model

Shared Mem. Notify Accelerator

AccelerationShared Memory

Completion

DD CallCopy or PinSource Data

MMIO NotifyAccelerator

AccelerationPoll / Int

CompletionCopy or UnpinResult Data

Ret. From DDCompletion

FPGA

Fu

nctio

n n

Fu

nctio

n 0

Fu

nctio

n 1

Fu

nctio

n 2

CAPI

IBM Supplied POWER Service Layer


Programming CAPI

● FPGA is programmed in VHDL or Verilog; program has two parts:

– Power Service Layer (PSL) is supplied by IBM, implements cache coherence protocol and address translation (MMU)

– Accelerator Function Unit (AFU) is application-specific, supplied by card vendor, solution integrator or end-user

● Two types of AFU:

– “Dedicated” AFU can only be used by a single application

– “Directed” AFU supports multiple contexts and can be used by several applications concurrently

● CXL driver in Linux kernel on POWER8 host

– Creates /dev/cxl/afu0.0d etc. device nodes

● Libcxl library provides user-level API

– Open-source, available at https://github.com/ibm-capi/libcxl.


Example CAPI Application Program

/* Get first physical AFU and open it */afu = cxl_afu_next(NULL);afu_d = cxl_afu_open_h(afu, CXL_VIEW_DEDICATED);

/* Map mmio and set queue size */cxl_mmio_map(afu_d, CXL_MMIO_LITTLE_ENDIAN);cxl_mmio_write64(afu_d, QUEUE_SIZE_OFFSET, QUEUE_SIZE);

/* Setup Work Element Descriptor (WED) and start AFU */wed = &workitem_queue;cxl_afu_attach(afu_d, wed);

/* Send a work request to the AFU */workitem_queue[next_req].buffer = mybuf;write_memory_barrier();workitem_queue[next_req].valid = 1;

/* Read an event sent by the AFU */cxl_read_event(afu_d, &event);

/* Wait till last queue entry done */while (!workitem_queue[last_req].done) {}

/* Stop and close afu */cxl_afu_free(afu_d);


Demonstrating the value of CAPI attached flash storage


Summary:

1.The OpenPOWER Foundation provides a framework for open innovation on hardware and software

2.The system software structure embraces many key open source technologies – KVM, OpenStack, and other

3.Opportunities abound for hardware innovation with the POWER8 processor as a foundation

OpenPOWER is transforming IBM, Power, and the industry!


Legal Statement

This work represents the view of the author and does not necessarily represent the view of IBM.

IBM, IBM (logo), OpenPower, POWER, POWER8, Power Systems, are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries.

Linux is a registered trademark of Linus Torvalds.

Other company, product and service names may be trademarks or service marks of others.

Documents

LinuxCon Japan 2015 OpenPOWER Technology Innovationservers, IBM is throwing open the gates and will be licensing Power8 to third-party chip and component makers. The Register: the