Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
© 2015 IBM Corporation
LinuxCon Japan 2015
OpenPOWER Technology Innovation
Paul MackerrasLinux on Power Kernel ArchitectPowerKVM Architect
June 5, 2015
© 2015 IBM Corporation2
Agenda
1. OpenPOWER Foundation Overview
2. POWER8 with NVIDIA GPU Integration
3. Coherent Accelerator Processor Interface (CAPI)
© 2015 IBM Corporation
OpenPOWER Foundation Overview
4 © 2015 OpenPOWER Foundation
• Moore’s law no longer satisfies performance gain
• Growing workload demands
• Numerous IT consumption models
• Mature Open software ecosystem
OpenPOWER, a catalyst for Open Innovation
• Rich software ecosystem
• Spectrum of power servers
• Multiple hardware options
• Derivative POWER chips
OpenPOWER is an open development community, using the POWER Architecture to serve the evolving needs of customers.
Performance of POWER architecture
amplified capability
Open Development
open software, open hardware
Collaboration of thought leaders
simultaneous innovation, multiple disciplines
Feeds back … resulting in client choice
Feeds back … resulting in client choice
© 2015 OpenPOWER Foundation4
New Chips & New Chips & ComponentsComponents
Components Components & Systems& Systems
New Systems New Systems & Platforms& Platforms
Bringing It Bringing It All TogetherAll Together
First Open server specification and motherboard combining OpenPOWER,
OpenCompute and OpenStack (mock-up) First GPU-accelerated OpenPOWER
developer platformPrototype of Firestone, a new high-
performance server on the path to exascale
First commercially available OpenPOWER server
RedPower, the first China OpenPOWER 2-socket system
coming to market in 2015Inspur 2-socket
POWER8 Server ChuangHe China-branded
OpenPOWER systems with POWER8
Data Engine for NoSQL with 40TB CAPI-attached flash 24:1 Server consolidation for 3x lower cost per user
Open Source Redis
Clustering 192 Vcores + CAPI
40TB in 2U
First China “local” POWER derivative chip, CP1
Convey’s CAPI developer kit based on the company’s Xilinix-based co-processors
CAPI shared virtual memory between an Altera Stratix V FPGA accelerator
and a POWER8 CPU
First commercially available OpenPOWER third-party server
New CAPI-based solution: the ConnectX-4 adapter card by Mellanox
Nallatech’s OpenPOWER CAPI Developer Kit
© 2015 IBM Corporation6
New POWER8 products are already appearing
http://www.enterprisetech.com/2014/04/28/inside-google-tyan-power8-server-boards/
The Tyan reference (ATX) board, SP010, measures 12” by 9.6”➢ one single-chip module (SCM)➢ four DDR3 memory slots➢ two Gigabit Ethernet network interfaces➢ keyboard and video➢ intended for developers
The Google reference board➢ two single-chip module (SCM)➢ four modified SATA ports➢ Google use only
Available from October 2014:TYAN GN70-BP010 Customer reference systemhttp://www.tyan.com/campaign/openpower/
© 2015 IBM Corporation7
Glimpses of things to come...OpenPower Systems Coming In Mid-2015
See http://www.enterprisetech.com/2014/12/09/openpower-systems-coming-mid-2015/ for more details
A Wistron-built system:➢ Two socket POWER8➢ Eight Wistron built DDR3 memory slots➢ 2 SFF SATA disks➢ 5 PCI-E gen 3 slots – half-height, half length➢ 4 CAPI enabled PCI slots➢ 1 TB memory max
© 2015 IBM Corporation8
POWER8 Processor
Caches • 512 KB SRAM L2 / core• 96 MB eDRAM shared L3• Up to 128 MB eDRAM L4
(off-chip)
Cores • 12 cores (SMT8)• 8 dispatch, 10 issue,
16 exec pipe• 2X internal data
flows/queues• Enhanced prefetching• 64K data cache,
32K instruction cache
Accelerators• Crypto & memory expansion• Transactional Memory • VMM assist • Data Move / VM Mobility Energy Management
• On-chip Power Management Micro-controller• Integrated Per-core VRM• Critical Path Monitors
Technology•22nm SOI, eDRAM, 15 ML 650mm2
Memory• Up to 230 GB/s
sustained bandwidth
Bus Interfaces• Durable open memory
attach interface• Integrated PCIe Gen3• SMP Interconnect• CAPI (Coherent
Accelerator Processor Interface)
ComputerWorld: To make the chip faster, IBM has turned to a more advanced manufacturing process, increased the clock speed and added more cache memory, but perhaps the biggest change heralded by the Power8 cannot be found in the specifications. After years of restricting Power processors to its servers, IBM is throwing open the gates and will be licensing Power8 to third-party chip and component makers. The Register: the Power8 is so clearly engineered for midrange and enterprise systems for running applications on a giant shared memory space, backed by lots of cores and threads. Power8 does not belong in a smartphone unless you want one the size of a shoebox that weighs 20 pounds. But it most certainly does belong in a badass server, and Power8 is by far one of the most elegant chips that Big Blue has ever created, based on the initial specs. PCWorld: With Power8, IBM has more than doubled the sustained memory bandwidth from the Power7 and Power7+, to 230 GB/s, as well as I/O speed, to 48 GB/s. Put another way, Watson’s ability to look up and respond to information has more than doubled as well.
Microprocessor report: Called Power8, the new chip delivers impressive numbers, doubling the performance of its already powerful predecessor, Power7+. Oracle currently leads in server-processor performance, but IBM’s new chip will crush those records. The Power8 specs are mind boggling.
Source: Hotchips presentation
© 2015 IBM Corporation9
OpenPOWER Ecosystem
XCATXCAT
System Operating Environment Software StackA modern development environment is emerging
based on tools and services
CloudSoftware
OperatingSystem / KVM
Standard OperatingEnvironment
(System Mgmt)
So
ftw
are
Power Open Source Software Stack Components
ExistingOpen Source
Software Communities
Firmware
Hardware
New OSS Community
OpenPOWERTechnology
OpenPOWERFirmware
CAPP
PC
Ie
POWER8
CAPI over PCIe
“Standard POWER Products” – 2014
Har
dw
a re
“Custom POWER SoC” – Future
Customizable
Framework to Integrate System IP on Chip
Industry IP License Model
Multiple Options to Design with POWER Technology Within OpenPOWER
© 2015 IBM Corporation10
OpenPOWER firmware available under Apache V2 license on github since July 2014
Community: https://github.com/open-power
One stop place to build all firmware – Includes cross compiler tool set, Hostboot OPAL, and OCC– Project will be used for all planed updates
Discussion Forum– Ask a question
© 2015 IBM Corporation11
“Little endian” is making it easier to port applications
● POWER8 processors support execution in both big endian (BE) and little endian mode (LE)
● Most compiled open source software is designed (defacto) to run in little endian mode.
● Linux on Power has chosen to exploit little endian (LE) processor mode based on OpenPOWER partner feedback.
– Eases the migration of applications from Linux on x86.– Enables simple data migration from Linux on x86.– Simplifies data sharing (interoperability) with Linux on x86.– Improves Power I/O offerings with modern I/O adapters and
devices, e.g. GPUs.
● LE distributions for Linux on Power does NOT mean x86 applications magically run: applications must still be compiled for Power.
● LE enablement is facilitating discussions with new partners and software providers for Linux on Power
● AIX and IBM I will remain BE
BE and LE are simply different ways of ordering of how data is stored. Important in 1980.Not as important in 2015.Market has moved.
© 2015 IBM Corporation12
Where to find more information? http://openpowerfoundation.org/
© 2015 IBM Corporation
POWER8 with NVIDIA GPU Integration
© 2015 IBM Corporation14
IBM and NVIDIA deliver new acceleration capabilities for analytics, big data, and Java
✔ Bare-metal system, runs Ubuntu directly on hardware (no hypervisor)
✔ First step toward GPU-based solutions for analytics and HPC
Runs pattern extraction analytic workloads faster
Provides new acceleration capability for analytics, big data, Java, and other technical computing workloads
Delivers faster results and lower energy costs by accelerating processor intensive applications
Power System S824L• Up to 24 POWER8 cores• Up to 1 TB of memory• 384 GB/s max memory bandwidth• Up to 2 NVIDIA K40 GPU accelerators• Ubuntu Linux running bare metal
© 2015 IBM Corporation15
Harnessing the power of GPUs
With their massively parallel architecture, GPUs far exceed the computational power of CPUs when
operating on large sets of floating point numbers.
© 2015 IBM Corporation16
GPU Programming with CUDA – A simple example
void cuda_add(int *a, int *b, int *c) { int *dev_a, *dev_b, *dev_c; int len = N*sizeof(int);
cudaMalloc((void**)&dev_a, len); cudaMalloc((void**)&dev_b, len); cudaMalloc((void**)&dev_c, len);
cudaMemcpy(dev_a, a, len, cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, len, cudaMemcpyHostToDevice);
add<<<N,1>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, len, cudaMemcpyDeviceToHost);
cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);}
__global__ void add(int *a, int *b, int *c) { int tid = blockIdx.x; c[tid] = a[tid] + b[tid];}
Language extension to invoke our kernel with 'N'
blocks executing in parallelone thread per block
Identify your block number
Add the single element whose index matches your block
number
+++
===
a b c
+++
===
On the CPU ...
...and GPU
Vector addition
© 2015 IBM Corporation17
Bringing GPU programming into Java
● CUDA4J library provides a Java API for managing and accessing GPU devices, libraries, kernels, and memory.
– Available with IBM Java on POWER
– There are times when you want this low level GPU control from Java
– API reflects the concepts familiar in CUDA programming
– Make use of Java exceptions, automatic resource management, etc.
– Handles copying data to/from the GPU, flow of control from Java to GPU and back, etc
CudaDevice – a CUDA capable GPU deviceCudaBuffer – a region of memory on the GPUCudaModule – user library of kernels to load into GPUCudaKernel – for launching a device functionCudaFunction – a kernel's entry pointCudaEvent – for timing and synchronizationCudaException – for when something goes wrong
new Java APIs
© 2015 IBM Corporation18
Fundamental types in CUDA4J
CudaBufferCudaBuffer
PTX.func add { … }.func foo { … }.func bar { … }
CudaFunction
CudaGlobal
CudaLinker
CudaSurface
CudaTextureCudaDevice
CudaDeviceCudaModule CudaKernel
CudaKernel
CudaBufferCudaBuffer
CudaGrid
add{}
Device events
CudaStream
CudaModule
Relationship for generating an instance
Relationship as an argument
Used to combine multiple cubin/fatbin/PTXsinto single module
Corresponds to a HW feature in GPU
CudaFunctionfoo{}
executionengine
devicememory
CudaEvent
Java
© 2015 IBM Corporation19
GPU-enabling standard Java SE APIs
● Example: java.util.Arrays.sort(int[] a)
● Java employs heuristics that determine if the work should be off-loaded to the GPU
– Overhead of moving data to GPU, invoking kernel, and returning results means small sorts (<~20k elements) are faster on the CPU.
– Host may have multiple GPUs. Are any available for the task?
– Is there space for conducting the sort on the device?
© 2015 IBM Corporation20
GPU-enabled array sort method performance
IBM Power 8 with Nvidia K40m GPU
© 2015 IBM Corporation21
JIT optimized GPU acceleration
bytecodes
intermediaterepresentation
optimizer
CPU GPU
code generator code generator
PTX ISACPU native
● As the JIT compiles a stream expression it can identify candidates for GPU off-loading– Arrays copied to and from the device implicitly– Java operations mapped to GPU kernel operations– Preserves the standard Java syntax and semantics
● Early steps
– Recognize a limited set of operations within the lambda expressions
– Redundant/pessimistic data transfer between host and device
– Limited heuristics about when to invoke the GPU and when to generate CPU instructions
● Java 8 streams allow developers to express computation as aggregate parallel operations on data
IntStream.range(0, N).parallel().forEach(i > c[i] = a[i] + b[i]);
© 2015 IBM Corporation22
JIT / GPU optimization of Lambda expression
JIT recognized Java code for matrix multiplication using Java 8 parallel stream
Speed-up factor when run on a GPU enabled host
IBM Power 8 with Nvidia K40m GPU
© 2015 IBM Corporation23
NVLINK: Enhanced bandwidth interconnect
● High-speed interconnect between CPUs and GPU
● 5x to 12x higher bandwidth than PCIe Gen 3
● Planned for POWER8 machines in 2016
© 2015 IBM Corporation
POWER8 Coherent Accelerator Processor Interface
© 2015 IBM Corporation25
Coherent Accelerator Processor Interface (CAPI) overview
CustomHardware
Application
POWER8
CAPP
Coherence Bus
PSL
FPGA or ASIC
Customizable HardwareApplication Accelerator • Specific system SW, middleware, or user application• Written to durable interface provided by PSL
POWER8
PCIe Gen 3Transport for encapsulated messages
Processor Service Layer (PSL)• Present robust, durable interfaces to applications• Offload complexity / content from CAPP
Virtual Addressing• Accelerator can work with same memory addresses that the
processors use• Pointers de-referenced same as the host application• Removes OS & device driver overhead
Hardware Managed Cache Coherence• Enables the accelerator to participate in “Locks” as a normal thread
Lowers Latency over IO communication model
Coherent Accelerator Processor Interface (CAPI)
© 2015 IBM Corporation26
Coherent Accelerator Processor Interface (CAPI) overview
CAPP PCIe
POWER8 Processor
Typical I/O Model Flow
Flow with a Coherent Model
Shared Mem. Notify Accelerator
AccelerationShared Memory
Completion
DD CallCopy or PinSource Data
MMIO NotifyAccelerator
AccelerationPoll / Int
CompletionCopy or UnpinResult Data
Ret. From DDCompletion
FPGA
Fu
nctio
n n
Fu
nctio
n 0
Fu
nctio
n 1
Fu
nctio
n 2
CAPI
IBM Supplied POWER Service Layer
© 2015 IBM Corporation27
Programming CAPI
● FPGA is programmed in VHDL or Verilog; program has two parts:
– Power Service Layer (PSL) is supplied by IBM, implements cache coherence protocol and address translation (MMU)
– Accelerator Function Unit (AFU) is application-specific, supplied by card vendor, solution integrator or end-user
● Two types of AFU:
– “Dedicated” AFU can only be used by a single application
– “Directed” AFU supports multiple contexts and can be used by several applications concurrently
● CXL driver in Linux kernel on POWER8 host
– Creates /dev/cxl/afu0.0d etc. device nodes
● Libcxl library provides user-level API
– Open-source, available at https://github.com/ibm-capi/libcxl.
© 2015 IBM Corporation28
Example CAPI Application Program
/* Get first physical AFU and open it */afu = cxl_afu_next(NULL);afu_d = cxl_afu_open_h(afu, CXL_VIEW_DEDICATED);
/* Map mmio and set queue size */cxl_mmio_map(afu_d, CXL_MMIO_LITTLE_ENDIAN);cxl_mmio_write64(afu_d, QUEUE_SIZE_OFFSET, QUEUE_SIZE);
/* Setup Work Element Descriptor (WED) and start AFU */wed = &workitem_queue;cxl_afu_attach(afu_d, wed);
/* Send a work request to the AFU */workitem_queue[next_req].buffer = mybuf;write_memory_barrier();workitem_queue[next_req].valid = 1;
/* Read an event sent by the AFU */cxl_read_event(afu_d, &event);
/* Wait till last queue entry done */while (!workitem_queue[last_req].done) {}
/* Stop and close afu */cxl_afu_free(afu_d);
© 2015 IBM Corporation29
Demonstrating the value of CAPI attached flash storage
© 2015 IBM Corporation30
Summary:
1.The OpenPOWER Foundation provides a framework for open innovation on hardware and software
2.The system software structure embraces many key open source technologies – KVM, OpenStack, and other
3.Opportunities abound for hardware innovation with the POWER8 processor as a foundation
OpenPOWER is transforming IBM, Power, and the industry!
© 2015 IBM Corporation31
Legal Statement
This work represents the view of the author and does not necessarily represent the view of IBM.
IBM, IBM (logo), OpenPower, POWER, POWER8, Power Systems, are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries.
Linux is a registered trademark of Linus Torvalds.
Other company, product and service names may be trademarks or service marks of others.