Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Speaker name, TitleCompany/Organization Name
Join the Conversation #OpenPOWERSummit
OpenCAPI TechnologyMyron Slota
OpenCAPI Consortium
OpenCAPI Topics
Industry Background
Where/How OpenCAPI Technology is used
Technology Overview and Advantages
Demonstrations
OpenCAPI Consortium – Where it all Happens
Key Messages Throughout Open IO Standard
High Performance – No OS/Hypervisor/FW Overhead with Low Latency and High Bandwidth
Not tied to Power – Architecture Agnostic
Very Low Accelerator Design Overhead
Programing Ease
Ideal for Accelerated Computing and SCM
Supports heterogeneous environment – Use Cases
Optimized for within a single system node
Products exist today!
Computation Data Access
Industry Background that Defined OpenCAPIGrowing computational demand due to emerging workloads (e.g., AI, cognitive, etc.)Moore’s Law not being supported by traditional silicon scalingDriving increased dependence on Hardware Acceleration for performance
• Hyperscale Datacenters and HPC need much higher network bandwidth• 100 Gb/s -> 200 Gb/s -> 400Gb/s are emerging
• Deep learning and HPC require more bandwidth between accelerators and memory• Emerging memory/storage technologies are driving need for bandwidth with low latency
Hardware accelerators are defining the attributes of a high performance bus
• Growing demand for network performance and network offload• Introduction of device coherency requirements (IBM’s introduction in 2013)• Emergence of complex storage and memory solutions• Various form factors with no one able to address everything (e.g., GPUs, FPGAs, ASICs, etc.)
4
Computation Data Access
…all Relevant to Modern Data Centers
Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI
OpenCAPI 3.0
OpenCAPI 3.1
OpenCAPI specifications are downloadable from the websiteat www.opencapi.org- Register- Download
AcceleratedOpenCAPI Device
OpenCAPI Key Attributes
6
TL/DL 25Gb I/O
Any OpenCAPI Enabled Processor
U
Accelerated Function
TLx/DLx
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture
2. Optimized for High Bandwidth and Low Latency
3. High performance 25Gbps PHY design with zero ‘overhead’
4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor
5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement
6. Wide range of Use Cases and access semantics
7. CPU coherent device memory (Home Agent Memory)
8. Architected for both Classic Memory and emerging Advanced Storage Class Memory
9. Minimal OpenCAPI design overhead (FPGA less than 5%)
Caches
Application
Storage/Compute/Network etc ASIC/FPGA/FFSA FPGA, SOC, GPU Accelerator Load/Store or Block Access
Standard System Memory Advanced SCM Solutions
Bu
ffe
red
Sys
tem
Me
mo
ry
Op
en
CA
PIM
em
ory
Bu
ffe
rs
Device Memory
POWER9 IO Features
8 and 16Gbps PHY
Protocols Supported
• PCIe Gen3 x16 and PCIe Gen4 x8
• CAPI 2.0 on PCIe Gen4
PC
Ie G
en
4
P9
25
Gb
s
25Gbps PHY
Protocols Supported
• OpenCAPI 3.0
• NVLink 2.0
Silicon DieVarious packages
(scale-out, scale-up)
POWER9 IO Leading the Industry• PCIe Gen4• CAPI 2.0 (Power)• NVLink 2.0• OpenCAPI 3.0
POWER9
Virtual Addressing and Benefits An OpenCAPI device operates in the virtual address spaces of the applications that it supports
• Eliminates kernel and device driver software overhead
• Allows device to operate on application memory without kernel-level data copies/pinned pages
• Simplifies programming effort to integrate accelerators into applications
• Improves accelerator performance
The Virtual-to-Physical Address Translation occurs in the host CPU
• Reduces design complexity of OpenCAPI-attached devices
• Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
• Security - Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access
Acceleration Paradigms with Great Performance
9
Examples: Encryption, Compression, Erasure prior to
delivering data to the network or storage
Processor Chip
Acc
Data
Egress Transform
DLx/TLx
Processor Chip
Acc
Data
Bi-Directional Transform
Acc
DLx/TLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Needle-in-a-haystack Engine
Examples: Machine or Deep Learning such as Natural Language processing,
sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory
Memory Transform
Processor Chip
Acc
DataDLx/TLx
Example: Basic work offload
Processor Chip Acc
NeedlesDLx/TLx
Examples: Database searches, joins, intersections, merges
Only the Needles are sent to the processor
Ingress Transform
Processor Chip
Acc
DataDLx/TLx
Examples: Video Analytics, Network Security,
Deep Packet Inspection, Data Plane Accelerator,
Video Encoding (H.265), High Frequency Trading etc
Needle-In-A-Haystack Engine
Large
Haystack
Of Data
OpenCAPI is ideal for acceleration due
to Bandwidth to/from accelerators, best
of breed latency, and flexibility of an
Open architecture
Comparison of Memory Paradigms
Emerging Storage Class Memory
Processor ChipDLx/TLx
SCM
Data
OpenCAPI 3.1 Architecture
Ultra Low Latency ASIC buffer chip adding +5ns
on top of native DDR direct connect!!
Storage Class Memory tiered with traditional DDR
Memory all built upon OpenCAPI 3.1 & 3.0
architecture.
Still have the ability to use Load/Store Semantics
Storage Class Memories have the potential to be
the next disruptive technology…..
Examples include ReRAM, MRAM, Z-NAND……
All are racing to become the defacto
Main Memory
Processor ChipDLx/TLx
DD
R4
/5
Example: Basic DDR attach
Data
Tiered Memory
Processor ChipDLx/TLx
DD
R4
/5 DLx/TLx
SCM
Data Data
Common physical interface between non-memory and memory devicesOpenCAPI protocol was architected to minimize latency; excellent for classic DRAM memoryExtreme bandwidth beyond classical DDR memory interfaceAgnostic interface will handle evolving memory technologies in the future (e.g., compute-in-mem)Ability to handle a memory buffer to decouple raw memory and host interface to optimize power, cost, perf
CAPI and OpenCAPI Performance
11
CAPI 1.0
PCIE Gen3 x8
Measured BW @8Gb/s
CAPI 2.0
PCIE Gen4 x8
Measured BW @16Gb/s
OpenCAPI 3.0
25 Gb/s x8
Measured BW @25Gb/s
128B DMA Read
3.81 GB/s 12.57 GB/s 22.1 GB/s
128B DMA Write
4.16 GB/s 11.85 GB/s 21.6 GB/s
256B DMA Read
N/A 13.94 GB/s 22.1 GB/s
256B DMA Write
N/A 14.04 GB/s 22.0 GB/s
POWER8
Introduced
in 2013
POWER9
Second
Generation
POWER9
Open Architecture with a
Clean Slate Focused on
Bandwidth and Latency
POWER8 CAPI 1.0
POWER9 CAPI 2.0and OpenCAPI 3.0
Xilinx
KU60/VU3P FPGA
Latency Test
Simple workload created to simulate communication between system and attached FPGA
1. Copy 512B from host send buffer to FPGA
2. Host waits for 128 Byte cache injection from FPGA and polls for last 8 bytes
3. Reset last 8 bytes
4. Repeat Go TO 1.
OpenCAPI Enabled FPGA Cards
14
Mellanox Innova2 Accelerator Card Alpha Data 9v3 Accelerator Card
Typical eye diagram at 25Gb/s using these cards
Barrel Eye G2 System Demo
Actual Barrel Eye G2 demo system
Packet Classifier Demonstrationusing Alpha Data 9v3 Accelerator Card
(early classifier bringup at 20 Gb/s)
Barrel Eye G2 System Demo
Actual Barrel Eye G2 demo system
Packet Classifier Demonstrationusing Mellanox Innova2 Accelerator Card
(early classifier bringup at 20 Gb/s)
OpenCAPI Consortium
• Open forum founded by AMD, Google, IBM, Mellanox, and Micron• Manage the OpenCAPI specification, Establish enablement, Grow the ecosystem
• Currently over 35 members
• Consortium now established• Established Board of Directors (AMD, Google, IBM, Mellanox Technologies, Micron, NVIDIA, Western Digital, Xilinx)
• Governing Documents (Bylaws, IPR Policy, Membership) with established Membership Levels
• Website www.opencapi.org
• Technical Steering Committee with Work Group Process established
• Marketing/Communications Committee
• Work Groups• TL Specification, DL Specification, PHY Signaling, PHY Mechanical, Compliance, and Enablement
• Creation of additional work groups include: Memory, Software, Accelerator, and more
• OpenCAPI Specification available on web site, was contributed to consortium as starting point for the Work Groups
• Design enablement available today (reference designs, documentation, SIM environment, exercisers, etc.)5
Incorporated September 13, 2016 Announced October 14, 2016
OpenCAPI Design Enablement
18
Item Availability
OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA Designs (RTL and Specifications)
Today
Xilinx Vivado Project Build with Memcopy Exerciser Today
Device Discovery and Configuration Specification and RTL
Today
AFU Interface Specification Today
Reference Card Design Enablement Specification 2Q18
25Gbps PHY Signal Specification Today
25Gbps PHY Mechanical Specification Today
OpenCAPI Simulation Environment (OCSE) Tech Preview
Today
Memcopy and Memory Home Agent Exercisers Today
Reference Driver Available 2Q18
Membership Entitlement Details
Strategic level - $25K
• Draft and Final Specifications and enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Vote on new Board Members
• Nominate and/or run for officer election
• Prominent listing in appropriate materials
Observing level - $5K
• Final Specifications and enablement
• License for Product development
Contributor level - $15K
• Draft and Final Specifications and enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Submit proposals
Academic and Non-Profit level - Free
• Final Specifications and enablement
• Workgroup participation and voting
Current Members
20
Strategic Membership level
Contributor Membership level
Academic Membership Level
Observing Membership Level
Cross Industry Collaboration and Innovation
21
OpenCAPI Protocol
Welcoming new members in all areas of the ecosystem
Systems and Software
Accelerator SolutionsSW
Research &Academic
Products and Services
Deployment
SOC