Upload
lance-brown
View
43
Download
0
Tags:
Embed Size (px)
Citation preview
GTC 2015 – Session S5429Creating Dense Mixed GPU and FPGA Systems
With Tegra K1s Using OpenCL & CUDA
Lance Brown, Director - HPC
ColoradoEngineering.com
719-641-7287 Cell
27 March 2015 ColoradoEngineering.com - Public Release 1
We Can Solve Really Cool Problems Now
• Heterogeneous computing is more than CPU + GPU
• ARM processors changed the game• NVIDIA - GPU + ARM - CUDA
• TI - DSP + ARM - OpenCL
• Altera - FPGA + ARM – OpenCL
• Scalable from handheld to Enterprise & HPC
27 March 2015 ColoradoEngineering.com - Public Release Slide 2
Why Listen to CEI?
• Been using FPGAs since 1985
• Been solving massively parallel problems for over 30 years
• We have/are designing multiple 24 & 32 layer boards featuring Altera FPGAs & NVIDIA GPUs
• Early adopter of new technologies and experts at marrying existing technologies in new ways
27 March 2015 ColoradoEngineering.com - Public Release Slide 3
Game Changer #1Altera’s Hard Floating Point Unit IP & OpenCL
• FPGAs have traditionally supported soft floating point
• Altera introduced IEEE 754 Hard Floating Point with Arria 10
• Arria 10 FPGAs are rated from 140 GigaFLOPS (GFLOPS) to 1.5 TeraFLOPS (TFLOPS)
• Details at: https://www.altera.com/en_US/pdfs/literature/po/bg-floating-point-fpga.pdf
• OpenCV & Suricata Implementations Using OpenCL
• Partial Reconfiguration for Streamlined OpenCL Development
• On Intel’s 14 nm FinFET Fab
27 March 2015 ColoradoEngineering.com - Public Release Slide 4
Game Changer #2NVIDIA Makes Tegra K1 Available• GPU + ARM @ low power
• Very important – camera interfaces galore
• Can do significant processing at each edge node now
• Jetson Kit – awesome eval kit & affordable
• More importantly – chipset available through Arrow!
• Details at: https://developer.nvidia.com/hardware-design-and-development
27 March 2015 ColoradoEngineering.com - Public Release Slide 5
CEI’s Epiphany – Ultimate CV PlatformAltera Arria 10 & NVIDIA Tegra K1?
+
1500 GFLOPS 326 GFLOPS27 March 2015 ColoradoEngineering.com - Public Release Slide 6
First Union – Dual TK1s + Arria 10HPC-A10-K1GPU
K61Health
Monitoring
HPC-A10HPC-A10-K1GPU
X8 PCIE Gen3
GigE
2/4 GB Micron HMC
QDR II+144 Mb
1334 MT/s
QSFP+1 – 40 GbE4 - 10 GbE
QSFP+1 – 40 GbE4 - 10 GbE
USBBlaster
DisplayPort - Source DisplayPort - SinkUSB 3.0
USB 3.0
SMA SMA
PCIESwitch
VITA 57 FMCHPC
(Optional)
QDR II+144 Mb
1334 MT/s
Tegra K1 System-On-Module TK1-SOM
16/32/64 GBeMMC
2/4/8 Gbit
DDR3
USB GigE HDMI
Tegra K1 System-On-Module TK1-SOM
16/32/64 GBeMMC
2/4/8 Gbit
DDR3
USB GigE HDMI
SMA
X4 PCIE GEN2 EXTRA X4 PCIE GEN2
SMA CLK-IN
TK1-SOM Tegra K1 System-On-Module
16/32/64 GB
eMMC
1/2/4 GB DDR3L
USB 2.0
GigE HDMI
2 In
ches
2 Inches
External Power x4 PCI Gen2, Clocks, i2c
JTAG
UART
AvailableStand-alone
27 March 2015 ColoradoEngineering.com - Public Release Slide 7
HPC-A10-K1GPUDesign Details
• NVIDIA GPUDirect Support
• TK1’s are root nodes
• TK1’s can be field upgraded
• 8 - High Speed 10GbE Ports
• CUDA on TK1
• OpenCL on Arria 10
• 2 GB/s to each TK1
• HMC is 17X faster than DDR3
• 12 to 25 Camera/Sensor I/Os
27 March 2015 ColoradoEngineering.com - Public Release Slide 8
• 1 to 21 Cameras/Sensors
• Makes dumb cameras smart
• 10/40 GbE Sensors
• OpenCL on FPGA
• CUDA on Tegra
27 March 2015 ColoradoEngineering.com - Public Release Slide 9
Single Node
C
C
C
C
C
C
CC
C
4 – 1
0 G
bE
4 – 1
0 G
bE
Display Port
USB
/GigE
USB
/GigE
C
C
C
C
C
C
C
C
FMC
C C CC
Tesla K80s + HPC-A10-K1GPU
C
C
C
C
C4 – 1
0 G
bE
4 – 1
0 G
bE
Display Port
USB
/GigE
USB
/GigE
C
C
C
C
C
C
C
C
FMC
C C CC
Telsa K80
Telsa K80
Telsa K80
Telsa K80
GPU
Direct
27 March 2015 ColoradoEngineering.com - Public Release Slide 10
27 March 2015 ColoradoEngineering.com - Public Release Slide 11
Sensor GatewaySmart Host Bus Adapter (HBA)
40
Gb
E 40
Gb
E FMC40
Gb
E4
0 G
bE 4
0 G
bE FM
C40
Gb
E
Sensor Cloud
Radar, MRI, PET,
Camera, EW, etc
Telsa K80 Cluster
Telsa K80 Cluster
• Easy to do now
• https://youtu.be/o5WtYiY5Hao
• Proficient in a day or two
• CAPI support too
• 95% to 99% Efficient as VHDL
27 March 2015 ColoradoEngineering.com - Public Release Slide 12
Programming FPGAs with OpenCL
EDGE Node Processing
• Process on the EDGE using GRID
• Distributed deep learning node
• Low cost
• 4G enabled
• Fusion of Radar, EO, IO and Sound
• Download apps from Google Play
• Feedback to Tesla K80s via GRID
• SmartCity Ready
• Military Level Device Security Built-in
NVIDIATegra K1/X1
Computer VisionVideo Compression
5 MP Camera 5 MP Camera
5 M
P C
amer
a5
MP
Cam
era
24 GHz RadarSystem
Motion Detection
Camera Queuing
COMMSAlerts
Streaming Video4G LTE
WiFiBlueTooth
USB
AlteraCyclone VAppliance Security
Pat
ch A
nte
nn
aP
atch
An
ten
na
Patch Antenna Patch Antenna
Directional MicDirectional Mic
Dir
ecti
on
al M
icD
irec
tio
nal
Mic
27 March 2015 ColoradoEngineering.com - Public Release Slide 13
Distributed Aperture SystemDistributed Sensors• Large vehicle/Military ADAS
• SA360 systems
• Retrofit casino camera systems
• Make any sensor system smart
• Tegra K1/X1’s Scalable
• Mixture of CUDA & OpenCL
x4 Gen2 PCIe2 GB/S
x4 Gen2 PCIe2 GB/S
x4 Gen2 PCIe2 GB/S x4 Gen2 PCIe
2 GB/S
x4 Gen2 PCIe2 GB/S x4 Gen2 PCIe
2 GB/Sx4 Gen2 PCIe
2 GB/S
x4 Gen2 PCIe2 GB/S
x4 Gen2 PCIe2 GB/S
64 GB eMMC
64 GB eMMC
64 GB eMMC
64 GB eMMC
64 GB eMMC
64 GB eMMC
64 GB eMMC
64 GB eMMC
64 GB eMMC
8 GBDDR4
8 GBDDR4
8 GBDDR4
8 GBDDR4
8 GBDDR4
8 GBDDR4
8 GBDDR4
8 GBDDR4
8 GBDDR4
USB3 orGigE
USB3 orGigE
USB3 orGigE
USB3 orGigE
USB3 orGigE
USB3 orGigE
USB3 orGigE
USB3 orGigE
USB3 orGigE
HDMI
4/8 GBHMC
QDR-II+Or
QDR-IV
HDMI HDMI HDMI HDMI
HDMIHDMIHDMIHDMI
AlteraArria 10 SoC
x2 ARMOpenCL
NVIDIATegra X1
x4 ARMCUDA/Linux
OpenCV
H.264/H.265
NVIDIATegra X1
x4 ARMCUDA/Linux
OpenCV
H.264/H.265
NVIDIATegra X1
x4 ARMCUDA/Linux
OpenCV
H.264/H.265
NVIDIATegra X1
x4 ARMCUDA/Linux
OpenCV
H.264/H.265
NVIDIATegra X1
x4 ARMCUDA/Linux
OpenCV
H.264/H.265
NVIDIATegra X1
x4 ARMCUDA/Linux
OpenCV
H.264/H.265
NVIDIATegra X1
x4 ARMCUDA/Linux
OpenCV
H.264/H.265
NVIDIATegra X1
x4 ARMCUDA/Linux
OpenCV
H.264/H.265
NVIDIATegra X1
x4 ARMCUDA/Linux
OpenCV
H.264/H.265Removable SATA Storage
40/10 GbE Ports
Main Display GPU
27 March 2015 ColoradoEngineering.com - Public Release Slide 14
ChallengesHardware, Interconnects & Software• FPGA + GPU
• CUDA, OpenCL or CUDA + OpenCL• Working with MDA & AFRL on solutions
• Bandwidth• Tegra K1/X1 are x4 Gen2 PCIe – limits number and resolution of sensors attached to
the Tegra.• More processing has to be done of Tegra, but that is okay since Tegra’s keep
increasing in power every year• Gen3 PCIe would be awesome• PCIe backplane – Using 40 GbE ports eliminates PCIe bottleneck
• Root Nodes• Tegra wants to root complex. Non-transparent switches need to be used• If Tegra could be an endpoint, a whole new world would open up
27 March 2015 ColoradoEngineering.com - Public Release Slide 15
Future ArchitecturesEven Cooler Designs Possible• Altera
• Arria 10 SoC• Eliminates need for x86 CPU to run OpenCL• Truly stand-alone appliances• 100 GbE interfaces
• Stratix 10 and Stratix 10 SoC• >10 TFLOPs for 100W• Details: https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.html
• NVIDIA VOLTA• Looking for NVLink intermingling with FPGAs
• Virtual FPGAs + Virtual GPUs• Allow instant scaling and data protection
27 March 2015 ColoradoEngineering.com - Public Release Slide 16
Summary
• GPU + FPGA can solve amazing and fun problems
• Tegra K1/X1 provide incredible capability at low cost which reduces the size of FPGA needed.
• OpenCL and Hard Floating Point IP make the Altera FPGAs a great partner with NVIDIA GPUs
• CEI is making scalable solutions to allow application developers to deploy from handheld to enterrpise/HPC
27 March 2015 ColoradoEngineering.com - Public Release Slide 17
Hardware & Software Capabilities• Enterprise & Embedded SW
• Net Centric, SOA, web services, J2EE,SQL
• C/C++
• CUDA & OpenCL
• Embedded real time code, RTOS, hardware drivers, Fault Detection / Fault Isolation, etc.
• Simulations, APIs, and GUIs
• Cognitive Software
• Device Drivers
• National Instruments Labview
• DO-178C
• FPGA designs (VHDL/Verilog/Simulink)
• RF Design
▪ System / Subsystem Designs
▪ 30+ complex board designs
▪ 32 layer PCBs with blind and buried vias
▪ High speed (100s MHz x GHz)
▪ Analog (RF & I/Q Receivers)
▪ Digital (FPGAs, DSPs, general purpose)
▪ ADC and DAC
▪ Standard and custom IO (busses, fabrics, SerDes, etc.)
▪ Ruggedization and thermal management
▪ CSWaP
▪ Serial I/O (e.g. PCIe, Serdes)
▪ DO-254
27 March 2015 ColoradoEngineering.com - Public Release 18
For More Informationon Standard Products and
Custom Engineering ServicesCall Us – 719-388-8582 Office
Emails Us – [email protected] Us – Colorado Springs, CO (Sunny 300+ Days)
Browse Us – www.ColoradoEngineering.com
27 March 2015 ColoradoEngineering.com - Public Release 19