Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
© Copyright 2018 Xilinx
HW/SW Programmable Engine:
Domain Specific Architecture for Project Everest
Juanjo Noguera, Goran Bilski, Jan Langer, Baris Ozgul, Tim Tuan, David Clarke, Peter McColgan, Sneha Date, Zachary Dickman, Pedro Duarte, Stephan Munz, Jose Marques, Sridhar Subramanian, Saurabh Mathur, MalavP. Shah, Chris Dick, Paul Newson, Kaushik Barman, Ramon Uribe, Helen Tarn, Engin Tunali, Richard Walke, Vinod Kathail, Shail Aditya Gupta, Akella Sastry, Mukund Sivaraman, Rishi Surendran, Abraham Lee, Chia-Jui Hsu, Kumud Bhandari, Jack Lo, Kristof Denolf, Phil James-Roxby, Samuel Bayliss, Kees Vissers, Ralph Wittig, Gaurav Singh
21st August 2018
© Copyright 2018 Xilinx
Heterogenous Computing Architecture
>> 2
Programmable
Logic
SW
Programmable
Engine
Processing
SystemI/O
(GT, AMS)
7nm Everest is the First Product to Integrate this New ArchitectureTape Out in 2018
ML Inference 5G Wireless Power
Xilinx 7nm Everest Xilinx 16nm UltraScale+20x
40%Less
Power
4x
Application-Level Performance Enabled
by SW Programmable Engine Everest Heterogeneous Architecture
˃ Compute Efficiency
˃ Reduced Power
˃ Software Programmable
© Copyright 2018 Xilinx
Everest Architecture Block Diagram
>> 3
Programmable Logic
• ‘Soft’ logic
• Flexibility
• Custom memory hierarchy
SW Programmable Engine
• Domain Specific Architecture
• Hardened 7nm technology
• Throughput-oriented, low-latency
Programmable Logic
Processing
System
I/O (GT, AMS)
SW Programmable Engine
SW PE SW PE SW PE
SW PE SW PE SW PE
LUT BRAM
DSP URAM
Application
Processor
Real-Time
Processor
OCM
SMMU
Transceivers
PCIe
DDR
HBM
AMS
Chips with Different Implementations of the Architecture will be Announced Shortly
© Copyright 2018 Xilinx
Market Requirements and Trends: Data Center
>> 4
Source: Canziani et. al, “An Analysis of Deep Neural Network Models for Practical Applications”
https://arxiv.org/abs/1605.07678
Increasing Compute Lower Latency Requirements
Video + ML
Genomics + ML
Risk Modelling + ML
Database + ML
Network IPS + ML
Storage + ML
Heterogeneous WorkloadsML Becoming an Essential
Part of Data Processing
© Copyright 2018 Xilinx
ML Inference on Heterogenous Architecture
>> 5
Programmable
Logic
SW
Programmable Engine
Processing
SystemI/O
(GT, AMS)
*Figure credit: https://en.wikipedia.org/wiki/Convolutional_neural_network
Custom
Memory
Hierarchy
Feature
Map
Data
Volume*
Convolutions Activations
yi = 0
yi = xi
y
x
ReLU ReLU/PReLU
yi = xi
y
yi = ai xi
Poolingsingle depth slice
Y
1 2 2 4
2 31 0
4 6 6 8
3 1 1 0
X
6 8
3 4
Fully Connected Layers
• Video
• Genomics
• Storage
• Database
• Network IPS
• Risk modeling
© Copyright 2018 Xilinx
Market Requirements and Trends: Wireless 5G
>> 6
5G Complexity is 100X that of 4GStill Evolving Standard
ETRI RWS-150029,
5G Vision and Enabling Technologies: ETRI Perspective 3GPP RAN Workshop
Phoenix, Dec. 2015
http://www.3gpp.org/ftp/tsg_ran/TSG_RAN/TSGR_70/Docs
New Technologies in 5G
Transport
& CTRLL2–L7
Pa
cke
t P
roce
ssin
g
an
d W
ired
Ba
ckh
au
l
Hig
he
r L
aye
r P
roce
ssin
g
Sw
itch
ing
Be
am
Fo
rmin
g &
MM
IO
+ S
om
e B
ase
ba
nd
Tra
nsfo
rms
Dig
ita
l Rad
io
AD
C /
DA
C
An
alo
gu
e R
ad
io
An
ten
na
Arr
ay
Ba
se
ba
nd
Pro
ce
ssin
g
Modulation
& FEC
IQ
Switch
Linear
Algebra
iFFT/
FFT
DUC, CFR,
DPD, DDC
PA, LNA,
Diplexer
˃ Massive MIMO
˃ Multiple antenna, frequency bands
˃ Changing functional partitioning
© Copyright 2018 Xilinx
SW Programmable Engine
5G Wireless on Heterogenous Architecture
>> 7
1: DUC: Digital Up Converter
2: DPD: Digital Pre-Distortion
3: AMS: Analog Mix-Signal (ADC/DAC)
4: CPRI: Common Public Radio Interface
Mapping Example
5G Wireless Infrastructure (i.e., base-station)
Packet P
rocessin
g
and W
ired B
ackhaul
Hig
her
Layer
Pro
cessin
g
Sw
itchin
g
Beam
Form
ing &
MM
IO
+ S
om
e B
aseb
an
d
Tra
nsfo
rms
Dig
ital R
adio
AD
C /
DA
C
Analo
gue R
adio
Ante
nna A
rray
Baseband P
rocessin
g
Programmable Logic
Processing
System
I/O
DPD Update
Control
Maps to PS
AMSDPDDUCCPRI
DPD Update
Digital Radio
with ADC/DAC
DUC DPD
Compute Maps
to Software PE
CPRI
AMS
I/O Maps to PL
© Copyright 2018 Xilinx
Architecture Overview
© Copyright 2018 Xilinx
Tile-Based Architecture
>> 9
Interconnect
ISA-based
Vector Processor
Local
Memory
ML Vector
Extensions
5G Wireless Vector
Extensions
ISA-based
Vector ProcessorSoftware Programmable
(e.g., C/C++)
Data
Mover
Data MoverNon-neighbor data communication
Integrated synchronization primitives
Non-Blocking InterconnectUp to 200+ GB/s bandwidth per tile
PL
PS I/O
Local MemoryMulti-bank implementation
Shared across neighbor cores
Cascade InterfacePartial results to next core
© Copyright 2018 Xilinx
Tile-Based Architecture (Continued)
>> 10
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Architectural Focus: Deterministic Performance & Low Latency
*Device Specific
Modular and scalable architecture
• Instantiate multiple tiles for more compute
• 10’s-100’s Instances*
Distributed memory hierarchyMaximize memory bandwidth
Massive multi-core engines
• Increase in compute, memory and communication bandwidth
PL
PS I/O
© Copyright 2018 Xilinx
Dataflow Graph
Mem
Mem
Mem
Software
PE
2
Data Movement Architecture: Examples (1/2)
>> 11
Software
PE0
Software
PE1
Dataflow Pipeline
Software
PE2 Compute
Comm
Overlap Compute and CommunicationMemory
B0
B1
Memory
B2
B3Compute Compute
Comm Comm
1
MemSoftware
PE
Software
PE
Software
PE
MemSoftware
PE
Software
PE
MemSoftware
PE
Streaming Multicast
Software
PE0
Software
PE1
Software
PE2
Software
PE3
3
© Copyright 2018 Xilinx
Data Movement Architecture: Examples (2/2)
>> 12
Software
PE0
Memory Memory
Software
PE1
Memory Memory
4 Non-neighbor communication
Software
PE2
Software
PE3
5 Cascade streaming
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
PE0
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Mem
ory
Mem
ory
PE1
Physical Mapping
Vector
CoreVector
CorePE3PE2
© Copyright 2018 Xilinx
Programmable Logic
LUT BRAMDSP URAM
Device Integration
>> 13
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory
Vector
Core
Mem
ory Vector
Core
Mem
ory
Vector
Core
Mem
ory
CDC CDC CDC
Clock-domain
crossing (CDC)
between PL &
Software PE clocks
Vector
Core
Up to TB/s of
bandwidth between
Software PE and PL*
*Device Specific
PL
PS I/O
© Copyright 2018 Xilinx
Data Movement Architecture: Examples PL Integration
>> 14
Mem Mem
Custom On-chip
Memory Hierarchy
PL
Accelerator
Software
PE0
Software
PE1
PL Accelerator
• Pre/Post Processing
• Co-processing
Custom On-Chip Memory Hierarchy
• Leverages on-chip memory resources in
PL (e.g., BRAM, URAM)
• High on-chip memory bandwidth and lower
memory access latency
© Copyright 2018 Xilinx
Programming Environment
© Copyright 2018 Xilinx
Programming Environment
>> 16
Full Device Programming SolutionFully Integrated with Tools for Other Everest Fabrics
CPU
Compiler
SW PE
CompilerPL Compiler
Device-Level Programming
C/C++
Programmable Logic
SW
Programmable Engine
PS I/O
AMSDPDDUCCPRI
DPD Update
High Level Programming Languages
(e.g., C/C++)
C/C++
© Copyright 2018 Xilinx
Complete Software Development Stack
>> 17
CPU CompilerSW PE
CompilerPL Compiler
Device-Level Programming
C/C++C/C++
IDE
Compiler
Debugger
Performance Analysis
Full SW Programming
Tool Chain(Single-Core and Multi-Core)
1
Wireless 4G/5G Library
ML (BLAS) Library
Vision Library
Performance-Optimized
Software Libraries(Examples)
2
Error Management
Memory Management
Boot + Configuration
Power/Thermal Management
Run-Time
Software(Examples)
3
© Copyright 2018 Xilinx
High-level Software Compiler for ML Inference
>> 18
Deep Learning Frameworks
Xilinx 16nm
UltraScale+
Xilinx ML Compiler
Xilinx Everest
w/ Software PE
Seamless Migration
• Same experience as 16nm devices
• Current models continue to work
Users Working at ML Framework Level
Users don’t directly program Software PE
© Copyright 2018 Xilinx
Application Results
© Copyright 2018 Xilinx
Kernel-Level Performance: ML and Wireless 5G
>> 20
High Compute Efficiency for Key Functions in ML and Wireless 5G
95%
80%
98%
ML Convolutions FFT DPD
Vector Processor Efficiency
Peak Theoretical Performance
Block-based Matrix Multiplication
(32×64) × (64×32)
1024-pt FFT/iFFT Volterra-based forward-path DPD
© Copyright 2018 Xilinx
Massive MIMO Radio(DUC, DDC, CFR, DPD)
Image Classification(GoogleNet CNN)
ML Inference 5G Wireless Power Reduction
Xilinx Everest 7nm Xilinx UltraScale+ 16nm
Software Programmable Engine: Summary
>> 23
Heterogenous Architecture
High-throughput, low-latency
PL flexibility
Custom memory hierarchy
SW Programmable
SW programmable (e.g., C/C++)
Compile, execute, debug
Optimized software libraries
Multiple Applications
ML Inference for Cloud DC
Wireless 5G: Radio, Baseband
ADAS/AD embedded vision
Wired: DOCSIS cable access
Compute Efficiency
Domain specific architecture
Increase in compute density
Xilinx 7nm Everest
Application-level Performance Enabled by
SW Programmable Engine
20x
40%Less
Power
4x
Sub 1ms Latency TX Sample Rate
Up to 2Gsps
© Copyright 2018 Xilinx>> 22
KEYNOTE
Adaptable Intelligence:
The Next Computing EraVictor Peng, Xilinx CEO
AUGUST 21ST @11:45 A.M.
Connect Learn Share
Silicon Valley October 1-2
XDF connects software developers and system designers to the deep expertise of Xilinx engineers, partners, and industry leaders.
Learn More www.xilinx.com/xdf
© Copyright 2018 Xilinx
Adaptable.
Intelligent.