HW/SW Programmable Engine...Sastry, Mukund Sivaraman, Rishi Surendran, Abraham Lee, Chia-Jui Hsu, Kumud Bhandari, Jack Lo, Kristof Denolf, Phil James-Roxby, Samuel Bayliss, Kees Vissers,

© Copyright 2018 Xilinx

HW/SW Programmable Engine:

Domain Specific Architecture for Project Everest

Juanjo Noguera, Goran Bilski, Jan Langer, Baris Ozgul, Tim Tuan, David Clarke, Peter McColgan, Sneha Date, Zachary Dickman, Pedro Duarte, Stephan Munz, Jose Marques, Sridhar Subramanian, Saurabh Mathur, MalavP. Shah, Chris Dick, Paul Newson, Kaushik Barman, Ramon Uribe, Helen Tarn, Engin Tunali, Richard Walke, Vinod Kathail, Shail Aditya Gupta, Akella Sastry, Mukund Sivaraman, Rishi Surendran, Abraham Lee, Chia-Jui Hsu, Kumud Bhandari, Jack Lo, Kristof Denolf, Phil James-Roxby, Samuel Bayliss, Kees Vissers, Ralph Wittig, Gaurav Singh

21st August 2018


Heterogenous Computing Architecture

>> 2

Programmable

Logic

SW

Programmable

Engine

Processing

SystemI/O

(GT, AMS)

7nm Everest is the First Product to Integrate this New ArchitectureTape Out in 2018

ML Inference 5G Wireless Power

Xilinx 7nm Everest Xilinx 16nm UltraScale+20x

40%Less

Power

4x

Application-Level Performance Enabled

by SW Programmable Engine Everest Heterogeneous Architecture

˃ Compute Efficiency

˃ Reduced Power

˃ Software Programmable


Everest Architecture Block Diagram

>> 3

Programmable Logic

• ‘Soft’ logic

• Flexibility

• Custom memory hierarchy

SW Programmable Engine

• Domain Specific Architecture

• Hardened 7nm technology

• Throughput-oriented, low-latency

Programmable Logic

Processing

System

I/O (GT, AMS)


SW PE SW PE SW PE

SW PE SW PE SW PE

LUT BRAM

DSP URAM

Application

Processor

Real-Time

Processor

OCM

SMMU

Transceivers

PCIe

DDR

HBM

AMS

Chips with Different Implementations of the Architecture will be Announced Shortly


Market Requirements and Trends: Data Center

>> 4

Source: Canziani et. al, “An Analysis of Deep Neural Network Models for Practical Applications”

https://arxiv.org/abs/1605.07678

Increasing Compute Lower Latency Requirements

Video + ML

Genomics + ML

Risk Modelling + ML

Database + ML

Network IPS + ML

Storage + ML

Heterogeneous WorkloadsML Becoming an Essential

Part of Data Processing

https://arxiv.org/abs/1605.07678


ML Inference on Heterogenous Architecture

>> 5

Programmable

Logic

SW

Programmable Engine

Processing

SystemI/O

(GT, AMS)

*Figure credit: https://en.wikipedia.org/wiki/Convolutional_neural_network

Custom

Memory

Hierarchy

Feature

Map

Data

Volume*

Convolutions Activations

yi = 0

yi = xi

y

x

ReLU ReLU/PReLU

yi = xi

y

yi = ai xi

Poolingsingle depth slice

Y

1 2 2 4

2 31 0

4 6 6 8

3 1 1 0

X

6 8

3 4

Fully Connected Layers

• Video

• Genomics

• Storage

• Database

• Network IPS

• Risk modeling


Market Requirements and Trends: Wireless 5G

>> 6

5G Complexity is 100X that of 4GStill Evolving Standard

ETRI RWS-150029,

5G Vision and Enabling Technologies: ETRI Perspective 3GPP RAN Workshop

Phoenix, Dec. 2015

http://www.3gpp.org/ftp/tsg_ran/TSG_RAN/TSGR_70/Docs

New Technologies in 5G

Transport

& CTRLL2–L7

Pa

cke

t P

roce

ssin

g

an

d W

ired

Ba

ckh

au

l

Hig

he

r L

aye

r P

roce

ssin

g

Sw

itch

ing

Be

am

Fo

rmin

g &

MM

IO

+ S

om

e B

ase

ba

nd

Tra

nsfo

rms

Dig

ita

l Rad

io

AD

C /

DA

C

An

alo

gu

e R

ad

io

An

ten

na

Arr

ay

Ba

se

ba

nd

Pro

ce

ssin

g

Modulation

& FEC

IQ

Switch

Linear

Algebra

iFFT/

FFT

DUC, CFR,

DPD, DDC

PA, LNA,

Diplexer

˃ Massive MIMO

˃ Multiple antenna, frequency bands

˃ Changing functional partitioning



5G Wireless on Heterogenous Architecture

>> 7

1: DUC: Digital Up Converter

2: DPD: Digital Pre-Distortion

3: AMS: Analog Mix-Signal (ADC/DAC)

4: CPRI: Common Public Radio Interface

Mapping Example

5G Wireless Infrastructure (i.e., base-station)

Packet P

rocessin

g

and W

ired B

ackhaul

Hig

her

Layer

Pro

cessin

g

Sw

itchin

g

Beam

Form

ing &

MM

IO

+ S

om

e B

aseb

an

d

Tra

nsfo

rms

Dig

ital R

adio

AD

C /

DA

C

Analo

gue R

adio

Ante

nna A

rray

Baseband P

rocessin

g

Programmable Logic

Processing

System

I/O

DPD Update

Control

Maps to PS

AMSDPDDUCCPRI

DPD Update

Digital Radio

with ADC/DAC

DUC DPD

Compute Maps

to Software PE

CPRI

AMS

I/O Maps to PL


Architecture Overview


Tile-Based Architecture

>> 9

Interconnect

ISA-based

Vector Processor

Local

Memory

ML Vector

Extensions

5G Wireless Vector

Extensions

ISA-based

Vector ProcessorSoftware Programmable

(e.g., C/C++)

Data

Mover

Data MoverNon-neighbor data communication

Integrated synchronization primitives

Non-Blocking InterconnectUp to 200+ GB/s bandwidth per tile

PL

PS I/O

Local MemoryMulti-bank implementation

Shared across neighbor cores

Cascade InterfacePartial results to next core


Tile-Based Architecture (Continued)

>> 10

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Architectural Focus: Deterministic Performance & Low Latency

*Device Specific

Modular and scalable architecture

• Instantiate multiple tiles for more compute

• 10’s-100’s Instances*

Distributed memory hierarchyMaximize memory bandwidth

Massive multi-core engines

• Increase in compute, memory and communication bandwidth

PL

PS I/O


Dataflow Graph

Mem

Mem

Mem

Software

PE

2

Data Movement Architecture: Examples (1/2)

>> 11

Software

PE0

Software

PE1

Dataflow Pipeline

Software

PE2 Compute

Comm

Overlap Compute and CommunicationMemory

B0

B1

Memory

B2

B3Compute Compute

Comm Comm

1

MemSoftware

PE

Software

PE

Software

PE

MemSoftware

PE

Software

PE

MemSoftware

PE

Streaming Multicast

Software

PE0

Software

PE1

Software

PE2

Software

PE3

3


Data Movement Architecture: Examples (2/2)

>> 12

Software

PE0

Memory Memory

Software

PE1

Memory Memory

4 Non-neighbor communication

Software

PE2

Software

PE3

5 Cascade streaming

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

PE0

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Mem

ory

Mem

ory

PE1

Physical Mapping

Vector

CoreVector

CorePE3PE2


Programmable Logic

LUT BRAMDSP URAM

Device Integration

>> 13

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory

Vector

Core

Mem

ory Vector

Core

Mem

ory

Vector

Core

Mem

ory

CDC CDC CDC

Clock-domain

crossing (CDC)

between PL &

Software PE clocks

Vector

Core

Up to TB/s of

bandwidth between

Software PE and PL*

*Device Specific

PL

PS I/O


Data Movement Architecture: Examples PL Integration

>> 14

Mem Mem

Custom On-chip

Memory Hierarchy

PL

Accelerator

Software

PE0

Software

PE1

PL Accelerator

• Pre/Post Processing

• Co-processing

Custom On-Chip Memory Hierarchy

• Leverages on-chip memory resources in

PL (e.g., BRAM, URAM)

• High on-chip memory bandwidth and lower

memory access latency


Programming Environment


Programming Environment

>> 16

Full Device Programming SolutionFully Integrated with Tools for Other Everest Fabrics

CPU

Compiler

SW PE

CompilerPL Compiler

Device-Level Programming

C/C++

Programmable Logic

SW

Programmable Engine

PS I/O

AMSDPDDUCCPRI

DPD Update

High Level Programming Languages

(e.g., C/C++)

C/C++


Complete Software Development Stack

>> 17

CPU CompilerSW PE

CompilerPL Compiler

Device-Level Programming

C/C++C/C++

IDE

Compiler

Debugger

Performance Analysis

Full SW Programming

Tool Chain(Single-Core and Multi-Core)

1

Wireless 4G/5G Library

ML (BLAS) Library

Vision Library

Performance-Optimized

Software Libraries(Examples)

2

Error Management

Memory Management

Boot + Configuration

Power/Thermal Management

Run-Time

Software(Examples)

3


High-level Software Compiler for ML Inference

>> 18

Deep Learning Frameworks

Xilinx 16nm

UltraScale+

Xilinx ML Compiler

Xilinx Everest

w/ Software PE

Seamless Migration

• Same experience as 16nm devices

• Current models continue to work

Users Working at ML Framework Level

Users don’t directly program Software PE


Application Results


Kernel-Level Performance: ML and Wireless 5G

>> 20

High Compute Efficiency for Key Functions in ML and Wireless 5G

95%

80%

98%

ML Convolutions FFT DPD

Vector Processor Efficiency

Peak Theoretical Performance

Block-based Matrix Multiplication

(32×64) × (64×32)

1024-pt FFT/iFFT Volterra-based forward-path DPD


Massive MIMO Radio(DUC, DDC, CFR, DPD)

Image Classification(GoogleNet CNN)

ML Inference 5G Wireless Power Reduction

Xilinx Everest 7nm Xilinx UltraScale+ 16nm

Software Programmable Engine: Summary

>> 23

Heterogenous Architecture

High-throughput, low-latency

PL flexibility

Custom memory hierarchy

SW Programmable

SW programmable (e.g., C/C++)

Compile, execute, debug

Optimized software libraries

Multiple Applications

ML Inference for Cloud DC

Wireless 5G: Radio, Baseband

ADAS/AD embedded vision

Wired: DOCSIS cable access

Compute Efficiency

Domain specific architecture

Increase in compute density

Xilinx 7nm Everest

Application-level Performance Enabled by


20x

40%Less

Power

4x

Sub 1ms Latency TX Sample Rate

Up to 2Gsps

© Copyright 2018 Xilinx>> 22

KEYNOTE

Adaptable Intelligence:

The Next Computing EraVictor Peng, Xilinx CEO

AUGUST 21ST @11:45 A.M.

Connect Learn Share

Silicon Valley October 1-2

XDF connects software developers and system designers to the deep expertise of Xilinx engineers, partners, and industry leaders.

Learn More www.xilinx.com/xdf


Adaptable.

Intelligent.

Documents

HW/SW Programmable Engine...Sastry, Mukund Sivaraman, Rishi Surendran, Abraham Lee, Chia-Jui Hsu, Kumud Bhandari, Jack Lo, Kristof Denolf, Phil James-Roxby, Samuel Bayliss, Kees Vissers,