Two FPGA-DNN Projects

Two FPGA-DNN Projects:1. Low Latency Multi-Layer Perceptrons using FPGAs

2. Acceleration of CNN Training on FPGA-based Clusters

Presented by Martin Herbordt

Work by Ahmed Sanaullah, Tong Geng, Tianqi Wang+,

Ethan Yang, Rushi Patel, Yuri Alexeev*, Kaz Yoshii*

*Argonne National Lab

+BU & USTC

Boston University Slideshow Title Goes Here

Part 1:

Low Latency Multi-Layer Perceptrons using FPGAs

Outline for part 1

Background Problems

Multi-Layer Perceptrons

FPGA OpenCL

FPGA Implementation

Evaluation

9/6/2018

Publications, part 1

1. A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, M.C. Herbordt (In Press): Real-

Time Data Analysis for Medical Diagnosis Using FPGA-Accelerated Neural

Networks, BMC Bioinformatics

2. A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, M.C. Herbordt (2018): Application

Aware Tuning of Reconfigurable Multi-Layer Perceptron Architectures, High

Performance Extreme Computing

3. A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, M.C. Herbordt (2017): Boosting

Curative Surgery Success Rates using FPGAs, Computational Approaches for

Cancer

4. A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, M.C. Herbordt (2017): TRIP: An

UltraLow Latency, TeraOps/s Reconfigurable Inference Processor for Multi-Layer

Perceptrons, SC17 (poster & extended abstract)


Problem 1: Proper Performance Metrics

Throughput is the typical performance metric for DNNs

Larger MMM units = Higher FLOPS = Better performance

Make compute units as large as possible

Google TPU has a 64K MAC

92 TOPS/s peak performance

But… Latency is often more important for inference

Want to get individual results faster

High throughput does not imply low latency

Many questions have not been answered in literature:

1. How do we size components to minimize latency?

Do we still make compute units as large as possible?

2. For large components, what is the impact of loading and unloading

When do their latencies become comparable with the compute latencies?

What about handling smaller computations?

9/6/2018

Google TPU – Chip Layout


Problem 2: Domain Specific ASICs (DSICs)

DNNs have a large pool of use cases

Each with varying number of layers and dimensions

ASICs for DNNs must be able to evaluate all possible models

Hence, ASICs are designed with a certain level of generality

No longer application specific – rather, domain specific

Drawbacks of Domain Specific ASICs

Trained models must be stored off-chip

Batch processing increases compute latency – wait times

TPU Microcode must be fetched from host

Tied to specific host APIs

e.g. TensorFlow is needed to use TPU

Fixed quantization

e.g. TPU uses 8 bit multipliers and 32 bit activation

New ASICs needed frequently to keep up with technology

2016: TPU

Today: TPU v2, TPU v2 Pod, TPU v3

9/6/2018

Google TPU – Block Diagram


Multi-layer Perceptrons (MLP)

Utility

Fully connected – often used where there are no direct

dependencies between pixels or where shift invariance is

not needed. Ex: Processing sensor inputs.

Layers in CNNs

Logical Characteristics (a)

• Fully connected layers of neurons

• Layer outputs are non-linear functions of the sum of

scaled/weighted neuron outputs of the previous layer

• Memory bound: no weight reuse for a test vector

• Inference can be performed in fixed point without loss of

accuracy

9/6/2018


Multi-layer Perceptrons (MLP)

Compute Model (b)

Test cases (inputs)

Output vector

Weights and biases (precomputed)

Computation is a Matrix-Vector MADD and …

Computational Interest (Why examine MLPs?)

Memory Bound

Parameter sizing is tractable: for this design, only

input and output sizes affect overall latency

9/6/2018


Why FPGAs?

• FPGAs enable module sizing to be application specific to balance latency and

throughput

• FPGAs enable use of on-chip memory to store weights and biases

Architecture designed for a very specific use case

data can be initialized on-chip as part of the bit-stream

• No instructions needed for our FPGA design

• only start trigger is required.

• FPGAs allow for variable quantization based on application

• FPGA designs can be implemented using Off-The-Shelf components

• Use of OpenCL reduces programming effort for FPGAs

9/6/2018


Intel® FPGA SDK for OpenCL™

What is OpenCL: Unified programming model (C-based)

Acceleration on Heterogeneous system (CPU, GPU, DSP, FPGA)

Architecture: Host – CPU functions for managing and delegating tasks to available resources

Kernel – Device functions that correspond to application offloads

FPGA features: C99 code and pragmas translated to specialized architecture

Direct connectivity between kernel functions

Support for HDL integration

Design & Verify designs in HDL that cannot be efficiently expressed in C99

e.g. ring buffers, interleaved memory, arbitrary precision data

9/6/2018


Part 1 Outline

Background Problems


FPGA OpenCL

FPGA Implementation

Evaluation

9/6/2018


Architecture Overview

9/6/2018

Vector Multiply + Add Scalar Product

Accumulate

Max Search

Activation & Re-Quantization Leading 1

Activation and Quantize

Buffer

ControlM: Number of Scalar Product Module Inputs

N: Number of Scalar Product Module Outputs

M

M

MN N

N

NN

11

11

M


Vector Multiplication

9/6/2018

Scalar Product Module with N units and M 8-bit multipliers per unit

Accumulator used for bias values and partial sums

One DSP

Use custom

RTL feature


Max Search

9/6/2018

Find maximum value from N inputs in log(N) stages


Activation & Re-Quantization

9/6/2018

Leading 1

Highest non-zero bit search in log(32) steps

Activation

ReLU: x = max(0,x)

Re-quantization using Truncation:

Input data: 32-bit

Output: 8-bit

Lower complexity than division


Buffer

9/6/2018

Implemented using RTL

Two bank design using 32 bit registers

Data In

N values per cycle, Stores result of current layer under evaluation

Data Out

M values per cycle, Stores result of previous layer

Transfer Enable

Triggered at the end of every layer


Control

9/6/2018

Implemented using a state machine (instructionless)

Application-specific triggers computed offline and initialized on-chip

Flow chartActual Implementation


Latency Models

Persistent Critical Path vs Variable Critical Path

17

9/6/2018

Decreases with increasing values of M, N

Increases with increasing values of M, N


Part 1 Outline

Background Problems


FPGA OpenCL

FPGA Implementation

Evaluation

9/6/2018


Benchmark & Platforms

9/6/2018

Benchmarks

GPU Nvidia Tesla K80 – 4992 Cuda Cores, 480 GB/s GM BW

cuBLAS, CUDA 8.0

FPGA Intel Arria 10 10AX115H3F34I2SG FPGA (20nm) – 400K ALMs, 1.5K DSP, 6Mb RAM

Intel FPGA OpenCL SDK 16.0


Design Parameter Selection – Impact of M&N

9/6/2018


Results

9/6/2018

FPGA outperforms the high-end GPU by an average of 1.47x

Execution time compared for evaluation of test cases

GPU evaluates test vectors with batch processing

FPGA speedup over GPU will increase if batch size is restricted


Part 2: Acceleration of CNN Training on FPGA-based Clusters

Outline

Background

9/6/2018

Publications, part 2

1. T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, M.C. Herbordt

(2018): A Framework for Acceleration of CNN Training on Deeply-

Pipelined FPGA Clusters with Work and Weight Load Balancing, Field

Programmable Logic and Applications

2. T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, M.C. Herbordt

(2018): FPDeep: Acceleration and Load Balancing of CNN Training on

FPGA Clusters, Field-Programmable Custom Computing Machines

Acceleration of CNN Training on FPGA-based Clusters


FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters 9/6/2018

Training is a big problem for CNNs

Inference (Alexnet): 720 Million FLOPs per image

Training: 2.2 Billion FLOPs per image * Batch size * Epoch * Iteration →

can be 550 Trillion FLOPs

To train a CNN, clusters/clouds are necessary

Problem: How to map training logic to multiple devices efficiently?

“TensorflowAlexnet benchmark,”

https://www.leadergpu.com/articles/428tensorflow-alexnet-benchmark.Ben-Nun, Tal, and Torsten Hoefler. "Demystifying Parallel

and Distributed Deep Learning: An In-Depth Concurrency

Analysis." arXiv preprint arXiv:1802.09941 (2018).



How to map training logic to multiple devices?

Features need to remain available while waiting for back-

propagation

Large storage demand

Weights are broadcast between all clients and server in

centralized topology; or all-to-all communication in

decentralized topology;

High bandwidth requirement from weight broadcast and

update

Requires large batch size

(8k used by Facebook using 256 GPUs)

Large batch size limits the scalability;

Most widely used:

Data parallelismEntire device is allocated to an image

which is processed layer-by-layer

Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).



How to map training logic to multi-devices?

Weights are distributed & stored;

Load balancing is hard; workloads of

different layers vary greatly;

Weights are distributed & stored;

Too much data exchange among

devices to combine intermediate results

Model parallelismLayer parallelism

• Solves batch size problem – But …

• GPU is not a good platform for either:

• No direct inter-SM communication

• Inefficient inter-device communication

• FPGA is the right choice?

• If Layer parallelism: How to balance the workload?

• If Model parallelism: How to reduce the massive data exchange?

FPDeep is proposed!


Solution – Hybrid Model/Layer Parallelism

Acceleration of CNN Training on FPGA-based Clusters 9/6/2018

Weights are stored in distributed fashion Not necessarily with layer – more later!

Less data exchange compared with Model Parallelism

Better load balance than Layer Parallelism

Small batch size is always supported even if using hundreds of devices;


FPDeep – Overview: What it is, Why it works

A framework to map CNN to FPGA

clusters

Achieves High Performance and

Energy Efficiency

Workload balancing

Full utilization of compute elements

Storage balancing for parameters

Only on-chip memory for CONV layers

Good portability: 1-D network is

sufficient

Good scalability： up to 83 FPGAs

with 5 transceivers

Includes HDL generator



FPDeep Method – Heuristic Partitioning, Pipeline, Adjust

1. Start w/ Layer-based; use Model-based as neededIf all layers are equal load, then that would be good starting point

2. Measure work per layer: Inter-layer Metric used = (FL)Ops >> first heuristic

Makes sense because we have nearly 100% utilization of compute resources

3. Group layers for mapping to FPGAs Balance work/FPGA – more or fewer layers/FPGA as indicated by layer workload

Split (high workload) layers among FPGAs as necessary

Map multiple layers/FPGA as necessary

Intra-layer Metrics: complex … >> second heuristic

4. Pipeline layer execution>> third set of heuristics

5. Adjust mapping to improve performance If all weights of the work mapped to the FPGA don’t fit, then transfer weights from neighbor (using

MGTs), rather than to load from memory

A Framework for Acceleration of CNN Training 9/6/2018


How to partition the workload? (1)


Inter-Layer Partition

Resources are allocated according to workload of each layer

Example: 7 FPGAs According to # of Flops, Layer1 needs

4.8 FPGAs, Layer2 needs 2.2 FPGAs



Intra-Layer Partition: Additional heuristic: Each FPGA evaluates a part of input features




Integration of Inter- and Intra-layer Partitioning Each FPGAs does its part of workload and forwards the result to the next FPGA

For Example: 4.8 FPGAs for Layer 1, 2.2 FPGAs for Layer 2 (with IFP method)



Layer Fusion: fine-grained pipelining


• Activations propagate faster, no need wait for the completion of whole layer

• The time features have to be cached waiting for backward propagation is reduced

Reduced storage demand


Architecture



Weight

Balancing


• Weight balancing allows weights to fit on chip• Does not overload network – weight transfer BW replaces

activation BW from later layers


Overall Resource Allocation Optimization



Overall Resource Allocation Optimization



Resource Utilization

0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 101112131415

Thro

ug

hp

ut (

TOp

s)

(F) VGG-19 DSP & Throughput

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(A) AlexNet BRAM

0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 101112131415

Th

rou

gh

pu

t(To

ps)

(B) AlexNet DSP & Throughput

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(C) VGG-16 BRAM

0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Thro

ug

hp

ut (

TOp

s)

(D) VGG-16 DSP & Throughput

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(E) VGG-19 BRAM

FPGA

FPGA

FPGA

Util Perc

Util Perc

Util Perc Util Perc

Util Perc

Util Perc

FPGA

FPGA

FPGA


AlexNet:

Without Weight Balancing

Unbalanced BRAM Utilization

VGG-16/19:

With Weight Balancing

Balanced BRAM Utilization

Example: Mapping AlexNet/VGG-

16/VGG-19 to 15 FPGAs

BRAM Utilization DSP Utilization & Throughput


Results


• Scalability: 80+ FPGA • Utilization: >98%


Performance

Power Efficiency: (GOPS/J)

Versus K80 GPU: 5.5x

Versus Titan X GPU: 8.8x

Versus other FPGA designs: 5.7x


Documents

Two FPGA-DNN Projects