View
2
Download
0
Category
Preview:
Citation preview
Two FPGA-DNN Projects:1. Low Latency Multi-Layer Perceptrons using FPGAs
2. Acceleration of CNN Training on FPGA-based Clusters
Presented by Martin Herbordt
Work by Ahmed Sanaullah, Tong Geng, Tianqi Wang+,
Ethan Yang, Rushi Patel, Yuri Alexeev*, Kaz Yoshii*
*Argonne National Lab
+BU & USTC
Boston University Slideshow Title Goes Here
Part 1:
Low Latency Multi-Layer Perceptrons using FPGAs
Outline for part 1
Background Problems
Multi-Layer Perceptrons
FPGA OpenCL
FPGA Implementation
Evaluation
9/6/2018
Publications, part 1
1. A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, M.C. Herbordt (In Press): Real-
Time Data Analysis for Medical Diagnosis Using FPGA-Accelerated Neural
Networks, BMC Bioinformatics
2. A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, M.C. Herbordt (2018): Application
Aware Tuning of Reconfigurable Multi-Layer Perceptron Architectures, High
Performance Extreme Computing
3. A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, M.C. Herbordt (2017): Boosting
Curative Surgery Success Rates using FPGAs, Computational Approaches for
Cancer
4. A. Sanaullah, C. Yang, Y. Alexeev, K. Yoshii, M.C. Herbordt (2017): TRIP: An
UltraLow Latency, TeraOps/s Reconfigurable Inference Processor for Multi-Layer
Perceptrons, SC17 (poster & extended abstract)
Boston University Slideshow Title Goes Here
Problem 1: Proper Performance Metrics
Throughput is the typical performance metric for DNNs
Larger MMM units = Higher FLOPS = Better performance
Make compute units as large as possible
Google TPU has a 64K MAC
92 TOPS/s peak performance
But… Latency is often more important for inference
Want to get individual results faster
High throughput does not imply low latency
Many questions have not been answered in literature:
1. How do we size components to minimize latency?
Do we still make compute units as large as possible?
2. For large components, what is the impact of loading and unloading
When do their latencies become comparable with the compute latencies?
What about handling smaller computations?
9/6/2018
Google TPU – Chip Layout
Boston University Slideshow Title Goes Here
Problem 2: Domain Specific ASICs (DSICs)
DNNs have a large pool of use cases
Each with varying number of layers and dimensions
ASICs for DNNs must be able to evaluate all possible models
Hence, ASICs are designed with a certain level of generality
No longer application specific – rather, domain specific
Drawbacks of Domain Specific ASICs
Trained models must be stored off-chip
Batch processing increases compute latency – wait times
TPU Microcode must be fetched from host
Tied to specific host APIs
e.g. TensorFlow is needed to use TPU
Fixed quantization
e.g. TPU uses 8 bit multipliers and 32 bit activation
New ASICs needed frequently to keep up with technology
2016: TPU
Today: TPU v2, TPU v2 Pod, TPU v3
9/6/2018
Google TPU – Block Diagram
Boston University Slideshow Title Goes Here
Multi-layer Perceptrons (MLP)
Utility
Fully connected – often used where there are no direct
dependencies between pixels or where shift invariance is
not needed. Ex: Processing sensor inputs.
Layers in CNNs
Logical Characteristics (a)
• Fully connected layers of neurons
• Layer outputs are non-linear functions of the sum of
scaled/weighted neuron outputs of the previous layer
• Memory bound: no weight reuse for a test vector
• Inference can be performed in fixed point without loss of
accuracy
9/6/2018
Boston University Slideshow Title Goes Here
Multi-layer Perceptrons (MLP)
Compute Model (b)
Test cases (inputs)
Output vector
Weights and biases (precomputed)
Computation is a Matrix-Vector MADD and …
Computational Interest (Why examine MLPs?)
Memory Bound
Parameter sizing is tractable: for this design, only
input and output sizes affect overall latency
9/6/2018
Boston University Slideshow Title Goes Here
Why FPGAs?
• FPGAs enable module sizing to be application specific to balance latency and
throughput
• FPGAs enable use of on-chip memory to store weights and biases
Architecture designed for a very specific use case
data can be initialized on-chip as part of the bit-stream
• No instructions needed for our FPGA design
• only start trigger is required.
• FPGAs allow for variable quantization based on application
• FPGA designs can be implemented using Off-The-Shelf components
• Use of OpenCL reduces programming effort for FPGAs
9/6/2018
Boston University Slideshow Title Goes Here
Intel® FPGA SDK for OpenCL™
What is OpenCL: Unified programming model (C-based)
Acceleration on Heterogeneous system (CPU, GPU, DSP, FPGA)
Architecture: Host – CPU functions for managing and delegating tasks to available resources
Kernel – Device functions that correspond to application offloads
FPGA features: C99 code and pragmas translated to specialized architecture
Direct connectivity between kernel functions
Support for HDL integration
Design & Verify designs in HDL that cannot be efficiently expressed in C99
e.g. ring buffers, interleaved memory, arbitrary precision data
9/6/2018
Boston University Slideshow Title Goes Here
Part 1 Outline
Background Problems
Multi-Layer Perceptrons
FPGA OpenCL
FPGA Implementation
Evaluation
9/6/2018
Boston University Slideshow Title Goes Here
Architecture Overview
9/6/2018
Vector Multiply + Add Scalar Product
Accumulate
Max Search
Activation & Re-Quantization Leading 1
Activation and Quantize
Buffer
ControlM: Number of Scalar Product Module Inputs
N: Number of Scalar Product Module Outputs
M
M
MN N
N
NN
11
11
M
Boston University Slideshow Title Goes Here
Vector Multiplication
9/6/2018
Scalar Product Module with N units and M 8-bit multipliers per unit
Accumulator used for bias values and partial sums
One DSP
Use custom
RTL feature
Boston University Slideshow Title Goes Here
Max Search
9/6/2018
Find maximum value from N inputs in log(N) stages
Boston University Slideshow Title Goes Here
Activation & Re-Quantization
9/6/2018
Leading 1
Highest non-zero bit search in log(32) steps
Activation
ReLU: x = max(0,x)
Re-quantization using Truncation:
Input data: 32-bit
Output: 8-bit
Lower complexity than division
Boston University Slideshow Title Goes Here
Buffer
9/6/2018
Implemented using RTL
Two bank design using 32 bit registers
Data In
N values per cycle, Stores result of current layer under evaluation
Data Out
M values per cycle, Stores result of previous layer
Transfer Enable
Triggered at the end of every layer
Boston University Slideshow Title Goes Here
Control
9/6/2018
Implemented using a state machine (instructionless)
Application-specific triggers computed offline and initialized on-chip
Flow chartActual Implementation
Boston University Slideshow Title Goes Here
Latency Models
Persistent Critical Path vs Variable Critical Path
17
9/6/2018
Decreases with increasing values of M, N
Increases with increasing values of M, N
Boston University Slideshow Title Goes Here
Part 1 Outline
Background Problems
Multi-Layer Perceptrons
FPGA OpenCL
FPGA Implementation
Evaluation
9/6/2018
Boston University Slideshow Title Goes Here
Benchmark & Platforms
9/6/2018
Benchmarks
GPU Nvidia Tesla K80 – 4992 Cuda Cores, 480 GB/s GM BW
cuBLAS, CUDA 8.0
FPGA Intel Arria 10 10AX115H3F34I2SG FPGA (20nm) – 400K ALMs, 1.5K DSP, 6Mb RAM
Intel FPGA OpenCL SDK 16.0
Boston University Slideshow Title Goes Here
Design Parameter Selection – Impact of M&N
9/6/2018
Boston University Slideshow Title Goes Here
Results
9/6/2018
FPGA outperforms the high-end GPU by an average of 1.47x
Execution time compared for evaluation of test cases
GPU evaluates test vectors with batch processing
FPGA speedup over GPU will increase if batch size is restricted
Boston University Slideshow Title Goes Here
Part 2: Acceleration of CNN Training on FPGA-based Clusters
Outline
Background
9/6/2018
Publications, part 2
1. T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, M.C. Herbordt
(2018): A Framework for Acceleration of CNN Training on Deeply-
Pipelined FPGA Clusters with Work and Weight Load Balancing, Field
Programmable Logic and Applications
2. T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, M.C. Herbordt
(2018): FPDeep: Acceleration and Load Balancing of CNN Training on
FPGA Clusters, Field-Programmable Custom Computing Machines
Acceleration of CNN Training on FPGA-based Clusters
Boston University Slideshow Title Goes Here
FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters 9/6/2018
Training is a big problem for CNNs
Inference (Alexnet): 720 Million FLOPs per image
Training: 2.2 Billion FLOPs per image * Batch size * Epoch * Iteration →
can be 550 Trillion FLOPs
To train a CNN, clusters/clouds are necessary
Problem: How to map training logic to multiple devices efficiently?
“TensorflowAlexnet benchmark,”
https://www.leadergpu.com/articles/428tensorflow-alexnet-benchmark.Ben-Nun, Tal, and Torsten Hoefler. "Demystifying Parallel
and Distributed Deep Learning: An In-Depth Concurrency
Analysis." arXiv preprint arXiv:1802.09941 (2018).
Boston University Slideshow Title Goes Here
Acceleration of CNN Training on FPGA-based Clusters
How to map training logic to multiple devices?
Features need to remain available while waiting for back-
propagation
Large storage demand
Weights are broadcast between all clients and server in
centralized topology; or all-to-all communication in
decentralized topology;
High bandwidth requirement from weight broadcast and
update
Requires large batch size
(8k used by Facebook using 256 GPUs)
Large batch size limits the scalability;
Most widely used:
Data parallelismEntire device is allocated to an image
which is processed layer-by-layer
Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).
Boston University Slideshow Title Goes Here
Acceleration of CNN Training on FPGA-based Clusters
How to map training logic to multi-devices?
Weights are distributed & stored;
Load balancing is hard; workloads of
different layers vary greatly;
Weights are distributed & stored;
Too much data exchange among
devices to combine intermediate results
Model parallelismLayer parallelism
• Solves batch size problem – But …
• GPU is not a good platform for either:
• No direct inter-SM communication
• Inefficient inter-device communication
• FPGA is the right choice?
• If Layer parallelism: How to balance the workload?
• If Model parallelism: How to reduce the massive data exchange?
FPDeep is proposed!
Boston University Slideshow Title Goes Here
Solution – Hybrid Model/Layer Parallelism
Acceleration of CNN Training on FPGA-based Clusters 9/6/2018
Weights are stored in distributed fashion Not necessarily with layer – more later!
Less data exchange compared with Model Parallelism
Better load balance than Layer Parallelism
Small batch size is always supported even if using hundreds of devices;
Boston University Slideshow Title Goes Here
FPDeep – Overview: What it is, Why it works
A framework to map CNN to FPGA
clusters
Achieves High Performance and
Energy Efficiency
Workload balancing
Full utilization of compute elements
Storage balancing for parameters
Only on-chip memory for CONV layers
Good portability: 1-D network is
sufficient
Good scalability: up to 83 FPGAs
with 5 transceivers
Includes HDL generator
FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters 9/6/2018
Boston University Slideshow Title Goes Here
FPDeep Method – Heuristic Partitioning, Pipeline, Adjust
1. Start w/ Layer-based; use Model-based as neededIf all layers are equal load, then that would be good starting point
2. Measure work per layer: Inter-layer Metric used = (FL)Ops >> first heuristic
Makes sense because we have nearly 100% utilization of compute resources
3. Group layers for mapping to FPGAs Balance work/FPGA – more or fewer layers/FPGA as indicated by layer workload
Split (high workload) layers among FPGAs as necessary
Map multiple layers/FPGA as necessary
Intra-layer Metrics: complex … >> second heuristic
4. Pipeline layer execution>> third set of heuristics
5. Adjust mapping to improve performance If all weights of the work mapped to the FPGA don’t fit, then transfer weights from neighbor (using
MGTs), rather than to load from memory
A Framework for Acceleration of CNN Training 9/6/2018
Boston University Slideshow Title Goes Here
How to partition the workload? (1)
Acceleration of CNN Training on FPGA-based Clusters
Inter-Layer Partition
Resources are allocated according to workload of each layer
Example: 7 FPGAs According to # of Flops, Layer1 needs
4.8 FPGAs, Layer2 needs 2.2 FPGAs
Boston University Slideshow Title Goes Here
Acceleration of CNN Training on FPGA-based Clusters
Intra-Layer Partition: Additional heuristic: Each FPGA evaluates a part of input features
How to partition the workload? (2)
Boston University Slideshow Title Goes Here
Acceleration of CNN Training on FPGA-based Clusters
Integration of Inter- and Intra-layer Partitioning Each FPGAs does its part of workload and forwards the result to the next FPGA
For Example: 4.8 FPGAs for Layer 1, 2.2 FPGAs for Layer 2 (with IFP method)
How to partition the workload? (3)
Boston University Slideshow Title Goes Here
Layer Fusion: fine-grained pipelining
Acceleration of CNN Training on FPGA-based Clusters
• Activations propagate faster, no need wait for the completion of whole layer
• The time features have to be cached waiting for backward propagation is reduced
Reduced storage demand
Boston University Slideshow Title Goes Here
Architecture
Acceleration of CNN Training on FPGA-based Clusters 9/6/2018
Boston University Slideshow Title Goes Here
Weight
Balancing
FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters 9/6/2018
• Weight balancing allows weights to fit on chip• Does not overload network – weight transfer BW replaces
activation BW from later layers
Boston University Slideshow Title Goes Here
Overall Resource Allocation Optimization
Acceleration of CNN Training on FPGA-based Clusters 9/6/2018
Boston University Slideshow Title Goes Here
Overall Resource Allocation Optimization
Acceleration of CNN Training on FPGA-based Clusters 9/6/2018
Boston University Slideshow Title Goes Here
Resource Utilization
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 101112131415
Thro
ug
hp
ut (
TOp
s)
(F) VGG-19 DSP & Throughput
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(A) AlexNet BRAM
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 101112131415
Th
rou
gh
pu
t(To
ps)
(B) AlexNet DSP & Throughput
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(C) VGG-16 BRAM
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Thro
ug
hp
ut (
TOp
s)
(D) VGG-16 DSP & Throughput
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(E) VGG-19 BRAM
FPGA
FPGA
FPGA
Util Perc
Util Perc
Util Perc Util Perc
Util Perc
Util Perc
FPGA
FPGA
FPGA
Acceleration of CNN Training on FPGA-based Clusters 9/6/2018
AlexNet:
Without Weight Balancing
Unbalanced BRAM Utilization
VGG-16/19:
With Weight Balancing
Balanced BRAM Utilization
Example: Mapping AlexNet/VGG-
16/VGG-19 to 15 FPGAs
BRAM Utilization DSP Utilization & Throughput
Boston University Slideshow Title Goes Here
Results
Acceleration of CNN Training on FPGA-based Clusters 9/6/2018
• Scalability: 80+ FPGA • Utilization: >98%
Boston University Slideshow Title Goes Here
Performance
Power Efficiency: (GOPS/J)
Versus K80 GPU: 5.5x
Versus Titan X GPU: 8.8x
Versus other FPGA designs: 5.7x
Acceleration of CNN Training on FPGA-based Clusters 9/6/2018
Recommended