Upload
others
View
23
Download
0
Embed Size (px)
Citation preview
Boston University Slideshow Title Goes Here
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Tong Geng1,2, Tianqi Wang1, Chunshu Wu1, Chen Yang1, Shuaiwen Leon Song2
Ang Li2, Martin Herbordt1
1Boston University2Pacific Northwest National Laboratory
Boston University Slideshow Title Goes Here
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
CNN: most popular algorithm in ML
▪ Widely used in computer vision such as object classification, detection and recognition
Boston University Slideshow Title Goes Here
Birth of BNN
▪ Limitations of CNN
▪ High computation + memory intensity
▪ Long Latency
▪ 32-bit floating-point -> 8-bit fixed-point -> binary
▪ Computation and Memory access are not intensive anymore
▪ Potentially, much Lower Latency
▪ Relatively lower accuracy than CNN, however, becoming better
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
DNN: BNN vs CNN✓ Easier Computation: floating-point-multiply-add (FMA) operations ➔ single-bit XNOR/POPCOUNT.
✓ Less Storage: floating-point parameters and activations ➔ single-bit
✓ Energy Efficient: ideal for edge device
“Accelerating Neural Networks with Binary Arithmetic,” https://ai.intel.com/accelerating-neural-networks-binary-arithmetic.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
Inference of BNN on different platforms
• Utilization to accelerate the inference of AlexNet• batch size of 1:
× GPU: ~ 1%. CPU: ~ 6%.
• batch size of 10:
× GPU: ~ 7%. CPU: ~ 10%.
✓FPGA: >60%: Millions of one-bit ALUs on a single device.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie Marr. 2016. Accelerating binarized neural networks: comparison of
FPGA, CPU, GPU, and ASIC. In Field-Programmable Technology (FPT), 2016 International Conference on. IEEE, 77–84.
Boston University Slideshow Title Goes Here
Challenges
• Challenges to make truly low-latency inference usingFPGA
1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).
2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.
3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
Challenges
• Challenges to make truly low-latency inference usingFPGA
1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).
2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.
3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
Intra-layer Fusion
• Addressing the 1st challenge: Intra-layer fusion• 3-to-1 fusion: ACT, NL and BL are fused to a Comparison layer.
• Simplified Computation: 2 Comparisons from ACT and BL and 5 floating-point operations from NL become an integer- comparison.
• Less Storage: 4 floating-point variables in NL become 1 integer.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
Intra-layer Fusion for Networks with Shortcuts
• ResNet
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
POP1
POP1
NO
RM
BIN +POP3
NORM1
POP3
Threshold Lookup(TL)
> >>
Simplify
(A) Original ResNet
(B) Optimized ResNet
CONV2 CONV3CONV1
NORM3
TL
Boston University Slideshow Title Goes Here
Challenges
• Challenges to make truly low-latency inference usingFPGA
1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).
2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.
3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
Inter-layer Fusion• Addressing the 2nd challenge: inter-layer fusion
• Fine-grained pipelining to fuse CONV & 1st FC layers.
• An image is processed based on the data dependency.
• Layers are processed in parallel ➔ overlapped latencies.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
Challenges
• Challenges to make truly low-latency inference usingFPGA
1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).
2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.
3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
Workload Balancing• Each Layer: NIC*NOC*K*K
• NIC=Num of Input Channel
• NOC=Num of Output Channel
• K=Kernel size
• Terms: PIC, POC, SIC, SOC• PIC=Parallelism of Input Channel
• POC=Parallelism of Output Channel
• SIC=Sequential of Input Channel
• SOC=Sequential of Output Channel
• Match Data Production and Consumption Rates of adjacent layers by adjusting PIC, POC, SIC and SOC
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Sequential
Parallel
Boston University Slideshow Title Goes Here
Parameterized Architecture
• Addressing the 3rd challenge: decent architecture design
• The architecture is flexible enough to support load balancing.
• All layers are fully configured on FPGA using model parallelism.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
PE: Processing ElementMAC: Multiply-Accumulate UnitTB: Threshold BufferSWM: Shared Weight Memory
VPBHPB
PE #1
PE #2
PE #3
PE #POC
SIDSR
BANK Set #1
BANK Set #2
BANK Set #K
#PIC
*K
123...
32
Ch
ann
els
amo
ng #P
IC
#SIC
...
#W
BANK #1
BANK #2
BANK #3
BANK #PIC/32 123...
32
Ch
ann
els
amo
ng #P
OC
#SOC
...
#W/P
Control Processor
(CP)
XNOR #1
XNOR #3
XNOR #2
PO
PC
OU
NT
XNOR #PIC
TB
>ACC
#SOC
123...
32 Channels among #PIC
#SOC
...#SIC
123
...
32
Ch
ann
els am
on
g #PO
C
#SOC
SWM
BankSet#1
BankSet#2
BankSet#K2
BankSet
#K2-1... B
an
k#
1
Ba
nk
#2
Ba
nk
#P
IC/3
2...
Ba
nk
#1
Ba
nk
#2
......
#PIC/32#POC
Ba
nk
#P
IC/3
2
...
POOLING Engine #1
POOLING Engine #2
POOLING Engine #(POC/32)
#PIC
Abbreviation:VPB: Vertical Pooling BufferHPB: Horizontal Pooling BufferSIDSR: Shared Input Data Shift Registers
Parameters:K: Filter Kernel SizeW: Width of the input feature mapP: Pooling Kernel Size
303132
303132
303132
30
31
32
Boston University Slideshow Title Goes Here
Evaluation: Latency Reduction from Intra-layer Fusion
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
0%
20%
40%
60%
80%
100%
OTHER
COMPARE
BN & SUM-FP
XNOR
POPCOUNT
Boston University Slideshow Title Goes Here
Evaluation: Single Layer Latency VS Whole Latency
0
100
200
300
400
CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10 CONV11 CONV12 CONV13 FC1 Whole VGG
Latency(μs)
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
Evaluation: Hardware Resource Saving from Workload Balancing
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
Evaluation: BNN Inference Latency using LP-BNN
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Boston University Slideshow Title Goes Here
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism