LP-BNN: Ultra-low-Latency BNN Inference with Layer …LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism ... CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10

Boston University Slideshow Title Goes Here

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Tong Geng1，2, Tianqi Wang1, Chunshu Wu1, Chen Yang1, Shuaiwen Leon Song2

Ang Li2, Martin Herbordt1

1Boston University2Pacific Northwest National Laboratory



CNN: most popular algorithm in ML

▪ Widely used in computer vision such as object classification, detection and recognition


Birth of BNN

▪ Limitations of CNN

▪ High computation + memory intensity

▪ Long Latency

▪ 32-bit floating-point -> 8-bit fixed-point -> binary

▪ Computation and Memory access are not intensive anymore

▪ Potentially, much Lower Latency

▪ Relatively lower accuracy than CNN, however, becoming better



DNN: BNN vs CNN✓ Easier Computation: floating-point-multiply-add (FMA) operations ➔ single-bit XNOR/POPCOUNT.

✓ Less Storage: floating-point parameters and activations ➔ single-bit

✓ Energy Efficient: ideal for edge device

“Accelerating Neural Networks with Binary Arithmetic,” https://ai.intel.com/accelerating-neural-networks-binary-arithmetic.



Inference of BNN on different platforms

• Utilization to accelerate the inference of AlexNet• batch size of 1:

× GPU: ~ 1%. CPU: ~ 6%.

• batch size of 10:

× GPU: ~ 7%. CPU: ~ 10%.

✓FPGA: >60%: Millions of one-bit ALUs on a single device.


Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie Marr. 2016. Accelerating binarized neural networks: comparison of

FPGA, CPU, GPU, and ASIC. In Field-Programmable Technology (FPT), 2016 International Conference on. IEEE, 77–84.


Challenges

• Challenges to make truly low-latency inference usingFPGA

1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).

2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.

3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.



Challenges







Intra-layer Fusion

• Addressing the 1st challenge: Intra-layer fusion• 3-to-1 fusion: ACT, NL and BL are fused to a Comparison layer.

• Simplified Computation: 2 Comparisons from ACT and BL and 5 floating-point operations from NL become an integer- comparison.

• Less Storage: 4 floating-point variables in NL become 1 integer.



Intra-layer Fusion for Networks with Shortcuts

• ResNet


POP1

POP1

NO

RM

BIN +POP3

NORM1

POP3

Threshold Lookup(TL)

> >>

Simplify

(A) Original ResNet

(B) Optimized ResNet

CONV2 CONV3CONV1

NORM3

TL


Challenges







Inter-layer Fusion• Addressing the 2nd challenge: inter-layer fusion

• Fine-grained pipelining to fuse CONV & 1st FC layers.

• An image is processed based on the data dependency.

• Layers are processed in parallel ➔ overlapped latencies.



Challenges







Workload Balancing• Each Layer: NIC*NOC*K*K

• NIC=Num of Input Channel

• NOC=Num of Output Channel

• K=Kernel size

• Terms: PIC, POC, SIC, SOC• PIC=Parallelism of Input Channel

• POC=Parallelism of Output Channel

• SIC=Sequential of Input Channel

• SOC=Sequential of Output Channel

• Match Data Production and Consumption Rates of adjacent layers by adjusting PIC, POC, SIC and SOC


Sequential

Parallel


Parameterized Architecture

• Addressing the 3rd challenge: decent architecture design

• The architecture is flexible enough to support load balancing.

• All layers are fully configured on FPGA using model parallelism.


PE: Processing ElementMAC: Multiply-Accumulate UnitTB: Threshold BufferSWM: Shared Weight Memory

VPBHPB

PE #1

PE #2

PE #3

PE #POC

SIDSR

BANK Set #1

BANK Set #2

BANK Set #K

#PIC

*K

123...

32

Ch

ann

els

amo

ng #P

IC

#SIC

...

#W

BANK #1

BANK #2

BANK #3

BANK #PIC/32 123...

32

Ch

ann

els

amo

ng #P

OC

#SOC

...

#W/P

Control Processor

(CP)

XNOR #1

XNOR #3

XNOR #2

PO

PC

OU

NT

XNOR #PIC

TB

>ACC

#SOC

123...

32 Channels among #PIC

#SOC

...#SIC

123

...

32

Ch

ann

els am

on

g #PO

C

#SOC

SWM

BankSet#1

BankSet#2

BankSet#K2

BankSet

#K2-1... B

an

k#

1

Ba

nk

#2

Ba

nk

#P

IC/3

2...

Ba

nk

#1

Ba

nk

#2

......

#PIC/32#POC

Ba

nk

#P

IC/3

2

...

POOLING Engine #1

POOLING Engine #2

POOLING Engine #(POC/32)

#PIC

Abbreviation:VPB: Vertical Pooling BufferHPB: Horizontal Pooling BufferSIDSR: Shared Input Data Shift Registers

Parameters:K: Filter Kernel SizeW: Width of the input feature mapP: Pooling Kernel Size

303132

303132

303132

30

31

32


Evaluation: Latency Reduction from Intra-layer Fusion


0%

20%

40%

60%

80%

100%

OTHER

COMPARE

BN & SUM-FP

XNOR

POPCOUNT


Evaluation: Single Layer Latency VS Whole Latency

0

100

200

300

400

CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10 CONV11 CONV12 CONV13 FC1 Whole VGG

Latency(μs)



Evaluation: Hardware Resource Saving from Workload Balancing



Evaluation: BNN Inference Latency using LP-BNN




Documents

LP-BNN: Ultra-low-Latency BNN Inference with Layer …LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism ... CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10