A ½ mWatt, 128‐MAC Sparsity Aware Neural Processing Unit

S a m s u n g S e m i c o n d u c t o rN e u r a l P r o c e s s i n g L a b

S a n J o s e , C A

J o s e p h H a s s o u n

A ½ mWatt, 128‐MAC Sparsity Aware Neural Processing Unit for Classification and Semantic Segmentation

C O L L A B O R A T E . I N N O V A T E . G R O W .

Contents

1. Motivation: Demand for Edge‐Neural Processing

2. High‐Performance Mobile NPU Architecture in several Samsung Products

3. HW/SW Co‐Designing for Edge‐NPU

a) HW Down‐Scaling with Multi‐Dimensional Parallelism

b) SW Algorithm: Binarization of Neural Network

4. Edge‐NPU Hardware & Inference Performance


Demand for Edge‐Neural Processing

• Solving the challenges of edge‐computing in the internet‐of‐things(IOT) era

RETAIL HEALTHCARE

SMARTBUILDING

TRAVEL&HOSPITALITY

MANUFACTURING


High‐Perf‐Mobile NPU

• Butterfly‐structure NPU w/1024 MACs (ISSCC2019’)• 2 NPU Cores & NPU Controller• Achieved 6.9 TOPS & 11.5 TOPS/W (8b) with 5.5 mm2 of area• Power = 39 mW@ 0.5V

Chip Micrograph

5.5 mm2

Source: Song et. al. “7.1 An 11.5TOPS/W 1024‐MAC Butterfly Structure Dual‐Core Sparsity‐Aware Neural Processing Unit in 8nm Flagship Mobile SoC”, ISSCC 2019AnandTech.com: https://www.anandtech.com/show/14072/the‐samsung‐galaxy‐s10plus‐review/4


Demand for Edge‐Neural Processing

Smart Things Fitness Tracker

Smart Phone Galaxy Watch

• Solving the challenges of edge‐computing in the internet‐of‐things(IOT) era

• Task: Edge‐NPU for Supporting Wearable Devices from Samsung


Input Feature Map

4 inputchannels

16 pixelsin spatial dim.

Filters

…

16 outputchannels

Partial/Output Feature Map

16 outputchannels

16 pixelsin spatial dim.

Convolution with 3D‐Parallelism

• 3D‐data balanced parallelism for convolutional operation:1. Input channel dimension : Reduce partial store/reload2. Output channel dimension : Reuse IFM, mitigating SRAM energy per access3. Spatial(X|Y) unrolling of pixel dimension : Share weight parameters

• High‐performance NPU with 1024 MACs = (4 input channel) x (16 output channel) x (16 pixels)

Source: Song et. al. “7.1 An 11.5TOPS/W 1024‐MAC Butterfly Structure Dual‐Core Sparsity‐Aware Neural Processing Unit in 8nm Flagship Mobile SoC”, ISSCC 2019


Architecture

Source: Song et. al. “7.1 An 11.5TOPS/W 1024‐MAC Butterfly Structure Dual‐Core Sparsity‐Aware Neural Processing Unit in 8nm Flagship Mobile SoC”, ISSCC 2019

7/

DRU: Data return unit

DSU: Data Staging Unit

MAA: MAC Arrays

NPU Core


Contents


2. High‐Performance Mobile NPU Architecture in Galaxy S10


1) HW Down‐Scaling with Multi‐Dimensional Parallelism

2) SW Algorithm: Binarization of Neural Network



Scaling‐Down to Edge‐NPU

High‐Performance NPU(1024 MACs)

Edge‐NPU(128 MACs)

#Parallel HW module #Parallel HW module

Input Channel 4 2 Cores x 2 DSUs 2 1 Core w/ 2 DSUs

Output Channel 16 16 MAAs/Core 16 16 MAAs/Core

SpatialDimension 16 16 Dual‐MACs/MAA 4 4 Dual‐MACs/MAA

• Scaling‐down high‐performance NPU (1K MACs) to Edge‐NPU with 128 MACs,while keeping the benefit from multi‐dimensional parallelism.


DSU 0 DSU 1

MMA0 MMA1 MMA15

Scratchpad

DRU0 DRU1 ….

Edge‐NPU CoreWeight from DSU 1 (ich=1)

Weight from DSU 0 (ich=0)

2

3

Each dispatcher in DSU sends 1 weight and 2x2 (=4) pixels of activation

Each MMA is for one output channel

1

2

Each MMA has 4 dual‐MACs that adds partial results from 2 input‐channel (i.e. 2 DSUs)3

x

+ x

Edge‐NPU Architecture

DRU15

x x+x x

+x x

+x x+

W Act

W ActDual‐MAC 0

Dual‐MAC 1

Dual‐MAC 2

Dual‐MAC 31

SOC & External Memory


Contents





2) SW Algorithm: Binarization of Neural Network



Neural Network Binarization

Floating point conv layer

IFM

OFM

filter outputpixel

• Naïve quantization to binary weights and low‐bit activations will greatly reduce accuracy

• An algorithmic solution is required to preserve both performance and accuracy

• Fortunately, Group‐Net (Zhuang et al., CVPR 2019) address this issue well using structure approximation

Binarized conv layer


Group‐Net Layer‐Wise Decomposition

Source: Zhuang et. al. “Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation”, CVPR 2019

Layer‐wise decomposition with two bases

• Layer‐wise decomposition is the simplest form of Group‐Net decomposition

• For layer‐wise decomposition, replace each layer with a set of binarized bases

• Take a weighted average of the binarized layer outputs to generate the layer output

Floating point conv layer

IFM

OFM

filter outputpixel


Group‐Wise Decomposition

• Instead of decomposing each layer separately, group‐wise structural decomposition offers more flexibility and better accuracy Group‐structure decomposition

can reduce gradient deviation during backpropagation

• Optimal group structures can be learned using neural architecture search


Model Weight width Activation width

Group‐Net Bases

Top‐1 Top‐5

ResNet‐18 32 32 n/a 69.7% 89.4%

FullyBinarized 1 1 1 56.4% (‐13.3%) 79.5% (‐9.9%)

Group‐Net

1 4 1 61.5% (‐8.2%) 83.2% (‐6.2%)

1 4 3 68.5% (‐1.2%) 88.7% (‐0.7%)

1 4 5 70.1% (+0.4%) 89.5% (+0.1%)

Representative Accuracy Results

Source: Zhuang et. al. “Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation”, CVPR 2019


Contents





2) Algorithm Choice : Group‐wise Binarization of Neural Network



Edge‐NPU: Scaling Factors & Power

• With the help of software, Group Convolution : Reduce computation per Conv. Layer → Increases fps

Low precision reduction With down‐scaled hardware resources,

1024 MACs → 128 MACs• Inference Performance :

Power reduced by 73x & Energy efficiency enhanced by x6.9

HP‐NPU Edge‐NPU Edge vsHP‐NPU Reduction

MACs 1024 128 8x

Frequency 67MHz 50MHz ‐25%

Multiplier Prec. 8 X 8b 4 X 1b ~ 16x

Accumulation Prec. 32b 10b ~ 3x

Low Prec. Reduction 1.0 0.15 6.7x

Memory 1568kB 784kB 2x

Power 39 mW 0.53 mW 73x

TOPS/W 3.52 24.1 6.9x


Edge‐NPU: Inference PerformanceHP‐NPU

& ResNet18Edge‐NPU

& GroupNet*Efficiency

Gain

ModelOptimization

Model Prec. 8b A4b & W1b (3‐bases) 4X

Class. Accuracy(Top‐1) 69.7% 68.5% ‐

#Operation/frame 4.30E+9 4.20E+8 10.3X

HardwareOptimization

OPS 1.37E+11 1.3E+10 ‐

Power 39mW 0.53mW 73X

InferencePerformance

Frame‐per‐second 31.9 fps 30.4 fps 95%

Energy‐per‐frame 1.2 mJ 5.8 J 210X

• Co‐optimization of both SW and HW for edge‐computing : HP‐NPU with ResNet18 Edge‐NPU with GroupNet

• Inference Performance : Near HP‐NPU frame‐per‐second (95%) with 210X energy‐per‐frame


HP Mobile NPU Edge NPU

GroupNet Resnet‐18ResNet‐18

68.5%

5.8 J

69.7%

1.2 mJ

31.9 fps 30.4 fps

Model

Energy

Frames per Sec

Top‐1 Accuracy


• NPL: Ali Shafiee Ardestani, Jong Hoon Shin, David Thorsley, Hamzah Abdelaziz• SAIT: Sehwan Lee, Jun‐Woo Jang, Joon‐Ho Song, Eunsoo Shim• S.LSI: Jinook Song, Yunkyo Cho, Jun‐Seok Park, Inyup Kang

Acknowledgement

Documents

A ½ mWatt, 128‐MAC Sparsity Aware Neural Processing Unit