39
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE International Solid-State Circuits Conference 1 of 39 DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General- Purpose Deep Neural Networks Dongjoo Shin, Jinmook Lee, Jinsu Lee, and Hoi-Jun Yoo Semiconductor System Laboratory School of EE, KAIST

DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

Embed Size (px)

Citation preview

Page 1: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 1 of 39

DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks

Dongjoo Shin, Jinmook Lee, Jinsu Lee,and Hoi-Jun Yoo

Semiconductor System LaboratorySchool of EE, KAIST

Page 2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 2 of 39

Outline Introduction Processor Architecture Key Features

1. Heterogeneous architecture2. Mixed workload division method3. Dynamic fixed-point with on-line adaptation4. Quantization-table-based multiplier

Implementation Results Conclusion

Configurability

Low Power

Page 3: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 3 of 39

Why Deep Learning? Overwhelming performance

Unlimited applicability– Image classification, action recognition, sign recognition…– Super resolution, depth map generation, …– Translation, natural language processing, …– Text generation, image captioning, composition, …

2011 2012 2013 2014 Human 2015 20160

5

10

15

20

25 Computer VisionDeep LearningHuman

2010

Introduction

Page 4: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 4 of 39

Types of Deep Learning

Hidden

ConvolutionPooling

Fully-connected LayerChracteristic Convolutional layer Feedback path,

internal state

Major Application

Layers

Simple classification, hand written letter

recognition ...

Vision processing,face recognition, image

classification, ...

Sequential data processing,

translation, speech recognition, ...

3~10 layers 5~100s layers 3~5 layers

(multi-layer perceptron) (convolutional) (recurrent)

Introduction

Page 5: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 5 of 39

Why both CNN & RNN? CNN: Visual feature extraction and recognition

– Face recognition, image classification… RNN: Sequential data recognition and generation

– Translation, speech recognition… CNN + RNN: CNN-extracted features RNN input

Previous works– Optimized for convolution layer only: [1], [2], [3]– Optimized for FC layer and RNN only: [4]

Introduction

[1] Y. Chen, ISSCC 2016

Two dogs playing in the water

Image Captioning

[2] B. Moons, SOVC 2016 [3] J. Sim, ISSCC 2016 [4] S. Han, ISCA 2016

ActionRecognition

CNN: Visual feature extractionRNN: Sentence generation

CNN: Visual feature extractionRNN: Temporal information recognition

Page 6: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 6 of 39

Deep Learning-dedicated SoC: DNPUIntroduction

ControlALU

Cache

ALU ALU

ALU

Low compute density

High programmability

High compute density

Floating point data type

Fixed computation pattern

Dedicated memory arch.

Complex control logic Parallel Computation

Dedicated data type

Higher Energy Efficiency (Operations/W)

Parallel Processing

Highly DedicatedArchitecture

Higher Programmability

Page 7: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 7 of 39

Deep Learning-dedicated SoC: DNPUIntroduction

ControlALU

Cache

ALU ALU

ALU

Low compute density

High programmability

High compute density

Floating point data type

Fixed computation pattern

Dedicated memory arch.

Complex control logic Parallel Computation

Dedicated data type

Higher Energy Efficiency (Operations/W)

Parallel Processing

Highly DedicatedArchitecture

Higher Programmability

Deep learning itself has high adaptability for various applications

Page 8: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 8 of 39

Design Challenges Heterogeneous characteristics

– Conv. Layer (CNN): Computation >> Parameter– FC Layer (CNN), RNN: Computation ~= Parameter

Introduction

Page 9: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 9 of 39

Design Challenges Acceleration of Conv. with FC-RNN dedicated HW

– High memory transaction– Low PE utilization– Mismatch of the computational patterns

Acceleration of FC-RNNs with Conv. dedicated HW

– Cannot exploit reusability (image, kernel, convolution)– Cannot achieve required throughput

Introduction

Page 10: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 10 of 39

Design Challenges Various configurations of layers

– Image size, channel size, kernel (filter) size

Example: VGG-16 Convolution Layers

Introduction

Page 11: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 11 of 39

Design Challenges Large amount of data & massive MACs

Introduction

Page 12: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 12 of 39

Proposed Processor: DNPU For High Configurability,

1. Heterogeneous Architecture2. Mixed Workload Division

For Low-power Operation,3. Dynamic Fixed-point with On-line Adaptation4. Quantization-table-based Multiplication

An 8.1TOPS/W CNN-RNN Processor for General-purpose DNNs

Page 13: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 13 of 39

Overall ArchitectureFeature 1. Heterogeneous Architecture

Convolution Cluster 1

Convolution Cluster 3

Aggregation Core

Custom Instruction Set-based Controller

Ext. IF

Ext. IF

Instruction Memory Data Memory

Input Buffer

Quantization Table-basedMatrix Multiplier

Accumulation Registers

Activation Function LUT

Weight BufferConvolution Cluster 0

Convolution Cluster 3

Convolution layer (CNN)

Fully-connected layer (CNN)

Recurrent neural network

HeterogeneousArchitecture

Page 14: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 14 of 39

Convolution ProcessorFeature 1. Heterogeneous Architecture

Distributed memory-based architecture

Data NoC

Instruction Network

Partial Sum Network

ImageMemory(16KB)

WeightMem.(1KB)

PE Array(12x4)

Partial Sum Registers

FSMCtrlr.

ImageMemory(16KB)Pooling

Relu

FSMCtrlr.

Partial Sum Registers

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Page 15: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 15 of 39

FC Layer and RNN Hardware Sharing FC layer - Mapping to matrix multiplication

RNN – Mapping to matrix multiplication

Feature 1. Heterogeneous Architecture

om = W0m i0 + W1m i1 + + Wnm in + bm

i0 i1 in b0W10

1

Wn0

b1W11

Wn1

bmW1m

Wnm

0 1 2 n

0 1 om

Page 16: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 16 of 39

Limited On-chip Memory Size Conv. operation with limited on-chip memory

Feature 2. Mixed Workload Division Method

Convolution Operation

Workload should be dividedto reduce the required on-chip memory size

On-chip memory size

100s KB ~ 1s MB

On-chip memory

On-chip processing elements

Trade-off

1 2

3

Page 17: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 17 of 39

Image Division Workload division with image direction

Feature 2. Mixed Workload Division Method

Ci

W''

H'

Co

Wf

HfH'

W'

W x H = W' x H' x Div.#

Page 18: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 18 of 39

Channel Division Workload division with channel direction

Feature 2. Mixed Workload Division Method

i o

f

f

Page 19: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 19 of 39

Mixed Division Workload division with both directions

Feature 2. Mixed Workload Division Method

Input Image i i i

Weight

Output Image

f f i o

o o o

i i i

f f i o

o o o

i i i

f f i o

o o o

Page 20: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 20 of 39

Variation Through Layers VGG-16 convolution layer example

Feature 2. Mixed Workload Division Method

Large Number of Channels

Large Image Size

Page 21: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 21 of 39

Off-chip Access Analysis Mixed division can take lower points

Feature 2. Mixed Workload Division Method

1 2 3 4 5 6 7 8 9 10 11 12 13

10

20

30

40

50

Convolutional Layer Number

Off-

chip

Mem

ory

Acc

ess

(MB

)

Image DivisionChannel DivisionMixed Division

Page 22: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 22 of 39

Layer-by-Layer Data Distribution Data distribution in each layer

Floating-point implementation– Large data range, but high cost

Fixed-point implementation– Low cost, but limited data range

Feature 3. Dynamic Fixed-point with On-line AdaptationD

ensi

ty

Max

. Val

ue

Page 23: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 23 of 39

Each layer has different WL and FL Having floating-point characteristic via layers

WL and FL are fixed in a same layer Enabling fixed-point operation

Layer-by-Layer Dynamic Fixed-point

FL: 1

FL: 2WL: 4

FL: -2

WL: 8

WL: 8

Fraction Point

Conv. Layer 1 – WL: 8, FL: 1

Conv. Layer 2 – WL: 8, FL: 2

Conv. Layer 3 – WL: 4, FL: -2

Fraction Point

Word Length

Integer Bits Fractional Bits

Word Length(WL: 4 ~ 16)

Fraction Length(FL: -15 ~ 16)

Example)

Feature 3. Dynamic Fixed-point with On-line Adaptation

Page 24: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 24 of 39

Off-line learning-based FL selection

– Off-line (off-chip) learning-based FL selection– FL is trained to fit with given image data set– Selected FL is used for every image at run time

Previous Approach

Image Data Set

Fraction Length (FL) SetsSet1 – L1: 0, L2: 0, L3:-4 …Set2 – L1: 1, L2: -1, L3: -5 …Set3 – L1: 1, L2: -3, L4: -6 …

Finding FL setwhich shows the minimum error

with given image data set

L: LayerL1

L2L3

Feature 3. Dynamic Fixed-point with On-line Adaptation

Page 25: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 25 of 39

On-line adaptation-based FL selection

– On-line (on-chip) learning-based FL selection– FL is dynamically fit to current input image No off-chip learning, lower required WL

Proposed On-line Adaptation

Threshold

Feature 3. Dynamic Fixed-point with On-line Adaptation

Page 26: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 26 of 39

Performance Comparisons Image classification results

2 4 86 10 12 140

0.2

0.6

0.4

0.8

Proposed Dynamic-Fixed-Point with AdaptationDynamic Fixed-Point Fixed-Point

Used Model: VGG-16Used Dataset: ImageNet

Baseline: 32-bit floating point

Feature 3. Dynamic Fixed-point with On-line Adaptation

Page 27: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 27 of 39

LUT-based Reconfigurable Multiplier

, = + − 1, + − 1 × ( , )50,176 times multiplications

for each kernel weightInput image(I): 224 X 224

Kernel(k): 3 X 3@

Feature 3. Dynamic Fixed-point with On-line Adaptation

Page 28: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 28 of 39

LUT-based Reconfigurable MultiplierFeature 3. Dynamic Fixed-point with On-line Adaptation

Page 29: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 29 of 39

Weight QuantizationFeature 4. Quantization Table-based Multiplier

Page 30: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 30 of 39

Quantization Table Construction Pre-computation with each quantized weight

Feature 4. Quantization Table-based Multiplier

Input RegistersI1

0 1 2 15w0 w1 w2 w15

I0

IndexWeight

I7

Weight Mapping Table

I0 x w0

I0 x w1

I0 x w15

Q-table for I0

I1 x w0

I1 x w1

I1 x w15

Q-table for I1

I7 x w0

I7 x w1

I7 x w15

Q-table for I7

Multiplications between input and quantized weights

Multiplication results arealso quantized to 16 values.

Page 31: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 31 of 39

Multiplication with Quantization Table Decode index to load the pre-computed result

Feature 4. Quantization Table-based Multiplier

Page 32: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 32 of 39

Quantization Table-based Multiplier Off-chip memory BW

– Quantization: 16-bit weight 4-bit index– Zero skipping: Skip weight load for zero input

Multiplications– 99% of multiplications Table look-up

Feature 4. Quantization Table-based Multiplier

Quantization(4-bit)

Quantization with Zero Skipping

Page 33: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 33 of 39

Chip Photograph and SummaryImplementation Results

Page 34: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 34 of 39

Measurement ResultsC

lock

Fre

quen

cy (M

Hz)

TOPS

/W

Pow

er (m

W)

Implementation Results

Page 35: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 35 of 39

Measurement Results Dynamic fixed-point with on-line adaptation

Implementation Results

Page 36: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 36 of 39

Demonstration VideoImplementation Results

Page 37: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 37 of 39

Performance Comparison Comparison with convolution processors

[1] Y. Chen, et al., ISSCC 2016 [2] B. Moons, et al., SOVC 2016

Implementation Results

Page 38: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 38 of 39

Performance Comparison Comparison with FC layer processor

[4] S. Han, et al., ISCA 2016

Implementation Results

Proposed work is the only one which can support both CNN and RNN

Page 39: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 39 of 39

Conclusion An Energy Efficient CNN-RNN Processor SoC

is proposed for General-purpose DNNs Key Features:

– Support both CNN and RNN– Mixed workload division method– Dynamic fixed-point with on-line adaptation– Quantization table-based multiplier

DNPU: An 8.1TOPS/W reconfigurable CNN-RNN Processor SoC