Upload
embedded-vision-alliance
View
53
Download
7
Tags:
Embed Size (px)
Citation preview
Copyright © 2015 Synopsys Inc. 1
Bruno Lavigueur
12 May 2015
Tailoring CNNs for Low-cost,
Low-power Implementations
Copyright © 2015 Synopsys Inc. 2
• Embedded vision subsystem, build from many silicon proven IPs
• DesignWare: ARC HS processor, AXI, DMA, Memory Compiler, …
• HAPS FPGA-based rapid prototyping system
Synopsys at a Glance
>5,300 Masters/PhD
Degrees
>2,300 IP Designers
>1,500 Applications
Engineers
>$2.2B FY14
Revenue
32% Revenue
on R&D
>9,300 Employees
Copyright © 2015 Synopsys Inc. 3
• Convolutional Neural Network (CNN)
• Wide range of detection and classification possible
• The majority of the published CNN graphs are not tailored for embedded
• Memory requirements
• Number of floating point operations (# of MAC)
• Yet CNN have nice properties for parallelization on embedded devices
• Regular processing, feed forward dataflow, no data dependant computation
• Key questions
• Can the size and complexity of the graph be reduced with minimal impact on detection rates ?
• Number of layers, connectivity, size of convolution
• What is the impact of moving from floating to fixed point ?
CNN on Embedded Devices
Copyright © 2015 Synopsys Inc. 4
How CNN Works (Once Trained)
• Multiple feature extraction layers
• Progressive refinement process
• Each successive layer extracts more complex features (higher level)
• Last layer performs classification
• Same computation (neuron) replicated multiple times
Input image Layer 1
Low level feature extraction
Pooling & down sampling
Layer 2
Mid-level features
Partially connected
Layer 3
High-level
features
Fully
connected
classification
Copyright © 2015 Synopsys Inc. 5
• Each layer of convolutions extract progressively higher level features
• Subsampling / max pooling to “zoom out” and detect bigger objects
with smaller convolutions
• Non-linear function on each neuron to activate it
Visualising a CNN
Layer 1 output
sample
Layer 2 output
sample
Layer 3 output
sample
Layer 4 output
sample
Copyright © 2015 Synopsys Inc. 6
• Convolution of
multiple inputs
together
• Fixed kernel size
• Optional subsampling
• 1, 2, 4x
• Optional max-pooling
• Very regular, repetitive
computation
• Dominated by MAC
• Deterministic
• Non-linear activation
function (sigmoid,
hyperbolic tangent,
rectifier)
CNN Computation
I0
IM-1
I1
O0
ON-1
M inputs
(XI * YI) Z kernels (K * K) with
associated weights
N outputs (XO * YO)
Oj = act(Bj+ (Iv x Kw) + …)
Convolution (x)
act
act
Activation (tanh, ReLU) …
Copyright © 2015 Synopsys Inc. 7
• Given the nature of the algorithm,
there are many ways to accelerate
CNNs including:
• Vector / SIMD unit
• Systolic array / Streaming
• GPU
• Performance / Power / Area trade-offs will vary
• Depending on the architecture
• In all cases the main limitations will be
• Amount of closely coupled memory available
• Maximum number of Giga-MAC/s that can be sustained
• I/O bandwidth required & available
• Optimized data movement, efficient streaming
Moving Towards Embedded CNN
EV Processor
Shared
Memory
DMA
Interconnect
RISC CPU
32-bit
Core
32-bit
Core
32-bit
Core
32-bit
Core
CNN Engine
…
…
PE PE PE
PE PE PE
Copyright © 2015 Synopsys Inc. 8
Moving CNN to Embedded Systems
• Graph Complexity
• Number of layers
(depth)
• Size of the
convolutions filters
• Number of
connections
between the layers
Compute requirements ALU width/cost Memory size
Input
Layer 1 Layer 2 Layer 3 Layer 4
3 2 1
1 2 6
1 2 1
0 1
1 0
Image Filter
5 8
3 3
Feature
map
Conv. = 4 6
2 2
Data precision # Coefficients
Act.
Copyright © 2015 Synopsys Inc. 9
• Starting point:
• Multicoreware generated ~10 million faces/non-faces from over 200
Hollywood and Bollywood full length movies
• Trained CNN to detect faces in those movies
Example of a Big& Small CNN Application
Metric Alexnet like Embedded
version
Weight Space 400 MB 0.5 MB
Layers 10
(7Cv+3 FC)
5
(3 Cv+2 FC)
Compute 200x 1x
Bandwidth 400x 1x
F1-Score .963 .905
Accuracy .993 .981
VGA 30 FPS 4800 GOPS 24 GOPS
• Cv: Convolution layers
(partially connected)
• FC: Fully connected
layers
Copyright © 2015 Synopsys Inc. 10
• Using standard open source projects to train networks with floating
point and GPU acceleration to explore network topology
• Cuda-convnet, Caffe, Theano
• Didn’t worry initially about numerical precision as literature has shown
CNN are robust to precision
• From scratch: Small networks can be trained very fast
• Enables lots of shots on goal :
• Using scripting and many GPU’s
• Number of network layers, convolutions, subsampling & pooling
• Explored huge space and quickly converged on a graph with good learning
• From an existing graph: Also worked backwards from high accuracy
large graph
• Iteratively reduced it and retrained the best ones
• End up with similar networks in both cases
Reducing Complexity of the Graph
Copyright © 2015 Synopsys Inc. 11
• Improve F-1 score with classic techniques such as
• Data Normalization
• Hard negative mining (boosting)
• Annealing the learning rate
• Data Augmentation: Flip, Random Cropping, color space, ..
• Moved initial system from F1 of ~.74 to ~.90
• Once the graph topology and training is satisfying look at the impact of
moving to fixed point
• Test below are done with 31437 positive and 263145 negative samples
Training Optimizations
Initial Optimized
True positive 19706 27093
False positive 1769 1335
False negative 11731 4344
F-1 Score 0.7449 0.9051
Copyright © 2015 Synopsys Inc. 12
• Compare output of every layer with reference floating
point version
• Differences may grow after each layer
• Detection threshold might need to be tweaked to
achieve similar results
Moving to Fixed Point: Empirical Approach
ReLU
Image
Filter
Convolution =
Accumulator Feature
map
200 64 1
150 50 1
1 10 220
4 0
0 -1
750 255
590 -20
Non-linear function
750 255
590 0
Shift +
saturate
255 127
255 0
Greyscale
image, 8
bit pixels
Convert to
fixed-point
based on range,
e.g 16 bit
(Q2S13)
Make sure
accumulator
is wide
enough,
e.g. 32 bit
(signed)
Shift-right values to avoid overflow,
x = max(0, x) >> N
Choose ‘N’ according to dynamic
range of ‘x’ values
Copyright © 2015 Synopsys Inc. 13
• FDDB: Face Detection
Data Set and Benchmark
• Results shown for the
embedded small & fixed
point graph
• Localization can be
improved with pre/post
processing
• Impacts scores
• Not done here
Results For Face Detection Application
Type F-1
Best (CascadeCNN) 0.91
Middle 10 average 0.85
Embedded – 40% 0.84
Embedded – 50% 0.82
Fixed point,
8bit
Copyright © 2015 Synopsys Inc. 14
• Design time configurable
• Number of CNN Processing Elements (2 to 8)
• Streaming interconnection network configured for number of cores
• Runtime reconfigurable
• Flexible point-to-point connections between all cores
• CNN-optimized instruction set
• Convolutions, MAC, LUT, …
• Micro-DMA & stream interface for data movement
• Programmable
• Using the generated C compiler
• Each CNN PE has a local data & program memory
Low-cost, Low-power, Flexible CNN
Su
bsys
tem
In
terc
on
nec
t
DMA Shared
DMem
CNN Engine
Reconfigurable
Streaming Interconnect
PE 1 … PE 2 PE 4
PE 5 PE 6 PE 8 …
RISC
MP
32 bit
RISC
32 bit
RISC
32 bit
RISC
32 bit
RISC
Sync
Copyright © 2015 Synopsys Inc. 15
Mapping Example and Performance
L1&4 FIFO L2
L3a
L3b
Subsystem Interconnect
L1 L2 L3 L4
• Input image read only once
• 30 cycles average to do 8 convolutions of 5x5 in parallel
• Including all data movement & contention
• Over 85% MAC resource utilization (8 MACs / CNN PE)
• ~15mW per PE @28nm HPM
• w. memory & interconnect
• Mapping on 4
processing elements
• Smaller layers merged
together
4 PE, 5 FIFO configuration
Copyright © 2015 Synopsys Inc. 16
Demonstrator
ARC EV52 Processor
RISC multi-core Shared
Data
Mem
CNN Engine
DMA
AXI Subsystem Interconnect
PE 8
Core 2
MEM
PE 1
Core 1
MEM
AXI Interconnect
DDR
ARC HS Core
• Read in frame,
• Pyramid (scaling)
• Non-max suppression
• Softmax
• Display the result
AXI 2
UMRBus
CNN graph
Host application
streaming video
frames to DDR over
UMR-bus and back
HAPS 70-S12 Prototyping System
Clocked at 50Mhz
(10% of real-time)
Workstation
webcam
Copyright © 2015 Synopsys Inc. 17
• CNN compute requirement can be dramatically reduced with a small impact
of the detection rates
• Works well when the number of object classes to detect is kept small
• Offline training is the critical step to obtain good performances
• Specialized and programmable hardware can be used to efficiently
implement many different CNN graphs
• Low power and area
• Some pre- and post-processing is needed to have a complete and useful
application
• CNN accelerator coupled with quad-core RISC cluster
• Useful to couple CNN with other processing steps to improve performances
• Shrinking the image when it doesn’t impact detection rates
• Sliding a detection window on an image
• Region of interest
Lessons Learned
Copyright © 2015 Synopsys Inc. 18
• Selected CNN papers
• Embedded facial image processing with Convolutional Neural Networks
• http://liris.cnrs.fr/Documents/Liris-6072.pdf
• Memory-Centric Accelerator Design for Convolutional Neural Networks
• http://parse.ele.tue.nl/system/attachments/58/original/iccdMP17.pdf?1381908921
• CNN tutorial & courses
• Stanford CNN course • http://cs231n.github.io/
• Neural network intro and visualization • http://colah.github.io/
• Synopsys DesignWare Embedded Vision Processors
• http://www.synopsys.com/ev
• More information and demo available at the Technology Showcase (Mission City Ballroom, Tables 3 & 4)
Resources