Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
Introduction to machine learning on FPGAs
Arthur Ruder ¦ Enclustra GmbH ¦ AI seminar EPFL Lausanne & ZHAW Winterthur ¦ 19 & 21/11/2019
Quick reminder: neural network
21/11/2019
input layer:
e.g. pixelshidden layer 1
output layer:
e.g. probability
hidden layer 2
𝑤1
𝑤2
𝑤3
𝑥1
𝑥2
𝑥3
𝑎
𝑎𝑎
2
21/11/20193
forward-propagation
Inputs: training set
• Goal: obtain trained weights
untrained network
back-propagation
Machine learning concepts: training phase
But: label says
100 % dog
Outputs: classification
probability
40 % dog,
60 % cat
21/11/20194
forward-propagation
Inputs: e.g. photographsOutputs: classification
probability
99.07 % dog
0.93 % cat
trained network
Machine learning concepts: inference
AlexNet VGG GoogleNet ResNet
2010 2011 2012 2013 2014 2014 2015
class
ific
ati
on
err
or
[%]
30
25
20
15
10
5
0
Quick reminder: Deep Learning
21/11/2019
Human error
shallow8 layers
19 layers
22 layers
152 layers
Image recognition challenge winner
5
Hardware platform
21/11/20196
What hardware do we need for this?
CPUs, GPUs, FPGAs, ASICs??
21/11/20197
What hardware do we need for this?
CPUs, GPUs, FPGAs, ASICs??
• What are the requirements for…?
Hardware platform
21/11/2019
What hardware do we need for this?
CPUs, GPUs, FPGAs, ASICs??
• What are the requirements for…?
a) training
b) inference
8
Hardware platform
21/11/2019
What hardware do we need for this?
CPUs, GPUs, FPGAs, ASICs??
• What are the requirements for…?
a) training
b) inference
• What type of hardware is best suited for each task?
9
Hardware platform
Neural network training: computational complexity
21/11/2019
forward-propagation
back-propagation
Untrained neural network
ResNet50Result:
50 % cat
50 % dog
Label:
100% dog
For one picture: image classification
Labelled data
10
21/11/2019
forward-propagation
back-propagation
Untrained neural network
ResNet50
For one picture: image classification
7.7 billion operations
~35 MB parameter storage
Labelled data
11
Neural network training: computational complexity
Result:
50 % cat
50 % dog
Label:
100% dog
21/11/2019
23 billion operations
~380 MB parameter storage
forward-propagation
back-propagation
Untrained neural network
ResNet50
For one picture: image classification
7.7 billion operations
~35 MB parameter storage
Labelled data
12
Neural network training: computational complexity
Result:
50 % cat
50 % dog
Label:
100% dog
21/11/2019
23 billion operations
~380 MB parameter storage
forward-propagation
back-propagation
Untrained neural network
ResNet50
For one picture: image classification
7.7 billion operations
~35 MB parameter storage
* for forward propagation only, backward propagation similar
Labelled data
13
Neural network training: computational complexity
Result:
50 % cat
50 % dog
Label:
100% dog
21/11/2019
ResNet50
forward-propagation
back-propagation
23 billion operations
~380 MB for parameter storage
ImageNet: 1.2 Million
pictures
Result?
1 epoch: 1.2𝑀 ∗ 30.7𝐵 ≈ 37 ∗ 1015 operations (majority MAC)
7.7 billion operations
~35 MB parameter storage
For the whole training process:
14
Neural network training: computational complexity
21/11/2019
ResNet50
forward-propagation
back-propagation
23 billion operations
~380 MB for parameter storage
ImageNet: 1.2 Million
pictures
Result?
1 epoch: 1.2𝑀 ∗ 30.7𝐵 ≈ 37 ∗ 1015 operations (majority MAC)
ResNet50 needs 100 epochs for training…
7.7 billion operations
~35 MB parameter storage
For the whole training process:
15
Neural network training: computational complexity
Requirements breakdown: training
21/11/201917
21/11/2019
• Typically not time-critical
18
Requirements breakdown: training
21/11/2019
• Typically not time-critical
• Compute billions of floating point calculations
19
Requirements breakdown: training
21/11/2019
• Typically not time-critical
• Compute billions of floating point calculations
• Handle large data sets (GBs to hundreds of GBs)
20
Requirements breakdown: training
21/11/2019
• Typically not time-critical
• Compute billions of floating point calculations
• Handle large data sets (GBs to hundreds of GBs)
• Flexibility to train a wide variety of neural networks
21
Requirements breakdown: training
21/11/2019
• Typically not time-critical
• Compute billions of floating point calculations
• Handle large data sets (GBs to hundreds of GBs)
• Flexibility to train a wide variety of neural networks
22
Clear answer (for now): GPUs do the heavy lifting
of neural network training
Requirements breakdown: training
Requirements: inference
21/11/201923
Requirements: inference
21/11/2019
• Edge requirements
• Cloud requirements
24
Requirements: inference
21/11/2019
• Edge requirements
• Low (deterministic) latency (e.g. real-time object detection)
• Cloud requirements
25
Requirements: inference
21/11/2019
• Edge requirements
• Low (deterministic) latency (e.g. real-time object detection)
• Power efficiency (limited battery capacity)
• Cloud requirements
26
Requirements: inference
21/11/2019
• Edge requirements
• Low (deterministic) latency (e.g. real-time object detection)
• Power efficiency (limited battery capacity)
• Sensor fusion (e.g. industrial surveillance)
• Cloud requirements
27
Requirements: inference
21/11/2019
• Edge requirements
• Low (deterministic) latency (e.g. real-time object detection)
• Power efficiency (limited battery capacity)
• Sensor fusion (e.g. industrial surveillance)
• Robustness (e.g. temperature)
• Cloud requirements
28
Requirements: inference
21/11/2019
• Edge requirements
• Low (deterministic) latency (e.g. real-time object detection)
• Power efficiency (limited battery capacity)
• Sensor fusion (e.g. industrial surveillance)
• Robustness (e.g. temperature)
• Cloud requirements
• Low latency, e.g. search engines
29
Requirements: inference
21/11/2019
• Edge requirements
• Low (deterministic) latency (e.g. real-time object detection)
• Power efficiency (limited battery capacity)
• Sensor fusion (e.g. industrial surveillance)
• Robustness (e.g. temperature)
• Cloud requirements
• Low latency, e.g. search engines
• Power efficiency (heat dissipation/cooling cost)
30
21/11/201932
Resource requirements overview
21/11/2019
Image
Classification
33
Resource requirements overview
21/11/2019
Image
Classification
Object
Detection
34
Resource requirements overview
21/11/2019
Image
Classification
Object
Detection
Semantic
Segmentation
35
Resource requirements overview
21/11/2019
Image
Classification
Object
Detection
Semantic
SegmentationOCR
36
Resource requirements overview
21/11/2019
Image
Classification
Object
Detection
Semantic
Segmentation
Speech
RecognitionOCR
37
Resource requirements overview
21/11/2019
Image
Classification
Object
Detection
Semantic
Segmentation
Speech
RecognitionOCR
Main takeaway points:
• Inference is challenging
• Huge variation in compute and memory
requirements (even within subgroups)
• Models typically don’t fit into local memory
38
Resource requirements overview
Inference Accelerator
Architectural challenges
21/11/2019
DMA
External memory
Buffer Compute Array
Partial Sums
Activation Functions, …
Weight Buffer
input result
39
Inference Accelerator
Architectural challenges
21/11/2019
DMA
External memory
Buffer Compute Array
Partial Sums
Activation Functions, …
Weight Buffer
input result
Huge amount of
computations
Memory bandwidth
Memory bandwidth
40
Performance & Power Efficiency
Fle
xib
ilit
y &
Ease
of
Use
Qualitative hardware comparison
21/11/201941
Performance & Power Efficiency
Fle
xib
ilit
y &
Ease
of
Use
21/11/201942
Qualitative hardware comparison
Requirements GPU FPGA ASIC
Low (deterministic) latency
Qualitative hardware comparison
21/11/201944
Requirements GPU FPGA ASIC
Low (deterministic) latency
Qualitative hardware comparison
21/11/201945
Qualitative hardware comparison
21/11/201946
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Qualitative hardware comparison
21/11/201947
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Qualitative hardware comparison
21/11/201948
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Qualitative hardware comparison
21/11/201949
Qualitative hardware comparison
21/11/201950
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Qualitative hardware comparison
21/11/201951
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Qualitative hardware comparison
21/11/201952
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Qualitative hardware comparison
21/11/201953
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Qualitative hardware comparison
21/11/201954
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Qualitative hardware comparison
21/11/201955
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Flexibility
Qualitative hardware comparison
21/11/201956
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Flexibility
Qualitative hardware comparison
21/11/201957
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Flexibility
Ease-of-use
Qualitative hardware comparison
21/11/201958
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Flexibility
Ease-of-use
Qualitative hardware comparison
21/11/201959
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Flexibility
Ease-of-use
(Development) cost
Qualitative hardware comparison
21/11/201960
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Flexibility
Ease-of-use
(Development) cost
Qualitative hardware comparison
21/11/201961
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Flexibility
Ease-of-use
(Development) cost
Qualitative hardware comparison
21/11/201962
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Flexibility
Ease-of-use
(Development) cost
Qualitative hardware comparison
21/11/201963
Requirements GPU FPGA ASIC
Low (deterministic) latency
High throughput
Power efficiency
Sensor fusion
Robustness
Programmability
Flexibility
Ease-of-use
(Development) cost
Qualitative hardware comparison
21/11/201964
FPGA ML workflow
21/11/201965
FPGA ML workflow
21/11/2019
Challenge: efficient mapping of floating point model to FPGA implementation
without losing accuracy
FP32
Trained network
Floating point model
66
FPGA ML workflow
21/11/2019
Challenge: efficient mapping of floating point model to FPGA implementation
without losing accuracy
FP32
Trained network
Floating point model
Compression67
FPGA ML workflow
21/11/2019
Challenge: efficient mapping of floating point model to FPGA implementation
without losing accuracy
FP32
Pruning
Pruned network
Trained network
Floating point model
Compression68
Quick digression
21/11/201969
FPGA ML workflow
21/11/2019
Challenge: efficient mapping of floating point model to FPGA implementation
without losing accuracy
FP32
Pruning
Pruned network
Quantization
Trained network
Floating point model
Compression70
FPGA ML workflow
21/11/2019
Challenge: efficient mapping of floating point model to FPGA implementation
without losing accuracy
FP32
Pruning
Pruned network
Quantization
Compilation
Trained network
Floating point model
Compression74
FPGA ML workflow
21/11/2019
Challenge: efficient mapping of floating point model to FPGA implementation
without losing accuracy
FP32
Pruning
Pruned network
Quantization
Compilation
FPGA implementationTrained network
Floating point model
Compression
Fixed Point
75
Impact of compression
21/11/2019
https://www.hotchips.org/hc30/0tutorials/T2_Part_2_Song_Hanv3.pdf
76
Impact of compression
21/11/2019
https://www.hotchips.org/hc30/0tutorials/T2_Part_2_Song_Hanv3.pdf
77
Impact of compression
21/11/2019
https://www.hotchips.org/hc30/0tutorials/T2_Part_2_Song_Hanv3.pdf
Compression allows using significantly less resources when
deploying a neural network
with minimal impact on network accuracy78
Hardware implementation architectures
21/11/2019
• Streaming architecture
Memory CPU
CO
NV
…
FPGA
HO
ST
PO
OL
CO
NV
FC
80
Hardware implementation architectures
21/11/2019
• Streaming architecture • Single computation engine
NLCONV/FC POOL
MemoryCPU
HO
ST CONV LAYER
ACTIVATION
POOL
CONV LAYER
ACTIVATION
FC
DMAControl Unit
FP
GA
Memory CPU
CO
NV
…
FPGA
HO
ST
PO
OL
CO
NV
FC
81
Hardware implementation architectures
21/11/2019
• Streaming architecture • Single computation engine
NLCONV/FC POOL
MemoryCPU
HO
ST CONV LAYER
ACTIVATION
POOL
CONV LAYER
ACTIVATION
FC
DMAControl Unit
FP
GA
Memory CPU
CO
NV
…
FPGA
HO
ST
PO
OL
CO
NV
FC
Properties Streaming architecture Single computation engine
Customizability
Flexibility
Power efficiency
82
Toolchains for AI on FPGAs
21/11/2019
Provider
Edge Cloud
Computer vision Language processing Computer visionLanguage processing
Xilinx
DNNDK
(Deep Neural Network
Development Kit)
- ML (Machine Learning) Suite
Intel - - OpenVINO
Omnitek DPU (Deep Learning Processing Unit) + software framework
Lattice sensAI -
83
Toolchains for AI on FPGAs
21/11/2019
Provider
Edge Cloud
Computer vision Language processing Computer visionLanguage processing
Xilinx
DNNDK
(Deep Neural Network
Development Kit)
- ML (Machine Learning) Suite
Intel - - OpenVINO
Omnitek DPU (Deep Learning Processing Unit) + software framework
Lattice sensAI -
84
Summary
21/11/201985
Summary
21/11/2019
• Neural network inference is viable on FPGAs
• Low power (~mW – W)
• Sensor integration
• Flexibility
• Low deterministic latency
• Edge examples
86
Summary
21/11/2019
• Neural network inference is viable on FPGAs
• Low power (~mW – W)
• Sensor integration
• Flexibility
• Low deterministic latency
• Edge examples
Xnor.ai: solar powered
person detection
87
• Neural network inference is viable on FPGAs
• Low power (~mW – W)
• Sensor integration
• Flexibility
• Low deterministic latency
• Cloud examples
Summary
21/11/2019
• Neural network inference is viable on FPGAs
• Low power (~mW – W)
• Sensor integration
• Flexibility
• Low deterministic latency
• Edge examples
CERN: sensor data filtering
and classificationXnor.ai: solar powered
person detection
88
• Neural network inference is viable on FPGAs
• Low power (~mW – W)
• Sensor integration
• Flexibility
• Low deterministic latency
• Cloud examples
Summary
21/11/2019
• Neural network inference is viable on FPGAs
• Low power (~mW – W)
• Sensor integration
• Flexibility
• Low deterministic latency
• Edge examples
CERN: sensor data filtering
and classificationMicrosoft: Azure cloud AIXnor.ai: solar powered
person detection
89
• Neural network inference is viable on FPGAs
• Low power (~mW – W)
• Sensor integration
• Flexibility
• Low deterministic latency
• Cloud examples