21
1 © 2015 The MathWorks, Inc. Deploying Deep Learning Networks to Embedded GPUs and CPUs Pierre Nowodzienski

Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

1© 2015 The MathWorks, Inc.

Deploying Deep Learning Networks

to Embedded GPUs and CPUs

Pierre Nowodzienski

Page 2: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

2

Deep Learning enablers

Labeled public datasets

Increased GPU acceleration

World-class models to be leveraged

AlexNetPRETRAINED MODEL

CaffeM O D E L S

ResNet PRETRAINED MODEL

TensorFlow/Keras M O D E L S

VGG-16PRETRAINED MODEL

GoogLeNet PRETRAINED MODEL

Human

Accuracy

Page 3: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

3

Deep Learning Applications:

Image classification, speech recognition, autonomous driving, etc…

Rain Detection and Removal1Detection of cars and road in autonomous driving systems

1. Deep Joint Rain Detection and Removal from a Single Image

Traffic Sign Recognition

Page 4: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

5

GPUs and CUDA programming

Ease of programming(expressivity, algorithmic, …)

Perf

orm

ance

easier

faster

MATLAB

Python

CUDA

OpenCL

C/C++

GPUs are “hardware on steroids”, … but, programming them is hard

GPU Coder

Page 5: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

6

Deep learning workflow in MATLAB

▪ Design in MATLAB

▪ Manage large data sets

▪ Automate data labeling

▪ Easy access to models

▪ Training in MATLAB

▪ Acceleration with GPU’s

▪ Scale to clusters

Train in MATLAB

Model

importer

Trained

DNN

Model

importer

Deep Neural Network

Design + Training

Page 6: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

7

Deep learning workflow in MATLAB

Train in MATLAB

Model

importer

Trained

DNN

Application

logic

Model

importer

Deep Neural Network

Design + TrainingApplication

designDeep learning

Page 7: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

8

Deep learning workflow in MATLAB

Train in MATLAB

Model

importer

Trained

DNN

Model

importer

Standalone

DeploymentDeep Neural Network

Design + TrainingApplication

design

Application

logic

C++/CUDA

Page 8: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

10

GPU Coder for Deployment

Deep Neural Networks

Deep Learning, machine learning

Image Processing and

Computer Vision

Image filtering, feature detection/extraction

Signal Processing and

Communications FFT, filtering, cross correlation,

5x faster than TensorFlow

2x faster than MXNet

60x faster than CPUs

for stereo disparity

20x faster than

CPUs for FFTs

GPU Coder

Accelerated implementation of

parallel algorithms on GPUs & CPUs

ARM Compute

Library

Intel

MKL-DNN

Library

Page 9: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

11

GPUs and CUDA

CUDA

kernelsC/C++

ARM

Cortex

GPU

CUDA Cores

C/C++

CUDA Kernel

C/C++

CUDA Kernel

GPU Memory

Space

CPU Memory

Space

Page 10: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

12

Challenges of Programming in CUDA for GPUs

▪ Learning to program in CUDA

– Need to rewrite algorithms for parallel processing paradigm

▪ Creating CUDA kernels

– Need to analyze algorithms to create CUDA kernels that maximize parallel

processing

▪ Allocating memory

– Need to deal with memory allocation on both CPU and GPU memory spaces

▪ Minimizing data transfers

– Need to minimize while ensuring required data transfers are done at the

appropriate parts of your algorithm

Page 11: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

13

GPU Coder Helps You Deploy to GPUs Faster

GPU Coder

CUDA Kernel creation

Memory allocation

Data transfer minimization

• Library function mapping

• Loop optimizations

• Dependence analysis

• Data locality analysis

• GPU memory allocation

• Data-dependence analysis

• Dynamic memcpy reduction

Page 12: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

19

GPU Coder speeds up MATLAB for Image Processing and

Computer Vision

8x speedup

Distance

transform

5x speedup

Fog removal

700x speedup

SURF feature

extraction

18x speedup

Ray tracing

3x speedup

Frangi filter

Page 13: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

20

GPU Coder speeds up MATLAB at least 2x for inference

AlexNet ResNet-50 VGG-16

Images / S

ec

Single image prediction using Intel® Xeon® CPU - 3.6 GHz, NVIDIA libraries: CUDA8 - cuDNN 7

TensorFlow 1.6.0, MXNet 1.1.0, MATLAB 18a

GPU Coder

MATLAB

Page 14: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

21

With GPU Coder, MATLAB is faster than other frameworksIm

ages / S

ec

AlexNet ResNet-50 VGG-16

2x

TensorFlow

MXNet

GPU Coder

Single image prediction using Intel® Xeon® CPU - 3.6 GHz, NVIDIA libraries: CUDA8 - cuDNN 7

TensorFlow 1.6.0, MXNet 1.1.0, MATLAB 18a

Page 15: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

22

Embedded GPU Benchmarking: Jetson TX2

60

62

64

66

68

70

72

74

76

0

200

400

600

800

1000

1200

1400

1600

1800

GPU Coder

(cuDNN v7)

GPU Coder

(TensorRT v3.0.4)Caffe

Images / s

ec

Mem

ory

Usage (

MB

)

GPU Coder

(cuDNN v7)

GPU Coder

(TensorRT v3.0.4)Caffe

Page 16: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

23

Algorithm Design to Embedded Deployment Workflow

MATLAB algorithm

(functional reference)

Functional test1 Deployment

unit-test

2

Desktop

GPU

C++

Deployment

integration-test

3

Desktop

GPU

C++

Real-time test4

Embedded GPU

.mex .lib Cross-compiled

.lib

Build type

Call CUDA

from MATLAB

directly

Call CUDA from

(C++) hand-

coded main()

Call CUDA from (C++)

hand-coded main().

Page 17: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

24

Demo: Alexnet Deployment with ‘mex’ Code Generation

Page 18: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

25

Algorithm Design to Embedded Deployment on Tegra GPU

MATLAB algorithm

(functional reference)

Functional test1

(Test in MATLAB on host)

Deployment

unit-test

2

(Test generated code in

MATLAB on host + GPU)

Tesla

GPU

C++

Deployment

integration-test

3

(Test generated code within

C/C++ app on host + GPU)

Tesla

GPU

C++

Real-time test4

(Test generated code within

C/C++ app on Tegra target)

Tegra GPU

.mex .lib Cross-compiled

.lib

Build type

Call CUDA

from MATLAB

directly

Call CUDA from

(C++) hand-

coded main()

Call CUDA from (C++)

hand-coded main().

Cross-compiled on host

with Linaro toolchain

Page 19: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

26

Alexnet Deployment to Tegra: Cross-Compiled with ‘lib’

Two small changes

1. Change build-type to ‘lib’

2. Select cross-compile toolchain

Page 20: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

27

End-to-End Application: Lane Detection

Transfer Learning

Alexnet

Lane detection

CNN

Post-processing

(find left/right lane

points)Image

Image with

marked lanes

Left lane coefficients

Right lane coefficients

Output of CNN is lane parabola coefficients according to: y = ax^2 + bx + c

GPU coder generates code for whole application

Page 21: Deploying Deep Learning Networks to Embedded GPUs and CPUs€¦ · Deep Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature

35

Deep learning workflow in MATLAB

Train in MATLAB

Model

importer

Trained

DNN

Model

importer

Standalone

DeploymentDeep Neural Network

Design + TrainingApplication

design

NVIDIA TensorRT

cuDNN Libraries

ARM Compute

Library

Intel

MKL-DNN

LibraryApplication

logic