clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration

Jingyi Jin, Software Architect, Intel Corp.

2

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration Speaker:

• Jingyi Jin, Ph.D, Software Architect

• Visual & Parallel Computing Group, Intel Corp

Abstract:• In this work, I present OpenCL™ acceleration of a well-known deep learning framework, Caffe*, while focusing on

the convolution layer which has been optimized with three different approaches; GEMM, spatial domain, and frequency domain. This work, clCaffe, greatly enhances the ability to leverage deep learning use cases on all types of OpenCL™ devices, particularly on small form factor devices in which discrete GPUs are rare and integrated GPUs are far more common. We present performance results of clCaffe running on Intel Graphics. Our benchmark shows 4.5x speedup running on the Intel Graphics, compared to default CPU implementation in Caffe, for AlexNet on ImageNet* 1K category dataset; or 4.0x (GoogleNet* classification) on 5th or 6th generation Intel® Core™ processors.

*Other names and brands may be claimed as the property of others.

3

Agenda

Background & motivation

clCaffe* framework

Development

Optimization

Results & Use case

Conclusion & future extension


4

Neural Networkf

x1

x2

w1

w2

y

Perceptronin

pu

t

ou

tpu

t

inp

ut

ou

tpu

t

Convolution kernel

from:Fully Connected NN

to:Convolutional NN

5

Convolutional Neural Network

convolutionconvolution convolution convolution convolution

fully connected fc fcExample AlexNet* Topology

Feature extraction Classification

cat

ILSVRC : ImageNet Large Scale Visual Recognition Challenge *Other names and brands may be claimed as the property of others.

6

ImageNet* Large Scale Visual Recognition Challenge (ILSVRC)

28%26%

16%

12%

6.60%3.57%

2010 2011 2012 2013 2014 2015

ILS

VR

C c

lass

ific

ati

on

err

or

rate

(%)

AlexNet*

human 5.1%


7

Motivation

Medical Image Analysis

Augmented Reality

Video SurveillanceAutonomous

Driving

Military Combat and Tracking

Image-basedSearch Engine

Deep Learning

Now we build over machines which can recognize!

8

Deep learning race

LeNet*digit recognition4 layers

GoogleNet*6.75% error rate22 layers

AlexNet*16.4% error rate8 layers

VGG-167.5% error rate19 layers

ResNet*3.57% error rate152 layers

Number of primitive layers

Err

or

rate

(%

)

1998 2012 2014 2014 2015

Call for Intel: How to best support this burst in compute demand?


9

Training Scoring/classification

Intel’s products for deep learning

Apple* Macbook* Pro 13’’

Apple* Macbook* Pro 15’’

Apple* iMac* 21.5’’Asus* Zenbook* Infinity

Gigabyte* Brix* Pro

Zotac* ZBOX* EI730

Sony* Vaio* Tap 21

JD.com – Terran ForceClevo* Niagara*

The Graphics Architecture for many OEM DT, LT, 2:1, tablet products

Example Products with Processor Graphics

Microsoft* Surface Pro* 3

Asus* MeMO Pad* 7

Asus* Transformer Pad*

Lenovo* Miix* 2

Toshiba* Encore* 2 Tablet


System

Agent

11

Example Chip Level Architecture: Intel® Core™ M

Intel® Processor Graphics

Gen8

Graphics, Compute, & Media

Shared

LLC

CPU

core

CPU

core

Key Takeaway: Intel® Processor Graphics are valuable compute resources in client platforms to

be unleashed!

Many different processor products, with different processor graphics configurations

Multiple CPU cores, shared LLC, system agent

Multiple clock domains, target power where it’s needed

12

Chip Level Architecture :4 CPU cores & Intel® Iris™ Pro Graphics: 48 EUs, & EDRAM

13

Typical System DesignPhase I: train on servers Phase II: classify on clients

Deployment of model with weights

Typically real-time

Offline tuning of model’s weights

Takes weeks or months

Deep learning framework

Caffe*Theano TorchTensor

FlowCNTK …


14

Caffe*

Open source framework for CNN

Written in C++, CUDA* for GPU, with command line, Python*, MATLAB* interfaces

Provides a complete toolkit for training, testing, benchmarking, fine-tuning and deploying models

Feature highlights

– Expressive: build net through plaintext schemas, not code.

– Speedy: fast implementation of state-of-art modules.

– Modular: easy extension to new data formats, network layers.

– Open: common code and reference models for reproducibility.

– Wide test coverage: every module attaches to unit test.

– Large community: big community, and large pool of users.

caffe.berkeleyvision.org

github.com/BVLC/caffe


http://www.google.com/url?q=http://caffe.berkeleyvision.org/&sa=D&sntz=1&usg=AFQjCNGBX9-k1FntcS_gdYLHjaHVnMs0bQ

https://www.google.com/url?q=https://github.com/BVLC/caffe&sa=D&sntz=1&usg=AFQjCNFbIccCgT4QNZ42AGY2BlUvbShK4w

15

Caffe*

CPUHW

Language / Math library

CNN primitives library

CNN framework

cuBLAS

CNN Applications

MKL BLAS

CUDA*

NVidia GPUs

cuFFT

cuDNN

Caffe

C++

ATLASOpenBLAS


16

clCaffe*

CPUHW



CNN framework

cuBLAS

CNN Applications

MKL BLAS

CUDA*

NVidia GPUs

cuFFT

cuDNN

clCaffe* (Caffe* + OpenCL™)

Intel pGfx

C++ OpenCL

ISAACATLASOpenBLAS

AMD GPU

ViennaCL

clBLAS

…

Code path enabled by clCaffe

Existing code path

BLAS: Basic Linear Algebra Subprograms*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

17

clCaffe* Development

1. Enabling OpenCL™ extension to Caffe*

2. CPU/pGfx memory synchronization

– Take advantage of integrated SoC: zero-copy on memory buffers

3. Implementation of primitive layers

4. Passing conformance tests

5. More testing

6. Performance optimization

*Other names and brands may be claimed as the property of othersOpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

.

Convolution

18

clCaffe* initial profiling

Convolution approaches:

GEMM (General Matrix Multiply) based

Spatial domain based

FFT (Fast Fourier Transform) based

Optimization

Optimize convolution is the key!

*Other names and brands may be claimed as the property of othersOpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

19

GEMM based convolution

Flatten input data and kernels, solve the convolution as a matrix multiplication problem

<Image source: http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/>

Step1: data flattening

Step2: matrix multiplyUsually mapped into a BLAS

(Basic Linear Algebra Subprogram) call

Step3: data unflattening

col2im=>

20

Spatial domain convolutionDirect application of convolution on the spatial domain: dot product of input with convolution kernel

<Image source: https://developer.apple.com/library/ios/documentation/Performance/Conceptual/vImage/Art/kernel_convolution.jpg>

21

FFT based convolutionConvert input into Fourier Domain, apply element-wise multiplication to reduce complexity: O(N2K2) O(N2log2N), where N is data size, K is kernel size

In [227x227]

W [11x11]

FFT(In_padded)[256x258]

W_padded [256x258]

0

FFT(W_padded)[256x258]

*

Out [55x55]

Inverse FFT

0

In_padded[256x256]

In [227x227]Input Data

0

In_padded[256x256]

0

FFT(In_padded)[256x258]

W [11x11]W [11x11]

W_padded [256x258]

00

FFT(W_padded)[256x258]

=

W [11x11]W [11x11]Kernel

Output

In spatial domain In frequency domainZero padded in spatial domainS

um

of e

lem

en

t-wise

mu

ltiplica

tion

22

Analysis of Convolution Approaches

22

1. GEMM (default)

Pros: • Generic and stable• Easy to implement

(problem mapped into a BLAS call)

• Optimized solution if good BLAS is provided

Cons: • Additional memory to

store the intermediate data

• Rely heavily on optimized BLAS

2. Spatial domain

Pros: • Avoids additional

memory copy• Speedy with optimized

code

Cons: • Rely on individually

optimized kernels according to given params, or even given HW architecture

3. FFT domain

Pros: • Lower computational

complexity

Cons: • Additional memory to

save FFT data• Overhead is big for small

kernel size, or large stride

23

Spatial Convolution

Auto-tuning

Performed at first time when conv is called on the machine, and cached for future use

Find the optimal kernel parameters and instantiate the fastest OpenCL™ kernel

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

25

Hardware Configuration

5th Generation Intel® Core™ processor 14nm Intel® Atom™ processor

CPU Intel® Xeon® CPU E3-1200 v4 @ 3.40GHz Intel® Atom™ x7 processor @ 1.60GHz

GPU Intel® Iris™ Pro 6200 w/ 48 core Intel® HD Graphics w/ 16 core

OS CentOS* 7.1, kernel 3.10.0-229 Windows* 10

OpenCL™ OpenCL Linux driver OpenCL Windows driver

Copyright © 2016, Intel Corporation. All rights reserved.*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

26

8

10

65

290

60

89

0 50 100 150 200 250 300

CPU + ATLAS

CPU + OpenBLAS

CPU + MKL

Spatial Convolution

FFT Convolution

GEMM Convolution

Images/sec

26

clCaffe*on Intel GEN with different convolution approach

AlexNet benchmark: Forward only, Batch size = 256 Experiment system: 5th Gen Intel® Core ™ Processor 4+3e with Intel® Iris ™ Pro 6200 (GT3e)

Caffe*on CPU with different BLAS libraryfor GEMM convolution

Higher is better

clCaffe* on 5th Generation Intel® Core™ ProcessorsAlexNet* classification

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

2727

clCaffe* on 5th Generation Intel® Core™ ProcessorsAlexNet* training

clCaffe*on Intel GEN with different convolution approach


Higher is better

4

5

28

56

19

28

0 50 100 150

CPU + ATLAS

CPU + OpenBLAS

CPU + MKL

Spatial Convolution

FFT Convolution

GEMM Convolution

Images/sec AlexNet benchmark: Forward only, Batch size = 256 Experiment system: Experiment system: 5th Gen Intel® Core ™ Processor 4+3e with Intel® Iris ™ Pro 6200 (GT3e)


28

Other Topologies classification

91

55

77

290

0 50 100 150 200 250 300 350

Overfeat

VGG-A

GoogLeNet

AlexNet

Img/sec

clCaffe*on 5th Gen Intel® Core™ Processor GT3eusing spatial convolution

Higher is better


29

clCaffe* on 14nm Intel® Atom™ processorAlexNet* classification

6

50

17

0 10 20 30 40 50 60

CPU + MKL

Spatial Convolution

GEMM Convolution

Img/sec

clCaffeon Intel GEN with different convolution approach


Higher is better


30

Conclusion

Intel not only provides the Silicon solution, but also builds SW ecosystem around its HW for deep learning support

clCaffe* is an optimized, user friendly DL solution on Intel® Processor Graphics

clCaffe presents 4.5x – 8.3x over default CPU on the same system for classification based on AlexNet*

Intel® Processor Graphics is a valuable compute resource to be unleashed on client platforms


31

clCaffe* release status

Handed over to Open Source team in Intel

Externally available: https://github.com/01org/caffe/wiki/clCaffe

Progressively optimized

Further optimization plan:

GEMM convolution (for back propagation)

Winograd convolution

Call for trial and open source contribution!


https://github.com/01org/caffe/wiki/clCaffe

32

clCaffe*

CPUHW



CNN framework

cuBLAS

CNN Applications

MKL BLAS

CUDA*

NVidia GPUs

cuFFT

cuDNN

clCaffe (Caffe* + OpenCL™)

Intel pGfx

C++ OpenCL™

ISAACATLASOpenBLAS

AMD GPU

ViennaCL

clBLAS

…


Existing code path*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

33

Future extension

CPUHW



CNN framework

cuBLAS

CNN Applications

MKL BLAS

CUDA*

Nvidia* GPUs

cuFFT

cuDNN

Caffe*

Intel pGfx

C++ OpenCL™

ISAACATLASOpenBLAS

Intel FPGA

ViennaCL

clBLAS


Existing code path

Intel® MKL-DNN

*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

34

Future extension

CPUHW



CNN framework

CNN Applications

MKL BLAS

TensorFlow*

Intel pGfx

C++ OpenCL™

ISAAC

Intel FPGA

ViennaCL


Existing code path

Intel® MKL-DNN

Caffe* …Torch*

…

Copyright © 2016, Intel Corporation. All rights reserved.*Other names and brands may be claimed as the property of others.

35

Reference

clCaffe*: OpenCL™ accelerated Caffe* for Convolutional Neural Networks. J.

Bottleson, S. Kim, J. Andrews, P. Bindu, D. N. Murthy, J. Jin, 25th International Heterogeneity in Computing Workshop, 2016.

Caffe* OpenCL™ branch: https://github.com/BVLC/caffe/tree/opencl

clCaffe* wiki: https://github.com/01org/caffe/wiki/clCaffe

Intel® MKL-DNN tech preview: https://software.intel.com/en-us/articles/deep-neural-network-

technical-preview-for-intel-math-kernel-library-intel-mkl

Intel® Processor Graphics: https://software.intel.com/sites/default/files/Compute%20Architecture%20of%20Intel%20Processor

%20Graphics%20Gen8.pdfCopyright © 2016, Intel Corporation. All rights reserved.*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

https://github.com/BVLC/caffe/tree/opencl

https://github.com/01org/caffe/wiki/clCaffe

https://software.intel.com/en-us/articles/deep-neural-network-technical-preview-for-intel-math-kernel-library-intel-mkl

https://software.intel.com/sites/default/files/Compute Architecture of Intel Processor Graphics Gen8.pdf

Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos© 2016 Intel Corporation. Intel, the Intel logo, and others are trademarks of Intel Corporation in the U.S. and/or other countries.


http://www.intel.com/performance

http://www.intel.com/performance

Technology

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration