Upload
intel-software
View
612
Download
1
Embed Size (px)
Citation preview
2
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration Speaker:
• Jingyi Jin, Ph.D, Software Architect
• Visual & Parallel Computing Group, Intel Corp
Abstract:• In this work, I present OpenCL™ acceleration of a well-known deep learning framework, Caffe*, while focusing on
the convolution layer which has been optimized with three different approaches; GEMM, spatial domain, and frequency domain. This work, clCaffe, greatly enhances the ability to leverage deep learning use cases on all types of OpenCL™ devices, particularly on small form factor devices in which discrete GPUs are rare and integrated GPUs are far more common. We present performance results of clCaffe running on Intel Graphics. Our benchmark shows 4.5x speedup running on the Intel Graphics, compared to default CPU implementation in Caffe, for AlexNet on ImageNet* 1K category dataset; or 4.0x (GoogleNet* classification) on 5th or 6th generation Intel® Core™ processors.
*Other names and brands may be claimed as the property of others.
3
Agenda
Background & motivation
clCaffe* framework
Development
Optimization
Results & Use case
Conclusion & future extension
*Other names and brands may be claimed as the property of others.
4
Neural Networkf
x1
x2
w1
w2
y
Perceptronin
pu
t
ou
tpu
t
inp
ut
ou
tpu
t
Convolution kernel
from:Fully Connected NN
to:Convolutional NN
5
Convolutional Neural Network
convolutionconvolution convolution convolution convolution
fully connected fc fcExample AlexNet* Topology
Feature extraction Classification
cat
ILSVRC : ImageNet Large Scale Visual Recognition Challenge *Other names and brands may be claimed as the property of others.
6
ImageNet* Large Scale Visual Recognition Challenge (ILSVRC)
28%26%
16%
12%
6.60%3.57%
2010 2011 2012 2013 2014 2015
ILS
VR
C c
lass
ific
ati
on
err
or
rate
(%)
AlexNet*
human 5.1%
*Other names and brands may be claimed as the property of others.
7
Motivation
Medical Image Analysis
Augmented Reality
Video SurveillanceAutonomous
Driving
Military Combat and Tracking
Image-basedSearch Engine
Deep Learning
Now we build over machines which can recognize!
8
Deep learning race
LeNet*digit recognition4 layers
GoogleNet*6.75% error rate22 layers
AlexNet*16.4% error rate8 layers
VGG-167.5% error rate19 layers
ResNet*3.57% error rate152 layers
Number of primitive layers
Err
or
rate
(%
)
1998 2012 2014 2014 2015
Call for Intel: How to best support this burst in compute demand?
*Other names and brands may be claimed as the property of others.
Apple* Macbook* Pro 13’’
Apple* Macbook* Pro 15’’
Apple* iMac* 21.5’’Asus* Zenbook* Infinity
Gigabyte* Brix* Pro
Zotac* ZBOX* EI730
Sony* Vaio* Tap 21
JD.com – Terran ForceClevo* Niagara*
The Graphics Architecture for many OEM DT, LT, 2:1, tablet products
Example Products with Processor Graphics
Microsoft* Surface Pro* 3
Asus* MeMO Pad* 7
Asus* Transformer Pad*
Lenovo* Miix* 2
Toshiba* Encore* 2 Tablet
*Other names and brands may be claimed as the property of others.
System
Agent
11
Example Chip Level Architecture: Intel® Core™ M
Intel® Processor Graphics
Gen8
Graphics, Compute, & Media
Shared
LLC
CPU
core
CPU
core
Key Takeaway: Intel® Processor Graphics are valuable compute resources in client platforms to
be unleashed!
Many different processor products, with different processor graphics configurations
Multiple CPU cores, shared LLC, system agent
Multiple clock domains, target power where it’s needed
13
Typical System DesignPhase I: train on servers Phase II: classify on clients
Deployment of model with weights
Typically real-time
Offline tuning of model’s weights
Takes weeks or months
Deep learning framework
Caffe*Theano TorchTensor
FlowCNTK …
*Other names and brands may be claimed as the property of others.
14
Caffe*
Open source framework for CNN
Written in C++, CUDA* for GPU, with command line, Python*, MATLAB* interfaces
Provides a complete toolkit for training, testing, benchmarking, fine-tuning and deploying models
Feature highlights
– Expressive: build net through plaintext schemas, not code.
– Speedy: fast implementation of state-of-art modules.
– Modular: easy extension to new data formats, network layers.
– Open: common code and reference models for reproducibility.
– Wide test coverage: every module attaches to unit test.
– Large community: big community, and large pool of users.
caffe.berkeleyvision.org
github.com/BVLC/caffe
*Other names and brands may be claimed as the property of others.
15
Caffe*
CPUHW
Language / Math library
CNN primitives library
CNN framework
cuBLAS
CNN Applications
MKL BLAS
CUDA*
NVidia GPUs
cuFFT
cuDNN
Caffe
C++
ATLASOpenBLAS
*Other names and brands may be claimed as the property of others.
16
clCaffe*
CPUHW
Language / Math library
CNN primitives library
CNN framework
cuBLAS
CNN Applications
MKL BLAS
CUDA*
NVidia GPUs
cuFFT
cuDNN
clCaffe* (Caffe* + OpenCL™)
Intel pGfx
C++ OpenCL
ISAACATLASOpenBLAS
AMD GPU
ViennaCL
clBLAS
…
Code path enabled by clCaffe
Existing code path
BLAS: Basic Linear Algebra Subprograms*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
17
clCaffe* Development
1. Enabling OpenCL™ extension to Caffe*
2. CPU/pGfx memory synchronization
– Take advantage of integrated SoC: zero-copy on memory buffers
3. Implementation of primitive layers
4. Passing conformance tests
5. More testing
6. Performance optimization
*Other names and brands may be claimed as the property of othersOpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
.
Convolution
18
clCaffe* initial profiling
Convolution approaches:
GEMM (General Matrix Multiply) based
Spatial domain based
FFT (Fast Fourier Transform) based
Optimization
Optimize convolution is the key!
*Other names and brands may be claimed as the property of othersOpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
19
GEMM based convolution
Flatten input data and kernels, solve the convolution as a matrix multiplication problem
<Image source: http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/>
Step1: data flattening
Step2: matrix multiplyUsually mapped into a BLAS
(Basic Linear Algebra Subprogram) call
Step3: data unflattening
col2im=>
20
Spatial domain convolutionDirect application of convolution on the spatial domain: dot product of input with convolution kernel
<Image source: https://developer.apple.com/library/ios/documentation/Performance/Conceptual/vImage/Art/kernel_convolution.jpg>
21
FFT based convolutionConvert input into Fourier Domain, apply element-wise multiplication to reduce complexity: O(N2K2) O(N2log2N), where N is data size, K is kernel size
In [227x227]
W [11x11]
FFT(In_padded)[256x258]
W_padded [256x258]
0
FFT(W_padded)[256x258]
*
Out [55x55]
Inverse FFT
0
In_padded[256x256]
In [227x227]Input Data
0
In_padded[256x256]
0
FFT(In_padded)[256x258]
W [11x11]W [11x11]
W_padded [256x258]
00
FFT(W_padded)[256x258]
=
W [11x11]W [11x11]Kernel
Output
In spatial domain In frequency domainZero padded in spatial domainS
um
of e
lem
en
t-wise
mu
ltiplica
tion
22
Analysis of Convolution Approaches
22
1. GEMM (default)
Pros: • Generic and stable• Easy to implement
(problem mapped into a BLAS call)
• Optimized solution if good BLAS is provided
Cons: • Additional memory to
store the intermediate data
• Rely heavily on optimized BLAS
2. Spatial domain
Pros: • Avoids additional
memory copy• Speedy with optimized
code
Cons: • Rely on individually
optimized kernels according to given params, or even given HW architecture
3. FFT domain
Pros: • Lower computational
complexity
Cons: • Additional memory to
save FFT data• Overhead is big for small
kernel size, or large stride
23
Spatial Convolution
Auto-tuning
Performed at first time when conv is called on the machine, and cached for future use
Find the optimal kernel parameters and instantiate the fastest OpenCL™ kernel
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
25
Hardware Configuration
5th Generation Intel® Core™ processor 14nm Intel® Atom™ processor
CPU Intel® Xeon® CPU E3-1200 v4 @ 3.40GHz Intel® Atom™ x7 processor @ 1.60GHz
GPU Intel® Iris™ Pro 6200 w/ 48 core Intel® HD Graphics w/ 16 core
OS CentOS* 7.1, kernel 3.10.0-229 Windows* 10
OpenCL™ OpenCL Linux driver OpenCL Windows driver
Copyright © 2016, Intel Corporation. All rights reserved.*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
26
8
10
65
290
60
89
0 50 100 150 200 250 300
CPU + ATLAS
CPU + OpenBLAS
CPU + MKL
Spatial Convolution
FFT Convolution
GEMM Convolution
Images/sec
26
clCaffe*on Intel GEN with different convolution approach
AlexNet benchmark: Forward only, Batch size = 256 Experiment system: 5th Gen Intel® Core ™ Processor 4+3e with Intel® Iris ™ Pro 6200 (GT3e)
Caffe*on CPU with different BLAS libraryfor GEMM convolution
Higher is better
clCaffe* on 5th Generation Intel® Core™ ProcessorsAlexNet* classification
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
2727
clCaffe* on 5th Generation Intel® Core™ ProcessorsAlexNet* training
clCaffe*on Intel GEN with different convolution approach
Caffe*on CPU with different BLAS libraryfor GEMM convolution
Higher is better
4
5
28
56
19
28
0 50 100 150
CPU + ATLAS
CPU + OpenBLAS
CPU + MKL
Spatial Convolution
FFT Convolution
GEMM Convolution
Images/sec AlexNet benchmark: Forward only, Batch size = 256 Experiment system: Experiment system: 5th Gen Intel® Core ™ Processor 4+3e with Intel® Iris ™ Pro 6200 (GT3e)
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
28
Other Topologies classification
91
55
77
290
0 50 100 150 200 250 300 350
Overfeat
VGG-A
GoogLeNet
AlexNet
Img/sec
clCaffe*on 5th Gen Intel® Core™ Processor GT3eusing spatial convolution
Higher is better
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
29
clCaffe* on 14nm Intel® Atom™ processorAlexNet* classification
6
50
17
0 10 20 30 40 50 60
CPU + MKL
Spatial Convolution
GEMM Convolution
Img/sec
clCaffeon Intel GEN with different convolution approach
Caffe*on CPU with different BLAS libraryfor GEMM convolution
Higher is better
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
30
Conclusion
Intel not only provides the Silicon solution, but also builds SW ecosystem around its HW for deep learning support
clCaffe* is an optimized, user friendly DL solution on Intel® Processor Graphics
clCaffe presents 4.5x – 8.3x over default CPU on the same system for classification based on AlexNet*
Intel® Processor Graphics is a valuable compute resource to be unleashed on client platforms
*Other names and brands may be claimed as the property of others.
31
clCaffe* release status
Handed over to Open Source team in Intel
Externally available: https://github.com/01org/caffe/wiki/clCaffe
Progressively optimized
Further optimization plan:
GEMM convolution (for back propagation)
Winograd convolution
Call for trial and open source contribution!
*Other names and brands may be claimed as the property of others.
32
clCaffe*
CPUHW
Language / Math library
CNN primitives library
CNN framework
cuBLAS
CNN Applications
MKL BLAS
CUDA*
NVidia GPUs
cuFFT
cuDNN
clCaffe (Caffe* + OpenCL™)
Intel pGfx
C++ OpenCL™
ISAACATLASOpenBLAS
AMD GPU
ViennaCL
clBLAS
…
Code path enabled by clCaffe
Existing code path*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
33
Future extension
CPUHW
Language / Math library
CNN primitives library
CNN framework
cuBLAS
CNN Applications
MKL BLAS
CUDA*
Nvidia* GPUs
cuFFT
cuDNN
Caffe*
Intel pGfx
C++ OpenCL™
ISAACATLASOpenBLAS
Intel FPGA
ViennaCL
clBLAS
Code path enabled by clCaffe
Existing code path
Intel® MKL-DNN
*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
34
Future extension
CPUHW
Language / Math library
CNN primitives library
CNN framework
CNN Applications
MKL BLAS
TensorFlow*
Intel pGfx
C++ OpenCL™
ISAAC
Intel FPGA
ViennaCL
Code path enabled by clCaffe
Existing code path
Intel® MKL-DNN
Caffe* …Torch*
…
Copyright © 2016, Intel Corporation. All rights reserved.*Other names and brands may be claimed as the property of others.
35
Reference
clCaffe*: OpenCL™ accelerated Caffe* for Convolutional Neural Networks. J.
Bottleson, S. Kim, J. Andrews, P. Bindu, D. N. Murthy, J. Jin, 25th International Heterogeneity in Computing Workshop, 2016.
Caffe* OpenCL™ branch: https://github.com/BVLC/caffe/tree/opencl
clCaffe* wiki: https://github.com/01org/caffe/wiki/clCaffe
Intel® MKL-DNN tech preview: https://software.intel.com/en-us/articles/deep-neural-network-
technical-preview-for-intel-math-kernel-library-intel-mkl
Intel® Processor Graphics: https://software.intel.com/sites/default/files/Compute%20Architecture%20of%20Intel%20Processor
%20Graphics%20Gen8.pdfCopyright © 2016, Intel Corporation. All rights reserved.*Other names and brands may be claimed as the property of others.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos© 2016 Intel Corporation. Intel, the Intel logo, and others are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.