November 2016 - Intel · Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2016 U2, scikit-learn 0.16.1; Hardware: Intel Xeon E5-2680 v3 @ 2.50GHz, 24 cores,

November 2016

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice2

Machine Learning Software Challenges

Machine learning open source frameworks and libraries are often not well-optimized for newer Intel-based systems

Frameworks can be difficult to configure and use

Need to target heterogeneous hardware from model training in the datacenter to deployment on endpoint systems Datacenter Endpoint



strategy provides the

foundation for success using AI

Machine Learning: Your Path to Deeper Insight

Intel® Math Kernel Library (Intel® MKL &

MKL-DNN)

Intel® Data Analytics Acceleration Library

(Intel® DAAL)

Trusted Analytics Platform

+Network

+Memory

+StorageDatacenter Endpoint

Solutionsfor reference across industries

Tools/Platformsto accelerate deployment

Optimized Frameworksto simplify development

Libraries/Languagesfeaturing optimized building blocks

Hardware Technologyportfolio that is broad and cross-

compatible

Driving increasing innovation and competitive advantage across industries

Intel® Deep Learning SDK for Training &

Deployment

Intel® Distribution for Python*

http://deeplearning.net/software/theano/




• Overview of Intel® Math Kernel Library (Intel® MKL)

• Neural network primitives in Intel MKL 2017 for deep learning

• Introduction to open source Intel MKL-DNN

• Intel® Data Analytics Acceleration Library (Intel® DAAL)

• Example using Intel DAAL: handwritten digit recognition

• Introduction to Intel® Deep Learning SDK

Agenda



Highly optimized threaded math routines

Performance, Performance, Performance!

Industry’s leading math library

Widely used in science, engineering, data processing

Tuned for Intel® processors – current and next generation

Intel® Math Kernel Library (Intel® MKL) Introduction

More math library users depend on MKL than any other library

EDC North America Development Survey

2016, Volume I

Be multiprocessor aware

• Cross-Platform Support

• Be vectorised , threaded, and distributed multiprocessor aware

Intel Engineering Algorithm Experts

Optimized code path dispatch auto

Intel®CompatibleProcessors

Past Intel®Processors

Current Intel®

Processors

Future Intel®

Processors


Optimization Notice

Intel MKL unleashes the performance benefits of Intel architectures

0

50

100

150

200

64 80 96 104 112 120 128 144 160 176 192 200 208 224 240 256 384Pe

rfo

rma

nce

(G

Flo

ps)

Matrix size (M = 10000, N = 6000, K = 64,80,96, …, 384)

Intel® Core™ Processor i7-4770K

Intel MKL - 1 thread Intel MKL - 2 threads Intel MKL - 4 threadsATLAS - 1 thread ATLAS - 2 threads ATLAS - 4 threads

0

500

1000

1500

256 300 450 800 1000 1500 2000 3000 4000 5000 6000 7000 8000

Pe

rfo

rma

nce

(G

Flo

ps)

Matrix size (M = N)

Intel® Xeon® Processor E5-2699 v3

Intel MKL - 1 thread Intel MKL - 18 threads Intel MKL - 36 threadsATLAS - 1 thread ATLAS - 18 threads ATLAS - 36 threads

Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 11.3, ATLAS* 3.10.2; Hardware: Intel® Xeon® Processor E5-2699v3, 2 Eighteen-core CPUs (45MB LLC, 2.3GHz), 64GB of RAM; Intel® Core™ Processor i7-4770K, Quad-core CPU (8MB LLC, 3.5GHz), 8GB of RAM; Operating System: RHEL 6.4 GA x86_64;

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .

DGEMM Performance Boost by using Intel® MKL vs. ATLAS*

7



0

200

400

600

800

1000

1200

1400

Pe

rfo

rma

nce

(G

Flo

ps)

Matrix size (M = N)

Intel MKL - 1 thread Intel MKL - 22 threads

Intel MKL - 44 threads

DGEMM PerformanceOn Intel® Xeon® Processor E5-2699 v4

Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 2017; Hardware:Hardware: Intel® Xeon® Processor E5-2699 v4, 2 Twenty-two-core CPU (55MB smart cache, 2.2GHz), 64GB of RAM; Intel® Xeon Phi™ Processor 7250, 68 cores (34MB L2 cache, 1.4GHz), 96 GB of DDR4 RAM and 16 GB MCDRAM; Operating System: RHEL 7.2 GA x86_64;



0

500

1000

1500

2000

2500

Pe

rfo

rma

nce

(G

Flo

ps)

Matrix size (M = N)

Intel MKL - 16 threads Intel MKL - 34 threads

Intel MKL - 68 threads

DGEMM PerformanceOn Intel® Xeon Phi™ Processor 7250

Intel MKL 2017 Performance



Components of Intel MKL 2017

Linear Algebra

• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative • PARDISO*• Cluster Sparse

Solver

Fast Fourier Transforms

• Multidimensional• FFTW interfaces• Cluster FFT

Vector Math

• Trigonometric• Hyperbolic • Exponential• Log• Power• Root• Vector RNGs

Summary Statistics

• Kurtosis• Variation

coefficient• Order statistics• Min/max• Variance-

covariance

And More…

• Splines• Interpolation• Trust Region• Fast Poisson

Solver

Deep Neural Networks

• Convolution• Pooling• Normalization• ReLU• Softmax

New


Optimization Notice

Xeon Xeon Phi FPGA

Deep Learning FrameworksCaffe

Intel® MKL-DNN

Intel® Math Kernel Library

(Intel® MKL)

* GEMM matrix multiply building blocks are binary

Intel® Math Kernel Library and Intel® MKL-DNN for Deep Learning Framework Optimization

Intel® MKL Intel® MKL-DNN

DNN primitives + wide variety of other math

functions

DNN primitives

C DNN APIs C/C++ DNN APIs

Binary distribution Open source DNN code*

Free community license. Premium support

available as part of Parallel Studio XE

Apache 2.0 license

Broad usage DNN primitives; not specific to

individual frameworks

Multiple variants of DNN primitives as required for framework integrations

Quarterly update releasesRapid development ahead of Intel MKL

releases





Deep Neural Network (DNN) Primitives in Intel MKL 2017

Operations Algorithms

Activation ReLU

Normalization batch, local response

Pooling max, min, average

Convolutional fully connected, direct 3D batched convolution

Inner product forward/backward propagation of inner product computation

Data manipulation layout conversion, split, concat, sum, scale



DNN Primitives in Intel® MKL Highlights

A plain C API to be used in the existing DNN frameworks

Brings IA-optimized performance to popular image recognition topologies:

– AlexNet, Visual Geometry Group (VGG), GoogleNet, and ResNet

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance . *Other names and brands may be property of othersConfigurations: • 2 socket system with Intel® Xeon® Processor E5-2699 v4 (22 Cores, 2.2 GHz,), 128 GB memory, Red Hat* Enterprise Linux 6.7, BVLC Caffe, Intel Optimized Caffe framework, Intel® MKL 11.3.3, Intel® MKL 2017• Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB memory, Red Hat* Enterprise Linux 6.7, Intel® Optimized Caffe framework, Intel® MKL 2017All numbers measured without taking data manipulation into account.

http://www.intel.com/performance

https://github.com/BVLC/caffe

https://github.com/intelcaffe/caffe



Optimization Notice

Intel® Data Analytics Acceleration Library(Intel® DAAL)

An industry leading Intel® Architecture based data analytics acceleration library of fundamental algorithms covering all machine learning stages.

(De-)CompressionOutlier DetectionNormalization

PCAStatistical momentsVariance matrixPp-QR, SVD, CholeskyAprioriSorting

Ridge linear regressionNaïve BayesSVMClassifier boosting

KmeansEM GMM

Collaborative filtering

Neural Networks

Pre-processing Transformation Analysis Modeling Decision Making

Sci

en

tifi

c/E

ng

ine

eri

ng

We

b/S

oci

al

Bu

sin

ess

Validation

14



Big data frameworks: Hadoop, Spark, Cassandra, etc.

SQ

L

sto

res

No

SQ

L

sto

res

In-m

em

ory

sto

res

Co

nn

ecto

rs

Data

mining

Recommendation

engines

Customer

behavior

modeling

BI

analytics

Real time

analytics

Big data analyticsCurrent common practice

• Limited performance

• Many layers of dependencies

• Low ROI on HW investment

• Run on state-of-art hardware

• Built with a patchwork of math libs

• Under-exploiting hardware performance features

Data sources

Finance

Social

media

Marketing

IoT

Mfg

…

Problem Statement

Spark* MLLib

Breeze

Netlib-Java JVM

JNI Netlib BLAS




SQ

L

sto

res

No

SQ

L

sto

res

In-m

em

ory

sto

res

Co

nn

ecto

rs

Data

mining

Recommendation

engines

Customer

behavior

modeling

BI

analytics

Real time

analytics

Big data analyticsDesired practice • Optimized

performance• Simpler

development & deployment

• High ROI on HW investment

• Run on state-of-art hardware

• Single library to cover all stages of data analytics

• Fully optimized for underlying hardware

Data sources

Finance

Social

media

Marketing

IoT

Mfg

…

Desired Solution




SQ

L

sto

res

No

SQ

L

sto

res

In-m

em

ory

sto

res

Co

nn

ecto

rs

Data

mining

Recommendation

engines

Customer

behavior

modeling

BI

analytics

Real time

analyticsData sources

Finance

Social

media

Marketing

IoT

Mfg

…

Where Intel DAAL Fits?

Intel® Data Analytics Acceleration Library

• Building end-to-end data applications• C++, Java, and Python APIs


Optimization Notice


Optimization Notice

Intel DAAL Components

Data Processing Algorithms

Optimized analytics building blocks for all data analysis stages, from data

acquisition to data mining and machine learning

Data Modeling

Data structures for model representation, and operations to

derive model-based predictions and conclusions

Data Management

Interfaces for data representation and access. Connectors to a variety of data sources and data formats, such HDFS,

SQL, CSV, KDB+, and user-defined data source/format

Data Sources Numeric Tables

Compression / Decompression

Serialization / Deserialization


Optimization Notice

Algorithms Batch Distributed Online

Descriptive statisticsLow order moments √ √ √

Quantiles/sorting √

Statistical relationships Correlation / Variance-Covariance √ √ √

(Cosine, Correlation) distance matrices √

Matrix decomposition

SVD √ √ √

Cholesky √

QR √ √ √

Regression Linear/ridge regression √ √ √

Classification

Multinomial Naïve Bayes √ √ √

SVM (two-class and multi-class) √

Boosting (Ada, Brown, Logit) √

Unsupervised learning

Association rules mining (Apriori) √

Anomaly detection (uni-/multi-variate) √

PCA √ √ √

KMeans √ √

EM for GMM √

Recommender systems ALS √ √

Deep learningFully connected, convolution,normalization, activation layers, model, NN, optimization solvers,

√

20



Processing modes in Intel DAAL

Distributed Processing

Online Processing

D1D2D3

R = F(R1,…,Rk)

Si+1 = T(Si,Di)Ri+1 = F(Si+1)

R1

Rk

D1

D2

Dk

R2 R

Si,Ri

Batch Processing

D1Dk-

1

Dk…

Append

R = F(D1,…,Dk)



Key Features in Intel DAAL 2017

• Neural Networks API

• Python API (a.k.a. PyDAAL)

– Easy installation through Anaconda

– Intel® Distribution for Python 2017

• Open source project on GitHub

Fork me on GitHub: https://github.com/01org/daal

https://github.com/01org/daal



Handwritten Digit Recognition



Handwritten Digit Recognition

Training multi-class SVM for 10 digits recognition.

3,823 pre-processed training data.

– available at http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

99.6% accuracy with 1,797 test data from the same data provider.Confusion matrix: 177.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 181.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 2.000 173.000 0.000 0.000 0.000 0.000 1.000 1.000 0.000 0.000 0.000 0.000 176.000 0.000 1.000 0.000 0.000 3.000 3.000 0.000 1.000 0.000 0.000 179.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 180.000 0.000 0.000 0.000 2.000 0.000 0.000 0.000 0.000 0.000 0.000 180.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 170.000 1.000 8.000 0.000 3.000 0.000 0.000 0.000 0.000 0.000 0.000 166.000 5.000 0.000 0.000 0.000 2.000 0.000 1.000 0.000 0.000 2.000 175.000 Average accuracy: 0.996 Error rate: 0.004 Micro precision: 0.978 Micro recall: 0.978 Micro F-score: 0.978 Macro precision: 0.978 Macro recall: 0.978 Macro F-score: 0.978

http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits


Optimization Notice

Training Handwritten Digitsvoid trainModel() { /* Initialize FileDataSource<CSVFeatureManager> to retrieve input data from .csv file */ FileDataSource<CSVFeatureManager> trainDataSource(trainDatasetFileName, DataSource::doAllocateNumericTable, DataSource::doDictionaryFromContext); /* Load data from the data files */ trainDataSource.loadDataBlock(nTrainObservations); /* Create algorithm object for multi-class SVM training */ multi_class_classifier::training::Batch<> algorithm; algorithm.parameter.nClasses = nClasses; algorithm.parameter.training = training; /* Pass training dataset and dependent values to the algorithm */ algorithm.input.set(classifier::training::data,trainDataSource.getNumericTable()); /* Build multi-class SVM model */ algorithm.compute(); /* Retrieve algorithm results */ trainingResult = algorithm.getResult(); /* Serialize the learned model into a disk file */ ModelFileWriter writer("./model"); writer.serializeToFile(trainingResult->get(classifier::training::model)); }

25

Create a numeric table

Create an alg. Obj.

Set input and parameters

Compute

Serialize the learned model


Optimization Notice

Handwritten Digit Prediction

26

void testDigit() { /* Initialize FileDataSource<CSVFeatureManager> to retrieve the test data from .csv file */ FileDataSource<CSVFeatureManager> testDataSource(testDatasetFileName, DataSource::doAllocateNumericTable, DataSource::doDictionaryFromContext); testDataSource.loadDataBlock(1); /* Create algorithm object for prediction of multi-class SVM values */ multi_class_classifier::prediction::Batch<> algorithm; algorithm.parameter.prediction = prediction; /* Deserialize a model from a disk file */ ModelFileReader reader("./model"); services::SharedPtr<multi_class_classifier::Model> pModel(new multi_class_classifier::Model()); reader.deserializeFromFile(pModel); /* Pass testing dataset and trained model to the algorithm */ algorithm.input.set(classifier::prediction::data, testDataSource.getNumericTable()); algorithm.input.set(classifier::prediction::model, pModel); /* Predict multi-class SVM values */ algorithm.compute(); /* Retrieve algorithm results */ predictionResult = algorithm.getResult(); /* Retrieve predicted labels */ predictedLabels = predictionResult->get(classifier::prediction::prediction); }

Deserializelearned model


Optimization Notice

SVM Performance Boosts Using Intel® DAAL vs. Scikit-learnon Intel® CPU

27

Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2016 U2, scikit-learn 0.16.1; Hardware: Intel Xeon E5-2680 v3 @ 2.50GHz, 24 cores, 30 MB L3 cache per CPU, 256 GB RAM; Operating System: Red Hat Enterprise Linux Server release 6.6, 64-bit.



0

2

4

6

8

10

12

Training Prediction

Sp

ee

du

p

11.25X

6.77X



Intel® DAAL+ Intel® MKL = Complementary Big Data Libraries Solution

Intel MKL Intel DAAL

C and Fortran APIPrimitive level

Python, Java & C++ API High-level

Processing of homogeneous data in single or double precision

Processing heterogeneous data (mix of integers and floating point), internal conversions are hidden in the library

Type of intermediate computations is defined by type of input data (in some library domains higher precision can be used)

Type of intermediate computations can be configured independently of the type of input data

Most of MKL supports batch computation modeonly

3 computation modes: Batch, streaming and distributed

Cluster functionality uses MPI internally Developer chooses communication method for distributed computation (e.g. Spark, MPI, etc.) Code samples provided.

30

Intel Deep Learning Software Stack

*Other names and brands may be claimed as property of others.

Xeon Xeon Phi FPGA

Intel libraries as path to bring optimized ML/DL frameworks to Intel hardware

Intel MKL extracts max Intel hardware performance and provide common interface to all Intel accelerators

Intel DAAL provides optimized building blocks for machine learning applications on Intel platforms

Intel Optimized Frameworks

Intel® Deep Learning SDK – free tools to accelerate design, training and deployment of deep networks

Intel® Deep Learning SDK

Intel® Data Analytics Acceleration

Library (Intel® DAAL)

Intel® Math Kernal Library

(Intel® MKL)

Intel optimized frameworks – single-node and multi-node CPU optimizations of the most widely used deep learning frameworks

Caffe ** * *

*



31

Intel® Deep Learning SDK

Deep Learning Training Tool Deep Learning Deployment Tool

• IMPORT Trained Model (trained on Intel or 3rd Party HW)

• COMPRESS Model for Inference on Target Intel HW

• GENERATE faster inference with HW-Specific runtime

• INTEGRATE with System SW / Application Stack & TUNE

• EVALUATE Results and ITERATE

• INSTALL / SELECT IA-Optimized Frameworks

• PREPARE / CREATE Dataset with Ground-truth

• DESIGN / TRAIN Model(s) with IA-Opt. Hyper-Parameters

• MONITOR Training Progress across Candidate Models

• EVALUATE Results and ITERATE

configure_nn(fpga/cve,…)

allocate_buffer(…)

fpga_conv(input,output);

fpga_conv(…);

mkl_SoftMax(…);

mkl_SoftMax(…);

…

Technical preview available at software.intel.com/deep-learning-sdk

http://software.intel.com/deep-learning-sdk

Increased Productivity

Faster Time-to-market for training and inference,

Improve model accuracy, Reduce total cost of

ownership

Maximum Performance

Optimized performance for training and inference on

Intel® Architecture

Intel® Deep Learning SDK BenefitsAccelerate Your Deep Learning Solution

“Plug & Train/Deploy”

Simplify installation & preparation of deep learning models using popular deep

learning frameworks on Intel hardware

32



Summary

• Intel MKL and Intel DAAL provide high performance and optimized building blocks for data analytics and machine learning algorithms on Intel platforms.

• Deep learning applications can benefit from DNN primitives in Intel MKL 2017 and Neural Networks API in Intel DAAL 2017

• Intel Deep Learning SDK offers tools to facilitate develop, train, and deploy deep learning solutions.



Resources

Intel machine learning strategy

https://software.intel.com/en-us/machine-learning

Intel MKL available in Intel® Parallel Studio XE and Intel® System Studio

https://software.intel.com/en-us/intel-mkl-support

Intel DAAL available as standalone, in Intel® Parallel Studio XE, as well as open source

https://software.intel.com/en-us/intel-daal-support


Free community License for Intel libraries

https://registrationcenter.intel.com/en/forms/?productid=2558&licensetype=2

Caffe* enabled by Intel MKL


https://software.intel.com/en-us/machine-learning


https://software.intel.com/en-us/intel-daal-support


https://registrationcenter.intel.com/en/forms/?productid=2558&licensetype=2



Optimization Notice

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance ofthat product when combined with other products.

Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

3535

Documents

November 2016 - Intel · Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2016 U2, scikit-learn 0.16.1; Hardware: Intel Xeon E5-2680 v3 @ 2.50GHz, 24 cores,