A Platform for Accelerating Machine Learning...

A Platform for Accelerating Machine Learning ApplicationsBen ChandlerHewlett Packard Labs

April 6th, 2016

Optimized

Platforms

HPE Big Data and HPC portfolio strategyDesign and deliver comprehensive solutions with purpose-built platforms

Innovate, design & deliver the best-in-class

hardware and software to support foundational

infrastructure needs of the Big Data customers

Provide vertical solutions by building software

stack and partner ecosystem

Enable Advisory Services to help manage

customer’s technology journey

Drive HPC and Big Data across all Enterprises

Modernize your datacenter for massive parallel processing innovation Deliver automated intelligence, real-time insights and optimized performance

Optimized performanceReal-time insightsAutomated intelligence

Extreme performance capabilities to process, manage and analyze data, I/O and storage intensive application workloads with high speed, scale, efficiency and enable high flexibility for open infrastructure innovation

Navigate the data-driven transformation journey across all enterprises with new HPC and Big

Data capabilities that accelerate time-to-value for increased competitive differentiation

Deep Learning Innovation

HPC Compute & Storage Solution

HPE Vertica for SQL on Hadoop

Integrity MC990 X for Database

Processing

Risk Compliant

Archive Solution

Trade & Match Server Solution

HPC for Trader Workstation

Apollo 6500, Apollo 4520 Apollo 2000Apollo 4510 HPE MoonshotApollo 4000 Series

Deliver automated intelligence in real-time for Deep Learning

Unprecedented performance and scale with HPE Apollo 6500 high density GPU solution

Customer benefits

HPE Apollo 6500 is an ideal HPC and Deep Learning platform providing unprecedented performance with 8 GPUs, high bandwidth

fabric and a configurable GPU topology to match deep learning workloads

− Up to 8 high powered GPUs per tray (node), 2P Intel E5-2600 v4 support

− Choice of high-speed, low latency fabrics with 2x IO expansion

− Workload optimized using flexible configuration capabilities

Video, Image, Text, Audio, time series pattern recognition

Large, highly complex, unstructured simulation

and modeling

Real-time, near real-time analytics

Faster Model training time, better fusion of data*

Use Cases

Transformto a hybrid

infrastructure

Enableworkplace

productivity

Protectyour digitalenterprise

Empowera data-drivenorganization

Automated

Intelligencedelivered by HPE

Apollo 6500 and Deep

Learning software

solutions

* Benchmarking results provided at or shortly after announcement

HPE Apollo 6500 solution innovationSystem Design Innovation to maximize GPU capacity and performance with lower TCO

HPE Apollo 6500– Dense GPU server optimized for Deep

Learning and HPC workloads

– Density optimization

– High performance fabrics

Cluster Management Enhancements(Massive Scaling, Open APIs, tight Integration, multiple user interfaces)

− GPU density

− Configurable GPU topologies

− More network bandwidth

− Power and cooling optimization

− Manageability

− Better productivity

New technologies, products

Unique Solution differentiators

Deep Learning, HPC Software platform Enablement(HPE CCTK, Caffe, CUDA, Google TensorFlow, HPE IDOL)

Roadmap

–Motivating evidence

–The CogX project and vision

–Open-source availability

A simple data-intensive program

val movie1 = ...val movie2 = ...

val average = (movie1 + movie2) / 2

movie1

movie2

average

Simplified architecture diagram

CPU GPU

CPU Mem GPU Mem

CPU Mem

Naïve data flow in practice

val average = (movie1 + movie2) / 2

GPU Mem

CPU GPU

CPU Mem GPU Mem

Optimized data flow in practice

val average = fusedOp(movie1, movie2, 2)

Performance portability on GPUs

Roadmap

Vision

performance-portable, high-productivity programming for accelerators

• Domain-specific embedded language with associated optimizing compiler and runtime

• Array programming language embedded in a state machine execution model

• Targets advanced analytics workloads on massively parallel distributed systems

• Design Goals

– Optimal deployment on parallel hardware

– Fast design iterations

– Enforce scalability

– Broad COTS hardware support

– Compatible with shared infrastructure

– High productivity for analysts and algorithm engineers

What is CogX?

CogX compute model

• Compute Graphs

– Fields

– Operators

– Sensors/Actuators

– Feedback/Time Compute

CogX compute model

val movie = ColorMovie(“courtyard.mp4”) val background = VectorField(movie.fieldShape, Shape(3))val nextBackground = 0.999f * background + 0.001f * moviebackground <== nextBackgroundval suspicious = reduceSum(abs(movie - background))

Demo: Hello World application

CogX compute model

val movie = ColorMovie(“courtyard.mp4”)

Compute graph

moviet

ColorMovie

CogX compute model

val background = VectorField(movie.fieldShape, Shape(3))

Compute graph

moviet

backgroundt

ColorMovie

CogX compute model

val nextBackground = 0.999f * background + 0.001f * movie

Compute graph

moviet

backgroundt +*0.999f

* 0.001f

nextBackgroundt

ColorMovie

CogX compute model

Compute graph

moviet

* 0.001f

nextBackgroundt

background <== nextBackground

backgroundt+1

ColorMovie

CogX compute model

Compute graph

moviet

* 0.001f

nextBackgroundt backgroundt+1

val suspicious = reduceSum(abs(movie - background))

- abs reduceSum

suspicioust

ColorMovie

CogX compute model

movie0

background0+* 0.999f

* 0.001f

background1

suspicious0

abs reduceSum

movie1

+* 0.999f

* 0.001f

background2

suspicious1

abs reduceSum

movie2

+* 0.999f

* 0.001f

background3

suspicious2

abs reduceSum

Opportunities for optimization

Compute graph

moviet

* 0.001f

- abs reduceSum

suspicioust

ColorMovie

Compute graph

moviet

* 0.001f

- abs reduceSum

suspicioust

Initially: 6 separate device kernels.

device kernel

ColorMovie

Compute graph

moviet

* 0.001f

- abs reduceSum

suspicioust

device kernel

After a “single-output” kernel fuser pass: 2 device kernels remain.

ColorMovie

Compute graph

moviet

* 0.001f

- abs reduceSum

suspicioust

device kernel

After a “multi-output” kernel fuser pass: only a single device kernel remains.

ColorMovie

CogX compiler: translating CogX to OpenCL with kernel fusion

User CogX model

(scala)

parsing and OpenCL code

generation

(ops, fields)

Kernel circuit

(kernels, field bufs)

Syntax tree

(ops, fields)

Optimized kernel circuit

(merged kernels)

optimizations, including kernel

fusion

CogX code snippet

val A = ScalarField(10,10)

val B = ScalarField(10,10)

val C = A * B

val D = ScalarField(10,10)

val E = C + D

opencl

multiply

kernel +

opencl

kernel

opencl

multiply/

kernel

CogX core functions and operators

• Basic operators

• +, -, *, /, %

• Logical operators

• >, >=, <, <=, ===, !===

• Pointwise functions

• cos, cosh, acos

• sin, sinh, asin

• tan, tanh, atan2

• sq, sqrt, log, signum

• pow, reciprocal

• exp, abs, floor

• Comparison functions

• max, min

• Shape manipulation

• flip, shift, shiftCyclic

• transpose, subfield

• expand, select, stack

• matrixRow, reshape

• subfields, trim

• vectorElement, vectorElements

• transposeMatrices

• transposeVectors

• replicate, slice

• FFT/DCT

• fft, fftInverse

• fftRI, fftInverseRI

• fftRows, fftInverseRows

• fftColumns, fftInverseColumns

• dct, dctInverse, dctTransposed

• dctInverseTransposed

• Complex numbers

• phase, magnitude, conjugate

• realPart, imaginaryPart

• Convolution-like

• crossCorrelate,

crossCorrelateSeparable

• convolve, convolveSeparable

• projectFrame, backProjectFrame

• crossCorrelateFilterAdjoint

• convolveFilterAdjoint

• Gradient/divergence

• backwardDivergence

• backwardGradient

• centralGradient

• forwardGradient

• Linear algebra

• dot, crossDot

• reverseCrossDot

• Debugging

• probe

• Type coercion

• toScalarField, toVectorField

• toMatrixField, toComplexField

• toComplexVectorField, toColorField

• toGenericComplexField

• Type construction

• complex, polarComplex

• vectorField, complexVectorField

• matrixField, colorField

• Reductions

• reduceSum, blockReduceSum

• reduceMin, blockReduceMin

• reduceMax, blockReduceMax

• fieldReduceMax, fieldReduceMin

• fieldReduceSum, fieldReduceMedian

• Normalizations

• normalizeL1, normalizeL2

• Resampling

• supersample, downsample, upsample

• Special operators

• winnerTakeAll

• random

• solve

• transform

• warp

• <==

CogX software stack

Application

CogX debugger

CogX compiler and standard library

Neural network

toolkitSandbox toolkitI/O toolkit

Scala CogX runtime

C++ CogX runtime

HDF5 loader JOCL

OpenCLHDF5 HDF5

CogX core

External

libraries

libraries/toolkitCluster package

Apache Mesos

Applications are written by users

− Introductory and training examples for single-GPU and distributed computation

− Performance benchmarks covering the core and neural network package

− Several larger-scale demo applications integrating multiple CogX functions

CogX toolkit functions

• Computer Vision

• Annotation tools

• Color space transformations

• Polynomial dense optic flow

• Segmentation

• Solvers

• Boundary-gated nonlinear

diffusion

• FISTA solver (with sub-

variants)

• Golden section solver

• Incremental k-means

implementation

• LSQR solver (with sub-

variants)

• Poisson solver (with sub-

variants)

• Filtering

• Contourlets

• 4 frequency-domain filters

• Mathematical morphology

operators

• 27 space-domain filters (from

a simple box filter up to local

polynomial expansion and

steerable Gabor filters)

• Steerable pyramid filter

• Wavelets

• Variants of whitening

transforms

• Contrast normalization

• Domain transfer filter

• Gaussian pyramid

• Monogenic phase

congruency

• Dynamical Systems

• Kalman filter

• Linear system modeling

support

• CPU matrix pseudo-

inverse

• Statistics

• Normal and uniform

distributions

• Histograms

• Moment calculations

• Pseudo-random number

generator sensors

Labeling Dynamic Ordinal Depth

Goal: “direct” readout of “in front of”, “behind”,

“emerging”, or “disappearing” in video streams

Scene segmentation

Based on motion signals only

Not contrast edges, stereo, ...

Use CogX, software from HPE Labs

Maximize use of GPUs

Near real-time processing, ~2 fps on an HP Z820

workstation

Some processing in CPU kernelsVideo

StreamOptic Flow

Discretized

Motion

Motion Onset/Offset

Boundary

Ownership

Occlusion

Status

Region

Properties

Motion

Regions

Region

Traces

Motion

Preprocessing

Region

Processing

OccludersRegion

CompletionOrdinal

Visualizing ordinal depth and occlusions. Unoccluded moving parts of an object are highlighted. Occluder is marked in red.

Functional Control Flow of CogMO Algorithm

Enumerating

motion

surfaces

Optic Flow

Assigning

Boundary

Ownership

Motion surfaces

Ordinal Depth

CogMO – Ordinal Depth

Video: CogMO algorithm

Roadmap

HPE Cognitive Computing Toolkit

Application

CogX debugger

Neural network

Scala CogX runtime

C++ CogX runtime

HDF5 loader JOCL

OpenCLHDF5 HDF5

CogX core

External

libraries

libraries/toolkitCluster package

Apache Mesos

HPE Cognitive Computing Toolkit

Application

CogX debugger

Neural network

Scala CogX runtime

HDF5 loader JOCL

OpenCLHDF5

CogX core

External

libraries

libraries/toolkit

High-level comparison

CogX TensorFlow

Core data abstraction Tensor Fields: single precision,

restriction on dimensions

Tensors: typed multi-dimensional array

Core compute

abstraction

OpenCL functions emitted and

compiled at runtime. User kernels.

C++/CUDA functions compiled into

TensorFlow project

Graph optimizations Kernel fusion Not available

Distribution across

Simulated annealing placer Unreleased: Graph partitioning, Greedy

placer

Debugging Single-step runtime debugging. Text

based profiler.

Non-interactive log file parser. Better

graph visualization. Unreleased profiler.

Automatic

differentiation

Supported as a library for neural

network specific operations

Supported by most of core API

Fault tolerance Not yet implemented Automatic check-pointing and restart of

Control flow Not yet implemented Predicated execution

Runtime optimization Not yet implemented Interleaved processing of iterations,

placer

High-level comparison

CogX TensorFlow

Core data abstraction Tensor Fields: single precision,

restriction on dimensions

Tensors: typed multi-dimensional array

Core compute

abstraction

OpenCL functions emitted and

compiled at runtime. User kernels.

C++/CUDA functions compiled into

TensorFlow project

Graph optimizations Kernel fusion Not available

Distribution across

Simulated annealing placer Unreleased: Graph partitioning, Greedy

placer

Debugging Single-step runtime debugging. Text

based profiler.

Non-interactive log file parser. Better

graph visualization. Unreleased profiler.

Automatic

differentiation

Supported as a library for neural

network specific operations

Supported by most of core API

Fault tolerance Not yet implemented Automatic check-pointing and restart of

Control flow Not yet implemented Predicated execution

Runtime optimization Not yet implemented Interleaved processing of iterations,

placer

TensorFlow plugin: high productivity, high performance operators

Simple Python API

ProtobufIntermediate

RepresentationOptimizer

CUDA Generator

C Generator

TensorFlowCustom Op

Python plugin TensorFlow

TensorFlow plugin: a familiar programming model

Example: element-wise L2 Norm of three 2 x 2 tensors

Input tensors Workgroup shape

out[pos] = sqrt(in_0[pos]*in_0[pos] + … + in_2[pos]*in_2[pos])

Output tensor

TensorFlow plugin: high productivity, high performance

High productivity:

def op(in0, in1, in2):

pos = position_in(in0.shape)

out = output_like(in0)

a = in0[pos]

b = in1[pos]

c = in2[pos]

out[pos] = sqrt(a*a + b*b + c*c)

return out

High performance:

Ben Chandlerbenjamin.chandler@hpe.com

A Platform for Accelerating Machine Learning...

Documents

Accelerating Machine Learning as a Service with Automated ... · Accelerating Machine Learning . as a Service with Automated Feature Engineering. Building scalable machine learning

Accelerating Materials Development via Automation, Machine ...vijaychan.github.io/Publications/2018_joule.pdf · technologies could accelerate the pace of novel materials development

Accelerating Machine Learning on Chameleon CloudAccelerating Machine Learning on Chameleon Cloud Arun Das University of Texas at San Antonio arun.das@my.utsa.edu ... researchers use

ACCELERATING MACHINE LEARNING VIA MULTI-OBJECTIVE …

Deploying Java Applicationson Ec2

ACCELERATING INDUCTION MACHINE FINITE-ELEMENT …

Accelerating Machine Learning with Cognitive Calibration - Kalpesh Balar, Coseer

Accelerating your Embedded Vision / Machine Learning ... · Accelerating your Embedded Vision / Machine Learning design with the reVISION Stack ... Platform Development DNN ... Broadening

HELIX: Holistic Optimization for Accelerating …HELIX: Holistic Optimization for Accelerating Iterative Machine Learning Doris Xin, Stephen Macke, Litian Ma, Jialin Liu, Shuchen Song,

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark: Keynote by Ziya Ma

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram

Applied Machine Learning: Beyond the Hype · Applied Machine Learning: Digital Transformation 2 The profound and accelerating transformation of business activities, processes, competencies

Accelerating Flash Memory - Open Compute Project › files › Accelerating... · Examples: NoSQL such as Neo4J with Graph Node Traversals, etc Examples: Machine or Deep Learning

Accelerating Data Center Growth · data center and AI cluster network infrastructure. Edge Compute IoT, smart cities, machine-to-machine (M2M) in general, and autonomous vehicles,

Accelerating Earth System Models with machine learning

Accelerating Machine Learning Development with · Accelerating Machine Learning Development with Matei Zaharia @matei_zaharia. ... usable machine learning Cloud platform for large-scale

Short-Circuit Dispatch: Accelerating Virtual Machine ... · Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors Channoh Kim †Sungmin Kim Hyeon

vTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core

Accelerating Problem Solving by Combining Machine Learning ... · Accelerating Problem Solving by Combining Machine Learning and Human Learning David B. Fogel, Ph.D. ... Turing (1950)

HELIX: Holistic Optimization for Accelerating Iterative ...data-people.cs.illinois.edu/helix-tr.pdf · HELIX: Holistic Optimization for Accelerating Iterative Machine Learning Doris