Machine Learning with New Hardware Challegens

Machine Learningwith New Hardware Challenges

Oscar M.K. Law

High Tech Challenges

Personal

ComputerInternet Smartphone Machine

Learning

1980 1995 2007 2016

Machine Learning

Visual/Audio Applications

Pattern Recognition

Pattern Detection

Voice Recognition

Motion/Movement Control

Self-Driving

Drone Control

Data Mining/Association

Medical

Financial

Legal

Neural Network Convolutional Neural Network (CNN)

Model

Mathematical Based Model

Hardware

Graphics Processing Unit (Nvidia)

Catapult Fabric (Microsoft)

Tensor Processing Unit (Google)

Applications

DCNN, RCNN, Fast-RCNN, Faster-RCN, RFCN

Spike Neural Network (SNN)

Model

Physical Based Model

Hardware

TrueNorth Processor (IBM)

Zeroth Processor (Qualcomm)

Convolutional Neural Network

A. Krizhevsky, I. Sutskever and G.E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS-2012, p.1-p.9.

1000

2048

2048 2048

128

128

192

192

192

192

128

128

48

48

3Max

pooling

Max

pooling

Max

pooling

224

224

5555

55

55

27

27

27

27

1313

1313

1313

1313

1313

1313

2048


Architecture

5 Convolutional Layers

3 Fully Connected Layers

Gaber Filters

650k Neurons

69M Parameters

630M Connections

Runtime

Nvidia GTX 580 3Gb GPU

One week

Results

Top-1 Error Rate: 37.5%

Top-5 Error Rate: 17.0%


Software Developer Platform Interface Hardware

Caffe Berkeley Vision and Learning

Center

Linux, OSX, Windows,

Android

C++. Python, Matlab CPU, GPU

MatConvNet Oxford Visual Geometry

Group

Linux, OSX, Windows Matlab CPU, GPU

Matlab MathWorks Linux, OSX, Windows Matlab CPU, GPU

Tensorflow Google Brain Team Linux, OSX, Windows C++, Python CPU, GPU, TPU

Torch 7 R. Collobert, K. Kavukcuoglu,

C. Farabet

Linux, OSX, iOS,

Android, Windows

Lua, LuaJIT, C CPU, GPU

Theano Universite de Montreal Cross-Platform Python CPU, GPU

CNTK Microsoft Linux, OSX, Windows Network Description

Language

CPU, GPU, FPGA

CPU vs GPUCentral Processing Unit (CPU) Graphics Processing Unit (GPU)

Architecture

Instruction Set Single Instruction Single Data (SISD) Single Instruction Multiple Data (SIMD)

Operation Sequential Parallel

Processor Core Few Many

Datapath Custom Synthesis

Clock Rate High Moderate

Bandwidth Medium Large

Power Moderate High

Temperature Moderate High

Graphics Processing Unit (Nvidia)

Pascal Architecture

Flag Chip GP100

Process TSMC 16nm FinFet Process

Maximum Transistors 15.3B

Stream Multiprocessor (SM) 56 (10SM/GPC)

CUDA Cores 3840 CC (60CU/SM)

Base Clock 1328MHz

Boost Clock 1480MHz

FP32 Performance 10.6 TFlops

FP64 Performance 5.3 TFlops

Memory Interface 4096bit HBM2

Maximum Bandwidth 720 GB/s

Maximum Power 300W

J. Walton, Nvidia Pascal P100 Architecture Deep Dive, PC Gamer, Apr 07, 2016.

Catapult Fabric (Microsoft)

Purpose

Design for Neural Network Classification

Target for power reduction

Architecture

Field Programmable Gate Array (FPGA)

Software configurable engine supports runtime multiple layer

configurations

A spatially distributed array of processing elements can be scaled

up to thousand of units

On-chip redistribution network with efficient data buffer

minimizes off-chip memory traffic

Power dissipation is significantly reduced to 25W only

K. Ovtcharov, O. Ruwase, J.Y. Kim, J. Fowers, K. Strauss, E.S. Chung, Accelerating Deep Convolutional Networks Using Specialized Hardware, Microsoft

Research, Feb 2015.

Tensor Processor Unit (Google)

Purpose

Support Tensorflow algorithm

Target for Neural Network Classification

Architecture

Application Specific Integrated Circuit (ASIC)

Single Instruction Multiple Data (SIMD) Architecture

Low computational precision

Better performance/watt

Hardware Optimization

Algorithm Architecture

Chip Design

Thanks

Engineering

Machine Learning with New Hardware Challegens