Upload
oscar-law
View
73
Download
0
Embed Size (px)
Citation preview
Machine Learningwith New Hardware Challenges
Oscar M.K. Law
High Tech Challenges
Personal
ComputerInternet Smartphone Machine
Learning
1980 1995 2007 2016
Machine Learning
Visual/Audio Applications
Pattern Recognition
Pattern Detection
Voice Recognition
Motion/Movement Control
Self-Driving
Drone Control
Data Mining/Association
Medical
Financial
Legal
Neural Network Convolutional Neural Network (CNN)
Model
Mathematical Based Model
Hardware
Graphics Processing Unit (Nvidia)
Catapult Fabric (Microsoft)
Tensor Processing Unit (Google)
Applications
DCNN, RCNN, Fast-RCNN, Faster-RCN, RFCN
Spike Neural Network (SNN)
Model
Physical Based Model
Hardware
TrueNorth Processor (IBM)
Zeroth Processor (Qualcomm)
Convolutional Neural Network
A. Krizhevsky, I. Sutskever and G.E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS-2012, p.1-p.9.
1000
2048
2048 2048
128
128
192
192
192
192
128
128
48
48
3Max
pooling
Max
pooling
Max
pooling
224
224
5555
55
55
27
27
27
27
1313
1313
1313
1313
1313
1313
2048
Convolutional Neural Network
Architecture
5 Convolutional Layers
3 Fully Connected Layers
Gaber Filters
650k Neurons
69M Parameters
630M Connections
Runtime
Nvidia GTX 580 3Gb GPU
One week
Results
Top-1 Error Rate: 37.5%
Top-5 Error Rate: 17.0%
Convolutional Neural Network
Software Developer Platform Interface Hardware
Caffe Berkeley Vision and Learning
Center
Linux, OSX, Windows,
Android
C++. Python, Matlab CPU, GPU
MatConvNet Oxford Visual Geometry
Group
Linux, OSX, Windows Matlab CPU, GPU
Matlab MathWorks Linux, OSX, Windows Matlab CPU, GPU
Tensorflow Google Brain Team Linux, OSX, Windows C++, Python CPU, GPU, TPU
Torch 7 R. Collobert, K. Kavukcuoglu,
C. Farabet
Linux, OSX, iOS,
Android, Windows
Lua, LuaJIT, C CPU, GPU
Theano Universite de Montreal Cross-Platform Python CPU, GPU
CNTK Microsoft Linux, OSX, Windows Network Description
Language
CPU, GPU, FPGA
CPU vs GPUCentral Processing Unit (CPU) Graphics Processing Unit (GPU)
Architecture
Instruction Set Single Instruction Single Data (SISD) Single Instruction Multiple Data (SIMD)
Operation Sequential Parallel
Processor Core Few Many
Datapath Custom Synthesis
Clock Rate High Moderate
Bandwidth Medium Large
Power Moderate High
Temperature Moderate High
Graphics Processing Unit (Nvidia)
Pascal Architecture
Flag Chip GP100
Process TSMC 16nm FinFet Process
Maximum Transistors 15.3B
Stream Multiprocessor (SM) 56 (10SM/GPC)
CUDA Cores 3840 CC (60CU/SM)
Base Clock 1328MHz
Boost Clock 1480MHz
FP32 Performance 10.6 TFlops
FP64 Performance 5.3 TFlops
Memory Interface 4096bit HBM2
Maximum Bandwidth 720 GB/s
Maximum Power 300W
J. Walton, Nvidia Pascal P100 Architecture Deep Dive, PC Gamer, Apr 07, 2016.
Catapult Fabric (Microsoft)
Purpose
Design for Neural Network Classification
Target for power reduction
Architecture
Field Programmable Gate Array (FPGA)
Software configurable engine supports runtime multiple layer
configurations
A spatially distributed array of processing elements can be scaled
up to thousand of units
On-chip redistribution network with efficient data buffer
minimizes off-chip memory traffic
Power dissipation is significantly reduced to 25W only
K. Ovtcharov, O. Ruwase, J.Y. Kim, J. Fowers, K. Strauss, E.S. Chung, Accelerating Deep Convolutional Networks Using Specialized Hardware, Microsoft
Research, Feb 2015.
Tensor Processor Unit (Google)
Purpose
Support Tensorflow algorithm
Target for Neural Network Classification
Architecture
Application Specific Integrated Circuit (ASIC)
Single Instruction Multiple Data (SIMD) Architecture
Low computational precision
Better performance/watt
Hardware Optimization
Algorithm Architecture
Chip Design
Thanks