Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
DEEP LEARNING TECHNOLOGIES FOR AI
Marc DurantonCEA Tech (Leti and List)
Considérations énergétiques de l’IA et perspectives
Marc DurantonCEA Fellow
22 Septembre 2020
Séminaire INS2I- Systèmes et architectures intégrées matériel-logiciel pour l’intelligence artificielle
| 2
KEY ELEMENTS OF ARTIFICIAL INTELLIGENCE
Traditional AI, symbolic, algorithms
rules…
Analysis of “big data”
Data analytics
ML-based AI:Bayesian, …
* Reinforcement Learning, One-shot Learning, Generative Adversarial Networks, etc…
From Greg. S. Corrado, Google brain team co-founder:– “Traditional AI systems are programmed to be clever– Modern ML-based AI systems learn to be clever.
Deep Learning*
AI“…as soon as it works, no one calls it AI anymore.”
John McCarthy, who coined the term “Artificial Intelligence” in 1956
“AI is whatever hasn't been done yet”
D. Hofstadter (1980)
“Classical” approachesto reduce energy consumption
Optimization of energyin datacenters
Our focus today:- Considerations on
state of the art systems(learning + inference)
- Edge computing( inference, federated learning,neuromorphic)
| 4
• ImageNet classification (Hinton’s team, hired by Google) • 14,197,122 images, 1,000 different classes• Top-5 17% error rate (huge improvement) in 2012 (now ~ 3.5%)
• Facebook’s ‘DeepFace’ Program (labs headed by Y. LeCun) • 4.4 million images, 4,030 identities• 97.35% accuracy, vs. 97.53% human performance
From:Y. Taigman, M. Yang, M.A. Ranzato, “DeepFace: Closing the Gap to Human-Level Performance in Face Verification”
“Supervision” networkYear: 2012650,000 neurons60,000,000 parameters630,000,000 synapses
They give the state-of-the-art performance e.g. in image classification
2012: DEEP NEURAL NETWORKS RISE AGAIN
The 2018 Turing Award recipients are Google VP Geoffrey Hinton, Facebook's Yann LeCun and Yoshua Bengio, Scientific Director of AI research center Mila.
| 6
COMPETITION ON IMAGENET !
Nameofthealgorithm
Date Errorontestset
Supervision 2012 15.3%
Clarifai 2013 11.7%
GoogLeNet 2014 6.66%
Humainlevel(AdrejKarpathy)
5%
Microsoft 05/02/2015 4.94%
Google 02/03/2015 4.82%
Baidu/DeepImage 10/05/2015 4.58%
ShenzhenInstitutesof
AdvancedTechnology,
ChineseAcademyofSciences
10/12/2015
(leCNNa152
couches!)
3.57%
GoogleInception-v3
(Arxiv)
2015 3.5%
WMW(Momenta) 2017 2.2%
Now ?
| 7
SPECIALIZATION LEADS TO MORE EFFICIENCY EFFICIENCY
Source from Bill Dally (nVidia) « Challenges for Future Computing Systems » HiPEAC conference 2015
Type of device Energy / Operation
CPU 1690 pJGPU 140 pJ
Fixed function 10 pJ
| 8
2017: GOOGLE’S CUSTOMIZED HARDWARE…… required to increase energy efficiency
with accuracy adapted to the use (e.g. float 16)
Google’s TPU2 : training and inference in a 180 teraflops16 board(over 200W per TPU2 chip according to the size of the heat sink)
| 9
" The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!"
[https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html]
DEEP LEARNING AND VOICE RECOGNITION
| 10
… required to increase energy efficiency with accuracy adapted to the use (e.g. float 16)
Google’s TPU2 : 11.5 petaflops16 of machine learning number crunching (and guessing about 400+ KW…, 100+ GFlops16/W)
Peta = 1015 = million of milliardFrom Google
2017: GOOGLE’S CUSTOMIZED HARDWARE…
| 11
From https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/
GOOGLE’S CUSTOMIZED TPU (V3) HARDWARE…
GOOGLE’S CUSTOMIZED TPU (V3) HARDWARE…
| 12
2017: THE GAME OF GO
Ke Jie (human world champion in the “Go” game), after being defeated by AlphaGo on May 27th 2017, will work with Deepmind to make a tool from AlphaGo to further help Go players to enhance their game.
| 14
ALPHAZERO: COMPUTING RESOURCES
Peta = 1015 = million of milliard From Google Deepmind
X 5000 = 200 KW*
* https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
X 40 days…
AlphaZero is a generic reinforcement learning algorithm
Follow-up is called MuZero (2019)
| 15
2019: OPENAI WINS DOTA 2 ESPORT GAME
OpenAI Five DefeatsDota 2 World Champions
OpenAI Five is the first AI to beat the world champions in an esports game, having won two back-to-back games versus the world champion Dota 2 team.It is also the first time an AI has beaten esports pros on livestream.
Cooperative mode
OpenAI Five’s ability to play with humans presents a compelling vision for the future of human-AI interaction, one where AI systems collaborate and enhance the human experience. Our testers reported feeling supported by their bot teammates, that they learned from playing alongside these advanced systems, and that it was generally a fun experience overall.
| 16
OpenAI Five DefeatsDota 2 World Champions
In total, the current version of OpenAI Fivehas consumed 800 petaflop/s-days andexperienced about 45,000 years of Dotaself-play over 10 realtime months (up fromabout 10,000 years over 1.5 realtimemonths as of The International), for anaverage of 250 years of simulatedexperience per day. The Finals version ofOpenAI Five has a 99.9% winrate versusthe TI (tournament) version.
2019: OPENAI WINS DOTA 2 ESPORT GAME
Peta = 1015 = million of milliard
| 17
OpenAI Five DefeatsDota 2 World Champions
In total, the current version of OpenAI Fivehas consumed 800 petaflop/s-days andexperienced about 45,000 years of Dotaself-play over 10 realtime months (up fromabout 10,000 years over 1.5 realtimemonths as of The International), for anaverage of 250 years of simulatedexperience per day. The Finals version ofOpenAI Five has a 99.9% winrate versusthe TI (tournament) version.
2019: OPENAI WINS DOTA 2 ESPORT GAME
Giga = 109 Tera = 1012
Green500 List - June 2020
Peta = 1015 = million of milliard
~ 38 MW
| 18
EXPONENTIAL INCREASE OF COMPUTING POWER FOR AI TRAINING
* https://blog.openai.com/ai-and-compute/
“Since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time (by comparison, Moore’s Law had an 18-month doubling period)*”
| 19
ALGORITHMIC IMPROVEMENT IS A KEY FACTOR DRIVING THE ADVANCE OF AI
* https://openai.com/blog/ai-and-efficiency/
“Since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet1 classification has been decreasing by a factor of 2 every 16 months. *”
| 20
2020: OPENAI’S GPT-3, AN AUTOREGRESSIVE LANGUAGE MODEL WITH 175 BILLION PARAMETERS
* https://lambdalabs.com/blog/demystifying-gpt-3/
“GPT-3 shows that language model performance scales as a power-law of model size, dataset size, and the amount of computation.GPT-3 demonstrates that a language model trained on enough data can solve NLP tasks that it has never encountered. That is, GPT-3 studies the model as a general solution for many downstream jobs without fine-tuning.The size of state-of-the-art (SOTA) language models is growing by at least a factor of 10 every year.*”GPT-3 175B is trained with 499 Billion tokens:
“GPT-3 175B model required 3.14x1023 FLOPS (100 trilliard (1021) so about 87h of exaflop (1018) machine), 302 days on Tera-1000) for computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run*”.
| 22
From “Total Consumer Power Consumption Forecast”, Anders S.G. Andrae, October 2017
The problem: IT projected to challenge future electricity supply
Exponential power consumption
| 23
PROCESSOR ARCHITECTURE EVOLUTION
In Memory Computing,Neuromorphic Computing.
Data Centric Interconnect
Moore’s law slow-down
DisruptiveArchitecture for
data management
Memory invades
logic
CPU
Cache
Memory
Bus
NIC(NetworkInterConnect)
Mono-core architecture for
single thread performance
~2006
End of Dennard’s scaling
NoC + LLC
Cache Cache
Memory
CacheCache
Many-core architecture for
parallelism
Memory
NIC
~2016
Close Mem.
Coherent Link
Close Mem
Close Mem
Close Mem
Far Mem. NIC
Generic processing
HW accelerator
Heterogeneousarchitecture for energy efficiency
~2026…
Quantum Computing
From Denis Dutoit
| 24
• Only GAFAM or BAITX able to train those gigantic systems?• Use in transfer learning:• Once learnt, the system can be tuned for a particular application• This can be done with minimal computing resources (only the last layers to
train)• The energy cost of learning is spread by the many usages and instances in
inference• Like the NRE for building chips
• Will we have access to the resulting systems for adaptation?• Is it possible to optimize those systems by post-processing to be able
to use them for inference/transfer learning in embedded system?
OPEN QUESTIONS…
| 26
New services
Smart sensors
Internet of Things
Big Data
Cloud / HPC
Data Analytics / Cognitive computing
Physical Systems
Transforming data into information as early as possible
Cyber Physical Entanglement
Processing,Understanding
as soon as possible
THE COMPUTING CONTINUUM
ENABLING EDGE INTELLIGENCE
True collaboration between edge devices and the HPC/cloud➪ creating a
continuum
Enabling Intelligent data processing at
the edge:Fog computing
Edge computingStream analytics
Fast data…
Intelligent sensor(pre) processing datalocally
| 27
This initiative currently involves:ETP4HPC, ECSO, BDVA, 5GIA,EUMATHS, CLAIRE, HiPEAC…More to come…
Edge processing, Fog, …
| 28
Shou
ld I
brak
e?
Tran
smis
sion
erro
rpl
ease
retry
late
r
System should be autonomous to make good decisions in all conditions
Embedded intelligence needs local high-end computing
Safety will impose that basic autonomous functionsshould not rely on “always connected” or “always available”
And should not consume most power (and space!) of an electric car!
| 29
N2D2, NEURAL NETWORK DESIGN & DEPLOYMENT
Code cross-generation
Code executionon target
COTS• STM (STM32, ASMP)• Nvidia (all GPUs)• Renesas (RCar)• Kalray (MPPA)
Custom accelerators• ASIC (PNeuro)• FPGA (DNeuro)
Software DNN libraries• C/OpenMP (+ custom SIMD)• C++/TensorRT (+ CUDA/CuDNN)• C++/OpenCL• C/C++ (+ custom API)• Assembly
Hardware DNN libraries• C/HLS• Custom RTL
Data conditioning
Learning & Test
databases
Considered criteria•Applicative performance metrics•Memory requirement•Computational complexity
Modeling Learning Test
Optimization
Trained DNN
ONNX model
Quantization
2 to 8 bit integers + rescaling•Based on dataset distribution•Quantized applicative
performance metrics validation
A unique platform for the design and exploration of DNN applications with :+ Easy database access to facilitate the learning phase+ Automatic code generation on various hardware targets thanks to the export modules+ Multiple comparison criteria : latency, hardware cost, memory need, power consumption…
Dumb sensors Smart sensors: Streaming and distributed data analytics
Bandwidth (and cost) will require more local processing
if you need a responsein less than 1ms, the server has to be in less than 150 Km
Fog computingFederated learning
EDGE INTELLIGENCE NEEDS LOCAL HIGH-END COMPUTING
| 31
Privacy will impose that some processing should be done locally
and not be sent to the cloud.
Example: detecting elderly people falling in their home
With minimum power and wiring!
Embedded intelligence needs local high-end computing
| 32
• Ultra low power local processing detecting lying people in a room
• Raw data (before post-processing):
• Standing • Crouching• Lying
Detecting elderly people falling in their home
| 33
ENERGY EFFICIENT PROGRAMMABLE H/W ACCELERATOR FOR DEEP LEARNING: PNEURO
n Designed for Deep Neural Network processing chainsn Also supporting traditional image processing chains (filtering, etc.) –
No need of external image processingn Clustered and modular SIMD architecturen Clustered SIMD architecture
− Optimized operators for MAC & Non-Linearity approximation− Optimized memory accesses to perform efficient data transfers to operators− ISA including ~50 instructions (control + computing)
n Can be sized to fit the best area/performance per applicationn 571 GMACs/s/W/cluster
Target Frequency Performance Energy Efficiency
Quad ARM A7 900 MHz 480 images/s 380 images/s/W
Quad ARM A15 2000 MHz 870 images/s 350 images/s/W
Tegra K1 850 MHz 3 550 images/s 600 images/s/W
PNeuro (FPGA) 100 MHz 5 000 images/s 2 000 images/s/W
PNeuro (ASIC) 500 MHz 25 000 images/s 125 000 images/s/W
PNeuro (ASIC) 1 000 MHz 50 000 images/s 125 000 images/s/WFD-SOI28nm1.8 TOPS/W @500MHz0.4mm², 6mW
Cluster Interconnect
Cluster
NeuroCores
Cluster Controller
ElementaryComputing Units
One cluster of P-Neuro NPU
Performance of NPU with 4 clusters
PNeuro Neural Processing Unit for ASIC @1 GHz is 14 times faster & 200 times
more energy efficient than an embedded GPU
Entirely developed in Europe
| 34
PNEURO IN A ULTRA LOW POWER SOC: SAMURAI
SAMURAI ARCHITECTURE OVERVIEW
WuC
FLL
ULVTP SRAM
On-Demand Sub-SystemAlways Responsive Sub-System
WuRSPI_m GPIO I2C
Interconnect
SynchoDBB
Security IPs
GPIO
UART
CPURISC V
SRAM PNeuro SRAM
• Always responsive running @ 0.4V• Ultra fast wake-up time with ultra-low power modes• WuR with DBB decoder to wake-up on radio signaling• 8KB of ULV TP SRAM• AI accelerator with embedded 256KB SRAM• Embedded security
| 35
• Technology: ST28nm FDSOI• Die area: ~4.52mm² (2564µm x 1764µm)• Memories• 424KB SRAM (incl. 32KB with retention
mode, 264KB in PNeuro)• 8KB TPSRAM
• 5 power domains• 64-pin QFN package• Performances• 100 MHz @ 0.7V (220MHz @ 1,0V)• PNeuro: 6.4GMAC/s @ 100MHz• 5895 image class. per second @100MHz
SAMURAI MAIN FIGURES
Ref: 2020 Symposia on VLSI Technology and Circuits
| 36
SOLVING THE ENERGY CHALLENGE: COST OF MOVING DATA
Source: Bill Dally, « To ExaScale and Beyond »www.nvidia.com/content/PDF/sc_2010/theater/Dally_SC10.pdf
Avoiding data movement:Computing in/near memory
x800 more!
| 37
HOW TO REDUCE ENERGY AND POWER Processor
« sequential »transmission
x 2 resolution ➔ x 2 transmission frequency (f)P = C.V2.f but lower f allows lower V
Using energy efficient technology: FDSOI
Fully Depleted – Silicon on Insulator
LETI’s ISSCC demonstrator 2014:
Fclk < 2.6GHz @ 1.3VFclk < 450MHz @ 0.39V
x 10 less power at 0.39V(at same frequency)
« classical » system
Image sensor
| 38
Simpler processors running at lower f (they have less to do)P = n.C.V2.(f/n) and lower f allows lower V
Image sensor
HOW TO REDUCE ENERGY AND POWER
| 39
• Power/pixel reduced by over 90% compared to best in class low power IR product• (7,03 µW/pixel => 0,54
µW/pixel)• 13 times less power for the
same system functionality
• Dedicated algorithms for• Presence sensing• Localization• People counting
• Privacy respected• Pre-processing on the sensor• Video is not transmitted
IR SENSOR WITH BUILT-IN 1ST LEVEL IMAGE PROCESSING
Image
Adding 128 RISCProcessors
Synthetic data
| 40
SMART RETINA USING 3D STACKING
ProcessingLayer
Lens
Sensor layer (Back Side Illumination)
Passive interposer or PCB
L2
L1
Advantages:Low frequency processorDirect interconnect pixel ➔ processor➪ energy efficient system➪ very fast pixel processing➪ perfect scalability, no size limitation due to bandwidth➪ complex processing in the sensor➪ suitable for privacy sensible applicationsEach pixel can be processed independently➪ opening a new range of algorithms Heterogeneous integration possible
| 41
SMART RETINA USING 3D STACKING
• Demonstrator “Retine”• L1@130 nm / L2@130 nm; • Die size : 160 mm²
• Image sensor: • 192x256 @ 5500 fps or 768x1024 @ 60 fps• 12 µm pixel, 75% fill factor,
• 192 processors (3072 PE)• Processing : 72 GOPS, 11.7 MOPS/mW• 1 k instructions / pixel @ 1000 fps• Distributed memory• Each processor can execute a different
code in a set
X 100 computing power, X 10 energy efficiency,/ 15 processing latency
| 42
RETINE: A 3D STACKED SMART RETINA
Note: theboard is only an interface allowing to drive an HDMI displayand to debug the chip. All processing is done inside the retina
| 43
• The promises of spike-coding NN:• Reduced computing complexity
è replace 32 bits MAC with 8 bits ACC per operation• Natural spatial and temporal parallelism• Simple and efficient performance tunability capabilities• Spiking NN best exploit NVMs,
for massively parallel synaptic memory
SPIKE CODING OF DNN FOR IMPROVED EFFICIENCY
“Standard” MAC Spiking ACCMULT 16 bits 1.17 pJ
(interpolated)ADD 8 bits
0.03 pJ
ADD 32 bits 0.1 pJ
Total 1.27 pJ Total 0.03 pJè 42x reduction
Energy per operation comparison (estimate for 45nm 0.9V)**Mark Horowitz, “Computing’s Energy Problem (and what we can do about it)”,Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International
CorrectOutput
16 kernels4x4 synapses
90 kernels5x5 synapses
Input: digit24x24 pixels
Conv. layer16 maps11x11
neurons
Conv. layer24 maps
4x4 neurons FC la
yer
150 n
euro
nsOutp
ut:10
neur
ons
1) Standard CNN topology, offline learning
layer 1 layer 2 layer 3 layer 4
Pixelbrightness
Spiking frequencyV
t
fMIN
fMAX
Rate-based input
coding
Time
2) Lossless spike transcoding
| 44
• Sparcity: results on MNIST and GTSRB datasets
è Best reported* spike recognition rate with 50% less op. and >40x energy efficiency/op. for MNIST!è Tunable error rate vs compute time
SPIKE-CODING RESULTS
Performance vs computing time tenability on GTSRB (approximated computing)
Standard Spiking
Score MACs ∆𝑺 Score ACCs(spikes/MAC)
MNISTReLU 99.42% 126k
2 98.24% 40.30k (0.32)
3 99.33% 51.32k (0.41)
4 99.42% 63.02k (0.50)
GTSRBReLU 98.42% 818k
5 98.18% 1.63M (2.00)
10 98.38% 2.30M (2.82)
15 98.39% 2.87M (3.51)
N2D2: first public deep learning framework to integrate spiking DNNs transcoding and simulation
00,511,522,533,54
012345678
3 4 5 6 7 8 9 10 11 12 13 14 15
AC
Cs
(spi
ke/M
AC
)
Test
erro
r rat
e (%
)
Decision threshold ΔS
*Previous state-of-the-art: D. Neil, M. Pfeier, and S.-C. Liu. “Learning to be efficient: algorithms for training low-latency, low-compute deep spiking neural networks.” In ACM SAC, 2016
Standard DNN Spiking DNNQuantization Yes YesApprox. computing No YesBase operation Multiply-Accumulate
(MAC)Accumulate only
Activation function Non-linear function Threshold(+ refractory period)*
Parallelism Spatial Spatial and temporalMemory reutilisation Yes No
*Not required for ReLU activation function
| 45
SPIKING NEURAL NETWORKSCOMMUNICATION SPIKING NEURONS UNSUPERVISED LEARNING
ADDRESS EVENT REPRESENTATION
LEAKY INTEGRATE-AND-FIRE
SPIKE TIMING DEPENDENT PLASTICITY
| 46
NEW ELEMENT: RRAM AS SYNAPSES
PCMGSTGeTeGST + HfO2
M.Suri, et. al, IEDM 2011M.Suri, et. al, IMW 2012 , JAP 2012O.Bichler et al. IEEE TED 2012M.Suri et al., EPCOS 2013D.Garbin et al., IEEE Nano 2013
CBRAMAg / GeS2
OXRAM
D.Garbin et al. IEDM 2014D.Garbin et al., IEEE TED 2015
TiN/HfO2/Ti/TiN
Thermal effect
Electrochemicaleffect
Electronic effectoxygen vacancies
Analog computing: using some aspects of physical phenomena to model the problem being solved
| 47
MONOLITHIC 3D PRINCIPLE
CMOS/CMOS: 14nm vs 2D:Area gain=55%Perf gain = 23%Power gain = 12%
LETI, DAC 2014
| 48
Introduction
HfO2Ti/TiN
Ti/TiNHfO2
Ti/TiN
Ti/TiN
Short term structureà RRAM on top level to
avoid contamination issueà Reuse of existing masks
plus ebeam to build 1T1R
à ”Synapses” are integrated in the very fabric of communication
1 base ebeam required for RRAM definitionRRAM based on HfO2/Ti/TiN low temp materials (~ 350°C) à no critical problems to integrate on the top level
3D INTEGRATION COUPLED WITH RRAM FOR SYNAPTIC WEIGHTS
Managing complexity
Cognitive solutions for complex computing systems:• Using AI and optimization
techniques for computing systems• Creating new hardware• Generating code• Optimizing systems
• Similar to Generative designfor mechanical engineering
| 50
USING AI FOR MAKING CPS SYSTEMS: “GENERATIVE DESIGN” APPROACH
Motorcycle swingarm: the piece that hinges the rear wheel to the bike’s frame
The user only states desired goals and constraints-> The complexity wall might prevent explaining the solution
“Autodesk”
| 51
• Ne-XVP project – Follow-up of the TriMedia VLIW (https://en.wikipedia.org/wiki/Ne-XVP )
• 1,105,747,200 heterogeneous multicores in the design space
• 2 millions years to evaluate all design points
• AI inspired techniques allowed to reduce the induction time to only few days
=> x16 performance increase
EXAMPLE: DESIGN SPACE EXPLORATION FOR DESIGN MULTI-CORE PROCESSORS1 (2010)
1 M. Duranton et all., “Rapid Technology-Aware Design Space Exploration for Embedded HeterogeneousMultiprocessors” in Processor and System-on-Chip Simulation, Ed. R. Leupers, 2010
| 52
WHAT’S NEXT FOR DEEP LEANING AND AI?1st Winter: 1970sThe Lighthill report (published in 1973)
2nd Winter: 1987Perceptrons (2nd ed)Minsky & Papert
3rd Winter: 1993SVMVapnik & Cortes (1963)
4th Winter or Plateau of ProductivityOr winter due to energy ?
| 53
CONCLUSION: WE LIVE AN EXCITING TIME!
“The best way to predict the future is to invent it.”Alan Kay