54
Considérations énergétiques de l’IA et perspectives Marc Duranton CEA Fellow 22 Septembre 2020 Séminaire INS2I- Systèmes et architectures intégrées matériel-logiciel pour l’intelligence artificielle

CONTEXT AND HISTORY: STATE-OF-THE-ART SYSTEMS

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

DEEP LEARNING TECHNOLOGIES FOR AI

Marc DurantonCEA Tech (Leti and List)

Considérations énergétiques de l’IA et perspectives

Marc DurantonCEA Fellow

22 Septembre 2020

Séminaire INS2I- Systèmes et architectures intégrées matériel-logiciel pour l’intelligence artificielle

| 2

KEY ELEMENTS OF ARTIFICIAL INTELLIGENCE

Traditional AI, symbolic, algorithms

rules…

Analysis of “big data”

Data analytics

ML-based AI:Bayesian, …

* Reinforcement Learning, One-shot Learning, Generative Adversarial Networks, etc…

From Greg. S. Corrado, Google brain team co-founder:– “Traditional AI systems are programmed to be clever– Modern ML-based AI systems learn to be clever.

Deep Learning*

AI“…as soon as it works, no one calls it AI anymore.”

John McCarthy, who coined the term “Artificial Intelligence” in 1956

“AI is whatever hasn't been done yet”

D. Hofstadter (1980)

“Classical” approachesto reduce energy consumption

Optimization of energyin datacenters

Our focus today:- Considerations on

state of the art systems(learning + inference)

- Edge computing( inference, federated learning,neuromorphic)

CONTEXT AND HISTORY: STATE-OF-THE-ART SYSTEMS

| 4

• ImageNet classification (Hinton’s team, hired by Google) • 14,197,122 images, 1,000 different classes• Top-5 17% error rate (huge improvement) in 2012 (now ~ 3.5%)

• Facebook’s ‘DeepFace’ Program (labs headed by Y. LeCun) • 4.4 million images, 4,030 identities• 97.35% accuracy, vs. 97.53% human performance

From:Y. Taigman, M. Yang, M.A. Ranzato, “DeepFace: Closing the Gap to Human-Level Performance in Face Verification”

“Supervision” networkYear: 2012650,000 neurons60,000,000 parameters630,000,000 synapses

They give the state-of-the-art performance e.g. in image classification

2012: DEEP NEURAL NETWORKS RISE AGAIN

The 2018 Turing Award recipients are Google VP Geoffrey Hinton, Facebook's Yann LeCun and Yoshua Bengio, Scientific Director of AI research center Mila.

| 5

| 6

COMPETITION ON IMAGENET !

Nameofthealgorithm

Date Errorontestset

Supervision 2012 15.3%

Clarifai 2013 11.7%

GoogLeNet 2014 6.66%

Humainlevel(AdrejKarpathy)

5%

Microsoft 05/02/2015 4.94%

Google 02/03/2015 4.82%

Baidu/DeepImage 10/05/2015 4.58%

ShenzhenInstitutesof

AdvancedTechnology,

ChineseAcademyofSciences

10/12/2015

(leCNNa152

couches!)

3.57%

GoogleInception-v3

(Arxiv)

2015 3.5%

WMW(Momenta) 2017 2.2%

Now ?

| 7

SPECIALIZATION LEADS TO MORE EFFICIENCY EFFICIENCY

Source from Bill Dally (nVidia) « Challenges for Future Computing Systems » HiPEAC conference 2015

Type of device Energy / Operation

CPU 1690 pJGPU 140 pJ

Fixed function 10 pJ

| 8

2017: GOOGLE’S CUSTOMIZED HARDWARE…… required to increase energy efficiency

with accuracy adapted to the use (e.g. float 16)

Google’s TPU2 : training and inference in a 180 teraflops16 board(over 200W per TPU2 chip according to the size of the heat sink)

| 9

" The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!"

[https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html]

DEEP LEARNING AND VOICE RECOGNITION

| 10

… required to increase energy efficiency with accuracy adapted to the use (e.g. float 16)

Google’s TPU2 : 11.5 petaflops16 of machine learning number crunching (and guessing about 400+ KW…, 100+ GFlops16/W)

Peta = 1015 = million of milliardFrom Google

2017: GOOGLE’S CUSTOMIZED HARDWARE…

| 11

From https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/

GOOGLE’S CUSTOMIZED TPU (V3) HARDWARE…

GOOGLE’S CUSTOMIZED TPU (V3) HARDWARE…

| 12

2017: THE GAME OF GO

Ke Jie (human world champion in the “Go” game), after being defeated by AlphaGo on May 27th 2017, will work with Deepmind to make a tool from AlphaGo to further help Go players to enhance their game.

| 13

ALPHAGO ZERO: SELF-PLAYING TO LEARN

From doi:10.1038/nature24270 (Received 07 April 2017)

| 14

ALPHAZERO: COMPUTING RESOURCES

Peta = 1015 = million of milliard From Google Deepmind

X 5000 = 200 KW*

* https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

X 40 days…

AlphaZero is a generic reinforcement learning algorithm

Follow-up is called MuZero (2019)

| 15

2019: OPENAI WINS DOTA 2 ESPORT GAME

OpenAI Five DefeatsDota 2 World Champions

OpenAI Five is the first AI to beat the world champions in an esports game, having won two back-to-back games versus the world champion Dota 2 team.It is also the first time an AI has beaten esports pros on livestream.

Cooperative mode

OpenAI Five’s ability to play with humans presents a compelling vision for the future of human-AI interaction, one where AI systems collaborate and enhance the human experience. Our testers reported feeling supported by their bot teammates, that they learned from playing alongside these advanced systems, and that it was generally a fun experience overall.

| 16

OpenAI Five DefeatsDota 2 World Champions

In total, the current version of OpenAI Fivehas consumed 800 petaflop/s-days andexperienced about 45,000 years of Dotaself-play over 10 realtime months (up fromabout 10,000 years over 1.5 realtimemonths as of The International), for anaverage of 250 years of simulatedexperience per day. The Finals version ofOpenAI Five has a 99.9% winrate versusthe TI (tournament) version.

2019: OPENAI WINS DOTA 2 ESPORT GAME

Peta = 1015 = million of milliard

| 17

OpenAI Five DefeatsDota 2 World Champions

In total, the current version of OpenAI Fivehas consumed 800 petaflop/s-days andexperienced about 45,000 years of Dotaself-play over 10 realtime months (up fromabout 10,000 years over 1.5 realtimemonths as of The International), for anaverage of 250 years of simulatedexperience per day. The Finals version ofOpenAI Five has a 99.9% winrate versusthe TI (tournament) version.

2019: OPENAI WINS DOTA 2 ESPORT GAME

Giga = 109 Tera = 1012

Green500 List - June 2020

Peta = 1015 = million of milliard

~ 38 MW

| 18

EXPONENTIAL INCREASE OF COMPUTING POWER FOR AI TRAINING

* https://blog.openai.com/ai-and-compute/

“Since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time (by comparison, Moore’s Law had an 18-month doubling period)*”

| 19

ALGORITHMIC IMPROVEMENT IS A KEY FACTOR DRIVING THE ADVANCE OF AI

* https://openai.com/blog/ai-and-efficiency/

“Since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet1 classification has been decreasing by a factor of 2 every 16 months. *”

| 20

2020: OPENAI’S GPT-3, AN AUTOREGRESSIVE LANGUAGE MODEL WITH 175 BILLION PARAMETERS

* https://lambdalabs.com/blog/demystifying-gpt-3/

“GPT-3 shows that language model performance scales as a power-law of model size, dataset size, and the amount of computation.GPT-3 demonstrates that a language model trained on enough data can solve NLP tasks that it has never encountered. That is, GPT-3 studies the model as a general solution for many downstream jobs without fine-tuning.The size of state-of-the-art (SOTA) language models is growing by at least a factor of 10 every year.*”GPT-3 175B is trained with 499 Billion tokens:

“GPT-3 175B model required 3.14x1023 FLOPS (100 trilliard (1021) so about 87h of exaflop (1018) machine), 302 days on Tera-1000) for computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run*”.

| 21

HOUSTON, WE HAVE A PROBLEM…

| 22

From “Total Consumer Power Consumption Forecast”, Anders S.G. Andrae, October 2017

The problem: IT projected to challenge future electricity supply

Exponential power consumption

| 23

PROCESSOR ARCHITECTURE EVOLUTION

In Memory Computing,Neuromorphic Computing.

Data Centric Interconnect

Moore’s law slow-down

DisruptiveArchitecture for

data management

Memory invades

logic

CPU

Cache

Memory

Bus

NIC(NetworkInterConnect)

Mono-core architecture for

single thread performance

~2006

End of Dennard’s scaling

NoC + LLC

Cache Cache

Memory

CacheCache

Many-core architecture for

parallelism

Memory

NIC

~2016

Close Mem.

Coherent Link

Close Mem

Close Mem

Close Mem

Far Mem. NIC

Generic processing

HW accelerator

Heterogeneousarchitecture for energy efficiency

~2026…

Quantum Computing

From Denis Dutoit

| 24

• Only GAFAM or BAITX able to train those gigantic systems?• Use in transfer learning:• Once learnt, the system can be tuned for a particular application• This can be done with minimal computing resources (only the last layers to

train)• The energy cost of learning is spread by the many usages and instances in

inference• Like the NRE for building chips

• Will we have access to the resulting systems for adaptation?• Is it possible to optimize those systems by post-processing to be able

to use them for inference/transfer learning in embedded system?

OPEN QUESTIONS…

EDGE COMPUTING

| 26

New services

Smart sensors

Internet of Things

Big Data

Cloud / HPC

Data Analytics / Cognitive computing

Physical Systems

Transforming data into information as early as possible

Cyber Physical Entanglement

Processing,Understanding

as soon as possible

THE COMPUTING CONTINUUM

ENABLING EDGE INTELLIGENCE

True collaboration between edge devices and the HPC/cloud➪ creating a

continuum

Enabling Intelligent data processing at

the edge:Fog computing

Edge computingStream analytics

Fast data…

Intelligent sensor(pre) processing datalocally

| 27

This initiative currently involves:ETP4HPC, ECSO, BDVA, 5GIA,EUMATHS, CLAIRE, HiPEAC…More to come…

Edge processing, Fog, …

| 28

Shou

ld I

brak

e?

Tran

smis

sion

erro

rpl

ease

retry

late

r

System should be autonomous to make good decisions in all conditions

Embedded intelligence needs local high-end computing

Safety will impose that basic autonomous functionsshould not rely on “always connected” or “always available”

And should not consume most power (and space!) of an electric car!

| 29

N2D2, NEURAL NETWORK DESIGN & DEPLOYMENT

Code cross-generation

Code executionon target

COTS• STM (STM32, ASMP)• Nvidia (all GPUs)• Renesas (RCar)• Kalray (MPPA)

Custom accelerators• ASIC (PNeuro)• FPGA (DNeuro)

Software DNN libraries• C/OpenMP (+ custom SIMD)• C++/TensorRT (+ CUDA/CuDNN)• C++/OpenCL• C/C++ (+ custom API)• Assembly

Hardware DNN libraries• C/HLS• Custom RTL

Data conditioning

Learning & Test

databases

Considered criteria•Applicative performance metrics•Memory requirement•Computational complexity

Modeling Learning Test

Optimization

Trained DNN

ONNX model

Quantization

2 to 8 bit integers + rescaling•Based on dataset distribution•Quantized applicative

performance metrics validation

A unique platform for the design and exploration of DNN applications with :+ Easy database access to facilitate the learning phase+ Automatic code generation on various hardware targets thanks to the export modules+ Multiple comparison criteria : latency, hardware cost, memory need, power consumption…

Dumb sensors Smart sensors: Streaming and distributed data analytics

Bandwidth (and cost) will require more local processing

if you need a responsein less than 1ms, the server has to be in less than 150 Km

Fog computingFederated learning

EDGE INTELLIGENCE NEEDS LOCAL HIGH-END COMPUTING

| 31

Privacy will impose that some processing should be done locally

and not be sent to the cloud.

Example: detecting elderly people falling in their home

With minimum power and wiring!

Embedded intelligence needs local high-end computing

| 32

• Ultra low power local processing detecting lying people in a room

• Raw data (before post-processing):

• Standing • Crouching• Lying

Detecting elderly people falling in their home

| 33

ENERGY EFFICIENT PROGRAMMABLE H/W ACCELERATOR FOR DEEP LEARNING: PNEURO

n Designed for Deep Neural Network processing chainsn Also supporting traditional image processing chains (filtering, etc.) –

No need of external image processingn Clustered and modular SIMD architecturen Clustered SIMD architecture

− Optimized operators for MAC & Non-Linearity approximation− Optimized memory accesses to perform efficient data transfers to operators− ISA including ~50 instructions (control + computing)

n Can be sized to fit the best area/performance per applicationn 571 GMACs/s/W/cluster

Target Frequency Performance Energy Efficiency

Quad ARM A7 900 MHz 480 images/s 380 images/s/W

Quad ARM A15 2000 MHz 870 images/s 350 images/s/W

Tegra K1 850 MHz 3 550 images/s 600 images/s/W

PNeuro (FPGA) 100 MHz 5 000 images/s 2 000 images/s/W

PNeuro (ASIC) 500 MHz 25 000 images/s 125 000 images/s/W

PNeuro (ASIC) 1 000 MHz 50 000 images/s 125 000 images/s/WFD-SOI28nm1.8 TOPS/W @500MHz0.4mm², 6mW

Cluster Interconnect

Cluster

NeuroCores

Cluster Controller

ElementaryComputing Units

One cluster of P-Neuro NPU

Performance of NPU with 4 clusters

PNeuro Neural Processing Unit for ASIC @1 GHz is 14 times faster & 200 times

more energy efficient than an embedded GPU

Entirely developed in Europe

| 34

PNEURO IN A ULTRA LOW POWER SOC: SAMURAI

SAMURAI ARCHITECTURE OVERVIEW

WuC

FLL

ULVTP SRAM

On-Demand Sub-SystemAlways Responsive Sub-System

WuRSPI_m GPIO I2C

Interconnect

SynchoDBB

Security IPs

GPIO

UART

CPURISC V

SRAM PNeuro SRAM

• Always responsive running @ 0.4V• Ultra fast wake-up time with ultra-low power modes• WuR with DBB decoder to wake-up on radio signaling• 8KB of ULV TP SRAM• AI accelerator with embedded 256KB SRAM• Embedded security

| 35

• Technology: ST28nm FDSOI• Die area: ~4.52mm² (2564µm x 1764µm)• Memories• 424KB SRAM (incl. 32KB with retention

mode, 264KB in PNeuro)• 8KB TPSRAM

• 5 power domains• 64-pin QFN package• Performances• 100 MHz @ 0.7V (220MHz @ 1,0V)• PNeuro: 6.4GMAC/s @ 100MHz• 5895 image class. per second @100MHz

SAMURAI MAIN FIGURES

Ref: 2020 Symposia on VLSI Technology and Circuits

| 36

SOLVING THE ENERGY CHALLENGE: COST OF MOVING DATA

Source: Bill Dally, « To ExaScale and Beyond »www.nvidia.com/content/PDF/sc_2010/theater/Dally_SC10.pdf

Avoiding data movement:Computing in/near memory

x800 more!

| 37

HOW TO REDUCE ENERGY AND POWER Processor

« sequential »transmission

x 2 resolution ➔ x 2 transmission frequency (f)P = C.V2.f but lower f allows lower V

Using energy efficient technology: FDSOI

Fully Depleted – Silicon on Insulator

LETI’s ISSCC demonstrator 2014:

Fclk < 2.6GHz @ 1.3VFclk < 450MHz @ 0.39V

x 10 less power at 0.39V(at same frequency)

« classical » system

Image sensor

| 38

Simpler processors running at lower f (they have less to do)P = n.C.V2.(f/n) and lower f allows lower V

Image sensor

HOW TO REDUCE ENERGY AND POWER

| 39

• Power/pixel reduced by over 90% compared to best in class low power IR product• (7,03 µW/pixel => 0,54

µW/pixel)• 13 times less power for the

same system functionality

• Dedicated algorithms for• Presence sensing• Localization• People counting

• Privacy respected• Pre-processing on the sensor• Video is not transmitted

IR SENSOR WITH BUILT-IN 1ST LEVEL IMAGE PROCESSING

Image

Adding 128 RISCProcessors

Synthetic data

| 40

SMART RETINA USING 3D STACKING

ProcessingLayer

Lens

Sensor layer (Back Side Illumination)

Passive interposer or PCB

L2

L1

Advantages:Low frequency processorDirect interconnect pixel ➔ processor➪ energy efficient system➪ very fast pixel processing➪ perfect scalability, no size limitation due to bandwidth➪ complex processing in the sensor➪ suitable for privacy sensible applicationsEach pixel can be processed independently➪ opening a new range of algorithms Heterogeneous integration possible

| 41

SMART RETINA USING 3D STACKING

• Demonstrator “Retine”• L1@130 nm / L2@130 nm; • Die size : 160 mm²

• Image sensor: • 192x256 @ 5500 fps or 768x1024 @ 60 fps• 12 µm pixel, 75% fill factor,

• 192 processors (3072 PE)• Processing : 72 GOPS, 11.7 MOPS/mW• 1 k instructions / pixel @ 1000 fps• Distributed memory• Each processor can execute a different

code in a set

X 100 computing power, X 10 energy efficiency,/ 15 processing latency

| 42

RETINE: A 3D STACKED SMART RETINA

Note: theboard is only an interface allowing to drive an HDMI displayand to debug the chip. All processing is done inside the retina

| 43

• The promises of spike-coding NN:• Reduced computing complexity

è replace 32 bits MAC with 8 bits ACC per operation• Natural spatial and temporal parallelism• Simple and efficient performance tunability capabilities• Spiking NN best exploit NVMs,

for massively parallel synaptic memory

SPIKE CODING OF DNN FOR IMPROVED EFFICIENCY

“Standard” MAC Spiking ACCMULT 16 bits 1.17 pJ

(interpolated)ADD 8 bits

0.03 pJ

ADD 32 bits 0.1 pJ

Total 1.27 pJ Total 0.03 pJè 42x reduction

Energy per operation comparison (estimate for 45nm 0.9V)**Mark Horowitz, “Computing’s Energy Problem (and what we can do about it)”,Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International

CorrectOutput

16 kernels4x4 synapses

90 kernels5x5 synapses

Input: digit24x24 pixels

Conv. layer16 maps11x11

neurons

Conv. layer24 maps

4x4 neurons FC la

yer

150 n

euro

nsOutp

ut:10

neur

ons

1) Standard CNN topology, offline learning

layer 1 layer 2 layer 3 layer 4

Pixelbrightness

Spiking frequencyV

t

fMIN

fMAX

Rate-based input

coding

Time

2) Lossless spike transcoding

| 44

• Sparcity: results on MNIST and GTSRB datasets

è Best reported* spike recognition rate with 50% less op. and >40x energy efficiency/op. for MNIST!è Tunable error rate vs compute time

SPIKE-CODING RESULTS

Performance vs computing time tenability on GTSRB (approximated computing)

Standard Spiking

Score MACs ∆𝑺 Score ACCs(spikes/MAC)

MNISTReLU 99.42% 126k

2 98.24% 40.30k (0.32)

3 99.33% 51.32k (0.41)

4 99.42% 63.02k (0.50)

GTSRBReLU 98.42% 818k

5 98.18% 1.63M (2.00)

10 98.38% 2.30M (2.82)

15 98.39% 2.87M (3.51)

N2D2: first public deep learning framework to integrate spiking DNNs transcoding and simulation

00,511,522,533,54

012345678

3 4 5 6 7 8 9 10 11 12 13 14 15

AC

Cs

(spi

ke/M

AC

)

Test

erro

r rat

e (%

)

Decision threshold ΔS

*Previous state-of-the-art: D. Neil, M. Pfeier, and S.-C. Liu. “Learning to be efficient: algorithms for training low-latency, low-compute deep spiking neural networks.” In ACM SAC, 2016

Standard DNN Spiking DNNQuantization Yes YesApprox. computing No YesBase operation Multiply-Accumulate

(MAC)Accumulate only

Activation function Non-linear function Threshold(+ refractory period)*

Parallelism Spatial Spatial and temporalMemory reutilisation Yes No

*Not required for ReLU activation function

| 45

SPIKING NEURAL NETWORKSCOMMUNICATION SPIKING NEURONS UNSUPERVISED LEARNING

ADDRESS EVENT REPRESENTATION

LEAKY INTEGRATE-AND-FIRE

SPIKE TIMING DEPENDENT PLASTICITY

| 46

NEW ELEMENT: RRAM AS SYNAPSES

PCMGSTGeTeGST + HfO2

M.Suri, et. al, IEDM 2011M.Suri, et. al, IMW 2012 , JAP 2012O.Bichler et al. IEEE TED 2012M.Suri et al., EPCOS 2013D.Garbin et al., IEEE Nano 2013

CBRAMAg / GeS2

OXRAM

D.Garbin et al. IEDM 2014D.Garbin et al., IEEE TED 2015

TiN/HfO2/Ti/TiN

Thermal effect

Electrochemicaleffect

Electronic effectoxygen vacancies

Analog computing: using some aspects of physical phenomena to model the problem being solved

| 47

MONOLITHIC 3D PRINCIPLE

CMOS/CMOS: 14nm vs 2D:Area gain=55%Perf gain = 23%Power gain = 12%

LETI, DAC 2014

| 48

Introduction

HfO2Ti/TiN

Ti/TiNHfO2

Ti/TiN

Ti/TiN

Short term structureà RRAM on top level to

avoid contamination issueà Reuse of existing masks

plus ebeam to build 1T1R

à ”Synapses” are integrated in the very fabric of communication

1 base ebeam required for RRAM definitionRRAM based on HfO2/Ti/TiN low temp materials (~ 350°C) à no critical problems to integrate on the top level

3D INTEGRATION COUPLED WITH RRAM FOR SYNAPTIC WEIGHTS

Managing complexity

Cognitive solutions for complex computing systems:• Using AI and optimization

techniques for computing systems• Creating new hardware• Generating code• Optimizing systems

• Similar to Generative designfor mechanical engineering

| 50

USING AI FOR MAKING CPS SYSTEMS: “GENERATIVE DESIGN” APPROACH

Motorcycle swingarm: the piece that hinges the rear wheel to the bike’s frame

The user only states desired goals and constraints-> The complexity wall might prevent explaining the solution

“Autodesk”

| 51

• Ne-XVP project – Follow-up of the TriMedia VLIW (https://en.wikipedia.org/wiki/Ne-XVP )

• 1,105,747,200 heterogeneous multicores in the design space

• 2 millions years to evaluate all design points

• AI inspired techniques allowed to reduce the induction time to only few days

=> x16 performance increase

EXAMPLE: DESIGN SPACE EXPLORATION FOR DESIGN MULTI-CORE PROCESSORS1 (2010)

1 M. Duranton et all., “Rapid Technology-Aware Design Space Exploration for Embedded HeterogeneousMultiprocessors” in Processor and System-on-Chip Simulation, Ed. R. Leupers, 2010

| 52

WHAT’S NEXT FOR DEEP LEANING AND AI?1st Winter: 1970sThe Lighthill report (published in 1973)

2nd Winter: 1987Perceptrons (2nd ed)Minsky & Papert

3rd Winter: 1993SVMVapnik & Cortes (1963)

4th Winter or Plateau of ProductivityOr winter due to energy ?

| 53

CONCLUSION: WE LIVE AN EXCITING TIME!

“The best way to predict the future is to invent it.”Alan Kay

| 54