Marc DurantonHiPEAC vision coordinator
CEA FellowCommissariat à l’énergie atomique et aux énergies alternatives
High-Performance and EmbeddedArchitecture and Compilation
HPC in the Loop and Cyber-Physical
Systems
Post-H202 vision for HPC workshop Sunday, 24 June 2018, Frankfurt
| 3
HiPEAC's mission is to steer and increase the European research in the area of high-performance and embedded computing systems,
And stimulate cooperation between:a) academia and industry b) computer architects and tool builders.
HiPEAC=
High-Performance and Embedded Architecture and Compilation
| 4
HIPEAC HISTORY
2005 2006 2007 2008 2009 2010 2011 2012 2013
HiPEAC1
HiPEAC2
HiPEAC3
2014 2015 2016 2017
HiPEAC4
HiPEAC5
2018 2019
| 5
MEM
BER
SHIP
Associated members: 76 Total: 1496
13 partners, 522 members, 99 associated members, 423 affiliated members and 855 affiliated PhD students from 363 institutions in 40 countries.
Membership is free of charge.
hipeac.net/members/stats/map
| 6
HIPEAC STRUCTURE
WP1 Growing the communities
WP2 Connecting the communities
WP3 Dissemination
WP4 Roadmapping
Management• Membership management• Growing the industrial community• Growing the innovator community• Growing the stakeholder community• Growing the new member states membership
• Conference• ACACES summer school• Computing systems weeks• Stimulating collaboration• HiPEAC Jobs
• Consultation meetings• HiPEAC Vision 2019• Disseminating the HiPEAC Vision
• Project management• Financial management• Industrial Advisory board
• Communications• Road show• Awards• Website
| 7
HIPEAC STRUCTURE
WP1 Growing the communities
WP2 Connecting the communities
WP3 Dissemination
WP4 Roadmapping
Management• Membership management• Growing the industrial community• Growing the innovator community• Growing the stakeholder community• Growing the new member states membership
• Conference• ACACES summer school• Computing systems weeks• Stimulating collaboration• HiPEAC Jobs
• Consultation meetings• HiPEAC Vision 2019• Disseminating the HiPEAC Vision
• Project management• Financial management• Industrial Advisory board
• Communications• Road show• Awards• Website
| 8
The last HiPEAC Vision Document was published in January 2017.The next version is on-going (expect printed version for beginning 2019)One of its aim is to drive the community and to help defining the
next European calls in ICT.
THE HIPEAC VISION
2009 20112008 2013 2015 2017
The 2017 version is available at:http://hipeac.net/vision
| 9
HiPEAC Vision
| 11
“The best way to predict the future is to invent it.”
Alan Kay
| 12
OUTLINE
Algorithm
Hardware
Language
Data
Applications
| 13
OUTLINE
Algorithm
Hardware
Language
Data
Applications
| 14
Entering in Human and Machine collaboration era
| 15
Artificial Intelligence is changing the man-machine interaction – natural interfaces, ”intelligent” behavior
• Image and situation understanding• Voice recognition and synthesis• Direct interfacing with the world• Creating the bridge between cyber and real world• …decision taking…
ENABLED BY ARTIFICIAL INTELLIGENCE (AND DEEP LEARNING)
| 16
New services
Smart sensors
Internet of Things
Big Data
Data Analytics / Cognitive
computing
Cloud / HPC
Physical Systems
Transforming data into information as early as possible
Cyber Physical Entanglement
Processing,Abstracting
Understandingas soon as
possible
C2PS: COGNITIVE ( CYBERNETIC* AND PHYSICAL ) SYSTEMSENABLING EDGE INTELLIGENCE
* As defined by Norbert Wiener: how humans, animals and machines control and communicate with each other.
True collaboration between edge devices and the HPC/cloud ensuring:- Data security
/ Privacy- Lower
bandwidth- Better use of
HPC/cloud
Enabling Intelligent data processing at the edge:
Fog computingEdge computingStream analytics
Fast data…
| 17
LOOKING FORWARD… EXAMPLE OF A CPS SYSTEM
Direct Brain Computer Interface (BCI)
Here allowing a paraplegic to walk again…
One current limitation: Required processing power – need supercomputer in a box
From CEA-Clinatec
| 18
Cyber-physical applications will require high performance level of computing – HPC in a Box – .
System should be autonomous to make good decisions in all conditions
Embedded intelligence needs local high-end computing
Safety will impose that basic autonomous functionsshould not rely on “always connected” or “always available”
| 19
System should be autonomous to make good decisions in all conditions
Safety will impose that basic autonomous functionsshould not rely on “always connected” or “always available”
Embedded intelligence needs local high-end computing
25 years1985 2010
Current embedded systems are HPC systems of few decades ago
| 20
CYBER-PHYSICAL SYSTEMS
Actions
Observations
Imposes the “timing”
Environment
| 21
1948: NORBERT WIENER
| 22
BUT COMPUTING SYSTEMS WERE NOT DESIGNED FOR CPS SYSTEMS
In nearly all hardware and software of computing systems:Time is abstracted or even not present at all
Very few programming languages can express time or timing constraintsAll is done to have the best average performance, not predictable performances
Caches, out of order execution, branch prediction, speculative execution, …(Hidden) compiler optimization, call to (time) unspecified libraries
Energy is also left out of scopeThis can have impact on data movement, optimizationsWith predictability: On-Time processing
Interaction with external world are second priorities vs. computation
Done with interrupts (introduced as an optimization, eliminating unproductive waiting time in polling loops) which were design to be exceptional events…
Etc.
| 23
Simulation Data
MachineLearning
3 PILLARDS OF FUTURE HPC
Intertwined with CPS requirements
| 24
Simulation
Algorithm
SIMULATION
Specialist
Parameters
Environment
Error?
Data
SIMULATION = MODEL FOR PREDICTION
• Human knowledge defines the processing
• Large algorithmic complexity• High precision floating point• Large set of output data• Von Neumann architecture
| 25
FROM CPS TO DIGITAL TWIN
Actions
Observations Environment
Observations
Error?Simulation
SIMULATION SHOULD BE FASTER THAN REALITY FOR PREDICTION
| 26
BIG DATA
Observations
Big DataData Analytics
Specialist
SpecialistModel
Environment
• Human knowledge refines the processing
• Large set of input data• Mappable on Von Neumann
Architecture
| 27
MACHINE LEARNING(DEEP LEARNING)LEARNING PHASE
Learning phase
Labelled data set
Specialist
• Human defines the learning data set, not the algorithm
• Large set of input data for learning phase
• Low precision floating point• Large number of operations• (Stochastic) gradient descent
| 28
MACHINE LEARNING(DEEP LEARNING)
INFERENCE PHASE
Inference phase
• Low precision arithmetic• Medium number of
operations• Co-location computing and
storage (“computing in memory”)
• Should satisfy the application non-functional requirements
Environment
| 29
DEEP LEARNING AND VOICE RECOGNITION
| 31
… required to increase energy efficiency with accuracy adapted to the use (e.g. float 16)
Google’s TPU2 : 11.5 petaflops16 of machine learning number crunching (and guessing about 400+ KW…, 100+ GFlops16/W)
Peta = 1015 = million of milliardFrom Google
2017: GOOGLE’S CUSTOMIZED TPU HARDWARE…
| 33
REINFORCEMENT LEARNING:
Actions
Observations
Respond to action
Rewards
Goal
Environment
Agent
Learns to maximize rewards
DYNAMIC PROGRAMMING + DEEP LEARNING
| 34
ALPHAGO ZERO: SELF-PLAYING TO LEARN
From doi:10.1038/nature24270 (Received 07 April 2017)
| 35
ALPHAZERO: SELF-PLAYING TO LEARN
REINFORCEMENT LEARNING: DYNAMIC PROGRAMMING + DEEP LEARNING
| 36
ALPHAZERO: COMPUTING RESOURCES
Peta = 1015 = million of milliard
From Google Deepmind
| 37
REINFORCEMENT LEARNING: DYNAMIC PROGRAMMING + DEEP LEARNING
Goal
Actions
Observations
Rewards
Respond to action
Environment
Agent
Learns to maximize rewards
• Mixed precision arithmetic• Very high number of operations• Large internal data manipulation• Mainly co-location computing and
storage (“computing in memory”)• High level of parallelism• Minimization of energy functions
| 38
REINFORCEMENT LEARNING: DYNAMIC PROGRAMMING + DEEP LEARNING
Goal
Actions
Observations
Rewards
Respond to action
Agent
Learns to maximize rewards
| 39
OUTLINE
Algorithm
Hardware
Language
Data
Applications
| 40
WHAT WILL BE THE NEXT TECHNOLOGY?
And after CMOS?
| 43
22FD
28nm
14nm
10nm
7nm
5nm
Next Gen
FinFET
Mechanical switches
Hyb
rid
logi
c Steep slope devices
Si Quantum bits
Disruptive scaling
Alternative to scaling and diversification
Monolithic 3D for 3D VLSI
25nm TBOX
20nm LG ISPD SiCRSD
Si channel
2017
2018
25nm TBOX
20nm LG ISPD SiCRSD
Si channel
12FDFDSOI
Technology evolutionSilicon Quantum bits
Non planar / trigate / stacked Nanowires
| 44
New technologies• Quantum computing• Printed/flexible electronics• Carbon nanotubes• Photonics• Neuro inspired (nano) technologies• Reservoir computing• Adiabatic computing• MEMS for computing• Synthetic biology, blob computing• Swarm computing• Symbiotic computing• Analogue/physic/hybrid computing• …
EXPLORE NEW WAYS AS ALTERNATIVE TO SILICON
| 45
NON VOLATILE MEMORIES
PCMGSTGeTeGST + HfO2
CBRAMAg / GeS2
OXRAM
TiN/HfO2/Ti/TiNThermal effect Electrochemical
effectElectronic effect oxygen vacancies
MRAM Magnetic effect• Can change the structure of
memory hierarchy?+ 64/128 addressing scheme⇒ Do we still need files?⇒ Direct access of objects
| 46
From “Total Consumer Power Consumption Forecast”, Anders S.G. Andrae, October 2017
The problem:IT projected to challenge future electricity
supply
| 49
COST OF MOVING DATA -> COMPUTING IN MEMORY
Source: Bill Dally, « To ExaScale and Beyond »www.nvidia.com/content/PDF/sc_2010/theater/Dally_SC10.pdf
| 50
Neuram3 1st
chipIBM True
NorthTechnology 28 nm FDSOI 28nm CMOSSupply Voltage 1 V 0.7VNeuron Type Analog DigitalNeurons per core 256 256Core Area 0.36 mm2 0.094 mm2
Computation Parallel processing
Time multiplexing
Fan In/Out 2k/8k 256/256Synaptic Operation per Second per Watt
300 GSOPS/W*1
46 GSOPS/W
Energy per synaptic event <2 pJ*2 10 pJEnergy per spike <0.375 nJ*3 3.9 nJ
∗ 1 At 100Hz mean firing rate, by appending 4 local-core destinations per spike, 400 k events will be broadcast to 4 cores with 25% connectivity per event. 400 k x 1 k x 25% / 300 μ W = 300 GSOPS/W∗ 2 In case of 25% match in each core, energy per synaptic event = energy per broadcast / (256*25%) =120pJ/64 = 2 pJ∗ 3 Energy per spike = total power consumption / spikes numbers = 300 uW/800 k = 0.375 nJ
NEUROMORPHIC ACCELERATOR:DYNAPS-SL (INI-ZURICH)
| 52
OFF-CHIP PHOTONICS
Off board: AOC, optical modules Off chip: Optical I/O Time
S1
Chip B
Chip C
Opt
ical
Tra
nsce
iver
Chip D
ICSi interposer or laminate substrate
Driver / TIA
IC
Micro-pillars
PIC
FiberFerrule
PCB
Photonics: cost in sending information, nearly nothing in transmission
| 53
IN-PACKAGE PHOTONICS
Off board: AOC, optical modules Off chip: Optical I/O Optical network
in package
S1
Opt
ical
Tra
nsce
iver
Chip C
Chip A
Chip D
Chip B
RAMComputing Cores
Photonic Interposer
Tx/Rx Integr. Rx/Tx
Substrate
photodiode modul
Laser
ThroughSilicon Via
RF Cu pillars
Power Power Power Power
Light source
Primary I/OCu pillars
Digital Cu pillars& proximity lines
Thermal Dissipation
Thermal Dissipation
SignalSignal
S2
| 54
D-Wave System Hardware
Daniel Vert - [email protected]
The system solves only one binary optimization problem :
Embedding a realistic problem instance :Physical qubits on each colored pathrepresent one logical qubit
Problem: Qubit efficiency !Possible solution : Search for an efficient operating path for an adiabatic quantum computer
Adiabatic quantum computation (Farhi et al. in 2000) as an alternative quantum model for solving NP-hard optimization problems considered classically insoluble !
https://web.eecs.utk.edu/~mclennan/Classes/494-594-UC-F16/presentations/D-Wave.pdf
Polynomial transformation of NP-hard problemsfor Quantum annealers
Slide from Christian Gamrat
| 55
• Importance of studying the full stack system architecture to connect quantum devices with conventional computing systems• Hardware level• System stack
QUANTUM HARDWARE ARCHITECTURE AND SYSTEM
Overview of a quantum computer architecture. Pink is quantum, Green is the interface and Blue parts are
conventional
“Fu et al - 2016 - A Heterogeneous Quantum Computer Architecture, Proc. ACM, 2016 .
The system stack from low level quantum hardware to high level algorithms
From work by QuTech @TU-DelftSlide from Christian Gamrat
| 56
Maurand et al, Nature Com., Jul. 2016.
FUSING PARADIGMS AT HARDWARE LEVEL
Coordination EngineVon Neumann styleCMOSTechnology
binary data
GraphicalEngine
DecisionEngine
NeuroEngine
QuantumEngine
SemanticalEngine
NumericalEngine
CMOS Substrate
Neuroengine
Quantumengine
Graphicsengine
Physical and logical interface layer
NumericalEngine
At the hardware level, the good old Von Neumann/ CMOS partnership can act as a computing substrate
• Acting as coordination / communication node• Allowing Hardware / Software integration
D. Roclin et al, IEEE NanoArch, 2014.
Qubits on Silicon
NVM Synapses on Silicon
Slide from Christian Gamrat
| 57
COORDINATE PARADIGMS AT SOFTWARE LEVEL
Valiron et al., “Programming the Quantum Future,” Commun. ACM, vol. 58, no. 8, pp. 52–61, Jul. 2015.
• A sequential program (running on the coordinator) distributes tasks to engines• Accelerator as services (J. Cong)
• Tasks are distributed to computing engines• Instruction streams• Supervised Learning• Unsupervised Learning• Quantum Engine
• Why would we want that?• Domain specific expressiveness• Run hard tasks, speedup NP problems• Provide cognitive functions
Fu et al. Proc. ACM, May 2016
Stefanini et al. Frontiers NeuroinformaticsAugust 2014
| 58
OUTLINE
Algorithm
Hardware
Language
Data
Applications
PARALLELISM AND SPECIALIZATION ARE NOT FOR FREE…
Frequency limit parallelism
Energy efficiency heterogeneity
Ease of programming
PARALLELISM AND SPECIALIZATION ARE NOT FOR FREE…
Frequency limit parallelism
Energy efficiency heterogeneity
Ease of programming
PARALLELISM AND SPECIALIZATION ARE NOT FOR FREE…
Frequency limit parallelism
Energy efficiency heterogeneity
Ease of programming
Managing complexity
Cognitive solutions for complex computing systems:• Using AI and optimization
techniques for computing systems• Creating new hardware• Generating code• Optimizing systems
• Similar to Generative designfor mechanical engineering
| 64
USING AI FOR MAKING CPS SYSTEMS: “GENERATIVE DESIGN” APPROACH
Motorcycle swingarm: the piece that hinges the rear wheel to the bike’s frame
The user only states desired goals and constraints-> The complexity wall might prevent explaining the solution
“Autodesk”
| 65
• Ne-XVP project – Follow-up of the TriMedia VLIW (https://en.wikipedia.org/wiki/Ne-XVP )
• 1,105,747,200 heterogeneous multicores in the design space
• 2 millions years to evaluate all design points
• AI inspired techniques allowed to reduce the induction time to only few days
=> x16 performance increase
EXAMPLE: DESIGN SPACE EXPLORATION FOR DESIGN MULTI-CORE PROCESSORS1 (2010)
1 M. Duranton et all., “Rapid Technology-Aware Design Space Exploration for Embedded HeterogeneousMultiprocessors” in Processor and System-on-Chip Simulation, Ed. R. Leupers, 2010
| 66
AUTOML AND OTHER PROGRAM GENERATORS
| 67
“Neural Architecture Search”, using a recurrent neural network to compose neural network architectures using reinforcement learning on CIFAR-10 (character recognition)
2017: GOOGLE; USING DEEP LEARNING TO DESIGN DEEP LEARNING
From arXiv:1611.01578v2, Barret Zoph, Quoc V. LeGoogle Brain
| 68
• Describing what the program should accomplish, rather than describing how to accomplish it as a sequence of the programming language primitives.
• For example, describe the concurrency of an application, not how to parallelize the code for it.
• (Good) compilers know better about architecture than humans, they are better at optimizing code…
PROGRAMMING 2.0: LET THE COMPUTER DO THE JOB:
| 69
CONCLUSION: WE LIVE AN EXCITING TIME!
| 70
Centre de Grenoble17 rue des Martyrs
38054 Grenoble Cedex
Centre de SaclayNano-Innov PC 172
91191 Gif sur Yvette [email protected]
Thank you for your attention