Download pdf - HPC in the Loop and Cyber-Physical Systems - Post H2020... · Creating the bridge between cyber and real world ... • Human knowledge defines the processing • Large algorithmic

Marc DurantonHiPEAC vision coordinator

CEA FellowCommissariat à l’énergie atomique et aux énergies alternatives

High-Performance and EmbeddedArchitecture and Compilation

HPC in the Loop and Cyber-Physical

Systems

Post-H202 vision for HPC workshop Sunday, 24 June 2018, Frankfurt

| 3

HiPEAC's mission is to steer and increase the European research in the area of high-performance and embedded computing systems,

And stimulate cooperation between:a) academia and industry b) computer architects and tool builders.

HiPEAC=

High-Performance and Embedded Architecture and Compilation

| 4

HIPEAC HISTORY

2005 2006 2007 2008 2009 2010 2011 2012 2013

HiPEAC1

HiPEAC2

HiPEAC3

2014 2015 2016 2017

HiPEAC4

HiPEAC5

2018 2019

https://www.google.be/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwi4mLukxZ7TAhWFfhoKHWhxBkEQjRwIBw&url=https://plus.google.com/118280076455348832041/videos&psig=AFQjCNHYk830EdcZV21vMsDvAjwMqz21Cw&ust=1492073533185058

| 5

MEM

BER

SHIP

Associated members: 76 Total: 1496

13 partners, 522 members, 99 associated members, 423 affiliated members and 855 affiliated PhD students from 363 institutions in 40 countries.

Membership is free of charge.

hipeac.net/members/stats/map

| 6

HIPEAC STRUCTURE

WP1 Growing the communities

WP2 Connecting the communities

WP3 Dissemination

WP4 Roadmapping

Management• Membership management• Growing the industrial community• Growing the innovator community• Growing the stakeholder community• Growing the new member states membership

• Conference• ACACES summer school• Computing systems weeks• Stimulating collaboration• HiPEAC Jobs

• Consultation meetings• HiPEAC Vision 2019• Disseminating the HiPEAC Vision

• Project management• Financial management• Industrial Advisory board

• Communications• Road show• Awards• Website

| 7

HIPEAC STRUCTURE

WP1 Growing the communities

WP2 Connecting the communities

WP3 Dissemination

WP4 Roadmapping

Management• Membership management• Growing the industrial community• Growing the innovator community• Growing the stakeholder community• Growing the new member states membership

• Conference• ACACES summer school• Computing systems weeks• Stimulating collaboration• HiPEAC Jobs

• Consultation meetings• HiPEAC Vision 2019• Disseminating the HiPEAC Vision

• Project management• Financial management• Industrial Advisory board

• Communications• Road show• Awards• Website

| 8

The last HiPEAC Vision Document was published in January 2017.The next version is on-going (expect printed version for beginning 2019)One of its aim is to drive the community and to help defining the

next European calls in ICT.

THE HIPEAC VISION

2009 20112008 2013 2015 2017

The 2017 version is available at:http://hipeac.net/vision

Moderador

Notas de la presentación

Purpose of the document: general outline of future directions in research, Vision document is one of the deliverables of HiPEAC 2-3 project

| 9

HiPEAC Vision

| 11

“The best way to predict the future is to invent it.”

Alan Kay

| 12

OUTLINE

Algorithm

Hardware

Language

Data

Applications

| 13

OUTLINE

Algorithm

Hardware

Language

Data

Applications

| 14

Entering in Human and Machine collaboration era

| 15

Artificial Intelligence is changing the man-machine interaction – natural interfaces, ”intelligent” behavior

• Image and situation understanding• Voice recognition and synthesis• Direct interfacing with the world• Creating the bridge between cyber and real world• …decision taking…

ENABLED BY ARTIFICIAL INTELLIGENCE (AND DEEP LEARNING)

| 16

New services

Smart sensors

Internet of Things

Big Data

Data Analytics / Cognitive

computing

Cloud / HPC

Physical Systems

Transforming data into information as early as possible

Cyber Physical Entanglement

Processing,Abstracting

Understandingas soon as

possible

C2PS: COGNITIVE ( CYBERNETIC* AND PHYSICAL ) SYSTEMSENABLING EDGE INTELLIGENCE

* As defined by Norbert Wiener: how humans, animals and machines control and communicate with each other.

True collaboration between edge devices and the HPC/cloud ensuring:- Data security

/ Privacy- Lower

bandwidth- Better use of

HPC/cloud

Enabling Intelligent data processing at the edge:

Fog computingEdge computingStream analytics

Fast data…

| 17

LOOKING FORWARD… EXAMPLE OF A CPS SYSTEM

Direct Brain Computer Interface (BCI)

Here allowing a paraplegic to walk again…

One current limitation: Required processing power – need supercomputer in a box

From CEA-Clinatec

| 18

Cyber-physical applications will require high performance level of computing – HPC in a Box – .

System should be autonomous to make good decisions in all conditions

Embedded intelligence needs local high-end computing

Safety will impose that basic autonomous functionsshould not rely on “always connected” or “always available”

Moderador


cars

| 19

System should be autonomous to make good decisions in all conditions

Safety will impose that basic autonomous functionsshould not rely on “always connected” or “always available”

Embedded intelligence needs local high-end computing

25 years1985 2010

Current embedded systems are HPC systems of few decades ago

Moderador


cars

| 20

CYBER-PHYSICAL SYSTEMS

Actions

Observations

Imposes the “timing”

Environment

| 21

1948: NORBERT WIENER

| 22

BUT COMPUTING SYSTEMS WERE NOT DESIGNED FOR CPS SYSTEMS

In nearly all hardware and software of computing systems:Time is abstracted or even not present at all

Very few programming languages can express time or timing constraintsAll is done to have the best average performance, not predictable performances

Caches, out of order execution, branch prediction, speculative execution, …(Hidden) compiler optimization, call to (time) unspecified libraries

Energy is also left out of scopeThis can have impact on data movement, optimizationsWith predictability: On-Time processing

Interaction with external world are second priorities vs. computation

Done with interrupts (introduced as an optimization, eliminating unproductive waiting time in polling loops) which were design to be exceptional events…

Etc.

Moderador


Look at the individula picturs

| 23

Simulation Data

MachineLearning

3 PILLARDS OF FUTURE HPC

Intertwined with CPS requirements

| 24

Simulation

Algorithm

SIMULATION

Specialist

Parameters

Environment

Error?

Data

SIMULATION = MODEL FOR PREDICTION

• Human knowledge defines the processing

• Large algorithmic complexity• High precision floating point• Large set of output data• Von Neumann architecture

| 25

FROM CPS TO DIGITAL TWIN

Actions

Observations Environment

Observations

Error?Simulation

SIMULATION SHOULD BE FASTER THAN REALITY FOR PREDICTION

| 26

BIG DATA

Observations

Big DataData Analytics

Specialist

SpecialistModel

Environment

• Human knowledge refines the processing

• Large set of input data• Mappable on Von Neumann

Architecture

| 27

MACHINE LEARNING(DEEP LEARNING)LEARNING PHASE

Learning phase

Labelled data set

Specialist

• Human defines the learning data set, not the algorithm

• Large set of input data for learning phase

• Low precision floating point• Large number of operations• (Stochastic) gradient descent

| 28

MACHINE LEARNING(DEEP LEARNING)

INFERENCE PHASE

Inference phase

• Low precision arithmetic• Medium number of

operations• Co-location computing and

storage (“computing in memory”)

• Should satisfy the application non-functional requirements

Environment

| 29

DEEP LEARNING AND VOICE RECOGNITION

| 31

… required to increase energy efficiency with accuracy adapted to the use (e.g. float 16)

Google’s TPU2 : 11.5 petaflops16 of machine learning number crunching (and guessing about 400+ KW…, 100+ GFlops16/W)

Peta = 1015 = million of milliardFrom Google

2017: GOOGLE’S CUSTOMIZED TPU HARDWARE…

| 33

REINFORCEMENT LEARNING:

Actions

Observations

Respond to action

Rewards

Goal

Environment

Agent

Learns to maximize rewards

DYNAMIC PROGRAMMING + DEEP LEARNING

| 34

ALPHAGO ZERO: SELF-PLAYING TO LEARN

From doi:10.1038/nature24270 (Received 07 April 2017)

| 35

ALPHAZERO: SELF-PLAYING TO LEARN

REINFORCEMENT LEARNING: DYNAMIC PROGRAMMING + DEEP LEARNING

| 36

ALPHAZERO: COMPUTING RESOURCES

Peta = 1015 = million of milliard

From Google Deepmind

| 37


Goal

Actions

Observations

Rewards

Respond to action

Environment

Agent


• Mixed precision arithmetic• Very high number of operations• Large internal data manipulation• Mainly co-location computing and

storage (“computing in memory”)• High level of parallelism• Minimization of energy functions

| 38


Goal

Actions

Observations

Rewards

Respond to action

Agent


| 39

OUTLINE

Algorithm

Hardware

Language

Data

Applications

| 40

WHAT WILL BE THE NEXT TECHNOLOGY?

And after CMOS?

| 43

22FD

28nm

14nm

10nm

7nm

5nm

Next Gen

FinFET

Mechanical switches

Hyb

rid

logi

c Steep slope devices

Si Quantum bits

Disruptive scaling

Alternative to scaling and diversification

Monolithic 3D for 3D VLSI

25nm TBOX

20nm LG ISPD SiCRSD

Si channel

2017

2018

25nm TBOX

20nm LG ISPD SiCRSD

Si channel

12FDFDSOI

Technology evolutionSilicon Quantum bits

Non planar / trigate / stacked Nanowires

| 44

New technologies• Quantum computing• Printed/flexible electronics• Carbon nanotubes• Photonics• Neuro inspired (nano) technologies• Reservoir computing• Adiabatic computing• MEMS for computing• Synthetic biology, blob computing• Swarm computing• Symbiotic computing• Analogue/physic/hybrid computing• …

EXPLORE NEW WAYS AS ALTERNATIVE TO SILICON

| 45

NON VOLATILE MEMORIES

PCMGSTGeTeGST + HfO2

CBRAMAg / GeS2

OXRAM

TiN/HfO2/Ti/TiNThermal effect Electrochemical

effectElectronic effect oxygen vacancies

MRAM Magnetic effect• Can change the structure of

memory hierarchy?+ 64/128 addressing scheme⇒ Do we still need files?⇒ Direct access of objects

| 46

From “Total Consumer Power Consumption Forecast”, Anders S.G. Andrae, October 2017

The problem:IT projected to challenge future electricity

supply

| 49

COST OF MOVING DATA -> COMPUTING IN MEMORY

Source: Bill Dally, « To ExaScale and Beyond »www.nvidia.com/content/PDF/sc_2010/theater/Dally_SC10.pdf

| 50

Neuram3 1st

chipIBM True

NorthTechnology 28 nm FDSOI 28nm CMOSSupply Voltage 1 V 0.7VNeuron Type Analog DigitalNeurons per core 256 256Core Area 0.36 mm2 0.094 mm2

Computation Parallel processing

Time multiplexing

Fan In/Out 2k/8k 256/256Synaptic Operation per Second per Watt

300 GSOPS/W*1

46 GSOPS/W

Energy per synaptic event <2 pJ*2 10 pJEnergy per spike <0.375 nJ*3 3.9 nJ

∗ 1 At 100Hz mean firing rate, by appending 4 local-core destinations per spike, 400 k events will be broadcast to 4 cores with 25% connectivity per event. 400 k x 1 k x 25% / 300 μ W = 300 GSOPS/W∗ 2 In case of 25% match in each core, energy per synaptic event = energy per broadcast / (256*25%) =120pJ/64 = 2 pJ∗ 3 Energy per spike = total power consumption / spikes numbers = 300 uW/800 k = 0.375 nJ

NEUROMORPHIC ACCELERATOR:DYNAPS-SL (INI-ZURICH)

Moderador


Target spec du chip pour depasser True North. Le silicium est attend pour juillet. Tape out novembre dernier. Delaos du aux temps des run industriels ST et qui necessite quelque revision du projet.

| 52

OFF-CHIP PHOTONICS

Off board: AOC, optical modules Off chip: Optical I/O Time

S1

Chip B

Chip C

Opt

ical

Tra

nsce

iver

Chip D

ICSi interposer or laminate substrate

Driver / TIA

IC

Micro-pillars

PIC

FiberFerrule

PCB

Photonics: cost in sending information, nearly nothing in transmission

| 53

IN-PACKAGE PHOTONICS

Off board: AOC, optical modules Off chip: Optical I/O Optical network

in package

S1

Opt

ical

Tra

nsce

iver

Chip C

Chip A

Chip D

Chip B

RAMComputing Cores

Photonic Interposer

Tx/Rx Integr. Rx/Tx

Substrate

photodiode modul

Laser

ThroughSilicon Via

RF Cu pillars

Power Power Power Power

Light source

Primary I/OCu pillars

Digital Cu pillars& proximity lines

Thermal Dissipation

Thermal Dissipation

SignalSignal

S2

| 54

D-Wave System Hardware

Daniel Vert - [email protected]

The system solves only one binary optimization problem :

Embedding a realistic problem instance :Physical qubits on each colored pathrepresent one logical qubit

Problem: Qubit efficiency !Possible solution : Search for an efficient operating path for an adiabatic quantum computer

Adiabatic quantum computation (Farhi et al. in 2000) as an alternative quantum model for solving NP-hard optimization problems considered classically insoluble !

https://web.eecs.utk.edu/~mclennan/Classes/494-594-UC-F16/presentations/D-Wave.pdf

Polynomial transformation of NP-hard problemsfor Quantum annealers

Slide from Christian Gamrat

| 55

• Importance of studying the full stack system architecture to connect quantum devices with conventional computing systems• Hardware level• System stack

QUANTUM HARDWARE ARCHITECTURE AND SYSTEM

Overview of a quantum computer architecture. Pink is quantum, Green is the interface and Blue parts are

conventional

“Fu et al - 2016 - A Heterogeneous Quantum Computer Architecture, Proc. ACM, 2016 .

The system stack from low level quantum hardware to high level algorithms

From work by QuTech @TU-DelftSlide from Christian Gamrat

| 56

Maurand et al, Nature Com., Jul. 2016.

FUSING PARADIGMS AT HARDWARE LEVEL

Coordination EngineVon Neumann styleCMOSTechnology

binary data

GraphicalEngine

DecisionEngine

NeuroEngine

QuantumEngine

SemanticalEngine

NumericalEngine

CMOS Substrate

Neuroengine

Quantumengine

Graphicsengine

Physical and logical interface layer

NumericalEngine

At the hardware level, the good old Von Neumann/ CMOS partnership can act as a computing substrate

• Acting as coordination / communication node• Allowing Hardware / Software integration

D. Roclin et al, IEEE NanoArch, 2014.

Qubits on Silicon

NVM Synapses on Silicon

Slide from Christian Gamrat

| 57

COORDINATE PARADIGMS AT SOFTWARE LEVEL

Valiron et al., “Programming the Quantum Future,” Commun. ACM, vol. 58, no. 8, pp. 52–61, Jul. 2015.

• A sequential program (running on the coordinator) distributes tasks to engines• Accelerator as services (J. Cong)

• Tasks are distributed to computing engines• Instruction streams• Supervised Learning• Unsupervised Learning• Quantum Engine

• Why would we want that?• Domain specific expressiveness• Run hard tasks, speedup NP problems• Provide cognitive functions

Fu et al. Proc. ACM, May 2016

Stefanini et al. Frontiers NeuroinformaticsAugust 2014

| 58

OUTLINE

Algorithm

Hardware

Language

Data

Applications

PARALLELISM AND SPECIALIZATION ARE NOT FOR FREE…

Frequency limit parallelism

Energy efficiency heterogeneity

Ease of programming




Ease of programming




Ease of programming

Managing complexity

Cognitive solutions for complex computing systems:• Using AI and optimization

techniques for computing systems• Creating new hardware• Generating code• Optimizing systems

• Similar to Generative designfor mechanical engineering

| 64

USING AI FOR MAKING CPS SYSTEMS: “GENERATIVE DESIGN” APPROACH

Motorcycle swingarm: the piece that hinges the rear wheel to the bike’s frame

The user only states desired goals and constraints-> The complexity wall might prevent explaining the solution

“Autodesk”

| 65

• Ne-XVP project – Follow-up of the TriMedia VLIW (https://en.wikipedia.org/wiki/Ne-XVP )

• 1,105,747,200 heterogeneous multicores in the design space

• 2 millions years to evaluate all design points

• AI inspired techniques allowed to reduce the induction time to only few days

=> x16 performance increase

EXAMPLE: DESIGN SPACE EXPLORATION FOR DESIGN MULTI-CORE PROCESSORS1 (2010)

1 M. Duranton et all., “Rapid Technology-Aware Design Space Exploration for Embedded HeterogeneousMultiprocessors” in Processor and System-on-Chip Simulation, Ed. R. Leupers, 2010

https://en.wikipedia.org/wiki/Ne-XVP

| 66

AUTOML AND OTHER PROGRAM GENERATORS

| 67

“Neural Architecture Search”, using a recurrent neural network to compose neural network architectures using reinforcement learning on CIFAR-10 (character recognition)

2017: GOOGLE; USING DEEP LEARNING TO DESIGN DEEP LEARNING

From arXiv:1611.01578v2, Barret Zoph, Quoc V. LeGoogle Brain

| 68

• Describing what the program should accomplish, rather than describing how to accomplish it as a sequence of the programming language primitives.

• For example, describe the concurrency of an application, not how to parallelize the code for it.

• (Good) compilers know better about architecture than humans, they are better at optimizing code…

PROGRAMMING 2.0: LET THE COMPUTER DO THE JOB:

| 69

CONCLUSION: WE LIVE AN EXCITING TIME!

| 70

Centre de Grenoble17 rue des Martyrs

38054 Grenoble Cedex

Centre de SaclayNano-Innov PC 172

91191 Gif sur Yvette [email protected]

Thank you for your attention