SoC programming in the era of the Internet of Things, machine learning and emerging ... ·...

Preview:

Citation preview

SoC programming in the era of the Internet of Things, machine learning and emerging technologies

Jeronimo CastrillonChair for Compiler Construction (CCC)TU Dresden, Germany

Groupement De Recherche SOC2: System On Chip, Systémes embarqués et Objets ConnectéMontpellier, France. June 20 2019

Systems on Chip (SoC): Evolution

q SoCs: Long history of specialization and interaction with environmentq Incredible evolution over the last decades

3 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Panda, P. R., Dutt, N. D., & Nicolau, A. Memory issues in embedded systems-on-chip: optimizations and exploration. Springer Science & Business Media. 1999

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

SoC design and programming: Handling complexity

q Advances in design methodologies

q Model-based programming

4 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Transistors

Gate level

Register transfer level

Electronic system level

Application Architecture

Results

Optimization

SoC design and programming: Handling complexity

q Advances in design methodologies

q Model-based programming

5 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Transistors

Gate level

Register transfer level

Electronic system level

Application Architecture

Results

Optimization

Exciting innovations in q Modeling languagesq Programming languages and compilers

q Costs models of hardwareq System simulators

q Design space exploration (DSE) methodologies

New challengesq System dynamics (e.g., IoT)q Ubiquity of machine learning workloads

q Complexity of emerging technologies

Dataflow and Hybrid DSE

© Prof. J. Castrillon. GdR SoC2. Montpellier, 20196

Dataflow programming

q Graph representation of applicationsq Implicit repetitive execution of tasksq Good model for streaming applications

q Good match for signal processing & multi-media

q The whyq Explicit parallelism

q Often: Determinism q Better analyzability (scheduling, mapping, optimization)

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);

PropagateChannelAccessRanges(D, S);PnStreamFactorystreamFactory(BasePath);

switch (transTarget) {case TransM VP:

PnTransformTemplateInstantiate(D, S);

ErasePnProcessTemplates(D);

PnPrintForM VP(D, S);break;

case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);

break;case TransVPUtg:

PrintForVPUtg(D, S, streamFactory);

ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S,

strM appingFileName, streamFactory);

ErasePnDefs(D);break;

case TransInvalid:assert(false);break;

}}

#include "PnTransform.h"#include "PnTVPUtg.h"

#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;

voidclang::PnTransform(TransTarget transTarget, bool traces,

const std::string & strM appingFileName, ASTContext

&Ctx,Sema &S, const

llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:

case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

}

PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);break;

case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;

case TransVPUmap:

PrintForVPUmap(D, S, strM appingFileName, streamFactory);

ErasePnDefs(D);break;

case TransInvalid:assert(false);break;

&BasePath) {assert(transTarget != TransInvalid);

TranslaYonUnitDecl *D = Ctx.getTranslaYonUnitDecl();

PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

}

7

Dataflow compilation

q Plenty of research onq Language, compiler and mapping algorithmsq Hardware modeling, performance estimation

q Runtime systemsq Code generation for heterogeneous multicores

8 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

LM

RISC

LM

RISC

LM

DSP

LM

DSP

LM

DSP

LM

DSP

ARM

SM

AHB Bus

LM: Local memorySM: Shared memory

qnt

src snk

qnt

qntdct

qnt

dctvle

dct

dct init

MJPEG application

[Springer14]

Example: HW acceleration and SW-defined radio

q Application: MIMO OFDM receiverq Hardware

q Platform 1: Baseline software q Platform 2: Optimized software

q Platform 3: Optimized SW + HW

9 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

ARM Mem

VLIW VLIW VLIW

src sink

Library 1: SW implementations

Library 2: SW optimized

MemMem

FFT

ARM Mem

VLIW FFT Viterbi

src sink

Custom Interconnect

Demap Library 3: SW+HW accel.7680

128000

1758241,758

1,00E+00

1,00E+02

1,00E+04

1,00E+06

1)bsp1 (sw,

unoptimized)

2)bsp2 (sw,optimized)

3)bsp3 (hw)

Rate

(bps

)

Achieved rate @ 100 MHz

[SDR’10, ALOG’11]

System dynamics

q Applications not so static anymore!

q Hybrid DSE: a compile and run-time approachq Enable adaptivity: malleable, multi-variant

q Run-time predictability, robustness & isolation

11 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

EUF Pareto front @ compile-time

Data-level parallelism: Scalable and adaptive

q Change parallelism from the application specificationq Static code analysis to identify possible transformations (or via annotations)q Implementation in FIFO library (semantics preserving)

12 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

P1 P2 P3

Meta-information & configurations

Changes @ runtime

P1

P12

P3P22

P32

P1

P12

P3P32

P42

or

P22

[PARMA-DITAM’18]

Exploiting symmetries

q Intuitionq SW: Some tasks/processes/actors may do the sameq HW: Symmetric latencies (CoreX çè CoreY)

q Symmetry: Allows transformations w/o changing the outcome

è No need to analyze all possible mappings (prune search space)

è Work on formalization via inverse semi-groups and efficient algorithms

13 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

=?

15

Symmetries in Odroid: Example

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

PE7

PE6PE5

PE4PE3

PE2PE1

PE0

PE7

PE6PE5

PE4PE3

PE2PE1

PE0D

A B C

DA

B

C

ϕ

PE1

PE2

PE3

PE4

PE5PE6

PE7

PE0

PE1

PE2

PE3

PE4

PE5

PE6PE7

PE0

Mappings Architecture Subgraphs

Cortex A7 Cortex A15

Equivalent mappings

Graph isomorphism

Mappings Architecture subgraphs

[IESS’15, ACM TACO’17]

Flexible mappings: Generalized Tetris

q Given multiple canonical configs by compiler, select one at run-time

q Exploit mapping equivalences and similarities

16 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration*

[SCOPES’17b]

Flexible mappings: Run-time analysis

q Linux kernel: symmetry-awareq Target: Odroid XU4 (big.LITTLE)q Multi-application scenarios: audio filter (AF) and MIMO

q 1x AF

q 4 x AF q 2 x AF + 2 x MIMO

q 3 mappings to two processors q T1: Best CPU time

q T2: Best wall-clock timeq T3: GBM heuristic

17 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

●●●

12

14

16

18

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

●●

●●

10

12

14

16

CFS Dyn T1 T2 T3

Ener

gy [J

]

●●●●

●●

4

5

6

7

8

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

Single AF

[Castrill12]

[SCOPES’17b]

Flexible mappings: Multi-application results (1)

18 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

●●

●● ●

●●

●●

●●

●●

6.5

7.0

7.5

8.0

8.5

9.0

9.5

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

●●

●●●●●●

●●●

●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●

10

20

30

40

AF MIMO

CPU

time

[s]

●●●● ●●

●● ●●●●●●●

●●●

●●●●

●● ●●

10

20

30

AF MIMO

Wal

l−clo

ck ti

me

[s] More

predictable performance

Comparable performance to dynamic mapping

●●●●●●

●●●●

●●●

● ● ● ●

● ●

●●

●●

●●

● ● ● ●

12.5

15.0

17.5

20.0

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

[SCOPES’17b]

Robustness

q Static mappings, transformed or not, provide good predictabilityq However: Many things out of control

q Application data, unexpected interrupts, unexpected OS decisions

è Can we reason about robustness of mapping to external factors?

20 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration* Perturb Modified behavior(correct?)

Design centering

q Design centering: Find a mapping that can better tolerate variations while staying feasible

q Studied field, in e.g., biology, circuit design or manufacturing systems.

q Currentlyq Using a bio-inspired algorithmq Robust against OS changes to the mapping

21 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

optimum

design center in feasible region

x1

x2

feasible region

[SCOPES’17a]

Evaluation

q Analyze how robust the center really isq Perturbate mappings and check how often the constraints are missedq Signal processing applications on clustered ARM manycore and NoC manycore (16)

23 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

MIMO-OFDM

[SCOPES’17a]

Machine Learning & emerging technologies

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

ML revolution: Frameworks and architectures

q Many existing frameworks, e.g., TVM, Tensor Comprehensions, TensorFlow, …

q Lots of traction in hardware architectures: TPU, V100, …

q Lot’s of resources for training, less on inference on edge-devices!

© Prof. J. Castrillon. GdR SoC2. Montpellier, 201926

[Chen, OSDI’18]Example flow: TVM

Domain-specific abstractions

q Commonality: Tensor expression languages

q Increase programmer’s productivityq From compiler perspective: No abstraction toll

q Easier access to informationq Larger score for optimization

27 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

var input A : matrix &var input u : tensorIN &

v = (A # A # A # u . [[5 8] [3 7] [1 6]])

[RWDSL’18]

vs

Correct by construction

q Especially important in embedded systemsq Correct by designq No abstraction leaks

q Current efforts in formal semantics for safe code generation

28 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

[Array’19]

A = placeholder((m,h), name='A')

B = placeholder((n,h), name='B')

k = reduce_axis((0, h), name='k')

C = compute((m, n), lambda i, j:

sum(A[k, i] * B[k, j], axis=k))

!"# = %&'(

)*&"+&#

TeML: Results

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

q Extra control allows for new optimization (vs pluto): changing shapesq General tensor semantics allows covering more benchmarks than TensorFlow

29

[GPCE’18]

Emerging technologies: Racetrack memories

q Racetrack memories: one of many future alternativesq Cf. STT/Re-RAM, hybrid architectures

q Predicted extreme density at low latencyq 3D nano-wires with magnetic domains

q One port shared for many bitsq Domains move at high speeds (1000 ms-1)

q Sequential: Game changer for current HW/SW stackq Memory management

q Integration with other memory architectures q Data layout and allocation

[Parkin-Nature’15]

© Prof. J. Castrillon. GdR SoC2. Montpellier, 201930

[TVLSI’18, TVLSI’19]

Compiler research on placement

q Compiler pass to reorganize placement of instructionsq Instruction fetch is naturally sequential!q Layout instructions to reduce shift operations

q Compiler pass to reorganize data for higher-level objectsq Variable allocation

q Tensor allocation and program transformation

31

[ISLPED’19]

[ArXiv’19]

[LCTES’19]

Simulating RTMs

q RTSim: Configurable racetrack simulatorq Allows running software benchmarksq Built on stop of other simulator technology: NVMAIN 2, Gem5, SystemC, …

© Prof. J. Castrillon. GdR SoC2. Montpellier, 201932

https://github.com/tud-ccc/RTSim[IEEE CAL’19]

Architecture and data layout optimization

q Architecture – software co-optimizationq Embedded system for inference: RTM as scratchpad with pre-shifting and other

optimizations

33[LCTES’19]

Architecture and data layout optimization

q Data-layout: Reduce the number of shifts

34[LCTES’19]

n

DB

C:2n

DB

C:2n+1

DB

C:3n-1

R0

R1

Rn-1

C00 C01

n

R0

R1

Rn-1

DB

C:0

DB

C:1

DB

C:n-1

A00 A01 A0n-1

A10 A1n-1

An-1n-1

C0 Cn-1DBC:n DBC:n+1 DBC:2n-1

B00

B10

Bn-10

A00 A01 A0n-1

compulsory shifts

overhead shifts

B00 B10 Bn-10

compulsory shifts

overhead shifts

C1

B01

B11

Bn-11

C (Bank-2)Ã (Bank-0) ~B (Bank-1)

Latency comparison vs SRAM

q Un-optimized and naïve mapping: Even worse latency than SRAMq 24% average improvement (even with very conservative circuit simulation)

35

0

0.5

1

1.5

2

2.5

3

4 8 16 32 64 128 256 512 1024 2048 Avg

Nor

mal

ized

late

ncy

Tensor size

SRAM RTM-naïve RTM-opt RTM-opt-ps

[LCTES’19]

Energy comparison vs SRAM

q Higher savings due to less leakage powerq 74% average improvement

36

0

0.2

0.4

0.6

0.8

1

1.2

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

4 8 16 32 64 128 256 512 1024 2048

Nor

mal

ized

ene

rgy

Tensor size

Leakage Energy Read/Write Energy Shift Energy

[LCTES’19]

Discussion

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Summary

q System dynamics: More complex in distributed IoT scenariosq Hybrid compile-runtime methodologies: Difficult balance, new interfacesq Strive at retaining time predictability

q New workloads (e.g., ML) + new techs: Harness domain-specific abstractionsq More complex decision making (e.g., het. Memory systems)

q Expression DSLs to ease high-level manipulation and transformation

q Exciting times for DSE, SW/HW co-design formal languages for emerging platforms (example: RTM-scratchpads)

© Prof. J. Castrillon. GdR SoC2. Montpellier, 201938

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

References

[Springer’14] J. Castrillon, R. Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap" , Springer, pp. 258, 2014. [SDR’10] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable SDR,” Wireless Innovation Conference and Product Exposition (SDR), 2010[ALOG’11] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable software defined radio”, Analog Integrated Circuits and Signal Processing, vol. 69, no. 2–3, pp. 173–190, 2011

[ACM TACO’17] Goens, A. et al. “Symmetry in Software Synthesis”. In: ACM Transactions on Architecture and Code Optimization (TACO) (2017).

[IESS’15] A. Goens and J. Castrillon, “Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs”, In IFIP International Embedded Systems Symposium (IESS), 2015, Foz do Iguaçu, Brazil, 2015.

[PARMA-DITAM’18] R. Khasanov, et al, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap" , PARMA-DITAM 18, ACM, pp. 20–25, Jan 2018.[SCOPES’17a] G. Hempel, et al, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering" SCOPES’17.[SCOPES’17b] Goens, A. et al. “TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings”, SCOPES’17.[Chen, OSDI’18] Chen, Tianqi, et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 2018.[RWDSL’18] N. A. Rink, et al. “CFDlang: High-level code generation for high-order methods in fluid dynamics”. RWDSL’18.[Array’19] N.A. Rink, N. A. and Jeronimo Castrillon. “TeIL: a type-safe imperative Tensor Intermediate Language”,

ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, 2019, 57-68[GPCE’17] A. Susungi, et al. "Towards Compositional and Generative Tensor Optimizations" GPCE’17, 169-175.

[GPCE’18] A. Susungi, et al. " "Meta-programming for cross-domain tensor optimizations" GPCE’18, 79-92.

[TVLSI’18] F. Hameed, A. A. Khan, J. Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018.

[TVLSI’19] F. Hameed, J. Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), pp. 13pp, Jun 2019.

[Parkin-Nature’15] S. Parkin, and See-Hun Yang. "Memory on the racetrack." Nature nanotechnology 10.3 (2015): 195.

[IEEE CAL’19] A. A. Khan, F. Hameed, R. Bläsing, S. Parkin, J.Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories" In IEEE Computer Architecture Letters, Feb 2019.

[ISPLED’19] J. Multanen, A. A. Khan, P. Jääskeläinen, F. Hameed, J. Castrillon, “SHRIMP: Efficient Instruction Delivery with Domain Wall Memory”, International Symposium on Low Power Electronics and Design (ISLPED), ACM, 2019, 10pp

[LCTES’19] Khan, A. A., Rink, N. A., Hameed, F., Castrillon, J. "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads”, Proceedings of LCTES’19, ACM, pp. 12pp, New York, NY, USA, Jun 2019

[ArXiv’19] Khan, A. A., Hameed, F., Bläsing, R., Parkin, S., Castrillon, J. “ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0”, ArXiv arXiv:1903.03597, Mar 2019.

41

Recommended