33
SoC programming in the era of the Internet of Things, machine learning and emerging technologies Jeronimo Castrillon Chair for Compiler Construction (CCC) TU Dresden, Germany Groupement De Recherche SOC2: System On Chip, Systémes embarqués et Objets Connecté Montpellier, France. June 20 2019

SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

SoC programming in the era of the Internet of Things, machine learning and emerging technologies

Jeronimo CastrillonChair for Compiler Construction (CCC)TU Dresden, Germany

Groupement De Recherche SOC2: System On Chip, Systémes embarqués et Objets ConnectéMontpellier, France. June 20 2019

Page 2: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Systems on Chip (SoC): Evolution

q SoCs: Long history of specialization and interaction with environmentq Incredible evolution over the last decades

3 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Panda, P. R., Dutt, N. D., & Nicolau, A. Memory issues in embedded systems-on-chip: optimizations and exploration. Springer Science & Business Media. 1999

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

Page 3: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

SoC design and programming: Handling complexity

q Advances in design methodologies

q Model-based programming

4 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Transistors

Gate level

Register transfer level

Electronic system level

Application Architecture

Results

Optimization

Page 4: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

SoC design and programming: Handling complexity

q Advances in design methodologies

q Model-based programming

5 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Transistors

Gate level

Register transfer level

Electronic system level

Application Architecture

Results

Optimization

Exciting innovations in q Modeling languagesq Programming languages and compilers

q Costs models of hardwareq System simulators

q Design space exploration (DSE) methodologies

New challengesq System dynamics (e.g., IoT)q Ubiquity of machine learning workloads

q Complexity of emerging technologies

Page 5: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Dataflow and Hybrid DSE

© Prof. J. Castrillon. GdR SoC2. Montpellier, 20196

Page 6: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Dataflow programming

q Graph representation of applicationsq Implicit repetitive execution of tasksq Good model for streaming applications

q Good match for signal processing & multi-media

q The whyq Explicit parallelism

q Often: Determinism q Better analyzability (scheduling, mapping, optimization)

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);

PropagateChannelAccessRanges(D, S);PnStreamFactorystreamFactory(BasePath);

switch (transTarget) {case TransM VP:

PnTransformTemplateInstantiate(D, S);

ErasePnProcessTemplates(D);

PnPrintForM VP(D, S);break;

case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);

break;case TransVPUtg:

PrintForVPUtg(D, S, streamFactory);

ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S,

strM appingFileName, streamFactory);

ErasePnDefs(D);break;

case TransInvalid:assert(false);break;

}}

#include "PnTransform.h"#include "PnTVPUtg.h"

#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;

voidclang::PnTransform(TransTarget transTarget, bool traces,

const std::string & strM appingFileName, ASTContext

&Ctx,Sema &S, const

llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:

case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

}

PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);break;

case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;

case TransVPUmap:

PrintForVPUmap(D, S, strM appingFileName, streamFactory);

ErasePnDefs(D);break;

case TransInvalid:assert(false);break;

&BasePath) {assert(transTarget != TransInvalid);

TranslaYonUnitDecl *D = Ctx.getTranslaYonUnitDecl();

PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

}

7

Page 7: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Dataflow compilation

q Plenty of research onq Language, compiler and mapping algorithmsq Hardware modeling, performance estimation

q Runtime systemsq Code generation for heterogeneous multicores

8 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

LM

RISC

LM

RISC

LM

DSP

LM

DSP

LM

DSP

LM

DSP

ARM

SM

AHB Bus

LM: Local memorySM: Shared memory

qnt

src snk

qnt

qntdct

qnt

dctvle

dct

dct init

MJPEG application

[Springer14]

Page 8: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Example: HW acceleration and SW-defined radio

q Application: MIMO OFDM receiverq Hardware

q Platform 1: Baseline software q Platform 2: Optimized software

q Platform 3: Optimized SW + HW

9 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

ARM Mem

VLIW VLIW VLIW

src sink

Library 1: SW implementations

Library 2: SW optimized

MemMem

FFT

ARM Mem

VLIW FFT Viterbi

src sink

Custom Interconnect

Demap Library 3: SW+HW accel.7680

128000

1758241,758

1,00E+00

1,00E+02

1,00E+04

1,00E+06

1)bsp1 (sw,

unoptimized)

2)bsp2 (sw,optimized)

3)bsp3 (hw)

Rate

(bps

)

Achieved rate @ 100 MHz

[SDR’10, ALOG’11]

Page 9: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

System dynamics

q Applications not so static anymore!

q Hybrid DSE: a compile and run-time approachq Enable adaptivity: malleable, multi-variant

q Run-time predictability, robustness & isolation

11 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

EUF Pareto front @ compile-time

Page 10: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Data-level parallelism: Scalable and adaptive

q Change parallelism from the application specificationq Static code analysis to identify possible transformations (or via annotations)q Implementation in FIFO library (semantics preserving)

12 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

P1 P2 P3

Meta-information & configurations

Changes @ runtime

P1

P12

P3P22

P32

P1

P12

P3P32

P42

or

P22

[PARMA-DITAM’18]

Page 11: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Exploiting symmetries

q Intuitionq SW: Some tasks/processes/actors may do the sameq HW: Symmetric latencies (CoreX çè CoreY)

q Symmetry: Allows transformations w/o changing the outcome

è No need to analyze all possible mappings (prune search space)

è Work on formalization via inverse semi-groups and efficient algorithms

13 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

=?

Page 12: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

15

Symmetries in Odroid: Example

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

PE7

PE6PE5

PE4PE3

PE2PE1

PE0

PE7

PE6PE5

PE4PE3

PE2PE1

PE0D

A B C

DA

B

C

ϕ

PE1

PE2

PE3

PE4

PE5PE6

PE7

PE0

PE1

PE2

PE3

PE4

PE5

PE6PE7

PE0

Mappings Architecture Subgraphs

Cortex A7 Cortex A15

Equivalent mappings

Graph isomorphism

Mappings Architecture subgraphs

[IESS’15, ACM TACO’17]

Page 13: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Flexible mappings: Generalized Tetris

q Given multiple canonical configs by compiler, select one at run-time

q Exploit mapping equivalences and similarities

16 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration*

[SCOPES’17b]

Page 14: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Flexible mappings: Run-time analysis

q Linux kernel: symmetry-awareq Target: Odroid XU4 (big.LITTLE)q Multi-application scenarios: audio filter (AF) and MIMO

q 1x AF

q 4 x AF q 2 x AF + 2 x MIMO

q 3 mappings to two processors q T1: Best CPU time

q T2: Best wall-clock timeq T3: GBM heuristic

17 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

●●●

12

14

16

18

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

●●

●●

10

12

14

16

CFS Dyn T1 T2 T3

Ener

gy [J

]

●●●●

●●

4

5

6

7

8

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

Single AF

[Castrill12]

[SCOPES’17b]

Page 15: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Flexible mappings: Multi-application results (1)

18 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

●●

●● ●

●●

●●

●●

●●

6.5

7.0

7.5

8.0

8.5

9.0

9.5

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

●●

●●●●●●

●●●

●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●

10

20

30

40

AF MIMO

CPU

time

[s]

●●●● ●●

●● ●●●●●●●

●●●

●●●●

●● ●●

10

20

30

AF MIMO

Wal

l−clo

ck ti

me

[s] More

predictable performance

Comparable performance to dynamic mapping

●●●●●●

●●●●

●●●

● ● ● ●

● ●

●●

●●

●●

● ● ● ●

12.5

15.0

17.5

20.0

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

[SCOPES’17b]

Page 16: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Robustness

q Static mappings, transformed or not, provide good predictabilityq However: Many things out of control

q Application data, unexpected interrupts, unexpected OS decisions

è Can we reason about robustness of mapping to external factors?

20 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration* Perturb Modified behavior(correct?)

Page 17: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Design centering

q Design centering: Find a mapping that can better tolerate variations while staying feasible

q Studied field, in e.g., biology, circuit design or manufacturing systems.

q Currentlyq Using a bio-inspired algorithmq Robust against OS changes to the mapping

21 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

optimum

design center in feasible region

x1

x2

feasible region

[SCOPES’17a]

Page 18: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Evaluation

q Analyze how robust the center really isq Perturbate mappings and check how often the constraints are missedq Signal processing applications on clustered ARM manycore and NoC manycore (16)

23 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

MIMO-OFDM

[SCOPES’17a]

Page 19: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Machine Learning & emerging technologies

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Page 20: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

ML revolution: Frameworks and architectures

q Many existing frameworks, e.g., TVM, Tensor Comprehensions, TensorFlow, …

q Lots of traction in hardware architectures: TPU, V100, …

q Lot’s of resources for training, less on inference on edge-devices!

© Prof. J. Castrillon. GdR SoC2. Montpellier, 201926

[Chen, OSDI’18]Example flow: TVM

Page 21: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Domain-specific abstractions

q Commonality: Tensor expression languages

q Increase programmer’s productivityq From compiler perspective: No abstraction toll

q Easier access to informationq Larger score for optimization

27 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

var input A : matrix &var input u : tensorIN &

v = (A # A # A # u . [[5 8] [3 7] [1 6]])

[RWDSL’18]

vs

Page 22: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Correct by construction

q Especially important in embedded systemsq Correct by designq No abstraction leaks

q Current efforts in formal semantics for safe code generation

28 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

[Array’19]

A = placeholder((m,h), name='A')

B = placeholder((n,h), name='B')

k = reduce_axis((0, h), name='k')

C = compute((m, n), lambda i, j:

sum(A[k, i] * B[k, j], axis=k))

!"# = %&'(

)*&"+&#

Page 23: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

TeML: Results

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

q Extra control allows for new optimization (vs pluto): changing shapesq General tensor semantics allows covering more benchmarks than TensorFlow

29

[GPCE’18]

Page 24: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Emerging technologies: Racetrack memories

q Racetrack memories: one of many future alternativesq Cf. STT/Re-RAM, hybrid architectures

q Predicted extreme density at low latencyq 3D nano-wires with magnetic domains

q One port shared for many bitsq Domains move at high speeds (1000 ms-1)

q Sequential: Game changer for current HW/SW stackq Memory management

q Integration with other memory architectures q Data layout and allocation

[Parkin-Nature’15]

© Prof. J. Castrillon. GdR SoC2. Montpellier, 201930

[TVLSI’18, TVLSI’19]

Page 25: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Compiler research on placement

q Compiler pass to reorganize placement of instructionsq Instruction fetch is naturally sequential!q Layout instructions to reduce shift operations

q Compiler pass to reorganize data for higher-level objectsq Variable allocation

q Tensor allocation and program transformation

31

[ISLPED’19]

[ArXiv’19]

[LCTES’19]

Page 26: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Simulating RTMs

q RTSim: Configurable racetrack simulatorq Allows running software benchmarksq Built on stop of other simulator technology: NVMAIN 2, Gem5, SystemC, …

© Prof. J. Castrillon. GdR SoC2. Montpellier, 201932

https://github.com/tud-ccc/RTSim[IEEE CAL’19]

Page 27: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Architecture and data layout optimization

q Architecture – software co-optimizationq Embedded system for inference: RTM as scratchpad with pre-shifting and other

optimizations

33[LCTES’19]

Page 28: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Architecture and data layout optimization

q Data-layout: Reduce the number of shifts

34[LCTES’19]

n

DB

C:2n

DB

C:2n+1

DB

C:3n-1

R0

R1

Rn-1

C00 C01

n

R0

R1

Rn-1

DB

C:0

DB

C:1

DB

C:n-1

A00 A01 A0n-1

A10 A1n-1

An-1n-1

C0 Cn-1DBC:n DBC:n+1 DBC:2n-1

B00

B10

Bn-10

A00 A01 A0n-1

compulsory shifts

overhead shifts

B00 B10 Bn-10

compulsory shifts

overhead shifts

C1

B01

B11

Bn-11

C (Bank-2)Ã (Bank-0) ~B (Bank-1)

Page 29: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Latency comparison vs SRAM

q Un-optimized and naïve mapping: Even worse latency than SRAMq 24% average improvement (even with very conservative circuit simulation)

35

0

0.5

1

1.5

2

2.5

3

4 8 16 32 64 128 256 512 1024 2048 Avg

Nor

mal

ized

late

ncy

Tensor size

SRAM RTM-naïve RTM-opt RTM-opt-ps

[LCTES’19]

Page 30: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Energy comparison vs SRAM

q Higher savings due to less leakage powerq 74% average improvement

36

0

0.2

0.4

0.6

0.8

1

1.2

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

4 8 16 32 64 128 256 512 1024 2048

Nor

mal

ized

ene

rgy

Tensor size

Leakage Energy Read/Write Energy Shift Energy

[LCTES’19]

Page 31: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Discussion

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Page 32: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

Summary

q System dynamics: More complex in distributed IoT scenariosq Hybrid compile-runtime methodologies: Difficult balance, new interfacesq Strive at retaining time predictability

q New workloads (e.g., ML) + new techs: Harness domain-specific abstractionsq More complex decision making (e.g., het. Memory systems)

q Expression DSLs to ease high-level manipulation and transformation

q Exciting times for DSE, SW/HW co-design formal languages for emerging platforms (example: RTM-scratchpads)

© Prof. J. Castrillon. GdR SoC2. Montpellier, 201938

Page 33: SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

References

[Springer’14] J. Castrillon, R. Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap" , Springer, pp. 258, 2014. [SDR’10] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable SDR,” Wireless Innovation Conference and Product Exposition (SDR), 2010[ALOG’11] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable software defined radio”, Analog Integrated Circuits and Signal Processing, vol. 69, no. 2–3, pp. 173–190, 2011

[ACM TACO’17] Goens, A. et al. “Symmetry in Software Synthesis”. In: ACM Transactions on Architecture and Code Optimization (TACO) (2017).

[IESS’15] A. Goens and J. Castrillon, “Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs”, In IFIP International Embedded Systems Symposium (IESS), 2015, Foz do Iguaçu, Brazil, 2015.

[PARMA-DITAM’18] R. Khasanov, et al, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap" , PARMA-DITAM 18, ACM, pp. 20–25, Jan 2018.[SCOPES’17a] G. Hempel, et al, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering" SCOPES’17.[SCOPES’17b] Goens, A. et al. “TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings”, SCOPES’17.[Chen, OSDI’18] Chen, Tianqi, et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 2018.[RWDSL’18] N. A. Rink, et al. “CFDlang: High-level code generation for high-order methods in fluid dynamics”. RWDSL’18.[Array’19] N.A. Rink, N. A. and Jeronimo Castrillon. “TeIL: a type-safe imperative Tensor Intermediate Language”,

ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, 2019, 57-68[GPCE’17] A. Susungi, et al. "Towards Compositional and Generative Tensor Optimizations" GPCE’17, 169-175.

[GPCE’18] A. Susungi, et al. " "Meta-programming for cross-domain tensor optimizations" GPCE’18, 79-92.

[TVLSI’18] F. Hameed, A. A. Khan, J. Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018.

[TVLSI’19] F. Hameed, J. Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), pp. 13pp, Jun 2019.

[Parkin-Nature’15] S. Parkin, and See-Hun Yang. "Memory on the racetrack." Nature nanotechnology 10.3 (2015): 195.

[IEEE CAL’19] A. A. Khan, F. Hameed, R. Bläsing, S. Parkin, J.Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories" In IEEE Computer Architecture Letters, Feb 2019.

[ISPLED’19] J. Multanen, A. A. Khan, P. Jääskeläinen, F. Hameed, J. Castrillon, “SHRIMP: Efficient Instruction Delivery with Domain Wall Memory”, International Symposium on Low Power Electronics and Design (ISLPED), ACM, 2019, 10pp

[LCTES’19] Khan, A. A., Rink, N. A., Hameed, F., Castrillon, J. "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads”, Proceedings of LCTES’19, ACM, pp. 12pp, New York, NY, USA, Jun 2019

[ArXiv’19] Khan, A. A., Hameed, F., Bläsing, R., Parkin, S., Castrillon, J. “ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0”, ArXiv arXiv:1903.03597, Mar 2019.

41