Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

1

Programming High Performance Embedded Systems:

Tackling the Performance Portability Problem

Alastair Reid

Principal Engineer, R&D

ARM Ltd

2

Programming HP Embedded Systems

High-Performance Energy-Efficient Hardware Example: Ardbeg processor cluster (ARM R&D)

Portable System-level programming Example: SoC-C language extensions (ARM R&D)

Portable Kernel-level programming Example: C+Builtins

Example: Data Parallel Language

Merging System/Kernel-level programming

3 3

Mobile Consumer Electronics TrendsMobile Application Requirements Still Growing Rapidly Still cameras: 2Mpixel 10 Mpixel Video cameras: VGA HD 1080p … Video players: MPEG-2 H.264 2D Graphics: QVGA HVGA VGA FWVGA … 3D Gaming: > 30Mtriangle/s, antialiasing, … Bandwidth: HSDPA (14.4Mbps) WiMax (70Mbps) LTE (326Mbps)

Feature Convergence Phone + graphics + UI + games + still camera + video camera + music + WiFi + Bluetooth + 3.5G + 3.9G + WiMax + GPS + …

5 5

Mobile SDR Design Challenges

1

10

100

1000

0.1 1 10 100

Power (Watts)

Pe

ak

Pe

rfo

rma

nc

e (

Go

ps

)

Better

Pow

er Efficiency

10 Mops/m

W

100 Mops/m

W

1 Mops/m

W

5

GeneralPurpose

ProcessorsEmbeddedDSPs

Mobile SDRRequirements

Pentium MTI C6x

IBM CellHigh-end

DSPs

SDR Design Objectives for 3G and WiFi

Throughput requirements 40+Gops peak throughput

Power budget 100mW~500mW peak power

SDR Design Objectives for 3G and WiFi

Throughput requirements 40+Gops peak throughput

Power budget 100mW~500mW peak power

Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008

8 8

Energy Efficient Systems are “Lumpy”Drop Frequency 10x Desktop: 2-4GHz Mobile: 200-400MHz

Increase Parallelism 100x Desktop: 1-2 cores Mobile: 32-way SIMD Instruction Set, 4-8 cores

Match Processor Type to Task Desktop: homogeneous, general purpose Mobile: heterogeneous, specialised

Keep Memory Local Desktop: coherent, shared memory Mobile: processor-memory clusters linked by DMA

10 10

512-bitSIMDReg.File

512-bitSIMD Mult

SIMDShuffle

Net-work

Scalar ALU+Mult

ScalarRF+ACC

L1Data

Memory

AGURF

AGU

1. wide SIMD

Pred.RF

SIMD+ScalarTransf

Unit

Ardbeg PE

3. Memory

SIMDPred.ALU

Scalarwdata

1024-bitSIMD

ACC RF

SIMDwdata

512-bitSIMD ALUwith

shuffle

EX

EX

INTERCONNECTS

INTERCONNECTS

L2Memory

2. Scalar & AGUL1ProgramMemory

Controller

EX

EX

AGU

AGU

WB

WB

WB

WB

64- b

it A

MB

A 3

AX

I In

terc

on

ne

ct

ControlProcessor

Ardbeg System

FECAccelerator

L1Mem

ExecutionUnit

PE

L1Mem

ExecutionUnit

PE

DMAC

Peripherals

L1Mem

L2Mem

512

-bit

Bu

s

Ardbeg SDR Processor

Application Specific HardwareApplication Specific Hardware

2-level memory hierarchy2-level memory hierarchy

8,16,32 bit fixed point support512-bit SIMD

8,16,32 bit fixed point support512-bit SIMD

Sparse Connected VLIWSparse Connected VLIW


11 11

W-CDMA 2Mbps

DVB-H

DVB-T

802.11a

W-CDMA data

W-CDMA voice

802.11a 180nm 802.11a

W-CDMA 2Mbps180nm W-CDMA 2Mbps

802.11a

W-CDMA 2Mbps

W-CDMA data

W-CDMA voice

W-CDMA data

802.11a

W-CDMA 2Mbps

0.01

0.1

1

10

100

0.01 0.1 1 10 100 1000

Power (Watts)

Ac

hie

ve

d T

hro

ug

hp

ut

(Mb

ps

)

Ardbeg

SODA

ASIC

Sandblaster

TigerSHARC

7 Pentium M

Summary of Ardbeg SDR Processor

• Ardbeg is lower power at same throughput• We are getting closer to ASICs


12 12

How do we program AMP systems?

C doesn’t provide language features to support Multiple processors (or multi-ISA systems)

Distributed memory

Multiple threads

13 13

Use Indirection (Strawman #1)

Add a layer of indirection Operating System

Layer of middleware

Device drivers

Hardware support

All impose a cost in Power/Performance/Area

14 14

Raise Pain Threshold (Strawman #2)

Write efficient code at very low level of abstraction

Problems Hard, slow and expensive to write, test, debug and maintain

Design intent drowns in sea of low level detail

Not portable across different architectures

Expensive to try different points in design space

15 15

Our Response

Extend C Support Asymmetric Multiprocessors

SoC-C language raises level of abstraction

… but take care not to hide expensive operations

16 16

SoC-C Overview

Pocket-Sized Supercomputers Energy efficient hardware is “lumpy” … and unsupported by C … but supported by SoC-C

SoC-C Extensions by Example Pipeline Parallelism Code Placement Data Placement

SoC-C Conclusion

17 17

3 steps in mapping an application

1. Decide how to parallelize

2. Choose processors for each pipeline stage

3. Resolve distributed memory issues

18 18

A Simple Programint x[100];

int y[100];

int z[100];

while (1) {

get(x);

foo(y,x);

bar(z,y);

baz(z);

put(z);

}

19 19

Simplified System Architecture

Distributed Memories

Control Processor

SIMD Instruction SetData Engines

Accelerators

Artist’s impression

20 20

Step 1: Decide how to parallelizeint x[100];

int y[100];

int z[100];

while (1) {

get(x);

foo(y,x);

bar(z,y);

baz(z);

put(z);

}

50% of work

50% of work

21 21

Step 1: Decide how to parallelize int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x);

FIFO(y);

bar(z,y);

baz(z);

put(z);

}

}

PIPELINE indicates region to parallelize

FIFO indicates boundaries between pipeline stages

22 22

SoC-C Feature #1: Pipeline Parallelism

Annotations express coarse-grained pipeline parallelism

PIPELINE indicates scope of parallelism

FIFO indicates boundaries between pipeline stages

Compiler splits into threads communicating through FIFOs

23 23

Step 2: Choose Processors int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x);

FIFO(y);

bar(z,y);

baz(z);

put(z);

}

}

24 24

Step 2: Choose Processors int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

@ P indicates processor to execute function

25 25

SoC-C Feature #2: RPC Annotations

Annotations express where code is to execute Behaves like Synchronous Remote Procedure Call

Does not change meaning of program

Bulk data is not implicitly copied to processor’s local memory

26 26

Step 3: Resolve Memory Issues int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

P0 uses x x must be in M0

P1 uses z z must be in M1

P0 uses y y must be in M0

P1 uses y y must be in M1

Conflict?!

27 27

Hardware Cache Coherency

P0

$0

P1

$1

write x

read x

write x

invalidate x

copy x

invalidate x

28 28


int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

Two versions: y@M0, y@M1

write y@M0 y@M1 is invalid

reads y@M1 Coherence error

29 29


int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

SYNC(x) @ DMA;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

SYNC(x) @ P copies data from one version of x to another using processor P

read y@M1

y@M1 and y@M0 are valid

30 30

SoC-C Feature #3: Compile Time Coherency

Variables can have multiple coherent versions Compiler uses memory topology to determine which version

is being accessed

Compiler applies cache coherency protocol Writing to a version makes it valid and other versions invalid

Dataflow analysis propagates validity

Reading from an invalid version is an error

SYNC(x) copies from valid version to invalid version

31

Compiling SoC-C

See paper:

SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip, Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems (CASES) 2008.

(Or view ‘bonus slides’ after talk.)

32

More realistic SoC-C code

DVB-T Inner ReceiverOFDM receiver 20 tasks 500-7000 cycles each 29000 cycles total

adc_t adc;

ADC_Init(&adc,ADC_BUFSIZE_SAMPLES,adc_Re,adc_Im,13);

SOCC_PIPELINE {

ChannelEstimateInit_DVB_simd(TPS_INFO, CrRe, CrIm) @ DEd;

for(int sym = 0; sym<LOOPS; ++sym) {

cbuffer_t src_r, src_i;

unsigned len = Nguard+asC_MODE[Mode];

ADC_AcquireData(&adc,(sym*len)%ADC_BUFSIZE_SAMPLES,len,&src_r, &src_i);

align(sym_Re,&src_r,len*sizeof(int16_t)) @ DMA_512;

align(sym_Im,&src_i,len*sizeof(int16_t)) @ DMA_512;

ADC_ReleaseRoom(&adc,&src_r,&src_i,len);

RxGuard_DVB_simd(sym_Re,sym_Im,TPS_INFO,Nguard,guarded_Re,guarded_Im) @ DEa;

cscale_DVB_simd(guarded_Re,guarded_Im,23170,avC_MODE[Mode],fft_Re,fft_Im) @ DEa;

fft_DVB_simd(fft_Re,fft_Im,TPS_INFO,ReFFTTwid,ImFFTTwid) @ DEa;

SymUnWrap_DVB_simd(fft_Re,fft_Im,TPS_INFO,unwrapped_Re,unwrapped_Im) @ DEb;

DeMuxSymbol_DVB_simd(unwrapped_Re,unwrapped_Im,TPS_INFO,ISymNum,

demux_Re,demux_Im,PilotsRe,PilotsIm,TPSRe,TPSIm) @ DEb;

DeMuxSymbol_DVB_simd(CrRe,CrIm,TPS_INFO,ISymNum,

demux_CrRe,demux_CrIm,CrPilotsRe,CrPilotsIm,CrTPSRe,CrTPSIm) @ DEb;

cfir1_DVB_simd(demux_Re,demux_Im,demux_CrRe,demux_CrIm,avN_DCPS[Mode],equalized_Re,equalized_Im) @ DEc;

cfir1_DVB_simd(TPSRe,TPSIm,CrTPSRe,CrTPSIm,avN_TPSSCPS[Mode],equalized_TPSRe,equalized_TPSIm) @ DEb;

DemodTPS_DVB_simd(equalized_TPSRe,equalized_TPSIm,TPS_INFO,Pilot,TPSRe) @ DEb;

DemodPilots_DVB_simd(PilotsRe,PilotsIm,TPS_INFO,ISymNum,demod_PilotsRe,demodPilotsIm) @ DEb;

cmagsq_DVB_simd(demux_CrRe,demux_CrIm,12612,avN_DCPS[Mode],MagCr) @ DEc;

int Direction = (ISymNum & 1);

Direction ^= 1;

if (Direction) {

Error=SymInterleave3_DVB_simd2(equalized_Re,equalized_Im,MagCr,

DE_vinterleave_symbol_addr_DVB_T_N,

DE_vinterleave_symbol_addr_DVB_T_OFFSET,

TPS_INFO,Direction,sRe,sIm,sCrMag) @ DEc;

pack3_DVB_simd(sRe,sIm,sCrMag,avN_DCPS[Mode],interleaved_Re,interleaved_Im,Range) @ DEc;

} else {

unpack3_DVB_simd(equalized_Re,equalized_Im,MagCr,avN_DCPS[Mode],sRe,sIm,sCrMag) @ DEc;

Error=SymInterleave3_DVB_simd2(sRe,sIm,sCrMag,

DE_vinterleave_symbol_addr_DVB_T_N,

DE_vinterleave_symbol_addr_DVB_T_OFFSET,

TPS_INFO,Direction,interleaved_Re,interleaved_Im,Range) @ DEc;

}

ChannelEstimate_DVB_simd(interleaved_Re,interleaved_Im,Range,TPS_INFO,CrRe2,CrIm2) @ DEd;

Demod_DVB_simd(interleaved_Re,interleaved_Im,TPS_INFO,Range,demod_softBits) @ DEd;

BitDeInterleave_DVB_simd(demod_softBits,TPS_INFO,deint_softBits) @ DEd;

uint_t err=HardDecoder_DVB_simd(deint_softBits,uvMaxCnt,hardbits) @ DEd;

Bytecpy(&output[p],hardbits,uMaxCnt/8) @ ARM;

p += uMaxCnt/8;

ISymNum = (ISymNum+1) % 4;

}

ADC_Fini(&adc);

33 33

Parallel Speedup

Efficient Same performance as hand-written code

Near Linear Speedup Very efficient use of parallel hardware

0%

50%

100%

150%

200%

250%

300%

350%

400%

1 2 3 4

Speedup

34 34

What SoC-C Provides

SoC-C language features Pipeline to support parallelism Coherence to support distributed memory RPC to support multiple processors/ISAs

Non-features Does not choose boundary between pipeline stages Does not resolve coherence problems Does not allocate processors

SoC-C is concise notation to express mapping decisions (not a tool for making them on your behalf)

35 35

Related Work

Language OpenMP: SMP data parallelism using ‘C plus annotations’ StreamIt: Pipeline parallelism using dataflow language

Pipeline parallelism J.E. Smith, “Decoupled access/execute computer

architectures,” Trans. Computer Systems, 2(4), 1984 Multiple independent reinventions

Hardware Woh et al., “From Soda to Scotch: The Evolution of a

Wireless Baseband Processor,” Proc. MICRO-41, Nov. 2008

36 36

More Recent Related Work

Mapping applications onto Embedded SoCs Exposing Non-Standard Architectures to Embedded Software

using Compile-Time Virtualization, CASES 2009

Pipeline parallelism The Paralax Infrastructure: Automatic Parallelization with a

Helping Hand, PACT 2010

37 37

The SoC-C Model

Program as if using SMP system Single multithreaded processor: RPCs provide a “Migrating

thread Model” Single memory: Compiler Managed Coherence handles

“bookkeeping” Annotations change execution, not semantics

Avoid need to restructure code Pipeline parallelism Compiler managed coherence

Efficiency Avoid abstracting expensive operations programmer can optimize and reason about

38

Kernel Programming

39

Overview

Example: FIR filter

Hand-vectorized code Optimal performance

Issues

An Alternative Approach

1

0

T

jjiji xhy

40

Example Vectorized Code

Very fast, efficient code Uses 32-wide SIMD

Each SIMD multiply performs 32 (useful) multiplies

VLIW compiler overlaps operations 3 vector operations per cycle

VLIW compiler performs software pipelining Multiplier active on every cycle

void FIR(vint16_t x[], vint16_t y[], int16_t h[]) {

vint16_t v = x[0];

for (int i=0; i<N/SIMD_WIDTH; ++i) {

vint16_t w = x[i+1];

vint32L_t acc = vqdmull(v,h[0]);

s = vget_lane(w,0);

v = vdown(v,s);

for(int j=1; j<T-1; ++j) {

acc = vqdmlal(acc,v,h[j]);

s = vget_lane(w,j);

v = vdown(v,s);

}

y[i] = vqrdmlah(acc,v,h[j]);

v = w;

}

}

41

Portability Issues

Vendor specific SIMD operations

vqdmull, vdown, vget_lane

SIMD-width specific

Assumes SIMD_WIDTH >= T

Doesn’t work/performs badly on Many SIMD architectures

GPGPU

SMP

42

Flexibility issues

Improve arithmetic intensity Merge with adjacent kernel

E.g., if filtering input to FFT, combine with bit reversal

Parallelize task across two Ardbeg engines Requires modification to system-level code

43

Summary

Programming directly to the processor Produces very high performance code

Kernel is not portable to other processor types

Kernels cannot be remapped to other devices

Kernels cannot be split/merged to improve scheduling or reduce inter-kernel overheads

Often produces local optimum

But misses global optimum

44

(Towards)Performance-Portable Kernel Programming

45

Outline

The goal

Quick and dirty demonstration

References to (more complete) versions

What still needs to be done

46

An alternative approach

Compiler

1

0

T

jjiji xhy

47

A simple data parallel languageloop(N) {

V1 = load(a);

V2 = load(b);

V3 = add(V1,V2);

store(c,V3);

}

* Currently implemented as a Haskell EDSL – adapted to C-like notation for presentation.

a0 a1 a2 a3 a4 a5 a6 a7 ...V1:

b0 b1 b2 b3 b4 b5 b6 b7 ...V2:

V3:a0+b0

a1+b1

a2+b2

a3+b3

a4+b4

a5+b5

a6+b6

a7+b7

...

c:a0+b0

a1+b1

a2+b2

a3+b3

a4+b4

a5+b5

a6+b6

a7+b7

...

N

48

Compiling Vector Expressions

Vector Expression Init Generate Next

V1 = load(a); p1=a; V1=vld1(p1); p1+=32;

V2 = load(b); p2=b; V2=vld1(p2); p2+=32;

V3 = add(V1,V2); V3=vadd(V1,V2);

store(c,V3); p3=c; vst1(p3,V3); p3+=32;

p1=a; p2=b; p3=c;for(i=0; i<N; i+=32) { V1=vld1(p1); V2=vld1(p2); V3=vadd(V1,V2); vst1(p3,V3); p1+=32; p2+=32; p3+=32;}

50

Generating datapath

+1

+1

MemA

MemB

+

+1

MemC

* Warning: this circuit does not adhere to any ARM quality standards.

51

Adding control

+1

+1

-1

!=0

MemA

MemB

+

+1

MemC

nDone

* Warning: this circuit does not adhere to any ARM quality standards.

en

en

en

en

52

Fixing timing

+1

+1

-1

!=0

MemA

MemB

+

+1

MemC

nDone

en

enen

en

53

Related Work

NESL – Nested Data Parallelism (CMU) Cray Vector machines, Connection Machine

DpH – Generalization of NESL as a Haskell library (SLPJ++) GPGPU

Accelerator – Data parallel library in C#/F# (MSR) SMP, DirectX9, FPGA

Array Building Blocks – C++ template library (Intel) SMP, SSE

Thrust – C++ template library (NVidia) GPGPU

(Also: Parallel Skeletons, Map-reduce, etc. etc.)

54

Summary of approach

(Only) Use highly structured bulk operations Bulk operations reason about vectors, not individual elements

Simple mathematical properties easy to optimize

Single frontend, multiple backends SIMD, SMP, GPGPU, FPGA, ...

(Scope for significant platform-dependent optimization)

55

Breaking down boundaries

Hard boundary between system and kernel layers Separate languages

Separate tools

Separate people writing/optimizing

Need to soften boundary Allow kernels to be split across processors

Allow kernels to be merged across processors

Allow kernels A and B to agree to use a non-standard memory layout (to run more efficiently)

(This is an open problem)

56

Tackling Performance Portability Problem

High Performance Embedded Systems Energy Efficient systems are “lumpy”

The hardware is the easy bit

Two level approach System Programming

Stitch kernels together

Inter-kernel parallelism, mapping onto processors/memory

Kernel Programming C+builtins efficient but inflexible, non-portable

Simple DPL in talk, references to more substantial efforts

Intra-kernel parallelism expressed

Boundary must be softened

57

Fin

58 58

Language Design Meta Issues

Compiler only uses simple analyses Easier to maintain consistency between different compiler

versions/implementations

Programmer makes the high-level decisions Code and Data Placement

Inserting SYNC

Load balancing

Implementation by many source-source transforms Programmer can mix high- and low-level features

90-10 rule: use high-level features when you can, low-level features when you need to

59 59

Compiling SoC-C

1. Data Placementa) Infer data placement

b) Propagate coherence

c) Split variables with multiple placement

2. Pipeline Parallelisma) Identify maximal threads

b) Split into multiple threads

c) Apply zero copy optimization

3. RPC (see paper for details)

60 60

Step 1a: Infer Data Placement int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

SYNC(x) @ DMA;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

Solve Set of Constraints

61 61

Step 1a: Infer Data Placement int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

SYNC(x) @ DMA;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

Solve Set of Constraints Memory Topology constrains where

variables could live

62 62



Step 1a: Infer Data Placement int x[100] @ {M0};

int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@?);

foo(y@M0, x@M0) @ P0;

SYNC(y,?,?) @ DMA;

FIFO(y@?);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@?);

}

}

63 63



Forwards Dataflow propagates availability of valid versions

Step 1b: Propagate Coherenceint x[100] @ {M0};

int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@?);

foo(y@M0, x@M0) @ P0;

SYNC(y,?,?) @ DMA;

FIFO(y@?);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@?);

}

}

64 64





int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@?);

foo(y@M0, x@M0) @ P0;

SYNC(y,?,M0) @ DMA;

FIFO(y@?);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@M1);

}

}

65 65




Backwards Dataflow propagates need for valid versions


int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@?);

foo(y@M0, x@M0) @ P0;

SYNC(y,?,M0) @ DMA;

FIFO(y@?);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@M1);

}

}

66 66




Backwards Dataflow propagates need for valid versions


int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@M0);

foo(y@M0, x@M0) @ P0;

SYNC(y,M1,M0) @ DMA;

FIFO(y@M1);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@M1);

}

}

67 67

Step 1c: Split Variablesint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}

Split variables with multiple locations

Replace SYNC with memcpy

68 68

Step 2: Implement Pipeline Annotationint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}

Dependency Analysis

69 69

Step 2a: Identify Dependent Operationsint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}

Dependency Analysis

Split use-def chains at FIFOs

70 70

Step 2b: Identify Maximal Threadsint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}

Dependency Analysis


Identify Thread Operations

71 71

Step 2b: Split Into Multiple Threadsint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1}; int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}

Perform Dataflow Analysis


Identify Thread Operations

Split into threads

72 72

Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1};int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}

Generate DataCopy into FIFO

Copy out of FIFOConsume Data

73 73

Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1};int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}

Calculate Live Range of variables passed through FIFOs

Live Range of y1a

Live Range of y1b

74 74

Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; fifo_acquireRoom(&f, &py1a); memcpy(py1a,y0,…) @ DMA; fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); bar(z, py1b) @ P1; fifo_releaseRoom(&f, py1b); baz(z) @ P1; put(z); } }}

Calculate Live Range of variables passed through FIFOs

Transform FIFO operations to pass pointers instead of copying data

Acquire empty buffer

Generate data directly into buffer

Pass full buffer to thread 2

Acquire full buffer from thread 1

Consume data directly from buffer

Release empty buffer

75 75

Step 3a: Resolve Overloaded RPCint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); DE32_foo(0, y0, x); fifo_acquireRoom(&f, &py1a); DMA_memcpy(py1a,y0,…); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); DE32_bar(1, z, py1b); fifo_releaseRoom(&f, py1b); DE32_baz(1, z); put(z); } }}

Replace RPC by architecture specific call

bar(…)@P1 DE32_bar(1,…)

76 76

Step 3b: Split RPCsint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};

PARALLEL { SECTION { while (1) { get(x); start_DE32_foo(0, y0, x); wait(semaphore_DE32[0]); fifo_acquireRoom(&f, &py1a); start_DMA_memcpy(py1a,y0,…); wait(semaphore_DMA); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); start_DE32_bar(1, z, py1b); wait(semaphore_DE32[1]); fifo_releaseRoom(&f, py1b); start_DE32_baz(1, z); wait(semaphore_DE32[1]); put(z); } }}

RPCs have two phases

start RPC

wait for RPC to complete

DE32_foo(0,…);

start_DE32_foo(0,…);

wait(semaphore_DE32[0]);

77 77

Order of transformations

Dataflow-sensitive transformations go first Inferring data placement

Coherence checking within threads

Dependency analysis for parallelism

Parallelism transformations Obscures data and control flow

Thread-local optimizations go last Zero-copy optimization of FIFO operations

Continuation passing thread implementation

78

Aside: Why hardware companies are fun

You get to play with cool hardware Often before it has been debugged

You get to play with powerful debugging tools Incredible level of detail visible

E.g., Palladium traces on next slides

79 79

Unoptimized task scheduling

968

DE0

DE1

fft demod

195 cycles

273 cycles

80 80

Optimized device driver on ARM

968

DE0

DE1

fft demod

155 cycles

257 cycles

81 81

Task scheduling hardware support

968

DE0

DE1

fft demod

1 cycle

303 cycles

183 cycles

Documents

Programming High Performance Embedded Systems: Tackling the Performance Portability Problem