Upload
palmer
View
45
Download
0
Embed Size (px)
DESCRIPTION
Programming High Performance Embedded Systems: Tackling the Performance Portability Problem. Alastair Reid Principal Engineer, R&D ARM Ltd. Programming HP Embedded Systems. High-Performance Energy-Efficient Hardware Example: Ardbeg processor cluster (ARM R&D) - PowerPoint PPT Presentation
Citation preview
1
Programming High Performance Embedded Systems:
Tackling the Performance Portability Problem
Alastair Reid
Principal Engineer, R&D
ARM Ltd
2
Programming HP Embedded Systems
High-Performance Energy-Efficient Hardware Example: Ardbeg processor cluster (ARM R&D)
Portable System-level programming Example: SoC-C language extensions (ARM R&D)
Portable Kernel-level programming Example: C+Builtins
Example: Data Parallel Language
Merging System/Kernel-level programming
3 3
Mobile Consumer Electronics TrendsMobile Application Requirements Still Growing Rapidly Still cameras: 2Mpixel 10 Mpixel Video cameras: VGA HD 1080p … Video players: MPEG-2 H.264 2D Graphics: QVGA HVGA VGA FWVGA … 3D Gaming: > 30Mtriangle/s, antialiasing, … Bandwidth: HSDPA (14.4Mbps) WiMax (70Mbps) LTE (326Mbps)
Feature Convergence Phone + graphics + UI + games + still camera + video camera + music + WiFi + Bluetooth + 3.5G + 3.9G + WiMax + GPS + …
5 5
Mobile SDR Design Challenges
1
10
100
1000
0.1 1 10 100
Power (Watts)
Pe
ak
Pe
rfo
rma
nc
e (
Go
ps
)
Better
Pow
er Efficiency
10 Mops/m
W
100 Mops/m
W
1 Mops/m
W
5
GeneralPurpose
ProcessorsEmbeddedDSPs
Mobile SDRRequirements
Pentium MTI C6x
IBM CellHigh-end
DSPs
SDR Design Objectives for 3G and WiFi
Throughput requirements 40+Gops peak throughput
Power budget 100mW~500mW peak power
SDR Design Objectives for 3G and WiFi
Throughput requirements 40+Gops peak throughput
Power budget 100mW~500mW peak power
Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008
8 8
Energy Efficient Systems are “Lumpy”Drop Frequency 10x Desktop: 2-4GHz Mobile: 200-400MHz
Increase Parallelism 100x Desktop: 1-2 cores Mobile: 32-way SIMD Instruction Set, 4-8 cores
Match Processor Type to Task Desktop: homogeneous, general purpose Mobile: heterogeneous, specialised
Keep Memory Local Desktop: coherent, shared memory Mobile: processor-memory clusters linked by DMA
10 10
512-bitSIMDReg.File
512-bitSIMD Mult
SIMDShuffle
Net-work
Scalar ALU+Mult
ScalarRF+ACC
L1Data
Memory
AGURF
AGU
1. wide SIMD
Pred.RF
SIMD+ScalarTransf
Unit
Ardbeg PE
3. Memory
SIMDPred.ALU
Scalarwdata
1024-bitSIMD
ACC RF
SIMDwdata
512-bitSIMD ALUwith
shuffle
EX
EX
INTERCONNECTS
INTERCONNECTS
L2Memory
2. Scalar & AGUL1ProgramMemory
Controller
EX
EX
AGU
AGU
WB
WB
WB
WB
64- b
it A
MB
A 3
AX
I In
terc
on
ne
ct
ControlProcessor
Ardbeg System
FECAccelerator
L1Mem
ExecutionUnit
PE
L1Mem
ExecutionUnit
PE
DMAC
Peripherals
L1Mem
L2Mem
512
-bit
Bu
s
Ardbeg SDR Processor
Application Specific HardwareApplication Specific Hardware
2-level memory hierarchy2-level memory hierarchy
8,16,32 bit fixed point support512-bit SIMD
8,16,32 bit fixed point support512-bit SIMD
Sparse Connected VLIWSparse Connected VLIW
Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008
11 11
W-CDMA 2Mbps
DVB-H
DVB-T
802.11a
W-CDMA data
W-CDMA voice
802.11a 180nm 802.11a
W-CDMA 2Mbps180nm W-CDMA 2Mbps
802.11a
W-CDMA 2Mbps
W-CDMA data
W-CDMA voice
W-CDMA data
802.11a
W-CDMA 2Mbps
0.01
0.1
1
10
100
0.01 0.1 1 10 100 1000
Power (Watts)
Ac
hie
ve
d T
hro
ug
hp
ut
(Mb
ps
)
Ardbeg
SODA
ASIC
Sandblaster
TigerSHARC
7 Pentium M
Summary of Ardbeg SDR Processor
• Ardbeg is lower power at same throughput• We are getting closer to ASICs
Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008
12 12
How do we program AMP systems?
C doesn’t provide language features to support Multiple processors (or multi-ISA systems)
Distributed memory
Multiple threads
13 13
Use Indirection (Strawman #1)
Add a layer of indirection Operating System
Layer of middleware
Device drivers
Hardware support
All impose a cost in Power/Performance/Area
14 14
Raise Pain Threshold (Strawman #2)
Write efficient code at very low level of abstraction
Problems Hard, slow and expensive to write, test, debug and maintain
Design intent drowns in sea of low level detail
Not portable across different architectures
Expensive to try different points in design space
15 15
Our Response
Extend C Support Asymmetric Multiprocessors
SoC-C language raises level of abstraction
… but take care not to hide expensive operations
16 16
SoC-C Overview
Pocket-Sized Supercomputers Energy efficient hardware is “lumpy” … and unsupported by C … but supported by SoC-C
SoC-C Extensions by Example Pipeline Parallelism Code Placement Data Placement
SoC-C Conclusion
17 17
3 steps in mapping an application
1. Decide how to parallelize
2. Choose processors for each pipeline stage
3. Resolve distributed memory issues
18 18
A Simple Programint x[100];
int y[100];
int z[100];
while (1) {
get(x);
foo(y,x);
bar(z,y);
baz(z);
put(z);
}
19 19
Simplified System Architecture
Distributed Memories
Control Processor
SIMD Instruction SetData Engines
Accelerators
Artist’s impression
20 20
Step 1: Decide how to parallelizeint x[100];
int y[100];
int z[100];
while (1) {
get(x);
foo(y,x);
bar(z,y);
baz(z);
put(z);
}
50% of work
50% of work
21 21
Step 1: Decide how to parallelize int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x);
FIFO(y);
bar(z,y);
baz(z);
put(z);
}
}
PIPELINE indicates region to parallelize
FIFO indicates boundaries between pipeline stages
22 22
SoC-C Feature #1: Pipeline Parallelism
Annotations express coarse-grained pipeline parallelism
PIPELINE indicates scope of parallelism
FIFO indicates boundaries between pipeline stages
Compiler splits into threads communicating through FIFOs
23 23
Step 2: Choose Processors int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x);
FIFO(y);
bar(z,y);
baz(z);
put(z);
}
}
24 24
Step 2: Choose Processors int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
@ P indicates processor to execute function
25 25
SoC-C Feature #2: RPC Annotations
Annotations express where code is to execute Behaves like Synchronous Remote Procedure Call
Does not change meaning of program
Bulk data is not implicitly copied to processor’s local memory
26 26
Step 3: Resolve Memory Issues int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
P0 uses x x must be in M0
P1 uses z z must be in M1
P0 uses y y must be in M0
P1 uses y y must be in M1
Conflict?!
27 27
Hardware Cache Coherency
P0
$0
P1
$1
write x
read x
write x
invalidate x
copy x
invalidate x
28 28
Step 3: Resolve Memory Issues int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
Two versions: y@M0, y@M1
write y@M0 y@M1 is invalid
reads y@M1 Coherence error
29 29
Step 3: Resolve Memory Issues int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
SYNC(x) @ DMA;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
SYNC(x) @ P copies data from one version of x to another using processor P
read y@M1
y@M1 and y@M0 are valid
30 30
SoC-C Feature #3: Compile Time Coherency
Variables can have multiple coherent versions Compiler uses memory topology to determine which version
is being accessed
Compiler applies cache coherency protocol Writing to a version makes it valid and other versions invalid
Dataflow analysis propagates validity
Reading from an invalid version is an error
SYNC(x) copies from valid version to invalid version
31
Compiling SoC-C
See paper:
SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip, Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems (CASES) 2008.
(Or view ‘bonus slides’ after talk.)
32
More realistic SoC-C code
DVB-T Inner ReceiverOFDM receiver 20 tasks 500-7000 cycles each 29000 cycles total
adc_t adc;
ADC_Init(&adc,ADC_BUFSIZE_SAMPLES,adc_Re,adc_Im,13);
SOCC_PIPELINE {
ChannelEstimateInit_DVB_simd(TPS_INFO, CrRe, CrIm) @ DEd;
for(int sym = 0; sym<LOOPS; ++sym) {
cbuffer_t src_r, src_i;
unsigned len = Nguard+asC_MODE[Mode];
ADC_AcquireData(&adc,(sym*len)%ADC_BUFSIZE_SAMPLES,len,&src_r, &src_i);
align(sym_Re,&src_r,len*sizeof(int16_t)) @ DMA_512;
align(sym_Im,&src_i,len*sizeof(int16_t)) @ DMA_512;
ADC_ReleaseRoom(&adc,&src_r,&src_i,len);
RxGuard_DVB_simd(sym_Re,sym_Im,TPS_INFO,Nguard,guarded_Re,guarded_Im) @ DEa;
cscale_DVB_simd(guarded_Re,guarded_Im,23170,avC_MODE[Mode],fft_Re,fft_Im) @ DEa;
fft_DVB_simd(fft_Re,fft_Im,TPS_INFO,ReFFTTwid,ImFFTTwid) @ DEa;
SymUnWrap_DVB_simd(fft_Re,fft_Im,TPS_INFO,unwrapped_Re,unwrapped_Im) @ DEb;
DeMuxSymbol_DVB_simd(unwrapped_Re,unwrapped_Im,TPS_INFO,ISymNum,
demux_Re,demux_Im,PilotsRe,PilotsIm,TPSRe,TPSIm) @ DEb;
DeMuxSymbol_DVB_simd(CrRe,CrIm,TPS_INFO,ISymNum,
demux_CrRe,demux_CrIm,CrPilotsRe,CrPilotsIm,CrTPSRe,CrTPSIm) @ DEb;
cfir1_DVB_simd(demux_Re,demux_Im,demux_CrRe,demux_CrIm,avN_DCPS[Mode],equalized_Re,equalized_Im) @ DEc;
cfir1_DVB_simd(TPSRe,TPSIm,CrTPSRe,CrTPSIm,avN_TPSSCPS[Mode],equalized_TPSRe,equalized_TPSIm) @ DEb;
DemodTPS_DVB_simd(equalized_TPSRe,equalized_TPSIm,TPS_INFO,Pilot,TPSRe) @ DEb;
DemodPilots_DVB_simd(PilotsRe,PilotsIm,TPS_INFO,ISymNum,demod_PilotsRe,demodPilotsIm) @ DEb;
cmagsq_DVB_simd(demux_CrRe,demux_CrIm,12612,avN_DCPS[Mode],MagCr) @ DEc;
int Direction = (ISymNum & 1);
Direction ^= 1;
if (Direction) {
Error=SymInterleave3_DVB_simd2(equalized_Re,equalized_Im,MagCr,
DE_vinterleave_symbol_addr_DVB_T_N,
DE_vinterleave_symbol_addr_DVB_T_OFFSET,
TPS_INFO,Direction,sRe,sIm,sCrMag) @ DEc;
pack3_DVB_simd(sRe,sIm,sCrMag,avN_DCPS[Mode],interleaved_Re,interleaved_Im,Range) @ DEc;
} else {
unpack3_DVB_simd(equalized_Re,equalized_Im,MagCr,avN_DCPS[Mode],sRe,sIm,sCrMag) @ DEc;
Error=SymInterleave3_DVB_simd2(sRe,sIm,sCrMag,
DE_vinterleave_symbol_addr_DVB_T_N,
DE_vinterleave_symbol_addr_DVB_T_OFFSET,
TPS_INFO,Direction,interleaved_Re,interleaved_Im,Range) @ DEc;
}
ChannelEstimate_DVB_simd(interleaved_Re,interleaved_Im,Range,TPS_INFO,CrRe2,CrIm2) @ DEd;
Demod_DVB_simd(interleaved_Re,interleaved_Im,TPS_INFO,Range,demod_softBits) @ DEd;
BitDeInterleave_DVB_simd(demod_softBits,TPS_INFO,deint_softBits) @ DEd;
uint_t err=HardDecoder_DVB_simd(deint_softBits,uvMaxCnt,hardbits) @ DEd;
Bytecpy(&output[p],hardbits,uMaxCnt/8) @ ARM;
p += uMaxCnt/8;
ISymNum = (ISymNum+1) % 4;
}
ADC_Fini(&adc);
33 33
Parallel Speedup
Efficient Same performance as hand-written code
Near Linear Speedup Very efficient use of parallel hardware
0%
50%
100%
150%
200%
250%
300%
350%
400%
1 2 3 4
Speedup
34 34
What SoC-C Provides
SoC-C language features Pipeline to support parallelism Coherence to support distributed memory RPC to support multiple processors/ISAs
Non-features Does not choose boundary between pipeline stages Does not resolve coherence problems Does not allocate processors
SoC-C is concise notation to express mapping decisions (not a tool for making them on your behalf)
35 35
Related Work
Language OpenMP: SMP data parallelism using ‘C plus annotations’ StreamIt: Pipeline parallelism using dataflow language
Pipeline parallelism J.E. Smith, “Decoupled access/execute computer
architectures,” Trans. Computer Systems, 2(4), 1984 Multiple independent reinventions
Hardware Woh et al., “From Soda to Scotch: The Evolution of a
Wireless Baseband Processor,” Proc. MICRO-41, Nov. 2008
36 36
More Recent Related Work
Mapping applications onto Embedded SoCs Exposing Non-Standard Architectures to Embedded Software
using Compile-Time Virtualization, CASES 2009
Pipeline parallelism The Paralax Infrastructure: Automatic Parallelization with a
Helping Hand, PACT 2010
37 37
The SoC-C Model
Program as if using SMP system Single multithreaded processor: RPCs provide a “Migrating
thread Model” Single memory: Compiler Managed Coherence handles
“bookkeeping” Annotations change execution, not semantics
Avoid need to restructure code Pipeline parallelism Compiler managed coherence
Efficiency Avoid abstracting expensive operations programmer can optimize and reason about
38
Kernel Programming
39
Overview
Example: FIR filter
Hand-vectorized code Optimal performance
Issues
An Alternative Approach
1
0
T
jjiji xhy
40
Example Vectorized Code
Very fast, efficient code Uses 32-wide SIMD
Each SIMD multiply performs 32 (useful) multiplies
VLIW compiler overlaps operations 3 vector operations per cycle
VLIW compiler performs software pipelining Multiplier active on every cycle
void FIR(vint16_t x[], vint16_t y[], int16_t h[]) {
vint16_t v = x[0];
for (int i=0; i<N/SIMD_WIDTH; ++i) {
vint16_t w = x[i+1];
vint32L_t acc = vqdmull(v,h[0]);
s = vget_lane(w,0);
v = vdown(v,s);
for(int j=1; j<T-1; ++j) {
acc = vqdmlal(acc,v,h[j]);
s = vget_lane(w,j);
v = vdown(v,s);
}
y[i] = vqrdmlah(acc,v,h[j]);
v = w;
}
}
41
Portability Issues
Vendor specific SIMD operations
vqdmull, vdown, vget_lane
SIMD-width specific
Assumes SIMD_WIDTH >= T
Doesn’t work/performs badly on Many SIMD architectures
GPGPU
SMP
42
Flexibility issues
Improve arithmetic intensity Merge with adjacent kernel
E.g., if filtering input to FFT, combine with bit reversal
Parallelize task across two Ardbeg engines Requires modification to system-level code
43
Summary
Programming directly to the processor Produces very high performance code
Kernel is not portable to other processor types
Kernels cannot be remapped to other devices
Kernels cannot be split/merged to improve scheduling or reduce inter-kernel overheads
Often produces local optimum
But misses global optimum
44
(Towards)Performance-Portable Kernel Programming
45
Outline
The goal
Quick and dirty demonstration
References to (more complete) versions
What still needs to be done
46
An alternative approach
Compiler
1
0
T
jjiji xhy
47
A simple data parallel languageloop(N) {
V1 = load(a);
V2 = load(b);
V3 = add(V1,V2);
store(c,V3);
}
* Currently implemented as a Haskell EDSL – adapted to C-like notation for presentation.
a0 a1 a2 a3 a4 a5 a6 a7 ...V1:
b0 b1 b2 b3 b4 b5 b6 b7 ...V2:
V3:a0+b0
a1+b1
a2+b2
a3+b3
a4+b4
a5+b5
a6+b6
a7+b7
...
c:a0+b0
a1+b1
a2+b2
a3+b3
a4+b4
a5+b5
a6+b6
a7+b7
...
N
48
Compiling Vector Expressions
Vector Expression Init Generate Next
V1 = load(a); p1=a; V1=vld1(p1); p1+=32;
V2 = load(b); p2=b; V2=vld1(p2); p2+=32;
V3 = add(V1,V2); V3=vadd(V1,V2);
store(c,V3); p3=c; vst1(p3,V3); p3+=32;
p1=a; p2=b; p3=c;for(i=0; i<N; i+=32) { V1=vld1(p1); V2=vld1(p2); V3=vadd(V1,V2); vst1(p3,V3); p1+=32; p2+=32; p3+=32;}
50
Generating datapath
+1
+1
MemA
MemB
+
+1
MemC
* Warning: this circuit does not adhere to any ARM quality standards.
51
Adding control
+1
+1
-1
!=0
MemA
MemB
+
+1
MemC
nDone
* Warning: this circuit does not adhere to any ARM quality standards.
en
en
en
en
52
Fixing timing
+1
+1
-1
!=0
MemA
MemB
+
+1
MemC
nDone
en
enen
en
53
Related Work
NESL – Nested Data Parallelism (CMU) Cray Vector machines, Connection Machine
DpH – Generalization of NESL as a Haskell library (SLPJ++) GPGPU
Accelerator – Data parallel library in C#/F# (MSR) SMP, DirectX9, FPGA
Array Building Blocks – C++ template library (Intel) SMP, SSE
Thrust – C++ template library (NVidia) GPGPU
(Also: Parallel Skeletons, Map-reduce, etc. etc.)
54
Summary of approach
(Only) Use highly structured bulk operations Bulk operations reason about vectors, not individual elements
Simple mathematical properties easy to optimize
Single frontend, multiple backends SIMD, SMP, GPGPU, FPGA, ...
(Scope for significant platform-dependent optimization)
55
Breaking down boundaries
Hard boundary between system and kernel layers Separate languages
Separate tools
Separate people writing/optimizing
Need to soften boundary Allow kernels to be split across processors
Allow kernels to be merged across processors
Allow kernels A and B to agree to use a non-standard memory layout (to run more efficiently)
(This is an open problem)
56
Tackling Performance Portability Problem
High Performance Embedded Systems Energy Efficient systems are “lumpy”
The hardware is the easy bit
Two level approach System Programming
Stitch kernels together
Inter-kernel parallelism, mapping onto processors/memory
Kernel Programming C+builtins efficient but inflexible, non-portable
Simple DPL in talk, references to more substantial efforts
Intra-kernel parallelism expressed
Boundary must be softened
57
Fin
58 58
Language Design Meta Issues
Compiler only uses simple analyses Easier to maintain consistency between different compiler
versions/implementations
Programmer makes the high-level decisions Code and Data Placement
Inserting SYNC
Load balancing
Implementation by many source-source transforms Programmer can mix high- and low-level features
90-10 rule: use high-level features when you can, low-level features when you need to
59 59
Compiling SoC-C
1. Data Placementa) Infer data placement
b) Propagate coherence
c) Split variables with multiple placement
2. Pipeline Parallelisma) Identify maximal threads
b) Split into multiple threads
c) Apply zero copy optimization
3. RPC (see paper for details)
60 60
Step 1a: Infer Data Placement int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
SYNC(x) @ DMA;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
Solve Set of Constraints
61 61
Step 1a: Infer Data Placement int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
SYNC(x) @ DMA;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
Solve Set of Constraints Memory Topology constrains where
variables could live
62 62
Solve Set of Constraints Memory Topology constrains where
variables could live
Step 1a: Infer Data Placement int x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo(y@M0, x@M0) @ P0;
SYNC(y,?,?) @ DMA;
FIFO(y@?);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@?);
}
}
63 63
Solve Set of Constraints Memory Topology constrains where
variables could live
Forwards Dataflow propagates availability of valid versions
Step 1b: Propagate Coherenceint x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo(y@M0, x@M0) @ P0;
SYNC(y,?,?) @ DMA;
FIFO(y@?);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@?);
}
}
64 64
Solve Set of Constraints Memory Topology constrains where
variables could live
Forwards Dataflow propagates availability of valid versions
Step 1b: Propagate Coherenceint x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo(y@M0, x@M0) @ P0;
SYNC(y,?,M0) @ DMA;
FIFO(y@?);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@M1);
}
}
65 65
Solve Set of Constraints Memory Topology constrains where
variables could live
Forwards Dataflow propagates availability of valid versions
Backwards Dataflow propagates need for valid versions
Step 1b: Propagate Coherenceint x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo(y@M0, x@M0) @ P0;
SYNC(y,?,M0) @ DMA;
FIFO(y@?);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@M1);
}
}
66 66
Solve Set of Constraints Memory Topology constrains where
variables could live
Forwards Dataflow propagates availability of valid versions
Backwards Dataflow propagates need for valid versions
Step 1b: Propagate Coherenceint x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@M0);
foo(y@M0, x@M0) @ P0;
SYNC(y,M1,M0) @ DMA;
FIFO(y@M1);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@M1);
}
}
67 67
Step 1c: Split Variablesint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}
Split variables with multiple locations
Replace SYNC with memcpy
68 68
Step 2: Implement Pipeline Annotationint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}
Dependency Analysis
69 69
Step 2a: Identify Dependent Operationsint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}
Dependency Analysis
Split use-def chains at FIFOs
70 70
Step 2b: Identify Maximal Threadsint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}
Dependency Analysis
Split use-def chains at FIFOs
Identify Thread Operations
71 71
Step 2b: Split Into Multiple Threadsint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1}; int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}
Perform Dataflow Analysis
Split use-def chains at FIFOs
Identify Thread Operations
Split into threads
72 72
Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1};int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}
Generate DataCopy into FIFO
Copy out of FIFOConsume Data
73 73
Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1};int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}
Calculate Live Range of variables passed through FIFOs
Live Range of y1a
Live Range of y1b
74 74
Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; fifo_acquireRoom(&f, &py1a); memcpy(py1a,y0,…) @ DMA; fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); bar(z, py1b) @ P1; fifo_releaseRoom(&f, py1b); baz(z) @ P1; put(z); } }}
Calculate Live Range of variables passed through FIFOs
Transform FIFO operations to pass pointers instead of copying data
Acquire empty buffer
Generate data directly into buffer
Pass full buffer to thread 2
Acquire full buffer from thread 1
Consume data directly from buffer
Release empty buffer
75 75
Step 3a: Resolve Overloaded RPCint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); DE32_foo(0, y0, x); fifo_acquireRoom(&f, &py1a); DMA_memcpy(py1a,y0,…); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); DE32_bar(1, z, py1b); fifo_releaseRoom(&f, py1b); DE32_baz(1, z); put(z); } }}
Replace RPC by architecture specific call
bar(…)@P1 DE32_bar(1,…)
76 76
Step 3b: Split RPCsint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};
PARALLEL { SECTION { while (1) { get(x); start_DE32_foo(0, y0, x); wait(semaphore_DE32[0]); fifo_acquireRoom(&f, &py1a); start_DMA_memcpy(py1a,y0,…); wait(semaphore_DMA); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); start_DE32_bar(1, z, py1b); wait(semaphore_DE32[1]); fifo_releaseRoom(&f, py1b); start_DE32_baz(1, z); wait(semaphore_DE32[1]); put(z); } }}
RPCs have two phases
start RPC
wait for RPC to complete
DE32_foo(0,…);
start_DE32_foo(0,…);
wait(semaphore_DE32[0]);
77 77
Order of transformations
Dataflow-sensitive transformations go first Inferring data placement
Coherence checking within threads
Dependency analysis for parallelism
Parallelism transformations Obscures data and control flow
Thread-local optimizations go last Zero-copy optimization of FIFO operations
Continuation passing thread implementation
78
Aside: Why hardware companies are fun
You get to play with cool hardware Often before it has been debugged
You get to play with powerful debugging tools Incredible level of detail visible
E.g., Palladium traces on next slides
79 79
Unoptimized task scheduling
968
DE0
DE1
fft demod
195 cycles
273 cycles
80 80
Optimized device driver on ARM
968
DE0
DE1
fft demod
155 cycles
257 cycles
81 81
Task scheduling hardware support
968
DE0
DE1
fft demod
1 cycle
303 cycles
183 cycles