Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 1
Design of Embedded DSP
Processors
Unit 9: ASIP and
Accelerators
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 2
Contents
• How to accelerate ASIP
• Instruction fusion / magic
• Data level parallel → SIMD
• Task/high level parallel → GPU
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 3
How do we accelerate- Instruction fusion
& magic instructions
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Instruction fusion
& magic instructions
• Instruction fusion
– Merge several instructions into one instruction in
parallel / pipeline parallel ASIP microarchitecture
• E.g. For I =0 to 15 S|Ai-Bi|
• Magic instruction
– A black box function in ASIP can be configured
and called by a magic instruction
• E.g. 1/x, Taylor series, de-blocking
10/3/2017 Unit 9 of TSEA26-2017 H1 4
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
90-10% code locality rule
• If we analyze an application we may find
– that 10% of the instructions take 90% run time
and 90% of the instructions take 10% run time.
1. ASIP design is to accelerate for the 10% most
frequently used instructions (innermost loops).
2. ASIP needs basic instructions for the 90% code
(function coverage or flexibility).
10/3/2017 Unit 9 of TSEA26-2017 H1 5
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 6
Flexibility first, select a template
A typical ASIP DSP processor assembly instruction set
Move
instructions
Arithmetic
instructions
Control
Load
im
med
iate
dat
a
Load
or
store
bet
wee
n
mem
ory
and r
egis
ters
Move
bet
wee
n r
egis
ters
Gen
eral
ari
thm
etic
, lo
gic
,
shif
t /
rota
te i
nst
ruct
ions
Div
isio
n a
nd o
ther
vec
tor
and i
tera
tive
inst
ruct
ions
Long a
rith
met
ic o
per
atio
ns
Bit
and
bit
s m
anip
ula
tion
s
Bra
nsc
h a
nd c
all
Oth
er p
rogra
m f
low
contr
ol
inst
ruct
ions
Res
erv
ed f
or
acce
lera
tion
exte
nsi
on
RISC CISC
Mult
ipli
cati
ons
CISC
instructions
MA
C a
nd c
onvo
luti
on
Rep
eat
inst
ruct
ion
Accelerate
extensions
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Select an instruction-set template
• An instruction set template shall have much
user’s ecological environment, many user’s
experiences, including:
1. OS: Linux, Android, GLIBC, other libs
2. API: to easy and support programming
3. Applications: to be reference designs
• Avoid using a new instruction set master
10/3/2017 Unit 9 of TSEA26-2017 H1 7
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 8
How do we accelerate
ASIP requirement specification
Early manual partition according to application profiling
ASIP Integration, final function verification and performance validation
Instruction set specification
Assembly instruction set simulator
Benchmarking of instruction set
Application SW implementation
Processor architecture specification
Microarchitecture design
Processor HW implementation
Implement the function as a subroutine
Implement the function as an instruction
Implement the function as a subroutine
Implement the function as an instruction Design for HW
acceleration
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
What shall we accelerate
1. Computing: Minimize cycle cost by instruction
fusion and magic instructions
2. Data access: To hide data access cost behind
computing (to pipeline & to accelerate)
3. Control: Minimize control overheads by hiding
it or using extra control HW (ch14)
4. and NoC: Reduce SoC / NoC cost before chip
integration (Core and NoC co-design).
10/3/2017 Unit 9 of TSEA26-2017 H1 9
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Instruction fusion example
• The first typical example: convolution
1. Computing, instruction fusion: merge MUL and
ACC into a MAC (cc=1)
2. Data access, Modulo addressing: ++DM pointer,
check DM FIFO top, jump to DM FIFO bottom
Automatic ++CM pointer
3. Control, Hardware loop no extra jump-taken cost
• Not accelerated: R=SAT(ROUND(ACR))
10/3/2017 Unit 9 of TSEA26-2017 H1 10
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 11
Single sample FIR code
01 // example 15.1
02 ACR = 0;
03 LCR = m; // LCR is loop counter register;
04 CAR = coefficient_starting_address;
05 DAR = data_starting_address; // for data memory DM;
06 TAR = top_address; //of FIFO in DM;
07 BAR = bottom_address; // of FIFO in DM;
08 DM(DAR) = input_new_data; // from in-buffer or in-port; 7 cycles
09 OPA <= DM(DAR); //1 cycle
10 OPB <= TM(CAR); //1 cycle
11 BFR <= OPA * OPB; //1 cycle
12 ACR <= ACR + BFR; //1 cycle
13 if DAR == BAR then DAR <= TAR //4 cycles
14 else DEC (DAR); //1 cycle
15 INC (CAR); //1 cycle
16 DEC (LCR); //1 cycle
17 if LCR != 0 then jump to 09 //4 cycles
18 else Y <= Truncate (round (ACR)); //1 cycle
19 end
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Acceleration examples
• Repeat N times of an inner
most loop with M
instructions:
– A HWed innermost loop
// The PC hardware
If PC <> Loop_Stop PC <= PC+1
Else
{
N <= N-1
If N=0 then goto Loop_stop + 1
Else PC <= Loop_Start;
}//The assembly code
Repeat Loop_Stop N
Loop_start
Loop instruction 1
Loop instruction 2
Loop_Stop
Loop Instruction 3
endrepeat
Loop_stop+1
Here CONV is a
simple repeat
instruction
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Datapath / data access accelerations
10/3/2017 For teachers using the book 13
+ BAR
TAR DAR
=
Load data to
registers “+1”
1 0
2 1 0
M1
M2
Flag if EQ
M3 1 0
0 1 M4
+
*
ACR
DM TM
Co
nv
olu
tio
n h
ard
war
e Modolu address generator
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 14
Benchmark Single sample FIR
• C-code: 16-tap FIR (one data sample)
• Assembly code
• Comparing the cost:
– Extra jump cost, modulo addressing …
Processor Algorithm Total cycle cost Kernel cycle cost
Junior 16-tapFIR 265 256
With acceleration 16-tapFIR 30 17
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Parallel and pipelined execution
example: Cryptograph ASIP, Y. Huo
10/3/2017 Unit 9 of TSEA26-2017 H1 15
DFIF ID
Ou
tpu
t
Co
ntro
l
EXE1 EXE2 EXE3 EXE4
WB
Addr line
Control line
Data line
Perm1
28
-way
dat
a ac
cess
128-w
ay
per
muta
tion
128-way Galois computing and
LUT run in pipelined parallel
Different instructions / algorithms run with different pipeline depth
Time stationary control: static 9-pipeline management
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Design of magic instructions
• A chain of instructions/operations integrated in a
black box controlled by one instruction
• The black box is frequently used and the silicon
cost of the box is not so heavy
• The C-function of the box can be configued
• The C-function shall be fixed during HW design
and be called while using it.
10/3/2017 Unit 9 of TSEA26-2017 H1 16
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Examples of Magic instructions
• General hardware functions
– 1/x, Taylor series, other function solvers
• LUT functions (x=Address, Y=M content)
– Y=M(A) When precision requirement is low
• Graphics function: rastering, pixel rendering
• Radio base band: Complex matrix invertion
• Video codec: DCT, YUV ↔ RGB, deblock
10/3/2017 Unit 9 of TSEA26-2017 H1 17
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 18
Implement a magic black box
Band pass filter
Function 1
Transform Filter
Function 2
Function 3
Band pass filter
Function 1
Transform
Filter
Function 2
Input buffer
Function 3
Confi
gura
tion r
egis
ters
, F
SM
, an
d m
emor
ies
(a) Behavior model (b) HW implementation
Output buffer
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 19
Bla
ck b
ox
exam
ple
R2B
F
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
FFT/DFT example, 8XR2, 4XR3,
4XR4, 2XR5, 2XR8, 1XR16, S. Liu
10/3/2017 Unit 9 of TSEA26-2017 H1 20
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 21
Data parallel architecture
- SIMD acceleration
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Data-Level Parallelism
10/3/2017 Unit 9 of TSEA26-2017 H1 22
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Data-Level Parallelism
10/3/2017 Unit 9 of TSEA26-2017 H1 23
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
SIMD instruction set architecture
• Run multi data controlled by single instruction
For I=0 to N-1 //N = width of SIMD
Same operation //innermost loop
Endfor
• Loop transform gets more SIMD opportunities
• The most efficient architecture if data parallel
is possible
10/3/2017 Unit 9 of TSEA26-2017 H1 24
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Loop transform opportunities
For J=0 to N-1
D1=A+B
D2=C*D1
D3=D2>>2
D4=D3+C
End for
// 4N cycles
10/3/2017 Unit 9 of TSEA26-2017 H1 25
For I=1 to N-1
D1(i)=A(i)+B(i)
For I=1 to N-1
D2(i)=C(i)*D1(i)
For I=1 to N-1
D3(i)=D2(i)>>2
For I=1 to N-1
D4(i)=D3(i)+C
//4 cycles
Through
loop
transform,
it becomes:
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
SIMD instruction set architecture
10/3/2017 Unit 9 of TSEA26-2017 H1 26
Program memory carries only one instruction
Address
Execution unit
Address
Execution unit
Address
Execution unit
Address
Execution unit
I-decoding
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
General purpose SIMD accelerator
• Most accelerations need SIMD instruction
sub-set (SSE, NEON, AltiVec, etc)
– Master: x86 or ARM or Power etc.
– Slave SIMD sharing one PCFSM with master
• Two popular acceleration architectures
– SIMD: acceleration for data parallel algorithms
– GPU: many parallel task level algorithms
10/3/2017 Unit 9 of TSEA26-2017 H1 27
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Custom SIMD accelerator
• Applications
– CNN (2D FIR), Baseband (complex data matrix), ISP,
Video CODEC, V-post (unsigned short filters matrices)
– Special applications, such as radar (FFT, BP, filters).
• Architectures
1. Instruction subset: (cache) architecture license + tools
2. Master processor + SIMD processor consist of a HSA
computing platform (challenges: interwork, SPM)
10/3/2017 Unit 9 of TSEA26-2017 H1 28
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
SIMD subset acceleration
• SIMD instruction subset
– SIMD - CPU run sequentially in order, using cache
– Use floating point memory / FRF for SIMD because
they are not used at the same time
• Load an instruction and dispatch to SIMD
– SIMD is controlled by master PCFSM
– Compiler-ability is the constraint of accelerations
10/3/2017 Unit 9 of TSEA26-2017 H1 29
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
SIMD slave processor
• SIMD processor running in parallel with master
– Two instruction-domain architecture: data dependent
control – synch & cache coherence control
• Programming on heterogeneous architecture
– P-Programming models: openMP, openCL, CUDA.
– More opportunities of accelerations, compiler (DSL)
– Intrinsic coding libraries: CUDA, MKL, IPP
10/3/2017 Unit 9 of TSEA26-2017 H1 30
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 31
Three SIMD architectures
SIMD/vector Reduce 2D array
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Three SIMD challenges: 1. alignment
10/3/2017 Unit 9 of TSEA26-2017 H1 32
P
erm
uta
tion h
ardw
are
(net
work
and
addre
ss g
ener
ators
)
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
Data memory block 1
Data memory block 2
Data memory block 3
Data memory block 4
Data memory block 5
Data memory block 6
Data memory block 7
Data memory block 8
16
16
16
16
16
16
16
16
Vec
tor
regis
ter
file
Vector memory
Vec
tor
dat
apat
h
1
2
3
4
5
6
7
8 To /
fro
m t
he
off
-chip
mem
ory
blo
ck
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Three SIMD challenges: 1. alignment
10/3/2017 Unit 9 of TSEA26-2017 H1 33
A(00) A(01) A(02) A(03)
A(10) A(11) A(12) A(13)
A(20) A(21) A(22) A(23)
A(30) A(31) A(32) A(33)
A(00) A(01) A(02) A(03)
A(10) A(11) A(12) A(13)
A(20) A(21) A(22) A(23)
A(30) A(31) A(32) A(33)
MB0 MB1 MB2 MB3 MB0 MB1 MB2 MB3
(a): row CFM (b): row and column CFM
Time slot 1
Memory blocks Memory blocks
Time slot 2
Time slot 3
Time slot 4
Michael Gössel Memory architecture and parallel access, 1994, Elsevier
Andreas Karlsson PhD dissertation, Joar Sohl PhD dissertation
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Three SIMD challenges: 2. branch
10/3/2017 For teachers using the book 34
IBM, 1984
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Three SIMD challenges: 2. branch
10/3/2017 Unit 9 of TSEA26-2017 H1 35
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Three SIMD challenges: 3. compiling
• Traditional SIMD compiling
– D-dependency analysis, D-alignment, vecctorization,
unrolling and regrouping……Long way to go……
• Intrinsic programming
– Design FW kernels by SIMD HW designers
– Set up a library call by a programming flow
– Compiler (DSL→ASM) can hide HW complexity and
simplify compiler to translator (rule constraint prog.)
10/3/2017 Unit 9 of TSEA26-2017 H1 36
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Example:
GPU video SIMD Instruction sub-set
10/3/2017 Unit 9 of TSEA26-2017 H1 37
VABSDIFF2(4) Vector video 2x16-bit (4x8-bit) absolute difference
VADD2(4) Vector video 2x16-bit (4x8-bit) addition
VAVRG2(4) Vector video 2x16-bit (4x8-bit) average
VMAX2(4) Vector video 2x16-bit (4x8-bit) maximum
VMIN2(4) Vector video 2x16-bit (4x8-bit) minimum
VSET2(4) Vector video 2x16-bit (4x8-bit) set
VSUB2(4) Vector video 2x16-bit (4x8-bit) subtraction
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Example: ePUMA SIMD Processors
10/3/2017 Unit 9 of TSEA26-2017 H1 38
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 39
Task parallel architecture
- GPU acceleration
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Data and task parallel
• Data level parallel
– To process independent data in parallel
– The bottom level vector processing
• Task level parallel
– Above SIMD data parallel level
– To process independent tasks in parallel
• Independent tasks: no data and control dependencies
– Codes can partition into independent tasks
10/3/2017 Unit 9 of TSEA26-2017 H1 40
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Task parallel architecture
• Multiple (single) scalars running in parallel
– Scalars can run independent tasks
– Can handle branches separately in each scalar
– Using Fork-join programming synch model
• Usually run multi scalar-SIMD in parallel
– More flexible acceleration on architecture level
– Compare to SIMD: Longer computing latency
10/3/2017 Unit 9 of TSEA26-2017 H1 41
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Challenges of task parallel
The challenge of multi-core SoC design
1. Programming
– Task partition: identify / manage dependency
– Task balancing: modify and balance lengths
– Synchrounization: stop all tasks, exchange data
2. Memory subsystem
– Can we avoid memory coherences?
– NoC for both data sharing and control / synch
10/3/2017 Unit 9 of TSEA26-2017 H1 42
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
GPU instruction set architecture
10/3/2017 Unit 9 of TSEA26-2017 H1 43
• 2 x16 SIMD
functional units
per core
• Share core local
memories
• 16 cores
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
NVIDIA GPU SoC architecture
• GeForce GTX 580
with 16 cores
• Running 256
scalers in parallel
• It is rather old
architecture
10/3/2017 Unit 9 of TSEA26-2017 H1 44
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Kepler GK110 block diagram
10/3/2017 Unit 9 of TSEA26-2017 H1 45
256 x 7
= 1792
scalars
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
CUDA (Compute Unified Device Architecture)
toolchain, its CPU-GPU Programming Model
10/3/2017 Unit 9 of TSEA26-2017 H1 46
Programming Massively Parallel Processors, A Hands-on Approach
David B. Kirk and Wen-mei W. Hwu, Elsevier
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/3/2017 Unit 9 of TSEA26-2017 H1 47
Review the discussion today
• What to accelerate and how to do it
1. Using available architecture and ASM set
2. Using custom magic architectures
3. Using SIMD based machine to accelerate
4. Using GPU task level parallel architecture
• 90% code must be compiled
• 10% code (kernels) by HW designers
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Self reading after the lecture
• The iteration period for HW-SW co-design
is too long, easy to say difficult to do. What
can you do?
– Use tool! Cost will be too high to make a tool
• Read chapter 20
• If you want to make a tool, read dissertation
1347, PhD thesis NoGAP by Per Karlström
10/3/2017 Unit 9 of TSEA26-2017 H1 48
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Exciting time now!
Let us discuss• Whatever you want to discuss and
related to HW
• You will have the chance after each
lecture (Fö), do take the chance!
• Prepare your Qs for the next time
10/3/2017 Unit 9 of TSEA26-2017 H1 49
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
LOGO
Dake Liu, Room 556 coridoor B, Hus-B, phone 281256, [email protected]
Welcome to ask any
questions you want to
• I can answer
• Or discuss together
• I want to know what you want