Design of Embedded DSP Processors · Custom SIMD accelerator • Applications –CNN (2D FIR), Baseband (complex data matrix), ISP, Video CODEC, V-post (unsigned short filters matrices)

by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®

10/3/2017 Unit 9 of TSEA26-2017 H1 1

Design of Embedded DSP

Processors

Unit 9: ASIP and

Accelerators

mailto:[email protected]


10/3/2017 Unit 9 of TSEA26-2017 H1 2

Contents

• How to accelerate ASIP

• Instruction fusion / magic

• Data level parallel → SIMD

• Task/high level parallel → GPU



10/3/2017 Unit 9 of TSEA26-2017 H1 3

How do we accelerate- Instruction fusion

& magic instructions



Instruction fusion

& magic instructions

• Instruction fusion

– Merge several instructions into one instruction in

parallel / pipeline parallel ASIP microarchitecture

• E.g. For I =0 to 15 S|Ai-Bi|

• Magic instruction

– A black box function in ASIP can be configured

and called by a magic instruction

• E.g. 1/x, Taylor series, de-blocking

10/3/2017 Unit 9 of TSEA26-2017 H1 4



90-10% code locality rule

• If we analyze an application we may find

– that 10% of the instructions take 90% run time

and 90% of the instructions take 10% run time.

1. ASIP design is to accelerate for the 10% most

frequently used instructions (innermost loops).

2. ASIP needs basic instructions for the 90% code

(function coverage or flexibility).

10/3/2017 Unit 9 of TSEA26-2017 H1 5



10/3/2017 Unit 9 of TSEA26-2017 H1 6

Flexibility first, select a template

A typical ASIP DSP processor assembly instruction set

Move

instructions

Arithmetic

instructions

Control

Load

im

med

iate

dat

a

Load

or

store

bet

wee

n

mem

ory

and r

egis

ters

Move

bet

wee

n r

egis

ters

Gen

eral

ari

thm

etic

, lo

gic

,

shif

t /

rota

te i

nst

ruct

ions

Div

isio

n a

nd o

ther

vec

tor

and i

tera

tive

inst

ruct

ions

Long a

rith

met

ic o

per

atio

ns

Bit

and

bit

s m

anip

ula

tion

s

Bra

nsc

h a

nd c

all

Oth

er p

rogra

m f

low

contr

ol

inst

ruct

ions

Res

erv

ed f

or

acce

lera

tion

exte

nsi

on

RISC CISC

Mult

ipli

cati

ons

CISC

instructions

MA

C a

nd c

onvo

luti

on

Rep

eat

inst

ruct

ion

Accelerate

extensions



Select an instruction-set template

• An instruction set template shall have much

user’s ecological environment, many user’s

experiences, including:

1. OS: Linux, Android, GLIBC, other libs

2. API: to easy and support programming

3. Applications: to be reference designs

• Avoid using a new instruction set master

10/3/2017 Unit 9 of TSEA26-2017 H1 7



10/3/2017 Unit 9 of TSEA26-2017 H1 8

How do we accelerate

ASIP requirement specification

Early manual partition according to application profiling

ASIP Integration, final function verification and performance validation

Instruction set specification

Assembly instruction set simulator

Benchmarking of instruction set

Application SW implementation

Processor architecture specification

Microarchitecture design

Processor HW implementation

Implement the function as a subroutine

Implement the function as an instruction

Implement the function as a subroutine

Implement the function as an instruction Design for HW

acceleration



What shall we accelerate

1. Computing: Minimize cycle cost by instruction

fusion and magic instructions

2. Data access: To hide data access cost behind

computing (to pipeline & to accelerate)

3. Control: Minimize control overheads by hiding

it or using extra control HW (ch14)

4. and NoC: Reduce SoC / NoC cost before chip

integration (Core and NoC co-design).

10/3/2017 Unit 9 of TSEA26-2017 H1 9



Instruction fusion example

• The first typical example: convolution

1. Computing, instruction fusion: merge MUL and

ACC into a MAC (cc=1)

2. Data access, Modulo addressing: ++DM pointer,

check DM FIFO top, jump to DM FIFO bottom

Automatic ++CM pointer

3. Control, Hardware loop no extra jump-taken cost

• Not accelerated: R=SAT(ROUND(ACR))

10/3/2017 Unit 9 of TSEA26-2017 H1 10



10/3/2017 Unit 9 of TSEA26-2017 H1 11

Single sample FIR code

01 // example 15.1

02 ACR = 0;

03 LCR = m; // LCR is loop counter register;

04 CAR = coefficient_starting_address;

05 DAR = data_starting_address; // for data memory DM;

06 TAR = top_address; //of FIFO in DM;

07 BAR = bottom_address; // of FIFO in DM;

08 DM(DAR) = input_new_data; // from in-buffer or in-port; 7 cycles

09 OPA <= DM(DAR); //1 cycle

10 OPB <= TM(CAR); //1 cycle

11 BFR <= OPA * OPB; //1 cycle

12 ACR <= ACR + BFR; //1 cycle

13 if DAR == BAR then DAR <= TAR //4 cycles

14 else DEC (DAR); //1 cycle

15 INC (CAR); //1 cycle

16 DEC (LCR); //1 cycle

17 if LCR != 0 then jump to 09 //4 cycles

18 else Y <= Truncate (round (ACR)); //1 cycle

19 end



Acceleration examples

• Repeat N times of an inner

most loop with M

instructions:

– A HWed innermost loop

// The PC hardware

If PC <> Loop_Stop PC <= PC+1

Else

{

N <= N-1

If N=0 then goto Loop_stop + 1

Else PC <= Loop_Start;

}//The assembly code

Repeat Loop_Stop N

Loop_start

Loop instruction 1

Loop instruction 2

Loop_Stop

Loop Instruction 3

endrepeat

Loop_stop+1

Here CONV is a

simple repeat

instruction



Datapath / data access accelerations

10/3/2017 For teachers using the book 13

+ BAR

TAR DAR

=

Load data to

registers “+1”

1 0

2 1 0

M1

M2

Flag if EQ

M3 1 0

0 1 M4

+

*

ACR

DM TM

Co

nv

olu

tio

n h

ard

war

e Modolu address generator



10/3/2017 Unit 9 of TSEA26-2017 H1 14

Benchmark Single sample FIR

• C-code: 16-tap FIR (one data sample)

• Assembly code

• Comparing the cost:

– Extra jump cost, modulo addressing …

Processor Algorithm Total cycle cost Kernel cycle cost

Junior 16-tapFIR 265 256

With acceleration 16-tapFIR 30 17



Parallel and pipelined execution

example: Cryptograph ASIP, Y. Huo

10/3/2017 Unit 9 of TSEA26-2017 H1 15

DFIF ID

Ou

tpu

t

Co

ntro

l

EXE1 EXE2 EXE3 EXE4

WB

Addr line

Control line

Data line

Perm1

28

-way

dat

a ac

cess

128-w

ay

per

muta

tion

128-way Galois computing and

LUT run in pipelined parallel

Different instructions / algorithms run with different pipeline depth

Time stationary control: static 9-pipeline management



Design of magic instructions

• A chain of instructions/operations integrated in a

black box controlled by one instruction

• The black box is frequently used and the silicon

cost of the box is not so heavy

• The C-function of the box can be configued

• The C-function shall be fixed during HW design

and be called while using it.

10/3/2017 Unit 9 of TSEA26-2017 H1 16



Examples of Magic instructions

• General hardware functions

– 1/x, Taylor series, other function solvers

• LUT functions (x=Address, Y=M content)

– Y=M(A) When precision requirement is low

• Graphics function: rastering, pixel rendering

• Radio base band: Complex matrix invertion

• Video codec: DCT, YUV ↔ RGB, deblock

10/3/2017 Unit 9 of TSEA26-2017 H1 17



10/3/2017 Unit 9 of TSEA26-2017 H1 18

Implement a magic black box

Band pass filter

Function 1

Transform Filter

Function 2

Function 3

Band pass filter

Function 1

Transform

Filter

Function 2

Input buffer

Function 3

Confi

gura

tion r

egis

ters

, F

SM

, an

d m

emor

ies

(a) Behavior model (b) HW implementation

Output buffer



10/3/2017 Unit 9 of TSEA26-2017 H1 19

Bla

ck b

ox

exam

ple

R2B

F



FFT/DFT example, 8XR2, 4XR3,

4XR4, 2XR5, 2XR8, 1XR16, S. Liu

10/3/2017 Unit 9 of TSEA26-2017 H1 20



10/3/2017 Unit 9 of TSEA26-2017 H1 21

Data parallel architecture

- SIMD acceleration



Data-Level Parallelism

10/3/2017 Unit 9 of TSEA26-2017 H1 22



Data-Level Parallelism

10/3/2017 Unit 9 of TSEA26-2017 H1 23



SIMD instruction set architecture

• Run multi data controlled by single instruction

For I=0 to N-1 //N = width of SIMD

Same operation //innermost loop

Endfor

• Loop transform gets more SIMD opportunities

• The most efficient architecture if data parallel

is possible

10/3/2017 Unit 9 of TSEA26-2017 H1 24



Loop transform opportunities

For J=0 to N-1

D1=A+B

D2=C*D1

D3=D2>>2

D4=D3+C

End for

// 4N cycles

10/3/2017 Unit 9 of TSEA26-2017 H1 25

For I=1 to N-1

D1(i)=A(i)+B(i)

For I=1 to N-1

D2(i)=C(i)*D1(i)

For I=1 to N-1

D3(i)=D2(i)>>2

For I=1 to N-1

D4(i)=D3(i)+C

//4 cycles

Through

loop

transform,

it becomes:



SIMD instruction set architecture

10/3/2017 Unit 9 of TSEA26-2017 H1 26

Program memory carries only one instruction

Address

Execution unit

Address

Execution unit

Address

Execution unit

Address

Execution unit

I-decoding



General purpose SIMD accelerator

• Most accelerations need SIMD instruction

sub-set (SSE, NEON, AltiVec, etc)

– Master: x86 or ARM or Power etc.

– Slave SIMD sharing one PCFSM with master

• Two popular acceleration architectures

– SIMD: acceleration for data parallel algorithms

– GPU: many parallel task level algorithms

10/3/2017 Unit 9 of TSEA26-2017 H1 27



Custom SIMD accelerator

• Applications

– CNN (2D FIR), Baseband (complex data matrix), ISP,

Video CODEC, V-post (unsigned short filters matrices)

– Special applications, such as radar (FFT, BP, filters).

• Architectures

1. Instruction subset: (cache) architecture license + tools

2. Master processor + SIMD processor consist of a HSA

computing platform (challenges: interwork, SPM)

10/3/2017 Unit 9 of TSEA26-2017 H1 28



SIMD subset acceleration

• SIMD instruction subset

– SIMD - CPU run sequentially in order, using cache

– Use floating point memory / FRF for SIMD because

they are not used at the same time

• Load an instruction and dispatch to SIMD

– SIMD is controlled by master PCFSM

– Compiler-ability is the constraint of accelerations

10/3/2017 Unit 9 of TSEA26-2017 H1 29



SIMD slave processor

• SIMD processor running in parallel with master

– Two instruction-domain architecture: data dependent

control – synch & cache coherence control

• Programming on heterogeneous architecture

– P-Programming models: openMP, openCL, CUDA.

– More opportunities of accelerations, compiler (DSL)

– Intrinsic coding libraries: CUDA, MKL, IPP

10/3/2017 Unit 9 of TSEA26-2017 H1 30



10/3/2017 Unit 9 of TSEA26-2017 H1 31

Three SIMD architectures

SIMD/vector Reduce 2D array



Three SIMD challenges: 1. alignment

10/3/2017 Unit 9 of TSEA26-2017 H1 32

P

erm

uta

tion h

ardw

are

(net

work

and

addre

ss g

ener

ators

)

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

Data memory block 1

Data memory block 2

Data memory block 3

Data memory block 4

Data memory block 5

Data memory block 6

Data memory block 7

Data memory block 8

16

16

16

16

16

16

16

16

Vec

tor

regis

ter

file

Vector memory

Vec

tor

dat

apat

h

1

2

3

4

5

6

7

8 To /

fro

m t

he

off

-chip

mem

ory

blo

ck



Three SIMD challenges: 1. alignment

10/3/2017 Unit 9 of TSEA26-2017 H1 33

A(00) A(01) A(02) A(03)

A(10) A(11) A(12) A(13)

A(20) A(21) A(22) A(23)

A(30) A(31) A(32) A(33)

A(00) A(01) A(02) A(03)

A(10) A(11) A(12) A(13)

A(20) A(21) A(22) A(23)

A(30) A(31) A(32) A(33)

MB0 MB1 MB2 MB3 MB0 MB1 MB2 MB3

(a): row CFM (b): row and column CFM

Time slot 1

Memory blocks Memory blocks

Time slot 2

Time slot 3

Time slot 4

Michael Gössel Memory architecture and parallel access, 1994, Elsevier

Andreas Karlsson PhD dissertation, Joar Sohl PhD dissertation



Three SIMD challenges: 2. branch

10/3/2017 For teachers using the book 34

IBM, 1984



Three SIMD challenges: 2. branch

10/3/2017 Unit 9 of TSEA26-2017 H1 35



Three SIMD challenges: 3. compiling

• Traditional SIMD compiling

– D-dependency analysis, D-alignment, vecctorization,

unrolling and regrouping……Long way to go……

• Intrinsic programming

– Design FW kernels by SIMD HW designers

– Set up a library call by a programming flow

– Compiler (DSL→ASM) can hide HW complexity and

simplify compiler to translator (rule constraint prog.)

10/3/2017 Unit 9 of TSEA26-2017 H1 36



Example:

GPU video SIMD Instruction sub-set

10/3/2017 Unit 9 of TSEA26-2017 H1 37

VABSDIFF2(4) Vector video 2x16-bit (4x8-bit) absolute difference

VADD2(4) Vector video 2x16-bit (4x8-bit) addition

VAVRG2(4) Vector video 2x16-bit (4x8-bit) average

VMAX2(4) Vector video 2x16-bit (4x8-bit) maximum

VMIN2(4) Vector video 2x16-bit (4x8-bit) minimum

VSET2(4) Vector video 2x16-bit (4x8-bit) set

VSUB2(4) Vector video 2x16-bit (4x8-bit) subtraction



Example: ePUMA SIMD Processors

10/3/2017 Unit 9 of TSEA26-2017 H1 38



10/3/2017 Unit 9 of TSEA26-2017 H1 39

Task parallel architecture

- GPU acceleration



Data and task parallel

• Data level parallel

– To process independent data in parallel

– The bottom level vector processing

• Task level parallel

– Above SIMD data parallel level

– To process independent tasks in parallel

• Independent tasks: no data and control dependencies

– Codes can partition into independent tasks

10/3/2017 Unit 9 of TSEA26-2017 H1 40



Task parallel architecture

• Multiple (single) scalars running in parallel

– Scalars can run independent tasks

– Can handle branches separately in each scalar

– Using Fork-join programming synch model

• Usually run multi scalar-SIMD in parallel

– More flexible acceleration on architecture level

– Compare to SIMD: Longer computing latency

10/3/2017 Unit 9 of TSEA26-2017 H1 41



Challenges of task parallel

The challenge of multi-core SoC design

1. Programming

– Task partition: identify / manage dependency

– Task balancing: modify and balance lengths

– Synchrounization: stop all tasks, exchange data

2. Memory subsystem

– Can we avoid memory coherences?

– NoC for both data sharing and control / synch

10/3/2017 Unit 9 of TSEA26-2017 H1 42



GPU instruction set architecture

10/3/2017 Unit 9 of TSEA26-2017 H1 43

• 2 x16 SIMD

functional units

per core

• Share core local

memories

• 16 cores



NVIDIA GPU SoC architecture

• GeForce GTX 580

with 16 cores

• Running 256

scalers in parallel

• It is rather old

architecture

10/3/2017 Unit 9 of TSEA26-2017 H1 44



Kepler GK110 block diagram

10/3/2017 Unit 9 of TSEA26-2017 H1 45

256 x 7

= 1792

scalars



CUDA (Compute Unified Device Architecture)

toolchain, its CPU-GPU Programming Model

10/3/2017 Unit 9 of TSEA26-2017 H1 46

Programming Massively Parallel Processors, A Hands-on Approach

David B. Kirk and Wen-mei W. Hwu, Elsevier



10/3/2017 Unit 9 of TSEA26-2017 H1 47

Review the discussion today

• What to accelerate and how to do it

1. Using available architecture and ASM set

2. Using custom magic architectures

3. Using SIMD based machine to accelerate

4. Using GPU task level parallel architecture

• 90% code must be compiled

• 10% code (kernels) by HW designers



Self reading after the lecture

• The iteration period for HW-SW co-design

is too long, easy to say difficult to do. What

can you do?

– Use tool! Cost will be too high to make a tool

• Read chapter 20

• If you want to make a tool, read dissertation

1347, PhD thesis NoGAP by Per Karlström

10/3/2017 Unit 9 of TSEA26-2017 H1 48



Exciting time now!

Let us discuss• Whatever you want to discuss and

related to HW

• You will have the chance after each

lecture (Fö), do take the chance!

• Prepare your Qs for the next time

10/3/2017 Unit 9 of TSEA26-2017 H1 49



LOGO

Dake Liu, Room 556 coridoor B, Hus-B, phone 281256, [email protected]

Welcome to ask any

questions you want to

• I can answer

• Or discuss together

• I want to know what you want


Documents

Design of Embedded DSP Processors · Custom SIMD accelerator • Applications –CNN (2D FIR), Baseband (complex data matrix), ISP, Video CODEC, V-post (unsigned short filters matrices)