Embedded Processor Architecture

Embedded Processor Architecture

Bart MesmanHenk Corporaal

5kk732010

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

2

flexibilityefficiency

DSP

Programmable CPU

Programmable DSP

Application specific instruction set

processor (ASIP)

Applicationspecific processor


3

low medium high

high

medium

low

flexibility

efficiency

ASIC

GP procFPGA

DSP

ASIP


4

Programmable CPU cores

• introduction• architecture of the MIPS core

• discussed as an example• pipelining

• application examples• software issues• comparison between different CPU cores• towards application specific architectures• discussion


5

•rationale: General-purpose -> large market•consequence: often handcrafted design optimised for clock rate•problem : fast changes in the IC process technology•examples embedded:

•MIPS (first one, licensing instruction set architecture)•ARM (Advanced Risc Machines, telecom, low power,

small code size, most popular one, licensing alsothe micro-architecture as hard or soft IP)

•derivatives from general purpose CPUsIntel, NEC, Hitachi, National, PowerPC

Introduction


6

Instruction set architectures

implicit operands explicit operands

stack machines

(e.g. ST20)

accumulatormachines

general purposeregisters

register-memory register-register= load-store

Introduction


7

PCClk

Instruction address

InstructionMemory

InstructionRd Rs Rt Imm

5 5 5 16

Architecture of the MIPS core

[Hennessy&Patterson]

DataMemory

Clk

Dataaddress

Data in

32

Data out

Rw Ra Rb

32 32-bitregisters

Clk

32

32

32

32


8

31 26 21 16 11 6 0 Op rs rt rd shamt funct

6 bits 5 bits 5 bits 5 bits 5 bits 6 bitsR - type

31 26 21 16 0 Op rs rt immediate

6 bits 5 bits 5 bits 16 bitsI - type

31 26 0 Op target address

6 bits 26 bitsJ - type

op operation of the instructionrs,rt,rd source and destination registersshamt shift amountfunct operation of the instruction-part 2imm for program constantsaddr target address of a jump

MIPS instruction formats ( 32 bits ) [Hennessy&Patterson]


9

31 26 21 16 11 6 0 Op rs rt rd shamt funct

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

Example 1 : R - type : add instruction

Rw Ra Rb

32 32-bitregisters

Clk

Result

Rd Rs Rt5 5 5

32

BusA32

32

Reg Wr

Bus W

BusB32

ALUctr

add rd, rs, rt • mem[PC]• R[rd] = R[rs] + R[rt]• PC = PC + 4



10

PC

InstructionMemory

Rw Ra Rb

32 32-bitregisters

DataMemory

Clk

Clk

Clk

Dataaddress

Data inData out

Instruction address

InstructionRd Rs Rt Imm

5 5 5 16

32

32

32

32

Critical path R-type operation



11


6 bits 5 bits 5 bits 16 bits

Example 2 : I-type : load word

Rw Ra Rb

32 32-bitregisters

Clk

Result

Rs dc (Rt)5 5 5

32

BusA32

32

Reg Wr

Bus W

Data In32

ALUctr

Rd RtRedDst

32ExtenderImm 16

16ALUSrcExtOp

WrEn Adr

DataMemoryClk

MemtoReg

MemWrBusB

32

lw rs, rt, imm16 • mem[PC]• addr = R[rs] + ext[imm16]• R[rt] = mem[addr]• PC = PC + 4



12



beq rs, rt, imm16 • mem[PC]• cond = R[rs] - R[rt] • if cond = 0

PC = PC + 4 + ext(imm16)*4• else

PC = PC + 4

Example 3 : I-type : branch



13



Rw Ra Rb

32 32-bitregisters

Clk

Rs dc (Rt)5 5 5

32

BusA32

Reg Wr

Bus W

ALUctr

Rd RtRedDst

32ExtenderImm 16

16ALUSrcExtOp

BusB32

Next AddressLogic

Imm 16 16

Branch

To InstructionMemory

PC Clk

Zero

Example 3 : I-type : branch



14

PC

Branch Zero

0

1

SignExtImm 16 16

Instruction <15:0>

“00”

Addr<31:2>Addr<1:0>

InstructionMemory

30

3030

30

30

30

Clk“1”

32

Instruction <31:0>

Example 3 : I-type : branch[Hennessy&Patterson]


15

• problem : long critical path defined by the slowest instruction (load)

• solution ?= pipelining

• break the instruction into smaller steps• all steps have about the same critical path

Ifetch RF read ALU dmem RF writeE.g. load

cycle 1 cycle 2 cycle 3 cycle 4 cycle 5

5 stages

Architecture of the MIPS core


16

Ifetch RF read ALU dmem RF write

cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7



lw

lw

lw

Pipelining lw instructions

• One instructions enters the pipeline every clock cycle• One instructions leaves the pipeline every clock cycle=> CPI = 1 (Cycles per Instruction)



17

I R A M W

Instructions Data

I R A M WI R A M W

I R A M WI R A M W

I R A M WCurrent CPU cycle

Pipelining lw instructions


18

Ifetch RF read ALU RF writeE.g. ADD

4 stages of R-type instruction

cycle 1 cycle 2 cycle 3 cycle 4



19

Resource conflicton the write port of the Rfile



Ifetch RF read ALU RF write

lw

add

Pipelining lw and R-type instructions



20





lw

add

add

Solution: stretch R-type to 5 stages


Dummy op (noop) [Hennessy&Patterson]


21

BusA

Din

RegDst

ext.Imm16

ALUSrcExtOp

Datamem

MemtoRegMemWr

BusB

Ra

Rb

RwDi

Rs

Rt

RtRd

adrProgmem

+ 4

Dout

Rfileflags

ALUop

branchRegWr

Ifetch Reg/dec exec mem wr

Next PC



22

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = ...

… = R1 + ...

… = R1 + ...

… = R1 + ...

… = R1 + ...

Data dependencies : R-type instructions



23

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = ...

… = R1 + ...

… = R1 + ...

… = R1 + ...

… = R1 + ...

Data dependencies : R-type instructions

Solution: bypasses



24

Datamem

adr

Bypasses[Hennessy&Patterson]


25

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = lw...

… = R1 + ...

… = R1 + ...

… = R1 + ...

Data dependencies : load instruction



26

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = lw...

… = R1 + ...

… = R1 - ...

… = R1 - ...


Bypass is no solutionfor + instruction



27

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = lw...

… = R1 + ...

… = R1 - ...

… = R1 - ...


Solution: pipeline interlock = detects a data hazard and stallsthe pipeline until the hazard is cleared



28

#define NTAPS 4

int fir(int in)int i;static int state[NTAPS];static int coeff[NTAPS];int out[NTAPS];

state[NTAPS] = in;out[0] = state[0] * coeff[0];for ( i = 1; i < NTAPS+1; i++)

out[i] = out[i-1] + state[i] * coeff[i];state[i-1] = state[i];

return(out[NTAPS]);

*

Z-1

*

Z-1

*

Z-1

*

+

c3c4 c2 c1

x4 x3 x2 x1

y

Z-1

c0

x0

*

Application examples (1)


29

.L1000006sll $3, $2, 2 R3=R2>>2 R3=i-1addu $14, $15, $3 R14=R15+R3lw $24, 0($14) R24=load(*R14) R24=coeff[i-1]addiu $12, $6, -4 R12=R6-4addu $11, $12, $3 R11=R12+R3lw $13, 0($11) R13=load(*R11) R13=state[i-1]nopmult $24, $13 R24=R24*R13addu $25, $sp, $3 R25=sp+R3lw $9, -4($25) R9=load(R25-4) R9=out[i-1]addiu $2, $2, 1 R2=R2+1 i=i+1mflo $13 R13=move from low mpy regaddu $10, $9, $13 R10=R9+R13 R10=out[i]sw $10, 0($25) mem(*R25)=R10addu $25, $7, $3 R25=R7+R3sw $24, 0($25) mem(*R25)=R24slti $24, $2, 10bne $24, $0, .L100006addiu $15, $7, -4


19 instructions per tap!!


30

temp1 = input << 1temp2 = if (bit(input,7) == 1

then 29 else 0

out = temp1 exor temp2

Bit level operations:finite field arithmetic

r1 = LB input Load byter2 = SLL r1 Shift left logicalr3 = ANDI r1, mask AND immediater4 = ADDI r3, -1 ADD immediateBNE ( r4 != r0) Branch on != to nonzeronopR5 = XORI(r1, 29) Exclusive or immediateJ common Jumpnop

nonzero r5 = XOR(r1,r0) Exclusive ORcommon …

in[0] in[1] in[2] in[3] in[4] in[5] in[6] in[7]

out[0] out[1] out[2] out[3] out[4] out[5] out[6] out[7]

exor exor exor


10 instructions!!Very simple in hardware


31

srl $13, $2, 20andi $25, $13, 1srl $14, $2, 21andi $24, $14, 6or $15, $25, $24srl $13, $2, 22andi $14, $13, 56or $25, $15, $14sll $24, $25, 2

202223252627source register ($2)

destination register ($24)

2 3 4 5 6 7

Bit level operations : DES exampleApplication examples (2)


32

srl $24, $5, 18srl $25, $5, 17xor $8, $24, $25srl $9, $5, 16xor $10, $8, $9srl $11, $5, 13xor $12, $10, $11andi $13, $12, 1

181716 13

xor

$5

1$13 … 0 ...

Bit level operations : A5 example (GSM encryption)



33

Application examples: conclusions

• CPUs offer flexibility, but…• not efficient in performance• not efficient in code size• not efficient in power consumption


34

Power Consumption in microprocessorsPower consumption is (becoming) the limiting factor in

processor design

Solution in direction of• Hardware acceleration• Instruction Level Parallelism instead of clock speed• Code size efficiency

source: ISSCC2001, Patrick Gelsinger, Intel


35

Amdahl’s law

• Impact of an improvement on the execution time of a program depends on 2 parameters:– f = fraction of the original computation time that is

affected by the improvement– s = speedup factor (local)

• exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s

• speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s)

• if s >> 1 then speedup_overall = 1 / ( 1 – f )• Example: 40 % of program can be executed 10 x faster

speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56


36

• Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw)• Keep it Simple heuristic (RISC vs. CISC)

• Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set• No special features that match a high level language construct.• At least 16 registers to ease register allocation.

• Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance)

Conclusions


37

Programmable Digital Signal Processors• real-time worst-case processing = need for more compute power

sec instr cycles secprog prog instr cycle

CPI = 1• instruction level parallelism (ILP)• hardware support for loop control• attention for high level data types e.g. arrays, delaylines

(vs. scalars for CPUs)• difficult to compare architectures

• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten

• benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)


38

• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures

• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)

examples: C6 and TM

Outline


39

Goal = 1 cycle per iteration

•position ACR (1 or 2)•adder/subtractor•extra pipelines•asymmetric inputs•multi-precision

PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

c(i) * x(i)

Sum of products = basic operation for correlation, filtering, spectral analysis ... linear

expr.

Modifications •extra inputs/outputs

clockP_reg

control


40

• not every signal requires 32 bits• 2 types of DSP: floating point and integer• advantages FP: most specs are in FP

(conversion to int is time consuming since the behaviour may change)

• disadvantage FP: cost (area, speed, power)• wanted : type of output of an operation = type of input

(because both stored in RAM) • no problem for FP but for integer • integer multiplication doubles the number of bits: n * n => 2n• What about fractional numbers ?

0.90.90.81

x

DSP data types


41

• integer and fractional numbers are a special case of fixed pointfix <p,q> (ART designer & SystemC)

1 1 0 1 1 0 1 -19/8 = -2.3751fix <8,3>

negative weight2’s complement

if q=0 then integer e.g. int <8,0>if q=p-1 then fractional e.g. int <8,7>

DSP data types

Scale factor 1/8

pq

2-2 2-32-120212223-24

quantization error

Same alu handlesfix <8,1>, fix <8,2>, fix <8,3>, ...


42

• continue (after multiplication) with msb only• represents the limit of the accuracy of the result

(can not be larger than the accuracy of the inputs)• more efficient solution

• continue with msb + lsb•sum-of-product operations generate accumulative noise at 32nd vs. 16th bit

• Still overflow for addition = overflow bits• double precision accumulator

+ extra overflow bits + shift, round, truncate unit

DSP data types


43

PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

SHIFTROUND

TRUNCATE

clockP_reg

clockP_reg

control


44

Prog/datamemory

EXU

Von Neumann(sequencial)

progmem.

EXU

Harvard

datamem.

progmem.

EXU

datamem. 1

datamem. 2

Modified Harvard

c(i) * x(i)Goal = 1 cycle per iteration


45

RAM_A RAM_B

ACU_A

AR_A

ACU_B

AR_B

MAC

DR_A DR_B

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

Control Bus

Rfile


46

*

Z-1

*

Z-1

*

Z-1

*

+

c4c5 c3 c2

x5 x4 x3 x2

y

Z-1

c1

x1

*

ci * xi

time loop

filter loop i

How updating the delayline ?

1 cycle/tap ?


47

Memorylocation

outputsample 1

outputsample 2

outputsample 3

outputsample 4

Outputsample 5

1 x1 x92 x2 x23 x3 x3 x34 x4 x4 x4 x45 x5 x5 x5 x5 x56 x6 x6 x6 x67 x7 x7 x78 x8 x8

Solution 2: indirect adressing

• use of a pointer to mark the begin of the delay line• update the pointer instead of moving the data• problem: trashing of the whole memory• solution: modulo addressing• need for a register to store the pointer


48

A S

Modulo

outputto RAM

Output reg A reg SRead_A A A SRead_S S A SincA A+1 A+1 SdecA A-1 A-1 SStep A+S A+S SInc_step S+1 A S+1

Modulo can beimplemented as a mask operation if the size is 2k

16 10 00023 10 111mask=hold

ACU architecture andInstruction set


49

Addressing modes

• register ADD R4, R3 R[R4] = R[R4] + R[R3]• immediate ADD R4, #3 R[R4] = R[R4] + #3• direct ADD R4, (100) R[R4] = R[R4] + Mem[100]• indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]

• w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1

• indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2]

Remarks• direct = for static data• indirect = for arrays

• inc/dec = for stepping through arrays e.g. xn

• index = for stepping through arrays e.g. x2n


50

• 8 ARs (address or auxiliary register) available• extra indirect modes

•circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular

• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.

Addressing modes: extra for DSP


51

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

ACU_A

AR_A

RAM_A

DR_A

ACU_B

AR_B

RAM_B

DR_B

MAC ALUControl Bus

Rfile


52

LABEL ALU MPY-ACC RAM ACUAcc = 0 init (i=0)

init counterloop incr (=i+1)

read x(i)acc(i)=acc(i-1)+x(i)*c(i)

dec counter branch to loop if counter > 0

nop

c(i) * x(i)

6 clockcycles/samplelimit pipelines in the controller

first solution

resources

time (cc)

Not showncoefficient RAM+ACU


53

f

g

h

ai

bi

ci

di

f

g

h

a0

b0

c0

d0

f

g

h

a1

b1

c1

d1

f

g

h

a2

b2

c2

d2

h g f

ai

bi

bi-1ci-2

ci-1di-2

for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci)

for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2)

Loopfolding (software pipelining)


54

c(i) * x(i)

Pre- and postamble4 clockcycles /sample

LABEL ALU MPY-ACC RAM ACUacc(i-1)=0 init (i=1)

init counter read x(i) inc(=i+1)loop acc(i) = acc(i-1)+x(i)*c(i) read x(i+1) incr (=i+2)

dec counterbranch to loop if counter > 0nop

acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1)+x(n)*c(n)

Loopfolding (software pipelining)


55

Label ALU MPY-ACC RAM ACUacc(i-1=0 init (i=1)

init counter read x(i) inc(=i+1)repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) read x(i+1) incr(=i+2)

acc(n-1) = acc(n-2) + x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1) + x(n)*c(n)

c(i) * x(i)

hardware support for loop control

1 clockcycles/samplerepeat instruction and repeat block


56

T register

Sign ctr Sign ctr Sign ctr Sign ctr Sign ctr

T

Multiplier (17*17)

A(40) B(40)

MUXA

0

A

A B

B A

fractional MUX

Adder (40)

ZERO SAT ROUND

MALU (40)

U B

MUX

TAB CD

C D

Barrer shifter

MSW/LSWselect

E

COMP

TRN

TC

B

A

P C DD

TMS320C5000


57

Address bus

16 bits

EXTERNALADRESS SWITCH

Y Address

Y memory256-by-24-bit

RAM256-by-24-bit

ROM

AddressALU

X memory256-by-24-bit

RAM256-by-24-bit

ROM

2,048-by-24-bitPROGRAMMEMORY

ROM

X AddressP Address

EXTERNALDATA-BUS

SWITCH

INTERNAL DATA-BUS

SWITCH

24 BITS DATA

BUS

X-DATAY DATAP DATAGLOBAL DATA

DATA ALU

24-by-24 bitMULTIPLIER-

ACCUMULATORPRODUCING

56 BIT RESULT

PROGRAM CONTROLLER

ON CHIPPERIPHERALS,

HOST,SYNCHRONOUS

SERIAL INTERFACESERIAL COMMU-

NICATIONSINTERFACE,

PROGRAMMED I/O,BUS CONTROL

2 BITS

CLOCK

3 BITS

INTERRUPT

24 BITS

I/OPORTS

7 BITS

Motorola 56K family


58

X data

Y data

Z data

Buses for

X

X datamemory

16 bitbus

Y datamemory

16 bit bus

Two address Compution

units

Y

Inst

ruc t

ion

d eco

der

96-b

it in

stru

ctio

ns

Program control

unit

Programmemory (Z data)

16-bit bus

Two 16-by-16 bitmultipliers

Y0

Y1

X

Y0

Y1

X

PO P1

scale scale

Two 40 bit arithmic-logic units

SaturationSaturation

Four 40 bitaccumulators

Saturation/scale

shift

R.E.A.L.


59

lexical analysis

syntax analysis

semantic analysis

Code selection

Register allocation

scheduling

Front end

Code generation

code

source

Intermediate machine independent

representation

1 instr = // opsorder of instr


60

a b

*

c d

+

+

*

c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3

t1 t2

t3

BBi

BBj BBk

Intermediate machine independent

representation


61

ax ay

ar

af mx my

mr

mf

+ -

x y x y

+ - *ALU MAC

d memory p memory ADSP[Analog Devices]

Code selection example


62

a b

*

c d

+

+

*

c

t1 t2

t3

mx := dmem my := pmem ax := dmem ay := pmem

mr := dmem

2:

1:

3: ar := ax + ay

my := ar

mr = mr * my

Mr := mr + (mx * my)

Example of code selection = covering of intermediate representation with RTPs


63

Problems• local decisions which have a global impact• phase coupling: example

• asap schedule• maximal freedom for scheduling• code selection during scheduling• register allocation comes afterwards• can lead to infeasible solutions


64

Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecture

develop an architecture which is still efficient but alsoa good model for building a compiler

Efficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction Word

It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler

phase coupling: discussion


65

Will embedded CPUs and DSPs converge ?• Converging forces

• both include a hardware multiplier• trend in DSPs towards caches and RTK• trend in DSPs towards C/C++• common trend towards VLIW

• Diverging forces• deeply embedded code (DSP) vs. end-user SW (CPU)• different RTKs

SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)

Conclusions VLIW• good balance between hw and sw• between efficiency (ILP) and cost• fundamental problems: code size, interruptability

Documents

Embedded Processor Architecture