65
Embedded Processor Architecture Bart Mesman Henk Corporaal 5kk73 2010

Embedded Processor Architecture

  • Upload
    elda

  • View
    78

  • Download
    2

Embed Size (px)

DESCRIPTION

Embedded Processor Architecture. Bart Mesman Henk Corporaal 5kk73 2010. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP). Application specific processor. efficiency. ASIC. high medium - PowerPoint PPT Presentation

Citation preview

Page 1: Embedded Processor Architecture

Embedded Processor Architecture

Bart MesmanHenk Corporaal

5kk732010

Page 2: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

2

flexibilityefficiency

DSP

Programmable CPU

Programmable DSP

Application specific instruction set

processor (ASIP)

Applicationspecific processor

Page 3: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

3

low medium high

high

medium

low

flexibility

efficiency

ASIC

GP procFPGA

DSP

ASIP

Page 4: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

4

Programmable CPU cores

• introduction• architecture of the MIPS core

• discussed as an example• pipelining

• application examples• software issues• comparison between different CPU cores• towards application specific architectures• discussion

Page 5: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

5

•rationale: General-purpose -> large market•consequence: often handcrafted design optimised for clock rate•problem : fast changes in the IC process technology•examples embedded:

•MIPS (first one, licensing instruction set architecture)•ARM (Advanced Risc Machines, telecom, low power,

small code size, most popular one, licensing alsothe micro-architecture as hard or soft IP)

•derivatives from general purpose CPUsIntel, NEC, Hitachi, National, PowerPC

Introduction

Page 6: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

6

Instruction set architectures

implicit operands explicit operands

stack machines

(e.g. ST20)

accumulatormachines

general purposeregisters

register-memory register-register= load-store

Introduction

Page 7: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

7

PCClk

Instruction address

InstructionMemory

InstructionRd Rs Rt Imm

5 5 5 16

Architecture of the MIPS core

[Hennessy&Patterson]

DataMemory

Clk

Dataaddress

Data in

32

Data out

Rw Ra Rb

32 32-bitregisters

Clk

32

32

32

32

Page 8: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

8

31 26 21 16 11 6 0 Op rs rt rd shamt funct

6 bits 5 bits 5 bits 5 bits 5 bits 6 bitsR - type

31 26 21 16 0 Op rs rt immediate

6 bits 5 bits 5 bits 16 bitsI - type

31 26 0 Op target address

6 bits 26 bitsJ - type

op operation of the instructionrs,rt,rd source and destination registersshamt shift amountfunct operation of the instruction-part 2imm for program constantsaddr target address of a jump

MIPS instruction formats ( 32 bits ) [Hennessy&Patterson]

Page 9: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

9

31 26 21 16 11 6 0 Op rs rt rd shamt funct

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

Example 1 : R - type : add instruction

Rw Ra Rb

32 32-bitregisters

Clk

Result

Rd Rs Rt5 5 5

32

BusA32

32

Reg Wr

Bus W

BusB32

ALUctr

add rd, rs, rt • mem[PC]• R[rd] = R[rs] + R[rt]• PC = PC + 4

[Hennessy&Patterson]

Page 10: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

10

PC

InstructionMemory

Rw Ra Rb

32 32-bitregisters

DataMemory

Clk

Clk

Clk

Dataaddress

Data inData out

Instruction address

InstructionRd Rs Rt Imm

5 5 5 16

32

32

32

32

Critical path R-type operation

[Hennessy&Patterson]

Page 11: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

11

31 26 21 16 0 Op rs rt immediate

6 bits 5 bits 5 bits 16 bits

Example 2 : I-type : load word

Rw Ra Rb

32 32-bitregisters

Clk

Result

Rs dc (Rt)5 5 5

32

BusA32

32

Reg Wr

Bus W

Data In32

ALUctr

Rd RtRedDst

32ExtenderImm 16

16ALUSrcExtOp

WrEn Adr

DataMemoryClk

MemtoReg

MemWrBusB

32

lw rs, rt, imm16 • mem[PC]• addr = R[rs] + ext[imm16]• R[rt] = mem[addr]• PC = PC + 4

[Hennessy&Patterson]

Page 12: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

12

31 26 21 16 0 Op rs rt immediate

6 bits 5 bits 5 bits 16 bits

beq rs, rt, imm16 • mem[PC]• cond = R[rs] - R[rt] • if cond = 0

PC = PC + 4 + ext(imm16)*4• else

PC = PC + 4

Example 3 : I-type : branch

[Hennessy&Patterson]

Page 13: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

13

31 26 21 16 0 Op rs rt immediate

6 bits 5 bits 5 bits 16 bits

Rw Ra Rb

32 32-bitregisters

Clk

Rs dc (Rt)5 5 5

32

BusA32

Reg Wr

Bus W

ALUctr

Rd RtRedDst

32ExtenderImm 16

16ALUSrcExtOp

BusB32

Next AddressLogic

Imm 16 16

Branch

To InstructionMemory

PC Clk

Zero

Example 3 : I-type : branch

[Hennessy&Patterson]

Page 14: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

14

PC

Branch Zero

0

1

SignExtImm 16 16

Instruction <15:0>

“00”

Addr<31:2>Addr<1:0>

InstructionMemory

30

3030

30

30

30

Clk“1”

32

Instruction <31:0>

Example 3 : I-type : branch[Hennessy&Patterson]

Page 15: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

15

• problem : long critical path defined by the slowest instruction (load)

• solution ?= pipelining

• break the instruction into smaller steps• all steps have about the same critical path

Ifetch RF read ALU dmem RF writeE.g. load

cycle 1 cycle 2 cycle 3 cycle 4 cycle 5

5 stages

Architecture of the MIPS core

Page 16: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

16

Ifetch RF read ALU dmem RF write

cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7

Ifetch RF read ALU dmem RF write

Ifetch RF read ALU dmem RF write

lw

lw

lw

Pipelining lw instructions

• One instructions enters the pipeline every clock cycle• One instructions leaves the pipeline every clock cycle=> CPI = 1 (Cycles per Instruction)

[Hennessy&Patterson]

Page 17: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

17

I R A M W

Instructions Data

I R A M WI R A M W

I R A M WI R A M W

I R A M WCurrent CPU cycle

Pipelining lw instructions

Page 18: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

18

Ifetch RF read ALU RF writeE.g. ADD

4 stages of R-type instruction

cycle 1 cycle 2 cycle 3 cycle 4

[Hennessy&Patterson]

Page 19: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

19

Resource conflicton the write port of the Rfile

Ifetch RF read ALU dmem RF write

cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7

Ifetch RF read ALU RF write

lw

add

Pipelining lw and R-type instructions

[Hennessy&Patterson]

Page 20: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

20

Ifetch RF read ALU dmem RF write

cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7

Ifetch RF read ALU dmem RF write

Ifetch RF read ALU dmem RF write

lw

add

add

Solution: stretch R-type to 5 stages

Ifetch RF read ALU dmem RF write

Dummy op (noop) [Hennessy&Patterson]

Page 21: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

21

BusA

Din

RegDst

ext.Imm16

ALUSrcExtOp

Datamem

MemtoRegMemWr

BusB

Ra

Rb

RwDi

Rs

Rt

RtRd

adrProgmem

+ 4

Dout

Rfileflags

ALUop

branchRegWr

Ifetch Reg/dec exec mem wr

Next PC

[Hennessy&Patterson]

Page 22: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

22

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = ...

… = R1 + ...

… = R1 + ...

… = R1 + ...

… = R1 + ...

Data dependencies : R-type instructions

[Hennessy&Patterson]

Page 23: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

23

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = ...

… = R1 + ...

… = R1 + ...

… = R1 + ...

… = R1 + ...

Data dependencies : R-type instructions

Solution: bypasses

[Hennessy&Patterson]

Page 24: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

24

Datamem

adr

Bypasses[Hennessy&Patterson]

Page 25: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

25

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = lw...

… = R1 + ...

… = R1 + ...

… = R1 + ...

Data dependencies : load instruction

[Hennessy&Patterson]

Page 26: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

26

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = lw...

… = R1 + ...

… = R1 - ...

… = R1 - ...

Data dependencies : load instruction

Bypass is no solutionfor + instruction

[Hennessy&Patterson]

Page 27: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

27

IM RF DM RF

IM RF DM RF

IM RF DM RF

IM RF DM RF

R1 = lw...

… = R1 + ...

… = R1 - ...

… = R1 - ...

Data dependencies : load instruction

Solution: pipeline interlock = detects a data hazard and stallsthe pipeline until the hazard is cleared

[Hennessy&Patterson]

Page 28: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

28

#define NTAPS 4

int fir(int in)int i;static int state[NTAPS];static int coeff[NTAPS];int out[NTAPS];

state[NTAPS] = in;out[0] = state[0] * coeff[0];for ( i = 1; i < NTAPS+1; i++)

out[i] = out[i-1] + state[i] * coeff[i];state[i-1] = state[i];

return(out[NTAPS]);

*

Z-1

*

Z-1

*

Z-1

*

+

c3c4 c2 c1

x4 x3 x2 x1

y

Z-1

c0

x0

*

Application examples (1)

Page 29: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

29

.L1000006sll $3, $2, 2 R3=R2>>2 R3=i-1addu $14, $15, $3 R14=R15+R3lw $24, 0($14) R24=load(*R14) R24=coeff[i-1]addiu $12, $6, -4 R12=R6-4addu $11, $12, $3 R11=R12+R3lw $13, 0($11) R13=load(*R11) R13=state[i-1]nopmult $24, $13 R24=R24*R13addu $25, $sp, $3 R25=sp+R3lw $9, -4($25) R9=load(R25-4) R9=out[i-1]addiu $2, $2, 1 R2=R2+1 i=i+1mflo $13 R13=move from low mpy regaddu $10, $9, $13 R10=R9+R13 R10=out[i]sw $10, 0($25) mem(*R25)=R10addu $25, $7, $3 R25=R7+R3sw $24, 0($25) mem(*R25)=R24slti $24, $2, 10bne $24, $0, .L100006addiu $15, $7, -4

Application examples (1)

19 instructions per tap!!

Page 30: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

30

temp1 = input << 1temp2 = if (bit(input,7) == 1

then 29 else 0

out = temp1 exor temp2

Bit level operations:finite field arithmetic

r1 = LB input Load byter2 = SLL r1 Shift left logicalr3 = ANDI r1, mask AND immediater4 = ADDI r3, -1 ADD immediateBNE ( r4 != r0) Branch on != to nonzeronopR5 = XORI(r1, 29) Exclusive or immediateJ common Jumpnop

nonzero r5 = XOR(r1,r0) Exclusive ORcommon …

in[0] in[1] in[2] in[3] in[4] in[5] in[6] in[7]

out[0] out[1] out[2] out[3] out[4] out[5] out[6] out[7]

exor exor exor

Application examples (2)

10 instructions!!Very simple in hardware

Page 31: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

31

srl $13, $2, 20andi $25, $13, 1srl $14, $2, 21andi $24, $14, 6or $15, $25, $24srl $13, $2, 22andi $14, $13, 56or $25, $15, $14sll $24, $25, 2

202223252627source register ($2)

destination register ($24)

2 3 4 5 6 7

Bit level operations : DES exampleApplication examples (2)

Page 32: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

32

srl $24, $5, 18srl $25, $5, 17xor $8, $24, $25srl $9, $5, 16xor $10, $8, $9srl $11, $5, 13xor $12, $10, $11andi $13, $12, 1

181716 13

xor

$5

1$13 … 0 ...

Bit level operations : A5 example (GSM encryption)

Application examples (2)

Page 33: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

33

Application examples: conclusions

• CPUs offer flexibility, but…• not efficient in performance• not efficient in code size• not efficient in power consumption

Page 34: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

34

Power Consumption in microprocessorsPower consumption is (becoming) the limiting factor in

processor design

Solution in direction of• Hardware acceleration• Instruction Level Parallelism instead of clock speed• Code size efficiency

source: ISSCC2001, Patrick Gelsinger, Intel

Page 35: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

35

Amdahl’s law

• Impact of an improvement on the execution time of a program depends on 2 parameters:– f = fraction of the original computation time that is

affected by the improvement– s = speedup factor (local)

• exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s

• speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s)

• if s >> 1 then speedup_overall = 1 / ( 1 – f )• Example: 40 % of program can be executed 10 x faster

speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56

Page 36: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

36

• Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw)• Keep it Simple heuristic (RISC vs. CISC)

• Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set• No special features that match a high level language construct.• At least 16 registers to ease register allocation.

• Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance)

Conclusions

Page 37: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

37

Programmable Digital Signal Processors• real-time worst-case processing = need for more compute power

sec instr cycles secprog prog instr cycle

CPI = 1• instruction level parallelism (ILP)• hardware support for loop control• attention for high level data types e.g. arrays, delaylines

(vs. scalars for CPUs)• difficult to compare architectures

• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten

• benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)

Page 38: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

38

• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures

• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)

examples: C6 and TM

Outline

Page 39: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

39

Goal = 1 cycle per iteration

•position ACR (1 or 2)•adder/subtractor•extra pipelines•asymmetric inputs•multi-precision

PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

c(i) * x(i)

Sum of products = basic operation for correlation, filtering, spectral analysis ... linear

expr.

Modifications •extra inputs/outputs

clockP_reg

control

Page 40: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

40

• not every signal requires 32 bits• 2 types of DSP: floating point and integer• advantages FP: most specs are in FP

(conversion to int is time consuming since the behaviour may change)

• disadvantage FP: cost (area, speed, power)• wanted : type of output of an operation = type of input

(because both stored in RAM) • no problem for FP but for integer • integer multiplication doubles the number of bits: n * n => 2n• What about fractional numbers ?

0.90.90.81

x

DSP data types

Page 41: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

41

• integer and fractional numbers are a special case of fixed pointfix <p,q> (ART designer & SystemC)

1 1 0 1 1 0 1 -19/8 = -2.3751fix <8,3>

negative weight2’s complement

if q=0 then integer e.g. int <8,0>if q=p-1 then fractional e.g. int <8,7>

DSP data types

Scale factor 1/8

pq

2-2 2-32-120212223-24

quantization error

Same alu handlesfix <8,1>, fix <8,2>, fix <8,3>, ...

Page 42: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

42

• continue (after multiplication) with msb only• represents the limit of the accuracy of the result

(can not be larger than the accuracy of the inputs)• more efficient solution

• continue with msb + lsb•sum-of-product operations generate accumulative noise at 32nd vs. 16th bit

• Still overflow for addition = overflow bits• double precision accumulator

+ extra overflow bits + shift, round, truncate unit

DSP data types

Page 43: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

43

PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

SHIFTROUND

TRUNCATE

clockP_reg

clockP_reg

control

Page 44: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

44

Prog/datamemory

EXU

Von Neumann(sequencial)

progmem.

EXU

Harvard

datamem.

progmem.

EXU

datamem. 1

datamem. 2

Modified Harvard

c(i) * x(i)Goal = 1 cycle per iteration

Page 45: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

45

RAM_A RAM_B

ACU_A

AR_A

ACU_B

AR_B

MAC

DR_A DR_B

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

Control Bus

Rfile

Page 46: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

46

*

Z-1

*

Z-1

*

Z-1

*

+

c4c5 c3 c2

x5 x4 x3 x2

y

Z-1

c1

x1

*

ci * xi

time loop

filter loop i

How updating the delayline ?

1 cycle/tap ?

Page 47: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

47

Memorylocation

outputsample 1

outputsample 2

outputsample 3

outputsample 4

Outputsample 5

1 x1 x92 x2 x23 x3 x3 x34 x4 x4 x4 x45 x5 x5 x5 x5 x56 x6 x6 x6 x67 x7 x7 x78 x8 x8

Solution 2: indirect adressing

• use of a pointer to mark the begin of the delay line• update the pointer instead of moving the data• problem: trashing of the whole memory• solution: modulo addressing• need for a register to store the pointer

Page 48: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

48

A S

Modulo

outputto RAM

Output reg A reg SRead_A A A SRead_S S A SincA A+1 A+1 SdecA A-1 A-1 SStep A+S A+S SInc_step S+1 A S+1

Modulo can beimplemented as a mask operation if the size is 2k

16 10 00023 10 111mask=hold

ACU architecture andInstruction set

Page 49: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

49

Addressing modes

• register ADD R4, R3 R[R4] = R[R4] + R[R3]• immediate ADD R4, #3 R[R4] = R[R4] + #3• direct ADD R4, (100) R[R4] = R[R4] + Mem[100]• indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]

• w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1

• indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2]

Remarks• direct = for static data• indirect = for arrays

• inc/dec = for stepping through arrays e.g. xn

• index = for stepping through arrays e.g. x2n

Page 50: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

50

• 8 ARs (address or auxiliary register) available• extra indirect modes

•circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular

• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.

Addressing modes: extra for DSP

Page 51: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

51

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

ACU_A

AR_A

RAM_A

DR_A

ACU_B

AR_B

RAM_B

DR_B

MAC ALUControl Bus

Rfile

Page 52: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

52

LABEL ALU MPY-ACC RAM ACUAcc = 0 init (i=0)

init counterloop incr (=i+1)

read x(i)acc(i)=acc(i-1)+x(i)*c(i)

dec counter branch to loop if counter > 0

nop

c(i) * x(i)

6 clockcycles/samplelimit pipelines in the controller

first solution

resources

time (cc)

Not showncoefficient RAM+ACU

Page 53: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

53

f

g

h

ai

bi

ci

di

f

g

h

a0

b0

c0

d0

f

g

h

a1

b1

c1

d1

f

g

h

a2

b2

c2

d2

h g f

ai

bi

bi-1ci-2

ci-1di-2

for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci)

for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2)

Loopfolding (software pipelining)

Page 54: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

54

c(i) * x(i)

Pre- and postamble4 clockcycles /sample

LABEL ALU MPY-ACC RAM ACUacc(i-1)=0 init (i=1)

init counter read x(i) inc(=i+1)loop acc(i) = acc(i-1)+x(i)*c(i) read x(i+1) incr (=i+2)

dec counterbranch to loop if counter > 0nop

acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1)+x(n)*c(n)

Loopfolding (software pipelining)

Page 55: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

55

Label ALU MPY-ACC RAM ACUacc(i-1=0 init (i=1)

init counter read x(i) inc(=i+1)repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) read x(i+1) incr(=i+2)

acc(n-1) = acc(n-2) + x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1) + x(n)*c(n)

c(i) * x(i)

hardware support for loop control

1 clockcycles/samplerepeat instruction and repeat block

Page 56: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

56

T register

Sign ctr Sign ctr Sign ctr Sign ctr Sign ctr

T

Multiplier (17*17)

A(40) B(40)

MUXA

0

A

A B

B A

fractional MUX

Adder (40)

ZERO SAT ROUND

MALU (40)

U B

MUX

TAB CD

C D

Barrer shifter

MSW/LSWselect

E

COMP

TRN

TC

B

A

P C DD

TMS320C5000

Page 57: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

57

Address bus

16 bits

EXTERNALADRESS SWITCH

Y Address

Y memory256-by-24-bit

RAM256-by-24-bit

ROM

AddressALU

X memory256-by-24-bit

RAM256-by-24-bit

ROM

2,048-by-24-bitPROGRAMMEMORY

ROM

X AddressP Address

EXTERNALDATA-BUS

SWITCH

INTERNAL DATA-BUS

SWITCH

24 BITS DATA

BUS

X-DATAY DATAP DATAGLOBAL DATA

DATA ALU

24-by-24 bitMULTIPLIER-

ACCUMULATORPRODUCING

56 BIT RESULT

PROGRAM CONTROLLER

ON CHIPPERIPHERALS,

HOST,SYNCHRONOUS

SERIAL INTERFACESERIAL COMMU-

NICATIONSINTERFACE,

PROGRAMMED I/O,BUS CONTROL

2 BITS

CLOCK

3 BITS

INTERRUPT

24 BITS

I/OPORTS

7 BITS

Motorola 56K family

Page 58: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

58

X data

Y data

Z data

Buses for

X

X datamemory

16 bitbus

Y datamemory

16 bit bus

Two address Compution

units

Y

Inst

ruc t

ion

d eco

der

96-b

it in

stru

ctio

ns

Program control

unit

Programmemory (Z data)

16-bit bus

Two 16-by-16 bitmultipliers

Y0

Y1

X

Y0

Y1

X

PO P1

scale scale

Two 40 bit arithmic-logic units

SaturationSaturation

Four 40 bitaccumulators

Saturation/scale

shift

R.E.A.L.

Page 59: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

59

lexical analysis

syntax analysis

semantic analysis

Code selection

Register allocation

scheduling

Front end

Code generation

code

source

Intermediate machine independent

representation

1 instr = // opsorder of instr

Page 60: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

60

a b

*

c d

+

+

*

c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3

t1 t2

t3

BBi

BBj BBk

Intermediate machine independent

representation

Page 61: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

61

ax ay

ar

af mx my

mr

mf

+ -

x y x y

+ - *ALU MAC

d memory p memory ADSP[Analog Devices]

Code selection example

Page 62: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

62

a b

*

c d

+

+

*

c

t1 t2

t3

mx := dmem my := pmem ax := dmem ay := pmem

mr := dmem

2:

1:

3: ar := ax + ay

my := ar

mr = mr * my

Mr := mr + (mx * my)

Example of code selection = covering of intermediate representation with RTPs

Page 63: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

63

Problems• local decisions which have a global impact• phase coupling: example

• asap schedule• maximal freedom for scheduling• code selection during scheduling• register allocation comes afterwards• can lead to infeasible solutions

Page 64: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

64

Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecture

develop an architecture which is still efficient but alsoa good model for building a compiler

Efficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction Word

It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler

phase coupling: discussion

Page 65: Embedded Processor Architecture

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

65

Will embedded CPUs and DSPs converge ?• Converging forces

• both include a hardware multiplier• trend in DSPs towards caches and RTK• trend in DSPs towards C/C++• common trend towards VLIW

• Diverging forces• deeply embedded code (DSP) vs. end-user SW (CPU)• different RTKs

SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)

Conclusions VLIW• good balance between hw and sw• between efficiency (ILP) and cost• fundamental problems: code size, interruptability