Upload
elda
View
78
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Embedded Processor Architecture. Bart Mesman Henk Corporaal 5kk73 2010. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP). Application specific processor. efficiency. ASIC. high medium - PowerPoint PPT Presentation
Citation preview
Embedded Processor Architecture
Bart MesmanHenk Corporaal
5kk732010
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
2
flexibilityefficiency
DSP
Programmable CPU
Programmable DSP
Application specific instruction set
processor (ASIP)
Applicationspecific processor
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
3
low medium high
high
medium
low
flexibility
efficiency
ASIC
GP procFPGA
DSP
ASIP
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
4
Programmable CPU cores
• introduction• architecture of the MIPS core
• discussed as an example• pipelining
• application examples• software issues• comparison between different CPU cores• towards application specific architectures• discussion
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
5
•rationale: General-purpose -> large market•consequence: often handcrafted design optimised for clock rate•problem : fast changes in the IC process technology•examples embedded:
•MIPS (first one, licensing instruction set architecture)•ARM (Advanced Risc Machines, telecom, low power,
small code size, most popular one, licensing alsothe micro-architecture as hard or soft IP)
•derivatives from general purpose CPUsIntel, NEC, Hitachi, National, PowerPC
Introduction
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
6
Instruction set architectures
implicit operands explicit operands
stack machines
(e.g. ST20)
accumulatormachines
general purposeregisters
register-memory register-register= load-store
Introduction
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
7
PCClk
Instruction address
InstructionMemory
InstructionRd Rs Rt Imm
5 5 5 16
Architecture of the MIPS core
[Hennessy&Patterson]
DataMemory
Clk
Dataaddress
Data in
32
Data out
Rw Ra Rb
32 32-bitregisters
Clk
32
32
32
32
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
8
31 26 21 16 11 6 0 Op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bitsR - type
31 26 21 16 0 Op rs rt immediate
6 bits 5 bits 5 bits 16 bitsI - type
31 26 0 Op target address
6 bits 26 bitsJ - type
op operation of the instructionrs,rt,rd source and destination registersshamt shift amountfunct operation of the instruction-part 2imm for program constantsaddr target address of a jump
MIPS instruction formats ( 32 bits ) [Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
9
31 26 21 16 11 6 0 Op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
Example 1 : R - type : add instruction
Rw Ra Rb
32 32-bitregisters
Clk
Result
Rd Rs Rt5 5 5
32
BusA32
32
Reg Wr
Bus W
BusB32
ALUctr
add rd, rs, rt • mem[PC]• R[rd] = R[rs] + R[rt]• PC = PC + 4
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
10
PC
InstructionMemory
Rw Ra Rb
32 32-bitregisters
DataMemory
Clk
Clk
Clk
Dataaddress
Data inData out
Instruction address
InstructionRd Rs Rt Imm
5 5 5 16
32
32
32
32
Critical path R-type operation
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
11
31 26 21 16 0 Op rs rt immediate
6 bits 5 bits 5 bits 16 bits
Example 2 : I-type : load word
Rw Ra Rb
32 32-bitregisters
Clk
Result
Rs dc (Rt)5 5 5
32
BusA32
32
Reg Wr
Bus W
Data In32
ALUctr
Rd RtRedDst
32ExtenderImm 16
16ALUSrcExtOp
WrEn Adr
DataMemoryClk
MemtoReg
MemWrBusB
32
lw rs, rt, imm16 • mem[PC]• addr = R[rs] + ext[imm16]• R[rt] = mem[addr]• PC = PC + 4
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
12
31 26 21 16 0 Op rs rt immediate
6 bits 5 bits 5 bits 16 bits
beq rs, rt, imm16 • mem[PC]• cond = R[rs] - R[rt] • if cond = 0
PC = PC + 4 + ext(imm16)*4• else
PC = PC + 4
Example 3 : I-type : branch
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
13
31 26 21 16 0 Op rs rt immediate
6 bits 5 bits 5 bits 16 bits
Rw Ra Rb
32 32-bitregisters
Clk
Rs dc (Rt)5 5 5
32
BusA32
Reg Wr
Bus W
ALUctr
Rd RtRedDst
32ExtenderImm 16
16ALUSrcExtOp
BusB32
Next AddressLogic
Imm 16 16
Branch
To InstructionMemory
PC Clk
Zero
Example 3 : I-type : branch
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
14
PC
Branch Zero
0
1
SignExtImm 16 16
Instruction <15:0>
“00”
Addr<31:2>Addr<1:0>
InstructionMemory
30
3030
30
30
30
Clk“1”
32
Instruction <31:0>
Example 3 : I-type : branch[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
15
• problem : long critical path defined by the slowest instruction (load)
• solution ?= pipelining
• break the instruction into smaller steps• all steps have about the same critical path
Ifetch RF read ALU dmem RF writeE.g. load
cycle 1 cycle 2 cycle 3 cycle 4 cycle 5
5 stages
Architecture of the MIPS core
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
16
Ifetch RF read ALU dmem RF write
cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7
Ifetch RF read ALU dmem RF write
Ifetch RF read ALU dmem RF write
lw
lw
lw
Pipelining lw instructions
• One instructions enters the pipeline every clock cycle• One instructions leaves the pipeline every clock cycle=> CPI = 1 (Cycles per Instruction)
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
17
I R A M W
Instructions Data
I R A M WI R A M W
I R A M WI R A M W
I R A M WCurrent CPU cycle
Pipelining lw instructions
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
18
Ifetch RF read ALU RF writeE.g. ADD
4 stages of R-type instruction
cycle 1 cycle 2 cycle 3 cycle 4
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
19
Resource conflicton the write port of the Rfile
Ifetch RF read ALU dmem RF write
cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7
Ifetch RF read ALU RF write
lw
add
Pipelining lw and R-type instructions
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
20
Ifetch RF read ALU dmem RF write
cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7
Ifetch RF read ALU dmem RF write
Ifetch RF read ALU dmem RF write
lw
add
add
Solution: stretch R-type to 5 stages
Ifetch RF read ALU dmem RF write
Dummy op (noop) [Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
21
BusA
Din
RegDst
ext.Imm16
ALUSrcExtOp
Datamem
MemtoRegMemWr
BusB
Ra
Rb
RwDi
Rs
Rt
RtRd
adrProgmem
+ 4
Dout
Rfileflags
ALUop
branchRegWr
Ifetch Reg/dec exec mem wr
Next PC
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
22
IM RF DM RF
IM RF DM RF
IM RF DM RF
IM RF DM RF
IM RF DM RF
R1 = ...
… = R1 + ...
… = R1 + ...
… = R1 + ...
… = R1 + ...
Data dependencies : R-type instructions
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
23
IM RF DM RF
IM RF DM RF
IM RF DM RF
IM RF DM RF
IM RF DM RF
R1 = ...
… = R1 + ...
… = R1 + ...
… = R1 + ...
… = R1 + ...
Data dependencies : R-type instructions
Solution: bypasses
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
24
Datamem
adr
Bypasses[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
25
IM RF DM RF
IM RF DM RF
IM RF DM RF
IM RF DM RF
R1 = lw...
… = R1 + ...
… = R1 + ...
… = R1 + ...
Data dependencies : load instruction
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
26
IM RF DM RF
IM RF DM RF
IM RF DM RF
IM RF DM RF
R1 = lw...
… = R1 + ...
… = R1 - ...
… = R1 - ...
Data dependencies : load instruction
Bypass is no solutionfor + instruction
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
27
IM RF DM RF
IM RF DM RF
IM RF DM RF
IM RF DM RF
R1 = lw...
… = R1 + ...
… = R1 - ...
… = R1 - ...
Data dependencies : load instruction
Solution: pipeline interlock = detects a data hazard and stallsthe pipeline until the hazard is cleared
[Hennessy&Patterson]
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
28
#define NTAPS 4
int fir(int in)int i;static int state[NTAPS];static int coeff[NTAPS];int out[NTAPS];
state[NTAPS] = in;out[0] = state[0] * coeff[0];for ( i = 1; i < NTAPS+1; i++)
out[i] = out[i-1] + state[i] * coeff[i];state[i-1] = state[i];
return(out[NTAPS]);
*
Z-1
*
Z-1
*
Z-1
*
+
c3c4 c2 c1
x4 x3 x2 x1
y
Z-1
c0
x0
*
Application examples (1)
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
29
.L1000006sll $3, $2, 2 R3=R2>>2 R3=i-1addu $14, $15, $3 R14=R15+R3lw $24, 0($14) R24=load(*R14) R24=coeff[i-1]addiu $12, $6, -4 R12=R6-4addu $11, $12, $3 R11=R12+R3lw $13, 0($11) R13=load(*R11) R13=state[i-1]nopmult $24, $13 R24=R24*R13addu $25, $sp, $3 R25=sp+R3lw $9, -4($25) R9=load(R25-4) R9=out[i-1]addiu $2, $2, 1 R2=R2+1 i=i+1mflo $13 R13=move from low mpy regaddu $10, $9, $13 R10=R9+R13 R10=out[i]sw $10, 0($25) mem(*R25)=R10addu $25, $7, $3 R25=R7+R3sw $24, 0($25) mem(*R25)=R24slti $24, $2, 10bne $24, $0, .L100006addiu $15, $7, -4
Application examples (1)
19 instructions per tap!!
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
30
temp1 = input << 1temp2 = if (bit(input,7) == 1
then 29 else 0
out = temp1 exor temp2
Bit level operations:finite field arithmetic
r1 = LB input Load byter2 = SLL r1 Shift left logicalr3 = ANDI r1, mask AND immediater4 = ADDI r3, -1 ADD immediateBNE ( r4 != r0) Branch on != to nonzeronopR5 = XORI(r1, 29) Exclusive or immediateJ common Jumpnop
nonzero r5 = XOR(r1,r0) Exclusive ORcommon …
in[0] in[1] in[2] in[3] in[4] in[5] in[6] in[7]
out[0] out[1] out[2] out[3] out[4] out[5] out[6] out[7]
exor exor exor
Application examples (2)
10 instructions!!Very simple in hardware
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
31
srl $13, $2, 20andi $25, $13, 1srl $14, $2, 21andi $24, $14, 6or $15, $25, $24srl $13, $2, 22andi $14, $13, 56or $25, $15, $14sll $24, $25, 2
202223252627source register ($2)
destination register ($24)
2 3 4 5 6 7
Bit level operations : DES exampleApplication examples (2)
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
32
srl $24, $5, 18srl $25, $5, 17xor $8, $24, $25srl $9, $5, 16xor $10, $8, $9srl $11, $5, 13xor $12, $10, $11andi $13, $12, 1
181716 13
xor
$5
1$13 … 0 ...
Bit level operations : A5 example (GSM encryption)
Application examples (2)
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
33
Application examples: conclusions
• CPUs offer flexibility, but…• not efficient in performance• not efficient in code size• not efficient in power consumption
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
34
Power Consumption in microprocessorsPower consumption is (becoming) the limiting factor in
processor design
Solution in direction of• Hardware acceleration• Instruction Level Parallelism instead of clock speed• Code size efficiency
source: ISSCC2001, Patrick Gelsinger, Intel
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
35
Amdahl’s law
• Impact of an improvement on the execution time of a program depends on 2 parameters:– f = fraction of the original computation time that is
affected by the improvement– s = speedup factor (local)
• exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s
• speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s)
• if s >> 1 then speedup_overall = 1 / ( 1 – f )• Example: 40 % of program can be executed 10 x faster
speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
36
• Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw)• Keep it Simple heuristic (RISC vs. CISC)
• Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set• No special features that match a high level language construct.• At least 16 registers to ease register allocation.
• Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance)
Conclusions
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
37
Programmable Digital Signal Processors• real-time worst-case processing = need for more compute power
sec instr cycles secprog prog instr cycle
CPI = 1• instruction level parallelism (ILP)• hardware support for loop control• attention for high level data types e.g. arrays, delaylines
(vs. scalars for CPUs)• difficult to compare architectures
• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten
• benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
38
• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures
• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)
examples: C6 and TM
Outline
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
39
Goal = 1 cycle per iteration
•position ACR (1 or 2)•adder/subtractor•extra pipelines•asymmetric inputs•multi-precision
PR
ADDER
ACR
MPY(Booth,
Wallace..)
c(i) x(i)
c(i) * x(i)
Sum of products = basic operation for correlation, filtering, spectral analysis ... linear
expr.
Modifications •extra inputs/outputs
clockP_reg
control
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
40
• not every signal requires 32 bits• 2 types of DSP: floating point and integer• advantages FP: most specs are in FP
(conversion to int is time consuming since the behaviour may change)
• disadvantage FP: cost (area, speed, power)• wanted : type of output of an operation = type of input
(because both stored in RAM) • no problem for FP but for integer • integer multiplication doubles the number of bits: n * n => 2n• What about fractional numbers ?
0.90.90.81
x
DSP data types
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
41
• integer and fractional numbers are a special case of fixed pointfix <p,q> (ART designer & SystemC)
1 1 0 1 1 0 1 -19/8 = -2.3751fix <8,3>
negative weight2’s complement
if q=0 then integer e.g. int <8,0>if q=p-1 then fractional e.g. int <8,7>
DSP data types
Scale factor 1/8
pq
2-2 2-32-120212223-24
quantization error
Same alu handlesfix <8,1>, fix <8,2>, fix <8,3>, ...
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
42
• continue (after multiplication) with msb only• represents the limit of the accuracy of the result
(can not be larger than the accuracy of the inputs)• more efficient solution
• continue with msb + lsb•sum-of-product operations generate accumulative noise at 32nd vs. 16th bit
• Still overflow for addition = overflow bits• double precision accumulator
+ extra overflow bits + shift, round, truncate unit
DSP data types
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
43
PR
ADDER
ACR
MPY(Booth,
Wallace..)
c(i) x(i)
SHIFTROUND
TRUNCATE
clockP_reg
clockP_reg
control
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
44
Prog/datamemory
EXU
Von Neumann(sequencial)
progmem.
EXU
Harvard
datamem.
progmem.
EXU
datamem. 1
datamem. 2
Modified Harvard
c(i) * x(i)Goal = 1 cycle per iteration
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
45
RAM_A RAM_B
ACU_A
AR_A
ACU_B
AR_B
MAC
DR_A DR_B
+1 PC
Interrupt address
Stack
Reset
ProgramMemory
IR
Control Bus
Rfile
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
46
*
Z-1
*
Z-1
*
Z-1
*
+
c4c5 c3 c2
x5 x4 x3 x2
y
Z-1
c1
x1
*
ci * xi
time loop
filter loop i
How updating the delayline ?
1 cycle/tap ?
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
47
Memorylocation
outputsample 1
outputsample 2
outputsample 3
outputsample 4
Outputsample 5
1 x1 x92 x2 x23 x3 x3 x34 x4 x4 x4 x45 x5 x5 x5 x5 x56 x6 x6 x6 x67 x7 x7 x78 x8 x8
Solution 2: indirect adressing
• use of a pointer to mark the begin of the delay line• update the pointer instead of moving the data• problem: trashing of the whole memory• solution: modulo addressing• need for a register to store the pointer
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
48
A S
Modulo
outputto RAM
Output reg A reg SRead_A A A SRead_S S A SincA A+1 A+1 SdecA A-1 A-1 SStep A+S A+S SInc_step S+1 A S+1
Modulo can beimplemented as a mask operation if the size is 2k
16 10 00023 10 111mask=hold
ACU architecture andInstruction set
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
49
Addressing modes
• register ADD R4, R3 R[R4] = R[R4] + R[R3]• immediate ADD R4, #3 R[R4] = R[R4] + #3• direct ADD R4, (100) R[R4] = R[R4] + Mem[100]• indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]
• w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1
• indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2]
Remarks• direct = for static data• indirect = for arrays
• inc/dec = for stepping through arrays e.g. xn
• index = for stepping through arrays e.g. x2n
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
50
• 8 ARs (address or auxiliary register) available• extra indirect modes
•circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular
• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.
Addressing modes: extra for DSP
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
51
+1 PC
Interrupt address
Stack
Reset
ProgramMemory
IR
ACU_A
AR_A
RAM_A
DR_A
ACU_B
AR_B
RAM_B
DR_B
MAC ALUControl Bus
Rfile
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
52
LABEL ALU MPY-ACC RAM ACUAcc = 0 init (i=0)
init counterloop incr (=i+1)
read x(i)acc(i)=acc(i-1)+x(i)*c(i)
dec counter branch to loop if counter > 0
nop
c(i) * x(i)
6 clockcycles/samplelimit pipelines in the controller
first solution
resources
time (cc)
Not showncoefficient RAM+ACU
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
53
f
g
h
ai
bi
ci
di
f
g
h
a0
b0
c0
d0
f
g
h
a1
b1
c1
d1
f
g
h
a2
b2
c2
d2
h g f
ai
bi
bi-1ci-2
ci-1di-2
for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci)
for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2)
Loopfolding (software pipelining)
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
54
c(i) * x(i)
Pre- and postamble4 clockcycles /sample
LABEL ALU MPY-ACC RAM ACUacc(i-1)=0 init (i=1)
init counter read x(i) inc(=i+1)loop acc(i) = acc(i-1)+x(i)*c(i) read x(i+1) incr (=i+2)
dec counterbranch to loop if counter > 0nop
acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1)+x(n)*c(n)
Loopfolding (software pipelining)
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
55
Label ALU MPY-ACC RAM ACUacc(i-1=0 init (i=1)
init counter read x(i) inc(=i+1)repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) read x(i+1) incr(=i+2)
acc(n-1) = acc(n-2) + x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1) + x(n)*c(n)
c(i) * x(i)
hardware support for loop control
1 clockcycles/samplerepeat instruction and repeat block
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
56
T register
Sign ctr Sign ctr Sign ctr Sign ctr Sign ctr
T
Multiplier (17*17)
A(40) B(40)
MUXA
0
A
A B
B A
fractional MUX
Adder (40)
ZERO SAT ROUND
MALU (40)
U B
MUX
TAB CD
C D
Barrer shifter
MSW/LSWselect
E
COMP
TRN
TC
B
A
P C DD
TMS320C5000
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
57
Address bus
16 bits
EXTERNALADRESS SWITCH
Y Address
Y memory256-by-24-bit
RAM256-by-24-bit
ROM
AddressALU
X memory256-by-24-bit
RAM256-by-24-bit
ROM
2,048-by-24-bitPROGRAMMEMORY
ROM
X AddressP Address
EXTERNALDATA-BUS
SWITCH
INTERNAL DATA-BUS
SWITCH
24 BITS DATA
BUS
X-DATAY DATAP DATAGLOBAL DATA
DATA ALU
24-by-24 bitMULTIPLIER-
ACCUMULATORPRODUCING
56 BIT RESULT
PROGRAM CONTROLLER
ON CHIPPERIPHERALS,
HOST,SYNCHRONOUS
SERIAL INTERFACESERIAL COMMU-
NICATIONSINTERFACE,
PROGRAMMED I/O,BUS CONTROL
2 BITS
CLOCK
3 BITS
INTERRUPT
24 BITS
I/OPORTS
7 BITS
Motorola 56K family
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
58
X data
Y data
Z data
Buses for
X
X datamemory
16 bitbus
Y datamemory
16 bit bus
Two address Compution
units
Y
Inst
ruc t
ion
d eco
der
96-b
it in
stru
ctio
ns
Program control
unit
Programmemory (Z data)
16-bit bus
Two 16-by-16 bitmultipliers
Y0
Y1
X
Y0
Y1
X
PO P1
scale scale
Two 40 bit arithmic-logic units
SaturationSaturation
Four 40 bitaccumulators
Saturation/scale
shift
R.E.A.L.
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
59
lexical analysis
syntax analysis
semantic analysis
Code selection
Register allocation
scheduling
Front end
Code generation
code
source
Intermediate machine independent
representation
1 instr = // opsorder of instr
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
60
a b
*
c d
+
+
*
c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3
t1 t2
t3
BBi
BBj BBk
Intermediate machine independent
representation
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
61
ax ay
ar
af mx my
mr
mf
+ -
x y x y
+ - *ALU MAC
d memory p memory ADSP[Analog Devices]
Code selection example
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
62
a b
*
c d
+
+
*
c
t1 t2
t3
mx := dmem my := pmem ax := dmem ay := pmem
mr := dmem
2:
1:
3: ar := ax + ay
my := ar
mr = mr * my
Mr := mr + (mx * my)
Example of code selection = covering of intermediate representation with RTPs
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
63
Problems• local decisions which have a global impact• phase coupling: example
• asap schedule• maximal freedom for scheduling• code selection during scheduling• register allocation comes afterwards• can lead to infeasible solutions
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
64
Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecture
develop an architecture which is still efficient but alsoa good model for building a compiler
Efficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction Word
It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler
phase coupling: discussion
Processor Architectures and Program Mapping H. Corporaal and B. Mesman
65
Will embedded CPUs and DSPs converge ?• Converging forces
• both include a hardware multiplier• trend in DSPs towards caches and RTK• trend in DSPs towards C/C++• common trend towards VLIW
• Diverging forces• deeply embedded code (DSP) vs. end-user SW (CPU)• different RTKs
SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)
Conclusions VLIW• good balance between hw and sw• between efficiency (ILP) and cost• fundamental problems: code size, interruptability