Upload
vantuong
View
238
Download
3
Embed Size (px)
Citation preview
Chapter 5ARM
Organization and Implementation
Source: http://www.ece.uah.edu/~milenka 2
ARM organizationRegister file –
2 read ports, 1 write port + 1 read, 1 write port reserved for r15 (pc)
Barrel shifter – shift or rotate one operand for any number of bitsALU – performs the arithmetic and logic functions requiredMemory address register + incrementerMemory data registersInstruction decoder and associated control logic
multiply
data out register
instructiondecode
&control
incrementer
registerbank
address register
barrelshifter
A[31:0]
D[31:0]
data in register
ALU
control
PC
PC
ALUbus
Abus
Bbus
register
Source: http://www.ece.uah.edu/~milenka 3
Three-stage pipeline
Fetchthe instruction is fetched from memory and placed in the instruction pipeline
Decodethe instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath
Executethe instruction owns the datapath; the register bank is read, an operand shifted, the ALU register generated and written back into a destination register
Source: http://www.ece.uah.edu/~milenka 4
ARM single-cycle instruction pipeline
fetch decode execute
time
1
fetch decode execute
fetch decode execute
2
3instruction
Source: http://www.ece.uah.edu/~milenka 5
ARM single-cycle instruction pipeline
add r0,r1,#5
sub r2,r3,r6
cmp r2,#3
fetch
time
decode
fetch
execute add
decode
fetch
execute sub
decode execute cmp
1 2 3
Source: http://www.ece.uah.edu/~milenka 6
ARM multi-cycle instruction pipeline
fetch ADD decode execute
time
1
fetch STR decode calc. addr.
fetch ADD decode execute
2
3
data xfer
fetch ADD decode execute4
5 fetch ADD decode executeinstruction
Decode logic is always generating the control signals for the datapathto use in the next cycle
Source: http://www.ece.uah.edu/~milenka 7
ARM multi-cycle LDMIA (load multiple) instruction
fetch decodeex ld r2ldmiar0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
ex ld r3
fetch
time
decode ex sub
fetch decodeex cmp
Decode stage occupied since ldmia must continue toremember decoded instruction
sub fetched at normal time butnot decoded until LDMIA is finishing
Instruction delayed
Source: http://www.ece.uah.edu/~milenka 8
Control stalls: due to branches
Branches often introduce stalls (branch penalty)Stall time may depend on whether branch is taken
May have to squash instructions that already started executingDon’t know what to fetch until condition is evaluated
Source: http://www.ece.uah.edu/~milenka 9
ARM pipelined branch
time
fetch decodeex bnebne foo
subr2,r3,r6 fetch decode
foo addr0,r1,r2
ex bne
fetch decodeex add
ex bne
Decision not made until the third clock cycle
Two cycles of work thrown away if bne takes place
Source: http://www.ece.uah.edu/~milenka 10
Pipeline: how it works
All instructions occupy the datapathfor one or more adjacent cyclesFor each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycleDuring the fist datapath cycle each instruction issues a fetch for the next instruction but oneBranch instruction flush and refill the instruction pipeline
Source: http://www.ece.uah.edu/~milenka 11
ARM9TDMI 5-stage pipeline
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-indexpre-index
LDM/STM
register write
r15
pc+8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
Fetch Decode
instruction is decodedregister operands read (3 read ports)
Execute an operand is shifted and the ALU result generated, oraddress is computed
Buffer/datadata memory is accessed (load, store)
Write-back write to register file
Source: http://www.ece.uah.edu/~milenka 12
ARM9TDMI Data Forwarding
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-indexpre-index
LDM/STM
register write
r15
pc+8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
r3 := r2 + 8 x r1r5 := r5 + 2r2 x r3
ADD r3, r2, r1, LSL #3ADD r5, r5, r3, LSL r2
r3 := r2 + 8 x r1r8 := r9 + r10r5 := r5 + 2r2 x r3
ADD r3, r2, r1, LSL #3ADD r8, r9, r10ADD r5, r5, r3, LSL r2
r3 := mem[r2]r1 := r2 + r3
LD r3, [r2] ADD r1, r2, r3
Data Forwarding
Stall?
Source: http://www.ece.uah.edu/~milenka 13
ARM9TDMI PC generation
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-indexpre-index
LDM/STM
register write
r15
pc+8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
3-stage pipelinePC behavior: operands are read in execution stage r15 = PC + 8
5-stage pipelineoperands are read in decode stage and r15 = PC + 4?incompatibilities between 3-stage and 5-stage implementations => unacceptableto avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs
Source: http://www.ece.uah.edu/~milenka 14
Data processing instruction datapath activity
address register
increment
registersRd
Rn
PC
Rm
as ins.
as instruction
mult
data out data in i. pipe
(a) register – register operations
address register
increment
registersRd
Rn
PC
as ins.
as instruction
mult
data out data in i. pipe
[7:0]
(b) register – immediate operations
Reg-RegRd = Rn op Rmr15 = AR + 4AR = AR + 4
Reg-ImmRd = Rn op Immr15 = AR + 4AR = AR + 4
Source: http://www.ece.uah.edu/~milenka 15
STR (store register) datapath activity
address register
increment
registersRn
PC
lsl #0
= A / A + B / A - B
mult
data out data in i. pipe
[11:0]
address register
increment
registersRn
Rd
shifter
= A + B / A - B
mult
PC
byte? data in i. pipe
(a) 1st cycle – compute address (b) 2nd cycle – store data & auto-index
Compute addressAR = Rn op Dispr15 = AR + 4
Store dataAR = PCmem[AR] = Rd<x:y>If autoindexing=>Rn = Rn +/- 4
Source: http://www.ece.uah.edu/~milenka 16
The first two (of three) cycles of a branch instruction
address register
increment
registersPC
lsl #2
= A + B
mult
data out data in i. pipe
[23:0]
address register
increment
registersR14
PC
shifter
= A
mult
data out data in i. pipe
(a) 1st cycle – compute branch target (b) 2nd cycle – save return address
Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch?
Compute target address
AR = PC + Disp,lsl #2Save return address (if required)
r14 = PCAR = AR + 4
Source: http://www.ece.uah.edu/~milenka 17
ARM Implementation
DatapathControl unit (FSM)
Source: http://www.ece.uah.edu/~milenka 18
2-phase non-overlapping clock scheme
Most ARMs do not operate on edge-sensitive registersInstead the design is based around 2-phase non-overlapping clocks which are generated internally from a single clock signalData movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2
1 clock cycle
phase 1
phase 2
Source: http://www.ece.uah.edu/~milenka 19
ARM datapath timingRegister read
Register read buses – dynamic, precharged during phase 2During phase 1 selected registers discharge the read buses which become valid early in phase 1
Shift operationsecond operand passes through barrel shifter
ALU operationALU has input latches which are open in phase 1,allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALUALU processes the operands during the phase 2, producing the valid output towards the end of the phasethe result is latched in the destination register at the end of phase 2
Source: http://www.ece.uah.edu/~milenka 20
ARM datapath timing (cont’d)
read bus valid
shift out valid
ALU out
shift time
ALU time
registerwrite time
registerreadtime
ALU operandslatched
phase 1
phase 2
prechargeinvalidatesbuses
Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time
Source: http://www.ece.uah.edu/~milenka 21
The original ARM1 ripple-carry adder
Carry logic: use CMOS AOI (And-Or-Invert) gateEven bits use circuit show belowOdd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped aroundWorst case path:32 gates long
AB
Cin
sum
Cout
Source: http://www.ece.uah.edu/~milenka 22
ARM2 4-bit carry look-ahead scheme
Carry Generate (G)Carry Propagate (P)Cout[3] =Cin[0].P + GUse AOI and alternate AND/OR gatesWorst case:8 gates long
A[3:0]
B[3:0]
Cin[0]
sum[3:0]
Cout[3]
4-bitadderlogic
P
G
Source: http://www.ece.uah.edu/~milenka 23
The ARM2 ALU logic for one result bit
ALU functionsdata operations (add, sub, ...)address computations for memory accessesbranch target computationsbit-wise logical operations...
ALUbus
432105
NBbus
NAbus
carrylogic
fs:
G
P
Source: http://www.ece.uah.edu/~milenka 24
ARM2 ALU function codes
f s 5 fs 4 fs 3 fs 2 fs 1 fs 0 ALU o utput0 0 0 1 0 0 A and B0 0 1 0 0 0 A and not B0 0 1 0 0 1 A xor B0 1 1 0 0 1 A plus not B plus carry0 1 0 1 1 0 A plus B plus carry1 1 0 1 1 0 not A plus B plus carry0 0 0 0 0 0 A0 0 0 0 0 1 A or B0 0 0 1 0 1 B0 0 1 0 1 0 not B0 0 1 1 0 0 zero
Source: http://www.ece.uah.edu/~milenka 25
The ARM6 carry-select adder scheme
Compute sums of various fields of the wordfor carry-in of zero and carry-in of oneFinal result is selected by using the correct carry-in value to control a multiplexor
sum[31:16]sum[15:8]sum[7:4]sum[3:0]
s s+1
a,b[31:28]a,b[3:0]
+ +, +1c
+, +1
mux
mux
mux
Worst case: O(log2[word width]) gates long
Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.
Source: http://www.ece.uah.edu/~milenka 26
The ARM6 ALU organization
Not easy to merge the arithmetic and logic functions =>a separate logic unit runs in parallel with the adder, and multiplexor selects the output
Z
N
VC
logic/arithmetic
C infunction
invert A invert B
result
result mux
logic functions
A operand latch B operand latch
XOR gates XOR gates
adder
zero detect
Source: http://www.ece.uah.edu/~milenka 27
ARM9 carry arbitration encoding
Carry arbitration adder
A B C u v
0 0 0 0 0
0 1 unknown 1 0
1 0 unknown 1 0
1 1 1 1 1
Source: http://www.ece.uah.edu/~milenka 28
The cross-bar switch barrel shifter
Shifter delay is critical since it contributes directly to the datapath cycle timeCross-bar switch matrix (32 x 32)Principle for 4x4 matrix
in[0]
in[1]
in[2]
in[3]
out[0] out[1] out[2] out[3]
no shiftright 1right 2right 3
left 1
left 2
left 3
Source: http://www.ece.uah.edu/~milenka 29
The cross-bar switch barrel shifter (cont’d)
Precharged logic is used => each switch is a single NMOS transistorPrecharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)Arithmetic shift right:use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately
Source: http://www.ece.uah.edu/~milenka 30
The 2-bit multiplication algorithm, Nth cycle
Carry - i n Mul t i p l i er Shi f t ALU Carry -o ut0 x 0 LSL #2N A + 0 0
x 1 LSL #2N A + B 0x 2 LSL #(2N + 1) A – B 1x 3 LSL #2N A – B 1
1 x 0 LSL #2N A + B 0x 1 LSL #(2N + 1) A + B 0x 2 LSL #2N A – B 1x 3 LSL #2N A + 0 1
Source: http://www.ece.uah.edu/~milenka 31
Carry-propagate (a) and carry-save (b) adder structures
+A B Cin
Cout S(a) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
+A B Cin
Cout S(b) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
Source: http://www.ece.uah.edu/~milenka 32
ARM high-speed multiplier organization
Rs >> 8 bits/cycle
carry-save adders
partial sum
partial carry
initialization for MLA registers
Rm
ALU (add partials)
rotate sum andcarry 8 bits/cycle
Source: http://www.ece.uah.edu/~milenka 33
ARM2 register cell circuit
A busB bus
ALU buswrite
readB
readA
Source: http://www.ece.uah.edu/~milenka 34
ARM register bank floorplan
A bus read decoders
B bus read decoders
write decoders
register cellsPC
Vdd
Vss
ALUbus
PCbus
INCbus
ALUbus
A bus
B bus
Source: http://www.ece.uah.edu/~milenka 35
ARM core datapath buses
address register
incrementer
register bank
multiplier
ALU
shifter
data in
instruction pipedata out
A B
W
instruction
Din
shift out
PC
Adinc
Source: http://www.ece.uah.edu/~milenka 36
ARM control logic structure
decodePLA
cyclecount
multiplycontrol
load/storemultiple
addresscontrol
registercontrol
ALUcontrol
shiftercontrol
instruction
coprocessor