Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Chapter 5ARM

Organization and Implementation

Source: http://www.ece.uah.edu/~milenka 2

ARM organizationRegister file –

2 read ports, 1 write port + 1 read, 1 write port reserved for r15 (pc)

Barrel shifter – shift or rotate one operand for any number of bitsALU – performs the arithmetic and logic functions requiredMemory address register + incrementerMemory data registersInstruction decoder and associated control logic

multiply

data out register

instructiondecode

&control

incrementer

registerbank

address register

barrelshifter

A[31:0]

D[31:0]

data in register

ALU

control

PC

PC

ALUbus

Abus

Bbus

register


Three-stage pipeline

Fetchthe instruction is fetched from memory and placed in the instruction pipeline

Decodethe instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath

Executethe instruction owns the datapath; the register bank is read, an operand shifted, the ALU register generated and written back into a destination register


ARM single-cycle instruction pipeline

fetch decode execute

time

1



2

3instruction


ARM single-cycle instruction pipeline

add r0,r1,#5

sub r2,r3,r6

cmp r2,#3

fetch

time

decode

fetch

execute add

decode

fetch

execute sub

decode execute cmp

1 2 3


ARM multi-cycle instruction pipeline

fetch ADD decode execute

time

1

fetch STR decode calc. addr.

fetch ADD decode execute

2

3

data xfer

fetch ADD decode execute4

5 fetch ADD decode executeinstruction

Decode logic is always generating the control signals for the datapathto use in the next cycle


ARM multi-cycle LDMIA (load multiple) instruction

fetch decodeex ld r2ldmiar0,{r2,r3}

sub r2,r3,r6

cmp r2,#3

ex ld r3

fetch

time

decode ex sub

fetch decodeex cmp

Decode stage occupied since ldmia must continue toremember decoded instruction

sub fetched at normal time butnot decoded until LDMIA is finishing

Instruction delayed


Control stalls: due to branches

Branches often introduce stalls (branch penalty)Stall time may depend on whether branch is taken

May have to squash instructions that already started executingDon’t know what to fetch until condition is evaluated


ARM pipelined branch

time

fetch decodeex bnebne foo

subr2,r3,r6 fetch decode

foo addr0,r1,r2

ex bne

fetch decodeex add

ex bne

Decision not made until the third clock cycle

Two cycles of work thrown away if bne takes place


Pipeline: how it works

All instructions occupy the datapathfor one or more adjacent cyclesFor each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycleDuring the fist datapath cycle each instruction issues a fetch for the next instruction but oneBranch instruction flush and refill the instruction pipeline


ARM9TDMI 5-stage pipeline

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-indexpre-index

LDM/STM

register write

r15

pc+8

pc + 4

+4

mux

shift

mul

B, BLMOV pc

Fetch Decode

instruction is decodedregister operands read (3 read ports)

Execute an operand is shifted and the ALU result generated, oraddress is computed

Buffer/datadata memory is accessed (load, store)

Write-back write to register file


ARM9TDMI Data Forwarding

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-indexpre-index

LDM/STM

register write

r15

pc+8

pc + 4

+4

mux

shift

mul

B, BLMOV pc

r3 := r2 + 8 x r1r5 := r5 + 2r2 x r3

ADD r3, r2, r1, LSL #3ADD r5, r5, r3, LSL r2

r3 := r2 + 8 x r1r8 := r9 + r10r5 := r5 + 2r2 x r3

ADD r3, r2, r1, LSL #3ADD r8, r9, r10ADD r5, r5, r3, LSL r2

r3 := mem[r2]r1 := r2 + r3

LD r3, [r2] ADD r1, r2, r3

Data Forwarding

Stall?


ARM9TDMI PC generation

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-indexpre-index

LDM/STM

register write

r15

pc+8

pc + 4

+4

mux

shift

mul

B, BLMOV pc

3-stage pipelinePC behavior: operands are read in execution stage r15 = PC + 8

5-stage pipelineoperands are read in decode stage and r15 = PC + 4?incompatibilities between 3-stage and 5-stage implementations => unacceptableto avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs


Data processing instruction datapath activity

address register

increment

registersRd

Rn

PC

Rm

as ins.

as instruction

mult

data out data in i. pipe

(a) register – register operations

address register

increment

registersRd

Rn

PC

as ins.

as instruction

mult


[7:0]

(b) register – immediate operations

Reg-RegRd = Rn op Rmr15 = AR + 4AR = AR + 4

Reg-ImmRd = Rn op Immr15 = AR + 4AR = AR + 4


STR (store register) datapath activity

address register

increment

registersRn

PC

lsl #0

= A / A + B / A - B

mult


[11:0]

address register

increment

registersRn

Rd

shifter

= A + B / A - B

mult

PC

byte? data in i. pipe

(a) 1st cycle – compute address (b) 2nd cycle – store data & auto-index

Compute addressAR = Rn op Dispr15 = AR + 4

Store dataAR = PCmem[AR] = Rd<x:y>If autoindexing=>Rn = Rn +/- 4


The first two (of three) cycles of a branch instruction

address register

increment

registersPC

lsl #2

= A + B

mult


[23:0]

address register

increment

registersR14

PC

shifter

= A

mult


(a) 1st cycle – compute branch target (b) 2nd cycle – save return address

Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch?

Compute target address

AR = PC + Disp,lsl #2Save return address (if required)

r14 = PCAR = AR + 4


ARM Implementation

DatapathControl unit (FSM)


2-phase non-overlapping clock scheme

Most ARMs do not operate on edge-sensitive registersInstead the design is based around 2-phase non-overlapping clocks which are generated internally from a single clock signalData movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2

1 clock cycle

phase 1

phase 2


ARM datapath timingRegister read

Register read buses – dynamic, precharged during phase 2During phase 1 selected registers discharge the read buses which become valid early in phase 1

Shift operationsecond operand passes through barrel shifter

ALU operationALU has input latches which are open in phase 1,allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALUALU processes the operands during the phase 2, producing the valid output towards the end of the phasethe result is latched in the destination register at the end of phase 2


ARM datapath timing (cont’d)

read bus valid

shift out valid

ALU out

shift time

ALU time

registerwrite time

registerreadtime

ALU operandslatched

phase 1

phase 2

prechargeinvalidatesbuses

Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time


The original ARM1 ripple-carry adder

Carry logic: use CMOS AOI (And-Or-Invert) gateEven bits use circuit show belowOdd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped aroundWorst case path:32 gates long

AB

Cin

sum

Cout


ARM2 4-bit carry look-ahead scheme

Carry Generate (G)Carry Propagate (P)Cout[3] =Cin[0].P + GUse AOI and alternate AND/OR gatesWorst case:8 gates long

A[3:0]

B[3:0]

Cin[0]

sum[3:0]

Cout[3]

4-bitadderlogic

P

G


The ARM2 ALU logic for one result bit

ALU functionsdata operations (add, sub, ...)address computations for memory accessesbranch target computationsbit-wise logical operations...

ALUbus

432105

NBbus

NAbus

carrylogic

fs:

G

P


ARM2 ALU function codes

f s 5 fs 4 fs 3 fs 2 fs 1 fs 0 ALU o utput0 0 0 1 0 0 A and B0 0 1 0 0 0 A and not B0 0 1 0 0 1 A xor B0 1 1 0 0 1 A plus not B plus carry0 1 0 1 1 0 A plus B plus carry1 1 0 1 1 0 not A plus B plus carry0 0 0 0 0 0 A0 0 0 0 0 1 A or B0 0 0 1 0 1 B0 0 1 0 1 0 not B0 0 1 1 0 0 zero


The ARM6 carry-select adder scheme

Compute sums of various fields of the wordfor carry-in of zero and carry-in of oneFinal result is selected by using the correct carry-in value to control a multiplexor

sum[31:16]sum[15:8]sum[7:4]sum[3:0]

s s+1

a,b[31:28]a,b[3:0]

+ +, +1c

+, +1

mux

mux

mux

Worst case: O(log2[word width]) gates long

Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.


The ARM6 ALU organization

Not easy to merge the arithmetic and logic functions =>a separate logic unit runs in parallel with the adder, and multiplexor selects the output

Z

N

VC

logic/arithmetic

C infunction

invert A invert B

result

result mux

logic functions

A operand latch B operand latch

XOR gates XOR gates

adder

zero detect


ARM9 carry arbitration encoding

Carry arbitration adder

A B C u v

0 0 0 0 0

0 1 unknown 1 0

1 0 unknown 1 0

1 1 1 1 1


The cross-bar switch barrel shifter

Shifter delay is critical since it contributes directly to the datapath cycle timeCross-bar switch matrix (32 x 32)Principle for 4x4 matrix

in[0]

in[1]

in[2]

in[3]

out[0] out[1] out[2] out[3]

no shiftright 1right 2right 3

left 1

left 2

left 3


The cross-bar switch barrel shifter (cont’d)

Precharged logic is used => each switch is a single NMOS transistorPrecharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)Arithmetic shift right:use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately


The 2-bit multiplication algorithm, Nth cycle

Carry - i n Mul t i p l i er Shi f t ALU Carry -o ut0 x 0 LSL #2N A + 0 0

x 1 LSL #2N A + B 0x 2 LSL #(2N + 1) A – B 1x 3 LSL #2N A – B 1

1 x 0 LSL #2N A + B 0x 1 LSL #(2N + 1) A + B 0x 2 LSL #2N A – B 1x 3 LSL #2N A + 0 1


Carry-propagate (a) and carry-save (b) adder structures

+A B Cin

Cout S(a) +

A B Cin

Cout S+

A B Cin

Cout S+

A B Cin

Cout S

+A B Cin

Cout S(b) +

A B Cin

Cout S+

A B Cin

Cout S+

A B Cin

Cout S


ARM high-speed multiplier organization

Rs >> 8 bits/cycle

carry-save adders

partial sum

partial carry

initialization for MLA registers

Rm

ALU (add partials)

rotate sum andcarry 8 bits/cycle


ARM2 register cell circuit

A busB bus

ALU buswrite

readB

readA


ARM register bank floorplan

A bus read decoders

B bus read decoders

write decoders

register cellsPC

Vdd

Vss

ALUbus

PCbus

INCbus

ALUbus

A bus

B bus


ARM core datapath buses

address register

incrementer

register bank

multiplier

ALU

shifter

data in

instruction pipedata out

A B

W

instruction

Din

shift out

PC

Adinc


ARM control logic structure

decodePLA

cyclecount

multiplycontrol

load/storemultiple

addresscontrol

registercontrol

ALUcontrol

shiftercontrol

instruction

coprocessor

Documents

Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic