18
Chapter 5 ARM Organization and Implementation Source: http://www.ece.uah.edu/~milenka 2 ARM organization Register file – 2 read ports, 1 write port + 1 read, 1 write port reserved for r15 (pc) Barrel shifter – shift or rotate one operand for any number of bits ALU – performs the arithmetic and logic functions required Memory address register + incrementer Memory data registers Instruction decoder and associated control logic multiply data out register instruction decode & control incrementer register bank address register barrel shifter A[31:0] D[31:0] data in register ALU control P C PC A L U b u s A b u s B b u s register

Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Embed Size (px)

Citation preview

Page 1: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Chapter 5ARM

Organization and Implementation

Source: http://www.ece.uah.edu/~milenka 2

ARM organizationRegister file –

2 read ports, 1 write port + 1 read, 1 write port reserved for r15 (pc)

Barrel shifter – shift or rotate one operand for any number of bitsALU – performs the arithmetic and logic functions requiredMemory address register + incrementerMemory data registersInstruction decoder and associated control logic

multiply

data out register

instructiondecode

&control

incrementer

registerbank

address register

barrelshifter

A[31:0]

D[31:0]

data in register

ALU

control

PC

PC

ALUbus

Abus

Bbus

register

Page 2: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 3

Three-stage pipeline

Fetchthe instruction is fetched from memory and placed in the instruction pipeline

Decodethe instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath

Executethe instruction owns the datapath; the register bank is read, an operand shifted, the ALU register generated and written back into a destination register

Source: http://www.ece.uah.edu/~milenka 4

ARM single-cycle instruction pipeline

fetch decode execute

time

1

fetch decode execute

fetch decode execute

2

3instruction

Page 3: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 5

ARM single-cycle instruction pipeline

add r0,r1,#5

sub r2,r3,r6

cmp r2,#3

fetch

time

decode

fetch

execute add

decode

fetch

execute sub

decode execute cmp

1 2 3

Source: http://www.ece.uah.edu/~milenka 6

ARM multi-cycle instruction pipeline

fetch ADD decode execute

time

1

fetch STR decode calc. addr.

fetch ADD decode execute

2

3

data xfer

fetch ADD decode execute4

5 fetch ADD decode executeinstruction

Decode logic is always generating the control signals for the datapathto use in the next cycle

Page 4: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 7

ARM multi-cycle LDMIA (load multiple) instruction

fetch decodeex ld r2ldmiar0,{r2,r3}

sub r2,r3,r6

cmp r2,#3

ex ld r3

fetch

time

decode ex sub

fetch decodeex cmp

Decode stage occupied since ldmia must continue toremember decoded instruction

sub fetched at normal time butnot decoded until LDMIA is finishing

Instruction delayed

Source: http://www.ece.uah.edu/~milenka 8

Control stalls: due to branches

Branches often introduce stalls (branch penalty)Stall time may depend on whether branch is taken

May have to squash instructions that already started executingDon’t know what to fetch until condition is evaluated

Page 5: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 9

ARM pipelined branch

time

fetch decodeex bnebne foo

subr2,r3,r6 fetch decode

foo addr0,r1,r2

ex bne

fetch decodeex add

ex bne

Decision not made until the third clock cycle

Two cycles of work thrown away if bne takes place

Source: http://www.ece.uah.edu/~milenka 10

Pipeline: how it works

All instructions occupy the datapathfor one or more adjacent cyclesFor each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycleDuring the fist datapath cycle each instruction issues a fetch for the next instruction but oneBranch instruction flush and refill the instruction pipeline

Page 6: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 11

ARM9TDMI 5-stage pipeline

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-indexpre-index

LDM/STM

register write

r15

pc+8

pc + 4

+4

mux

shift

mul

B, BLMOV pc

Fetch Decode

instruction is decodedregister operands read (3 read ports)

Execute an operand is shifted and the ALU result generated, oraddress is computed

Buffer/datadata memory is accessed (load, store)

Write-back write to register file

Source: http://www.ece.uah.edu/~milenka 12

ARM9TDMI Data Forwarding

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-indexpre-index

LDM/STM

register write

r15

pc+8

pc + 4

+4

mux

shift

mul

B, BLMOV pc

r3 := r2 + 8 x r1r5 := r5 + 2r2 x r3

ADD r3, r2, r1, LSL #3ADD r5, r5, r3, LSL r2

r3 := r2 + 8 x r1r8 := r9 + r10r5 := r5 + 2r2 x r3

ADD r3, r2, r1, LSL #3ADD r8, r9, r10ADD r5, r5, r3, LSL r2

r3 := mem[r2]r1 := r2 + r3

LD r3, [r2] ADD r1, r2, r3

Data Forwarding

Stall?

Page 7: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 13

ARM9TDMI PC generation

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-indexpre-index

LDM/STM

register write

r15

pc+8

pc + 4

+4

mux

shift

mul

B, BLMOV pc

3-stage pipelinePC behavior: operands are read in execution stage r15 = PC + 8

5-stage pipelineoperands are read in decode stage and r15 = PC + 4?incompatibilities between 3-stage and 5-stage implementations => unacceptableto avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs

Source: http://www.ece.uah.edu/~milenka 14

Data processing instruction datapath activity

address register

increment

registersRd

Rn

PC

Rm

as ins.

as instruction

mult

data out data in i. pipe

(a) register – register operations

address register

increment

registersRd

Rn

PC

as ins.

as instruction

mult

data out data in i. pipe

[7:0]

(b) register – immediate operations

Reg-RegRd = Rn op Rmr15 = AR + 4AR = AR + 4

Reg-ImmRd = Rn op Immr15 = AR + 4AR = AR + 4

Page 8: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 15

STR (store register) datapath activity

address register

increment

registersRn

PC

lsl #0

= A / A + B / A - B

mult

data out data in i. pipe

[11:0]

address register

increment

registersRn

Rd

shifter

= A + B / A - B

mult

PC

byte? data in i. pipe

(a) 1st cycle – compute address (b) 2nd cycle – store data & auto-index

Compute addressAR = Rn op Dispr15 = AR + 4

Store dataAR = PCmem[AR] = Rd<x:y>If autoindexing=>Rn = Rn +/- 4

Source: http://www.ece.uah.edu/~milenka 16

The first two (of three) cycles of a branch instruction

address register

increment

registersPC

lsl #2

= A + B

mult

data out data in i. pipe

[23:0]

address register

increment

registersR14

PC

shifter

= A

mult

data out data in i. pipe

(a) 1st cycle – compute branch target (b) 2nd cycle – save return address

Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch?

Compute target address

AR = PC + Disp,lsl #2Save return address (if required)

r14 = PCAR = AR + 4

Page 9: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 17

ARM Implementation

DatapathControl unit (FSM)

Source: http://www.ece.uah.edu/~milenka 18

2-phase non-overlapping clock scheme

Most ARMs do not operate on edge-sensitive registersInstead the design is based around 2-phase non-overlapping clocks which are generated internally from a single clock signalData movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2

1 clock cycle

phase 1

phase 2

Page 10: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 19

ARM datapath timingRegister read

Register read buses – dynamic, precharged during phase 2During phase 1 selected registers discharge the read buses which become valid early in phase 1

Shift operationsecond operand passes through barrel shifter

ALU operationALU has input latches which are open in phase 1,allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALUALU processes the operands during the phase 2, producing the valid output towards the end of the phasethe result is latched in the destination register at the end of phase 2

Source: http://www.ece.uah.edu/~milenka 20

ARM datapath timing (cont’d)

read bus valid

shift out valid

ALU out

shift time

ALU time

registerwrite time

registerreadtime

ALU operandslatched

phase 1

phase 2

prechargeinvalidatesbuses

Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time

Page 11: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 21

The original ARM1 ripple-carry adder

Carry logic: use CMOS AOI (And-Or-Invert) gateEven bits use circuit show belowOdd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped aroundWorst case path:32 gates long

AB

Cin

sum

Cout

Source: http://www.ece.uah.edu/~milenka 22

ARM2 4-bit carry look-ahead scheme

Carry Generate (G)Carry Propagate (P)Cout[3] =Cin[0].P + GUse AOI and alternate AND/OR gatesWorst case:8 gates long

A[3:0]

B[3:0]

Cin[0]

sum[3:0]

Cout[3]

4-bitadderlogic

P

G

Page 12: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 23

The ARM2 ALU logic for one result bit

ALU functionsdata operations (add, sub, ...)address computations for memory accessesbranch target computationsbit-wise logical operations...

ALUbus

432105

NBbus

NAbus

carrylogic

fs:

G

P

Source: http://www.ece.uah.edu/~milenka 24

ARM2 ALU function codes

f s 5 fs 4 fs 3 fs 2 fs 1 fs 0 ALU o utput0 0 0 1 0 0 A and B0 0 1 0 0 0 A and not B0 0 1 0 0 1 A xor B0 1 1 0 0 1 A plus not B plus carry0 1 0 1 1 0 A plus B plus carry1 1 0 1 1 0 not A plus B plus carry0 0 0 0 0 0 A0 0 0 0 0 1 A or B0 0 0 1 0 1 B0 0 1 0 1 0 not B0 0 1 1 0 0 zero

Page 13: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 25

The ARM6 carry-select adder scheme

Compute sums of various fields of the wordfor carry-in of zero and carry-in of oneFinal result is selected by using the correct carry-in value to control a multiplexor

sum[31:16]sum[15:8]sum[7:4]sum[3:0]

s s+1

a,b[31:28]a,b[3:0]

+ +, +1c

+, +1

mux

mux

mux

Worst case: O(log2[word width]) gates long

Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.

Source: http://www.ece.uah.edu/~milenka 26

The ARM6 ALU organization

Not easy to merge the arithmetic and logic functions =>a separate logic unit runs in parallel with the adder, and multiplexor selects the output

Z

N

VC

logic/arithmetic

C infunction

invert A invert B

result

result mux

logic functions

A operand latch B operand latch

XOR gates XOR gates

adder

zero detect

Page 14: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 27

ARM9 carry arbitration encoding

Carry arbitration adder

A B C u v

0 0 0 0 0

0 1 unknown 1 0

1 0 unknown 1 0

1 1 1 1 1

Source: http://www.ece.uah.edu/~milenka 28

The cross-bar switch barrel shifter

Shifter delay is critical since it contributes directly to the datapath cycle timeCross-bar switch matrix (32 x 32)Principle for 4x4 matrix

in[0]

in[1]

in[2]

in[3]

out[0] out[1] out[2] out[3]

no shiftright 1right 2right 3

left 1

left 2

left 3

Page 15: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 29

The cross-bar switch barrel shifter (cont’d)

Precharged logic is used => each switch is a single NMOS transistorPrecharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)Arithmetic shift right:use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately

Source: http://www.ece.uah.edu/~milenka 30

The 2-bit multiplication algorithm, Nth cycle

Carry - i n Mul t i p l i er Shi f t ALU Carry -o ut0 x 0 LSL #2N A + 0 0

x 1 LSL #2N A + B 0x 2 LSL #(2N + 1) A – B 1x 3 LSL #2N A – B 1

1 x 0 LSL #2N A + B 0x 1 LSL #(2N + 1) A + B 0x 2 LSL #2N A – B 1x 3 LSL #2N A + 0 1

Page 16: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 31

Carry-propagate (a) and carry-save (b) adder structures

+A B Cin

Cout S(a) +

A B Cin

Cout S+

A B Cin

Cout S+

A B Cin

Cout S

+A B Cin

Cout S(b) +

A B Cin

Cout S+

A B Cin

Cout S+

A B Cin

Cout S

Source: http://www.ece.uah.edu/~milenka 32

ARM high-speed multiplier organization

Rs >> 8 bits/cycle

carry-save adders

partial sum

partial carry

initialization for MLA registers

Rm

ALU (add partials)

rotate sum andcarry 8 bits/cycle

Page 17: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 33

ARM2 register cell circuit

A busB bus

ALU buswrite

readB

readA

Source: http://www.ece.uah.edu/~milenka 34

ARM register bank floorplan

A bus read decoders

B bus read decoders

write decoders

register cellsPC

Vdd

Vss

ALUbus

PCbus

INCbus

ALUbus

A bus

B bus

Page 18: Chapter 5 Organization and Implementation - ntut.edu.ttylee/Courses/91_2/SOC... · Organization and Implementation ... ARM organization ¾Register file ... associated control logic

Source: http://www.ece.uah.edu/~milenka 35

ARM core datapath buses

address register

incrementer

register bank

multiplier

ALU

shifter

data in

instruction pipedata out

A B

W

instruction

Din

shift out

PC

Adinc

Source: http://www.ece.uah.edu/~milenka 36

ARM control logic structure

decodePLA

cyclecount

multiplycontrol

load/storemultiple

addresscontrol

registercontrol

ALUcontrol

shiftercontrol

instruction

coprocessor