Upload
lamngoc
View
231
Download
3
Embed Size (px)
Citation preview
پيشرفته كامپيوتر معماري كامپيوتر پيشرفته معماري
MIPSMIPS معماريdata path and controldata path and control
Topics
Building a datapath support a subset of the MIPS-I instruction-set
A single cycle processor datapath all instruction actions in one (long) cycle
A multi-cycle processor datapath each instructions takes multiple (shorter) cycles
Exception support
For details see book (ch 5):
2
Datapath and Control
Registers &MemoriesFSM
orMultiplexorsMicro-
programming
Buses
ALUs
DatapathControl
3
The Processor: Datapath & Control Simplified MIPS implementation to contain only:
memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j
Generic Implementation: Generic Implementation: use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do
All instructions use the ALU after reading the registersWhy?
memory reference?
4
memory-reference? arithmetic? control flow?
More Implementation Details Abstract / Simplified View:
Data
Register #Registers
Register #
Register #
Datamemory
Address
Register #
PC Instruction ALU
Instructionmemory
Address
e o y
Data
Register #
Two types of functional units: elements that operate on data values (combinational)
5
elements that contain state (sequential)
State Elements Unclocked vs. Clocked
Cl k d i h l i Clocks used in synchronous logic when should an element that contains state be updated?
falling edge
cycle timerising edge
6
An unclocked state element The set-reset (SR) latch
output depends on present inputs and also on past inputs output depends on present inputs and also on past inputs
R Q
SQ
SQ
Truth table: R S QTruth table: 0 0 Q0 1 11 0 01 1 ?
state change
7
1 1 ?
Latches and Flip-flops Output is equal to the stored value inside the element
(don't need to ask for permission to look at the value)(don t need to ask for permission to look at the value)
Change of state (value) is based on the clock Change of state (value) is based on the clock Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge
(edge triggered methodology)(edge-triggered methodology)
A clocking methodology defines when signals can be read and written— wouldn't want to read a signal at the same time it was being written
8
wouldn t want to read a signal at the same time it was being written
D-latch Two inputs:
the data value to be stored (D) the data value to be stored (D) the clock signal (C) indicating when to read & store D
Two outputs:p the value of the internal state (Q) and it's complement
Q
CD
C
D
_Q
C
Q
9
D flip-flop Output changes only on the clock edge
_Q
Q
_Q
Dlatch
D
C
Dlatch
DD
C
C
D
C
Q
10
Our Implementation An edge triggered methodology
T i l ti Typical execution: read contents of some state elements, send values through some combinational logic,g g write results to one or more state elements
Stateelement
1Combinational logic
Stateelement
2
11
Clockcycle
Register File 3-ported: one write, two read ports
Read reg. #1 Readdata 1
Read reg.#2
W it #
Readdata 2
WriteWrite reg.#
Write
data
12
Register file: read ports
Read register
• Register file built using D flip-flops
Register 0
Register 1 Mu Read data 1
number 1
Register n – 1
Register n
ux
Read data 1
Read register
M
Read registernumber 2
ux
Read data 2
13Implementation of the read ports
Register file: write port Note: we still use the real clock to determine when to
write
R i t 0C
W rite
0
n -to -1de co der
R e giste r 0
R e giste r 1C
D
R eg is te r n um ber
1
Dn – 1
n
R eg is te r n – 1C
C
D
14
DR e giste r n
R e gister d ata
Simple Implementation Include the functional units we need for each instruction
Instruction
PC
Instructionmemory
Instructionaddress
Instruction Add Sum
16 32
MemWrite
Readdata
Addressy
a. Instructionmemory b. Programcounter c. Adder
16 32Sign
extend
M R d
Datamemory
Writedata
data
ALU control
Registers
Readdata 1
Readregister 1
Readregister 2
ALUALUData
Registernumbers Zero
5
5
5 3b. Sign-extension unit
MemRead
a. Data memory unit
RegWrite
Writeregister Read
data 2Writedata
result
Data
5
15
a. Registers b. ALU
Building the Datapath Use multiplexors to stitch them together
PCSrc
Add ALUresult
Mux
4
PCSrc
Add
PC Read
result
Registers
ReadReadregister 1
Shiftleft 2
ALU operation3 MemWriteALUSrcPC
Instructionmemory
address
InstructionWriteregisterWrite
Readdata 1
Readdata 2
gReadregister 2
Mux
MemtoReg
ALUresult
ZeroALU
Data
Address Readdata M
uxmemory
16 32
dataRegWrite
MemRead
DatamemoryWrite
data
x
Signextend
16
Our Simple Control Structure All of the logic is combinational
We wait for everything to settle down, and the right thing to be done ALU might not produce “right answer” right away
we use write signals along with clock to determine when to write
Cycle time determined by length of the longest path Cycle time determined by length of the longest path
State State
Clock cycle
Stateelement
1Combinational logic
Stateelement
2
17
We are ignoring some details like setup and hold times !Clock cycle
Control Selecting the operations to perform (ALU, read/write, etc.)
Controlling the flow of data (multiplexor inputs)
Information comes from the 32 bits of the instruction
Example:add $8 $17 $18 Instruction Format:add $8, $17, $18 Instruction Format:
000000 10001 10010 01000 00000 100000
op rs rt rd shamt funct
18
ALU's operation based on instruction type and function code
Control: 2 level implementation
31
bit
ter ALUop
Opc
ode 31
26Control 2
6
2
ctio
n re
gis 00: lw, sw
01: beq10: add, sub, and, or, slt
inst
ru
ALUcontrolControl 1 000: and001: or010: add110: sub
3
Func
t.
0
5 ALU111: set on less than
6
19
Datapath with Control0
Add
Add ALUresult
Mux
0
1
ShiftAdd
MemtoRegALUOp
MemReadBranchRegDst
Instruction[31–26]
4
Control
Shiftleft 2
PC Readdd
Instruction[25–21]
MemWrite
RegWriteALUSrc
ReadReadregister 1PC
Instructionmemory
address
Instruction[31–0]
Instruction[20–16]
0Mux
0
1
RegistersWriteregister
Writed t
Readdata1
Readdata2
g
Readregister 2
Mux
ALUresult
Zero
Data
Readdata
Mux
1
Instruction[15–11]
ALUAddress
16 32Instruction[15–0]
01 data
Signextend
1Data
memoryWritedata
x
ALUcontrol
20
Instruction[5–0]
control
ALU Control1 What should the ALU do with this instruction
example: lw $1, 100($2)
35 2 1 100op rs rt 16 bit offset
ALU control input
000 AND001 OR010 add110 subtract111 set-on-less-than
Why is the code for subtract 110 and not 011?
21
ALU Control1 Must describe hardware to compute 3-bit ALU control input
given instruction type 00 = lw, sw01 b ALU Operation class,01 = beq, 10 = arithmetic
function code for arithmeticD ib it i t th t bl ( t i t t )
ALU Operation class, computed from instruction type
Describe it using a truth table (can turn into gates):
ALUO F t fi ld O tioutputsintputs
ALUOp Funct field OperationALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0 0 X X X X X X 010X 1 X X X X X X 110X 1 X X X X X X 1101 X X X 0 0 0 0 0101 X X X 0 0 1 0 1101 X X X 0 1 0 0 000
22
1 X X X 0 1 0 0 0001 X X X 0 1 0 1 0011 X X X 1 0 1 0 111
ALU Control1 Simple combinational logic (truth tables)
ALUOp
ALUOp0
ALUOp
ALU control block
Operation2
ALUOp1
Operation2
Operation1Operation
F3
F2F(5–0)
Operation0F1
F0
F (5–0)
23
Deriving Control2 signals
9 control (output) signalsInput6-bits
Instruction RegDst ALUSrcMemto-
RegReg
WriteMem Read
Mem Write Branch ALUOp1 ALUp0
R f 1 0 0 1 0 0 0 1 0
6 bits
R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1
Determine these control signals directly from the opcodes:g y pR-format: 0lw: 35sw: 43
24
sw: 43beq: 4
Control 2Op5
Inputs
PLA example implementation Op1
Op2Op3Op4
implementationOp0Op1
R-format Iw sw beq
Outputs
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
25
p
ALUOpO
Single Cycle Implementation Calculate cycle time assuming negligible delays except:
memory (2ns), ALU and adders (2ns), register file access (1ns)
Add
4 0
Mux
1
PCSrc
Add ALU
MemWritePC Read
Instruction [25– 21]
RegWrite
4
Readd t 1
Readregister 1
0
Shiftleft 2
Add result
MemtoRegALUSrcPC
Instructionmemory
addressInstruction
[31– 0]
Instruction [20– 16]
Registers
WriteregisterWritedata
data 1
Readdata 2
Readregister 2
ALUresult
Zero
Address Readdata M
ux
1
0
Mux
1
0
Mux
1
Instruction [15– 11]
ALU
MemRead
RegDst16 32Instruction [15– 0]
0gdata
Writedata
Signextend
Datamemory
x00
ALUcontrol
26
ALUOp
Instruction [5– 0]
Single Cycle Implementation Memory (2ns), ALU & adders (2ns), reg. file access (1ns)
Fixed length clock: longest instruction is the ‘lw’ which requires 8 ns
Variable clock length (not realistic, just as exercise): R-instr: 6 ns Load: 8 ns Store: 7 ns Branch: 5 ns Jump: 2 ns
Average depends on instruction mix (see pg 374)
27
Where we are headed Single Cycle Problems:
what if we had a more complicated instruction like floating point? wasteful of area: NO Sharing of Hardware resources
One Solution: use a “smaller” cycle time have different instructions take different numbers of cycles have different instructions take different numbers of cycles a “multicycle” datapath:
PC Address
Instructionregister
DataPC
Memory
Address
Instructionor data
Data
RegistersRegister #
Register #ALU
Memorydata
A
B
ALUOut
IR
28
DataRegister #register
MDR
Multicycle Approach We will be reusing functional units
ALU used to compute address and to increment PC ALU used to compute address and to increment PC Memory used for instruction and data
Add registers after every major functional unit Add registers after every major functional unit
Our control signals will not be determined solely by g y yinstruction e.g., what should the ALU do for a “subtract” instruction?
We’ll use a finite state machine (FSM) or microcode for control
29
Review: finite state machines Finite state machines:
a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input)
Next-statefunctionCurrent state
Nextstate
ClockInputs
Outputfunction Outputs
30
We’ll use a Moore machine (output based only on current state)
Multicycle Approach Break up the instructions into steps, each step takes a
cyclecycle balance the amount of work to be done restrict each cycle to use only one major functional unit
At the end of a cycle store values for use in later cycles (easiest thing to do) introduce additional “internal” registers introduce additional internal registers
Notice: we distinguish processor state: programmer visible registers internal state: programmer invisible registers (like IR, MDR, A, B,
and ALUout)
31
Multicycle Approach
PC
Memory
MemData
Mux
0
1
R i t
Readdata1
Readregister 1
Readregister 2
0
Instruction[25–21]
Instruction[20–16]
Mux
ALUALUZeroA
ALUOut
0
1
Address
MemData
Writedata
RegistersWriteregister
Writedata
Readdata2
Mux
0
1
0
4
Instruction
Instruction[15–0]
Instructionregister
1 Mux
0
32
ALUresult
U
Instruction[15–11]
B
ALUOut
Shift
Mux
0
1
Instruction[15–0]
Sign3216
3
Memorydata
register left 2extendregister
32
Multicycle Approach Note that previous picture does not include:
branch support jump support Control lines and logic
Tclock > max (ALU delay, Memory access, Regfile access)
See book for complete picture
33
Five Execution Steps Instruction Fetch
Instruction Decode and Register Fetch
E ti M Add C t ti B h Execution, Memory Address Computation, or Branch Completion
Memory Access or R-type instruction completion
Write-back step Write back step
INSTRUCTIONS TAKE FROM 3 5 CYCLES!
34
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
Step 1: Instruction Fetch Use PC to get instruction and put it in the Instruction
Register Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL "Register-Transfer
Language"Language"
IR = Memory[PC];PC = PC + 4;
Can we figure out the values of the control signals? Can we figure out the values of the control signals?
What is the advantage of updating the PC now?
35
Step 2: Instruction Decode and Register Fetch Read registers rs and rt in case we need them
C t th b h dd i th i t ti i
Register Fetch
Compute the branch address in case the instruction is a branch
Previous two actions are done optimistically!!p y RTL:
A = Reg[IR[25-21]];A Reg[IR[25 21]];B = Reg[IR[20-16]];ALUOut = PC+(sign-extend(IR[15-0])<< 2);
We aren't setting any control lines based on the instruction type
(we are busy "decoding" it in our control logic)
36
(we are busy decoding it in our control logic)
Step 3 (instruction dependent) ALU is performing one of four functions, based on instruction type
Memory Reference:y
ALUOut = A + sign-extend(IR[15-0]); R-type:
ALUOut = A op B; Branch:
if (A==B) PC = ALUOut;
Jump:
PC = PC[31-28] || (IR[25-0]<<2)
37
Step 4 (R-type or memory-access) Loads and stores access memory
MDR = Memory[ALUOut];or
Memory[ALUOut] = B;Memory[ALUOut] B;
R-type instructions finish
Reg[IR[15-11]] = ALUOut;
The write actually takes place at the end of the cycle on the edge
38
Write-back step Memory read completion step
Reg[IR[20-16]]= MDR;
What about all the other instructions?
39
Summary execution stepsSteps taken to execute any instruction class
Step nameAction for R-type
instructionsAction for memory-reference
instructionsAction for branches
Action for jumps
Instruction fetch IR = Memory[PC]PC= PC+ 4PC PC 4
Instruction A = Reg [IR[25-21]]decode/register fetch B = Reg [IR[20-16]]
ALUOut = PC + (sign-extend (IR[15-0]) << 2)Execution address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC= PC[31-28] IIExecution, address ALUOut A op B ALUOut A + sign extend if (A B) then PC PC [31 28] IIcomputation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2)jump completionMemory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut]completion ALUOut orcompletion ALUOut or
Store: Memory [ALUOut] = B
Memory read completion Load: Reg[IR[20-16]] = MDR
40
Simple Questions How many cycles will it take to execute this code?
l $t2 0($t3)lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, L1 #assume not takenadd $t5 $t2 $t3add $t5, $t2, $t3sw $t5, 8($t3)
L1: ...
What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place?
41
Implementing the Control Value of control signals is dependent upon:
what instruction is being executedhi h t i b i f d which step is being performed
Use the information we have accumulated to specify a finite t t hi (FSM)state machine (FSM) specify the finite state machine graphically, or use microprogramming
Implementation can be derived from specification
42
FSM: high level viewStart/reset
Instruction fetch, decode and register fetch
Memory access R-type Branch Jumpinstructions instructions instruction instruction
43
Graphical Specification of FSM ALUSrcA = 0
MemReadALUSrcA = 0
IorD = 0
Instruction fetchInstruction decode/
register fetch0
1
How many state bits will
d?
ALUSrcB = 11ALUOp = 00
IRWriteALUSrcB = 01ALUOp = 00
PCWritePCSource = 00
='J
')
Start
we need?
PCWritePCSource = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01
ALUSrcA =1ALUSrcB = 00
ALUSrcA = 1ALUSrcB = 10
Jumpcompletion
BranchcompletionExecution
Memory addresscomputation (O
p=
9862
PCSource = 10PCWriteCondPCSource = 01
ALUOp = 10ALUOp = 00
Memory Memoryp=
'LW
')
RegDst = 1RegWrite
MemtoReg = 0MemWriteIorD = 1
MemReadIorD = 1
Memoryaccess
Memoryaccess R-type completion(O
p
753
Write-back step4
RegDst = 0RegWrite
MemtoReg = 1
Finite State Machine for ControlImplementation:
P C W rite
PC W riteC on dIo rDM em R eadM W it
M em to R egP C S ou rceA L U O p
IR W riteM em W rite
C on tro l log ic
A L U O pA L U S rcBA L U S rcAR egW riteR egD st
O u tp uts
N S 3N S 2N S 1N S 0Inputs
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
S ta te regis te rIns tru ct io n re g is ter o pcode fie ld
45
PLA Implemen-
Op5
Op4
Op3
odeImplemen
tation Op2
Op1
Op0(see fig C.14)
opco
S3
S2
S1
( g )
currentstate
S0
IorD
PCWritePCWriteCond da
If I picked a horizontal or vertical line could you explain it ?
IRWrite
MemReadMemWrite
MemtoRegPCSource1PCSource0
atapath c
you explain it ? What type of
FSM is used?Mealy or Moore?
ALUOp1
ALUSrcB0ALUSrcARegWrite
ALUSrcB1ALUOp0
control
46
Mealy or Moore?RegDstNS3NS2NS1NS0
nextstate
ROM Implementation ROM = "Read Only Memory"
values of memory locations are fixed ahead of time values of memory locations are fixed ahead of time A ROM can be used to implement a truth table
if the address is m-bits, we can address 2m entries in the ROMt t th bit f d t th t th dd i t t our outputs are the bits of data that the address points to
0 0 0 0 0 1 10 0 1 1 1 0 0
ROMaddress data
0 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 11 1 0 0 1 1 0
mbits
nbits
1 1 0 0 1 1 01 1 1 0 1 1 1
m is the "heigth", and n is the "width"
47
ROM Implementation How many inputs are there?
6 bits for opcode 4 bits for state = 10 address lines6 bits for opcode, 4 bits for state 10 address lines(i.e., 210 = 1024 different addresses)
H t t th ? How many outputs are there?16 datapath-control outputs, 4 state bits = 20 outputs
ROM is 210 x 20 = 20K bits (very large and a rather unusual size)
Rather wasteful, since for lots of the entries, the outputs are the same
— i e opcode is often ignored
48
i.e., opcode is often ignored
ROM ImplementationCheaper implementation:
Exploit the fact that the FSM is a Moore machine ==> Control outputs only depend on current state and not on other Control outputs only depend on current state and not on other
incoming control signals ! Next state depends on all inputs
Break up the table into two parts4 state bits tell o the 16 o tp ts 24 16 bits of ROM— 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM
— 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM— Total number of bits: 4.3K bits of ROM
49
ROM vs PLA PLA is much smaller
can share product terms (ROM has an entry (=address) for every product tterm
only need entries that produce an active output can take into account don't cares
Size of PLA:(#inputs #product terms) + (#outputs #product terms)(#inputs #product-terms) + (#outputs #product-terms) For this example: (10x17)+(20x17) = 460 PLA cells
PLA cells usually slightly bigger than the size of a ROM cell
50
Exceptions Unexpected events
E t l i t t External: interrupt e.g. I/O request
Internal: exception Internal: exception e.g. Overflow, Undefined instruction opcode, Software trap,
Page fault
How to handle exception? Jump to general entry point (record exception type in status
register) Jump to vectored entry point Address of faulting instruction has to be recorded !
51
g
ExceptionsChanges needed: see fig. 5.48 / 5.49 / 5.50
Extend PC input mux with extra entry with fixed address: “C000000hex”
Add EPC register containing old PC (we’ll use the ALU to decrement PC with 4)
t i t ALU 2 d d ith fi d l 4 extra input ALU src2 needed with fixed value 4
Cause register (one bit in our case) containing: 0: undefined instruction 1: ALU overflow
Add 2 states to FSMd fi d i t t t #10
52
undefined instr. state #10 overflow state #11
Exceptions Legend:IntCause =0/1 type of exceptionCauseWrite write Cause registerCauseWrite write Cause registerALUSrcA = 0 select PCALUSrcB = 01 select constant 4ALUOp = 01 subtract operationEPCWrite write EPC register with current PCPCWrite write PC with exception addressPCSource =11 select exception address: C000000hex
2 New states:PCSource =11 select exception address: C000000hex
#10 d fi d i t ti #11 fl
IntCause =0CauseWrite
ALUSrcA = 0
IntCause =1CauseWrite
ALUS A 0
#10 undefined instruction #11 overflow
ALUSrcA 0ALUSrcB = 01ALUOp = 01
EPCWritePCWrite
PCSource =11
ALUSrcA = 0ALUSrcB = 01ALUOp = 01
EPCWritePCWrite
PCSource 11PCSource 11 PCSource =11
53
To state 0 (begin of next instruction)