Upload
onesuper
View
626
Download
0
Embed Size (px)
DESCRIPTION
How Tomasulo Algorithm works. And why it works.
Citation preview
Understanding the Tomasulo Algorithm
Yichao Cheng Jul 23, 2013
Background
IBM System/360 Model 91
FPU’s add/mul/div takes 2/3/13 cycles
Can performance be improved through utilizing multiple execution units?
Adder Mul div
Major Contributions
Proposed three innovative mechanisms:
Common data busing(CDB)
Register tagging scheme
Reservation station
which permits:
Out-of-order execution of independent instructions
while preserving the essential precedences in the instruction stream
Doubt
When people talk about Tomasolu algorithm, they talk about register renaming
However this word can’t be found in the original paper
How could anyone invent a thing without noticing it?
Architecture Overview
FLOS
Adder Mul div
FLB
SDB
FLR Decoder
Storage
Instruction Unit
FPU
From a FPU’s perspective
All instructions are ‘register-to-register’
Register-to-register arithmetic
Storage-to-register arithmetic
Load
Store
Instruction Unit(outside FPU) is in charge of the address generation and memory access.
Be equivalent to destination and source
For example, AD R1, R2
R1 is both a sink and a source
‘sink’ and ‘source’
source
sink
value
1.Reg-to-reg arithmetic AD R1, R2
FLOS
Adder Mul div
FLB
SDB
FLR Decoder
Storage
2.Storage-to-reg arithmetic AD R1, FLB
FLOS
Mul div SDB
Decoder
Storage
Adder
FLR
FLB
3.Load LD R1, FLB1
FLOS
Adder Mul div
FLB
SDB
FLR Decoder
Storage 0
4.Store STD R1, SDB1
FLOS
Mul div
FLB
Decoder
Storage
FLR
Adder SDB
0
Timing Sequence: 1. reg-to reg arithmetic
Decode IU
EU Execute
Write back to FLR
2 operands To ALU
Decode
2. storage-to-reg arithmetic
Decode IU
EU Execute
Write back to FLR
FLR To ALU
Decode
FLB To ALU
Addr Gen
Mem Read
3.Load
Decode IU
EU Execute
Writeback to FLR
FLR To ALU
Decode
FLB To ALU
Addr Gen
Mem Read
4.Store
Decode IU
EU Execute
FLR To ALU
Decode
Write To SDB
Addr Gen
Mem Write
A Day in the Life of ‘LD R1, addr’
FLOS
Adder Mul div
FLB
SDB
FLR Decoder
Storage
Instruction Unit
FLB Storage FLOS
Adder Mul div SDB
Decoder
FLB1
addr FLR
Decode & Address
generation
A Day in the Life of ‘LD R1, addr’
Instruction Unit
FLB Storage
A Day in the Life of ‘LD R1, addr’
FLOS
Adder Mul div SDB
Decoder addr
FLB1
LD R1, FLB1
FLR
Instruction Unit
FLB Storage
A Day in the Life of ‘LD R1, addr’
FLOS
Adder Mul div SDB
Decoder addr
FLB1
LD R1, FLB1
FLR
FLB Storage
A Day in the Life of ‘LD R1, addr’
FLOS
Mul div SDB
Decoder addr
FLB1
LD R1, FLB1 OP
FLR
Adder
FLB Storage
A Day in the Life of ‘LD R1, addr’
FLOS
Mul div SDB
addr
FLB1
LD R1, FLB1 OP
Decoder FLR
Adder
FLB Storage
A Day in the Life of ‘LD R1, addr’
FLOS
Adder Mul div SDB
FLR addr
FLB1
R1
LD R1, FLB1
Decoder
An Example of Dependence
LD F0, FLB1
MD F0, FLB2
What if send them to different execution units at the same time?
Adder Mul div
to exploit parallelisim
An Example of Dependence
LD F0, FLB1
MD F0, FLB2
The result(F0) cannot reflect the impact of LD, because MD uses the old value of F0
Adder Mul div
An Example of Dependence
LD F0, FLB1
MD F0, FLB2
Adder Mul div
It is also called true dependence, a.k.a. RAW
A Simple Solution
‘busy’ bit scheme
R0
R1
R2
R3
B
I’am already the sink of some instruction
I need your content LD R1 B
MD R1 A
Performance Degrades...
When the code keep using one register
E.g. MD F0, E
AD F2, F0
AD F4, A
AD F2, F4
overlap fails because the first AD depends on MD, though the others don’t
The second AD is qualified to issue
Cause of the Problem
If one instruction gets stuck(due to dependence), the following can’t be decoded(even it is qualified to issue)
Solution :
Decouple the dependence mantainance from decoding
Look ahead more instructions for concurrency
Dispatch and Issue Decoupling
MD F0, E AD F2, F0 AD F4, A AD F2, F4
Adder
Can issue? Decode
Is that reg busy?
Dispatch and Issue Decoupling
MD F0, E AD F2, F0 AD F4, A AD F2, F4
Adder
Dispatch anyway
Decode Are my operands ready?
MD F0, E Can issue?
An Example of True Dependence
LD F0, FLB1 F0 as sink
AD F2, F0 F0 as source
Adder Mul div
FLB
FLR
FLB1
F0
Assume CDB has not been introduced yet
LD F0, FLB1 dispatches to A1
AD F2, F0
Adder Mul div
FLB
FLR
FLB1
F0 LD F0, FLB1
B A1
An Example of True Dependence
F0 is reserved for some instruction
LD F0, FLB1 dispatches to A1
AD F2, F0
Adder Mul div
FLB
FLR
FLB1
F0 LD F0, FLB1
B A1
An Example of True Dependence
Its content is calculated by A1
LD F0, FLB1
AD F2, F0
Adder Mul div
FLB
FLR
FLB1
F0 LD F0, FLB1
B A1
I need the value of F0, but he seems to be busy
An Example of True Dependence
LD F0, FLB1
AD F2, F0 dispatches to A2
Adder Mul div
FLB
FLR
FLB1
F0 LD F0, FLB1
B A1
Since A1 is the producer, just let
him tell me
An Example of True Dependence
AD F2, F0
LD F0, FLB1
AD F2, F0 dispatches to A2
Adder Mul div
FLB
FLR
FLB1
F0 LD F0, FLB1
B A1
Since A1 is the producer, just ask
him for it
An Example of True Dependence
AD F2, A1
LD F0, FLB1 executing
AD F2, F0
Adder Mul div
FLB
FLR
FLB1
F0 LD F0, FLB1
B A1
An Example of True Dependence
AD F2, A1
Operands are ready. Execute!
LD F0, FLB1 broadcasts it’s result to the air
AD F2, F0
Adder Mul div
FLB
FLR
FLB1
F0 LD F0, FLB1
B A1
I’m A1. Who needs my result? Over..
An Example of True Dependence
AD F2, A1
LD F0, FLB1 broadcasts it’s result to the air
AD F2, F0
Adder Mul div
FLB
FLR
FLB1
F0 LD F0, FLB1
B A1
I depend on A1!
An Example of True Dependence
AD F2, A1
Me too!
The Role of CDB
Common Data Bus is in charge of value forwarding
In reg-to-reg model, a value is passed through a register(write & read)
F0
Write as sink (Producer)
The Role of CDB
Common Data Bus is in charge of value forwarding
In reg-to-reg model, a value is passed through a register(write & read)
F0
Read as source (Consumer)
The Role of CDB
Add
For Mul
Resv. S
For
Resv. S
FLB
SDB
FLR
Load/Store doesn’t need to go through ALU
The dependence management is decoupled from execution as expected
The Role of CDB
CDB All units which may take register as an operand
All units which can alter a register
Consumer Producer
Add
For Mul
Resv. S
For
Resv. S
FLB
SDB
FLR P:3
P:2
P:6
The Role of CDB
CDB All units which may take register as an operand
All units which can alter a register
Consumer Producer
Add
For Mul
Resv. S
For
Resv. S
FLB
SDB
FLR C:4
C:3 C:2*2
C:3*2
The Implementation of CDB
A consumer recognizes his producer by tagging
Producers throw <tag, value> on the bus by turns(make a request first)
If tag matches , consumer ingates the value
C C C C C C
P P P P P P
tag tag tag X Y Y
Requset (2 cycles)
The Implementation of CDB
A consumer recognizes his producer by tagging
Producers throw <tag, value> on the bus by turns(make a request first)
If tag matches , consumer ingates the value
P P P P P P
Y value
C C C C C C
tag tag tag X Y Y
The Implementation of CDB
A consumer recognizes his producer by tagging
Producers throw <tag, value> on the bus by turns(make a request first)
If tag matches , consumer ingates the value
P P P P P P
C C C C C C
tag tag tag X Y Y
request
The Implementation of CDB
A consumer recognizes his producer by tagging
Producers throw <tag, value> on the bus by turns(make a request first)
If tag matches , consumer ingates the value
P P P P P P
C C C C C C
tag tag tag X Y Y
X value
The Principle behind the Scene
Tag is a pointer pointing to the producer of the value required by the current instruction
The pointers construct the dependency information which are hidden by the reg-reg model(discuss later)
With the information, the order of execution can be resolved
CDB enables ‘producer-consumer’ style data flow
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder Mul div
FLB
FLR
F0
An Example for False Dependence
FLB2
FLB1
WAW WAR
LD F0, FLB1 dispatches
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder Mul div
FLB
FLR
F0
An Example for False Dependence
FLB2
FLB1
B FLB1
LD F0, FLB1
AD F2, F0 dispatches to A1
LD F0, FLB2
AD F3, F0
Adder Mul div
FLB
FLR
F0 AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB1
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder Mul div
FLB
FLR
F0 AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB1
LD F0, FLB1
AD F2, F0
LD F0, FLB2 dispatches
AD F3, F0
Adder Mul div
FLB
FLR
F0 AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB2
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0 dispatches to A2
Adder Mul div
FLB
FLR
F0
AD F3, F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB2
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder Mul div
FLB
FLR
F0
AD F3, F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB2
Keep tracing the source of the value instead of the
register holding it
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder Mul div
FLB
FLR
F0
AD F3, F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB2
There’s no need to rename a register(Naming is just a
way of referring values)
Timing Sequence with Busy Bit
D
T EX WB
AG
D
FLB
D
T T EX WB D
D T EX WB
AG
D FLB
D
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
T T EX WB D
Timing Sequence with Reservation Station
D
T EX WB
AG
D
FLB
D
T T EX WB D
D
T EX WB
AG
D
FLB
D
T T EX WB D
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
The Side Effect of Register Machine
What are the differences between a circuit and a register machine?
The Side Effect of Register Machine
What are the differences between a circuit and a register machine?
Register Machine General purpose Control-driven Implict dependence via
registers
Circuit Special purpose Data-driven Exposed dependence
...But registers are rare
Conclusion
Tomasulo algorithm has nothing to do with register renaming
It resolves the WAR & WAW by elimating the side effect of using register to pass value
By using Tomasulo algorithm, the execution of a program is driven by data flow thus exploiting maximum concurrency