Understanding Tomasulo Algorithm

Preview:

DESCRIPTION

How Tomasulo Algorithm works. And why it works.

Citation preview

Understanding the Tomasulo Algorithm

Yichao Cheng Jul 23, 2013

Background

IBM System/360 Model 91

FPU’s add/mul/div takes 2/3/13 cycles

Can performance be improved through utilizing multiple execution units?

Adder Mul div

Major Contributions

Proposed three innovative mechanisms:

Common data busing(CDB)

Register tagging scheme

Reservation station

which permits:

Out-of-order execution of independent instructions

while preserving the essential precedences in the instruction stream

Doubt

When people talk about Tomasolu algorithm, they talk about register renaming

However this word can’t be found in the original paper

How could anyone invent a thing without noticing it?

Architecture Overview

FLOS

Adder Mul div

FLB

SDB

FLR Decoder

Storage

Instruction Unit

FPU

From a FPU’s perspective

All instructions are ‘register-to-register’

Register-to-register arithmetic

Storage-to-register arithmetic

Load

Store

Instruction Unit(outside FPU) is in charge of the address generation and memory access.

Be equivalent to destination and source

For example, AD R1, R2

R1 is both a sink and a source

‘sink’ and ‘source’

source

sink

value

1.Reg-to-reg arithmetic AD R1, R2

FLOS

Adder Mul div

FLB

SDB

FLR Decoder

Storage

2.Storage-to-reg arithmetic AD R1, FLB

FLOS

Mul div SDB

Decoder

Storage

Adder

FLR

FLB

3.Load LD R1, FLB1

FLOS

Adder Mul div

FLB

SDB

FLR Decoder

Storage 0

4.Store STD R1, SDB1

FLOS

Mul div

FLB

Decoder

Storage

FLR

Adder SDB

0

Timing Sequence: 1. reg-to reg arithmetic

Decode IU

EU Execute

Write back to FLR

2 operands To ALU

Decode

2. storage-to-reg arithmetic

Decode IU

EU Execute

Write back to FLR

FLR To ALU

Decode

FLB To ALU

Addr Gen

Mem Read

3.Load

Decode IU

EU Execute

Writeback to FLR

FLR To ALU

Decode

FLB To ALU

Addr Gen

Mem Read

4.Store

Decode IU

EU Execute

FLR To ALU

Decode

Write To SDB

Addr Gen

Mem Write

A Day in the Life of ‘LD R1, addr’

FLOS

Adder Mul div

FLB

SDB

FLR Decoder

Storage

Instruction Unit

FLB Storage FLOS

Adder Mul div SDB

Decoder

FLB1

addr FLR

Decode & Address

generation

A Day in the Life of ‘LD R1, addr’

Instruction Unit

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Adder Mul div SDB

Decoder addr

FLB1

LD R1, FLB1

FLR

Instruction Unit

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Adder Mul div SDB

Decoder addr

FLB1

LD R1, FLB1

FLR

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Mul div SDB

Decoder addr

FLB1

LD R1, FLB1 OP

FLR

Adder

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Mul div SDB

addr

FLB1

LD R1, FLB1 OP

Decoder FLR

Adder

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Adder Mul div SDB

FLR addr

FLB1

R1

LD R1, FLB1

Decoder

An Example of Dependence

LD F0, FLB1

MD F0, FLB2

What if send them to different execution units at the same time?

Adder Mul div

to exploit parallelisim

An Example of Dependence

LD F0, FLB1

MD F0, FLB2

The result(F0) cannot reflect the impact of LD, because MD uses the old value of F0

Adder Mul div

An Example of Dependence

LD F0, FLB1

MD F0, FLB2

Adder Mul div

It is also called true dependence, a.k.a. RAW

A Simple Solution

‘busy’ bit scheme

R0

R1

R2

R3

B

I’am already the sink of some instruction

I need your content LD R1 B

MD R1 A

Performance Degrades...

When the code keep using one register

E.g. MD F0, E

AD F2, F0

AD F4, A

AD F2, F4

overlap fails because the first AD depends on MD, though the others don’t

The second AD is qualified to issue

Cause of the Problem

If one instruction gets stuck(due to dependence), the following can’t be decoded(even it is qualified to issue)

Solution :

Decouple the dependence mantainance from decoding

Look ahead more instructions for concurrency

Dispatch and Issue Decoupling

MD F0, E AD F2, F0 AD F4, A AD F2, F4

Adder

Can issue? Decode

Is that reg busy?

Dispatch and Issue Decoupling

MD F0, E AD F2, F0 AD F4, A AD F2, F4

Adder

Dispatch anyway

Decode Are my operands ready?

MD F0, E Can issue?

An Example of True Dependence

LD F0, FLB1 F0 as sink

AD F2, F0 F0 as source

Adder Mul div

FLB

FLR

FLB1

F0

Assume CDB has not been introduced yet

LD F0, FLB1 dispatches to A1

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

An Example of True Dependence

F0 is reserved for some instruction

LD F0, FLB1 dispatches to A1

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

An Example of True Dependence

Its content is calculated by A1

LD F0, FLB1

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

I need the value of F0, but he seems to be busy

An Example of True Dependence

LD F0, FLB1

AD F2, F0 dispatches to A2

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

Since A1 is the producer, just let

him tell me

An Example of True Dependence

AD F2, F0

LD F0, FLB1

AD F2, F0 dispatches to A2

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

Since A1 is the producer, just ask

him for it

An Example of True Dependence

AD F2, A1

LD F0, FLB1 executing

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

An Example of True Dependence

AD F2, A1

Operands are ready. Execute!

LD F0, FLB1 broadcasts it’s result to the air

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

I’m A1. Who needs my result? Over..

An Example of True Dependence

AD F2, A1

LD F0, FLB1 broadcasts it’s result to the air

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

I depend on A1!

An Example of True Dependence

AD F2, A1

Me too!

The Role of CDB

Common Data Bus is in charge of value forwarding

In reg-to-reg model, a value is passed through a register(write & read)

F0

Write as sink (Producer)

The Role of CDB

Common Data Bus is in charge of value forwarding

In reg-to-reg model, a value is passed through a register(write & read)

F0

Read as source (Consumer)

The Role of CDB

Add

For Mul

Resv. S

For

Resv. S

FLB

SDB

FLR

Load/Store doesn’t need to go through ALU

The dependence management is decoupled from execution as expected

The Role of CDB

CDB All units which may take register as an operand

All units which can alter a register

Consumer Producer

Add

For Mul

Resv. S

For

Resv. S

FLB

SDB

FLR P:3

P:2

P:6

The Role of CDB

CDB All units which may take register as an operand

All units which can alter a register

Consumer Producer

Add

For Mul

Resv. S

For

Resv. S

FLB

SDB

FLR C:4

C:3 C:2*2

C:3*2

The Implementation of CDB

A consumer recognizes his producer by tagging

Producers throw <tag, value> on the bus by turns(make a request first)

If tag matches , consumer ingates the value

C C C C C C

P P P P P P

tag tag tag X Y Y

Requset (2 cycles)

The Implementation of CDB

A consumer recognizes his producer by tagging

Producers throw <tag, value> on the bus by turns(make a request first)

If tag matches , consumer ingates the value

P P P P P P

Y value

C C C C C C

tag tag tag X Y Y

The Implementation of CDB

A consumer recognizes his producer by tagging

Producers throw <tag, value> on the bus by turns(make a request first)

If tag matches , consumer ingates the value

P P P P P P

C C C C C C

tag tag tag X Y Y

request

The Implementation of CDB

A consumer recognizes his producer by tagging

Producers throw <tag, value> on the bus by turns(make a request first)

If tag matches , consumer ingates the value

P P P P P P

C C C C C C

tag tag tag X Y Y

X value

The Principle behind the Scene

Tag is a pointer pointing to the producer of the value required by the current instruction

The pointers construct the dependency information which are hidden by the reg-reg model(discuss later)

With the information, the order of execution can be resolved

CDB enables ‘producer-consumer’ style data flow

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0

An Example for False Dependence

FLB2

FLB1

WAW WAR

LD F0, FLB1 dispatches

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0

An Example for False Dependence

FLB2

FLB1

B FLB1

LD F0, FLB1

AD F2, F0 dispatches to A1

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0 AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB1

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0 AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB1

LD F0, FLB1

AD F2, F0

LD F0, FLB2 dispatches

AD F3, F0

Adder Mul div

FLB

FLR

F0 AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB2

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0 dispatches to A2

Adder Mul div

FLB

FLR

F0

AD F3, F0

AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB2

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0

AD F3, F0

AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB2

Keep tracing the source of the value instead of the

register holding it

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0

AD F3, F0

AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB2

There’s no need to rename a register(Naming is just a

way of referring values)

Timing Sequence with Busy Bit

D

T EX WB

AG

D

FLB

D

T T EX WB D

D T EX WB

AG

D FLB

D

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

T T EX WB D

Timing Sequence with Reservation Station

D

T EX WB

AG

D

FLB

D

T T EX WB D

D

T EX WB

AG

D

FLB

D

T T EX WB D

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

The Side Effect of Register Machine

What are the differences between a circuit and a register machine?

The Side Effect of Register Machine

What are the differences between a circuit and a register machine?

Register Machine General purpose Control-driven Implict dependence via

registers

Circuit Special purpose Data-driven Exposed dependence

...But registers are rare

Conclusion

Tomasulo algorithm has nothing to do with register renaming

It resolves the WAR & WAW by elimating the side effect of using register to pass value

By using Tomasulo algorithm, the execution of a program is driven by data flow thus exploiting maximum concurrency

Recommended