Understanding Tomasulo Algorithm

Understanding the Tomasulo Algorithm

Yichao Cheng Jul 23, 2013

Background

IBM System/360 Model 91

FPU’s add/mul/div takes 2/3/13 cycles

Can performance be improved through utilizing multiple execution units?

Adder Mul div

Major Contributions

Proposed three innovative mechanisms:

Common data busing(CDB)

Register tagging scheme

Reservation station

which permits:

Out-of-order execution of independent instructions

while preserving the essential precedences in the instruction stream

When people talk about Tomasolu algorithm, they talk about register renaming

However this word can’t be found in the original paper

How could anyone invent a thing without noticing it?

Architecture Overview

Adder Mul div

FLR Decoder

Storage

Instruction Unit

From a FPU’s perspective

All instructions are ‘register-to-register’

Register-to-register arithmetic

Storage-to-register arithmetic

Instruction Unit(outside FPU) is in charge of the address generation and memory access.

Be equivalent to destination and source

For example, AD R1, R2

R1 is both a sink and a source

‘sink’ and ‘source’

source

1.Reg-to-reg arithmetic AD R1, R2

Adder Mul div

FLR Decoder

Storage

2.Storage-to-reg arithmetic AD R1, FLB

Mul div SDB

Decoder

Storage

3.Load LD R1, FLB1

Adder Mul div

FLR Decoder

Storage 0

4.Store STD R1, SDB1

Mul div

Decoder

Storage

Adder SDB

Timing Sequence: 1. reg-to reg arithmetic

Decode IU

EU Execute

Write back to FLR

2 operands To ALU

Decode

2. storage-to-reg arithmetic

Decode IU

EU Execute

Write back to FLR

FLR To ALU

Decode

FLB To ALU

Addr Gen

Mem Read

3.Load

Decode IU

EU Execute

Writeback to FLR

FLR To ALU

Decode

FLB To ALU

Addr Gen

Mem Read

4.Store

Decode IU

EU Execute

FLR To ALU

Decode

Write To SDB

Addr Gen

Mem Write

A Day in the Life of ‘LD R1, addr’

Adder Mul div

FLR Decoder

Storage

Instruction Unit

FLB Storage FLOS

Adder Mul div SDB

Decoder

addr FLR

Decode & Address

generation

Instruction Unit

FLB Storage

Adder Mul div SDB

Decoder addr

LD R1, FLB1

Instruction Unit

FLB Storage

Adder Mul div SDB

Decoder addr

LD R1, FLB1

FLB Storage

Mul div SDB

Decoder addr

LD R1, FLB1 OP

FLB Storage

Mul div SDB

LD R1, FLB1 OP

Decoder FLR

FLB Storage

Adder Mul div SDB

FLR addr

LD R1, FLB1

Decoder

An Example of Dependence

LD F0, FLB1

MD F0, FLB2

What if send them to different execution units at the same time?

Adder Mul div

to exploit parallelisim

LD F0, FLB1

MD F0, FLB2

The result(F0) cannot reflect the impact of LD, because MD uses the old value of F0

Adder Mul div

LD F0, FLB1

MD F0, FLB2

Adder Mul div

It is also called true dependence, a.k.a. RAW

A Simple Solution

‘busy’ bit scheme

I’am already the sink of some instruction

I need your content LD R1 B

MD R1 A

Performance Degrades...

When the code keep using one register

E.g. MD F0, E

AD F2, F0

AD F4, A

AD F2, F4

overlap fails because the first AD depends on MD, though the others don’t

The second AD is qualified to issue

Cause of the Problem

If one instruction gets stuck(due to dependence), the following can’t be decoded(even it is qualified to issue)

Solution :

Decouple the dependence mantainance from decoding

Look ahead more instructions for concurrency

Dispatch and Issue Decoupling

MD F0, E AD F2, F0 AD F4, A AD F2, F4

Can issue? Decode

Is that reg busy?

Dispatch and Issue Decoupling

MD F0, E AD F2, F0 AD F4, A AD F2, F4

Dispatch anyway

Decode Are my operands ready?

MD F0, E Can issue?

An Example of True Dependence

LD F0, FLB1 F0 as sink

AD F2, F0 F0 as source

Adder Mul div

Assume CDB has not been introduced yet

LD F0, FLB1 dispatches to A1

AD F2, F0

Adder Mul div

F0 LD F0, FLB1

F0 is reserved for some instruction

LD F0, FLB1 dispatches to A1

AD F2, F0

Adder Mul div

F0 LD F0, FLB1

Its content is calculated by A1

LD F0, FLB1

AD F2, F0

Adder Mul div

F0 LD F0, FLB1

I need the value of F0, but he seems to be busy

LD F0, FLB1

AD F2, F0 dispatches to A2

Adder Mul div

F0 LD F0, FLB1

Since A1 is the producer, just let

him tell me

AD F2, F0

LD F0, FLB1

Adder Mul div

F0 LD F0, FLB1

Since A1 is the producer, just ask

him for it

AD F2, A1

LD F0, FLB1 executing

AD F2, F0

Adder Mul div

F0 LD F0, FLB1

AD F2, A1

Operands are ready. Execute!

LD F0, FLB1 broadcasts it’s result to the air

AD F2, F0

Adder Mul div

F0 LD F0, FLB1

I’m A1. Who needs my result? Over..

AD F2, A1

LD F0, FLB1 broadcasts it’s result to the air

AD F2, F0

Adder Mul div

F0 LD F0, FLB1

I depend on A1!

AD F2, A1

Me too!

The Role of CDB

Common Data Bus is in charge of value forwarding

In reg-to-reg model, a value is passed through a register(write & read)

Write as sink (Producer)

The Role of CDB

Common Data Bus is in charge of value forwarding

In reg-to-reg model, a value is passed through a register(write & read)

Read as source (Consumer)

The Role of CDB

For Mul

Resv. S

Load/Store doesn’t need to go through ALU

The dependence management is decoupled from execution as expected

The Role of CDB

CDB All units which may take register as an operand

All units which can alter a register

Consumer Producer

For Mul

Resv. S

FLR P:3

The Role of CDB

CDB All units which may take register as an operand

All units which can alter a register

Consumer Producer

For Mul

Resv. S

FLR C:4

C:3 C:2*2

The Implementation of CDB

A consumer recognizes his producer by tagging

Producers throw <tag, value> on the bus by turns(make a request first)

If tag matches , consumer ingates the value

C C C C C C

P P P P P P

tag tag tag X Y Y

Requset (2 cycles)

P P P P P P

Y value

C C C C C C

tag tag tag X Y Y

P P P P P P

C C C C C C

tag tag tag X Y Y

request

P P P P P P

C C C C C C

tag tag tag X Y Y

X value

The Principle behind the Scene

Tag is a pointer pointing to the producer of the value required by the current instruction

The pointers construct the dependency information which are hidden by the reg-reg model(discuss later)

With the information, the order of execution can be resolved

CDB enables ‘producer-consumer’ style data flow

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

An Example for False Dependence

WAW WAR

LD F0, FLB1 dispatches

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

B FLB1

LD F0, FLB1

LD F0, FLB2

AD F3, F0

Adder Mul div

F0 AD F2, F0

B FLB1

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

F0 AD F2, F0

B FLB1

LD F0, FLB1

AD F2, F0

LD F0, FLB2 dispatches

AD F3, F0

Adder Mul div

F0 AD F2, F0

B FLB2

LD F0, FLB1

AD F2, F0

LD F0, FLB2

Adder Mul div

AD F3, F0

AD F2, F0

B FLB2

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

AD F3, F0

AD F2, F0

B FLB2

Keep tracing the source of the value instead of the

register holding it

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

AD F3, F0

AD F2, F0

B FLB2

There’s no need to rename a register(Naming is just a

way of referring values)

Timing Sequence with Busy Bit

T EX WB

T T EX WB D

D T EX WB

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

T T EX WB D

Timing Sequence with Reservation Station

T EX WB

T T EX WB D

T EX WB

T T EX WB D

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

The Side Effect of Register Machine

What are the differences between a circuit and a register machine?

The Side Effect of Register Machine

What are the differences between a circuit and a register machine?

Register Machine General purpose Control-driven Implict dependence via

registers

Circuit Special purpose Data-driven Exposed dependence

...But registers are rare

Conclusion

Tomasulo algorithm has nothing to do with register renaming

It resolves the WAR & WAW by elimating the side effect of using register to pass value

By using Tomasulo algorithm, the execution of a program is driven by data flow thus exploiting maximum concurrency

Understanding Tomasulo Algorithm

Technology

CS152 – Computer Architecture and Engineering Lecture 17 – Advanced Pipelining: Tomasulo Algorithm

Advanced Computer Architecture - Zhejiang Universityarc.zju.edu.cn/_upload/article/files/cd/b8/6de28c9b47ed8...Tomasulo Algorithm vs. Scoreboard Advanced Computer Architecture 4 Tomasulo

Algoritmo de Tomasulo

Lecture-11 (Tomasulo+ROB) CS422-Spring 2020 · 2020. 2. 10. · Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural

Lecture 4: Tomasulo Algorithm and Dynamic Branch Prediction

Understanding Interactions among Genetic Algorithm Parameters · 2009. 3. 26. · Understanding Interactions among Genetic Algorithm Parameters∗ Kalyanmoy Deb and Samir Agrawal

Instruction-level parallelism: Tomasulo - Reorder Buffer

Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

Hummingbird unleashed. Understanding the new Google Search Algorithm

Tomasulo AlgorithmLecture 11: Case Study—ricardo/Courses/AdvTopCompSys/Material/tomasulo... · RHK.F95 3 Another Dynamic Algorithm: Tomasulo Algorithm • For IBM 360/91 about 3

Tomasulo Algorithm and Dynamic Branch Prediction

Understanding the Metropolis-Hastings Algorithm

Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

Tomasulo’s Algorithm · 2014. 1. 22. · Tomasulo: Register renaming more flexibility, better performance We focus on Tomasulo’s algorithm in the lecture No test questions on

Scoreboarding i Tomasulo Algoritam

ΗΥ425 Αρχιτεκτονική Υπολογιστών Tomasulo Register Renaming · 2016. 10. 19. · Three Stages of Tomasulo Algorithm 1. Issue(dispatch) —πάρε εντολή

Nov. 9, 20041 Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

Lecture 6 Tomasulo Algorithm - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture8.pdf · 2013. 10. 3. · Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems

Tomasulo Examples

Design and Evaluation of a RISC Processor with a Tomasulo … · The Tomasulo scheduling algorithm is one of the most competitive scheduling algo-rithms. It provides low CPI rates