AMD Bulldozer Microarchitecture

AMD Bulldozer Microarchitecture

Overview

• Two cores - to have high throughput per thread

• Bulldozer module can execute two threads via a combination of shared and dedicated resources.

• AMD’s design focuses on Multithreading.

High Level Block Diagram

The figure is taken from [3]

Branch Prediction & Fetch

• Prediction structures - shared between two threads

• Multilevel BTBs• Guess!!! • Prediction runs ahead of the IF pipeline

during fetch misses or other stalls. • Instruction is prefetched into L1 cache using the

prediction queue.

Decode

• Fetch lines are queued in an instruction byte buffer.

• Decode unit extracts and decodes up to four x86 instructions per cycle.

• Decoded instructions dispatch to one integer core.

Integer Core

• Replicated (2 Integer Cores)• Scheduler handles out of order execution. • Core transparency

o Avoids complexity o Lean Hardware

Integer Core


Floating Point Unit

• Single floating point unit.• Shared between integer cores.• Floating point operations implemented in

pipelined fashion & hence exploit SMT.• Interfaces with the decode unit for receiving

cops and load/store unit for data transfer

Floating Point Unit


Register Renaming

• PRF(Physical Register File)-based renaming• Table containing mappings of names to

locations (tags).• Issued instructions execute after reading

from PRF.• Uses snapshots for recovering from branch

mispredictions/ exceptions.• Separate register files for integer cores and

floating point unit.

Register Renaming

• Advantageso Eliminates data replication by not using

distributed reservation stations.o Less overhead of CDB.

• Disadvantageso Increase in latency as the tags are

fetched instead of the values.o Complicated recovery mechanism for

branch misprediction.

Multithreading

• Shared front end (vertical multithreading)o Larger resource in single thread modeo Utilize fetch bandwidth

• Dedicated integer execution core (single thread)o Keep the integer execution core small and simpleo Possible to run in a higher frequency

• Shared FPU (SMT)o Consumes a great deal of area and powero Rarely utilized to the full capacity

• Shared L2 (thread agnostic)o Good when 2 threads share instruction/data image

Cache Hierarchy


TLB Hierarchy


Conclusion

• Decoupled branch prediction and instruction fetch enables the instruction prefetch

• By using PRF-based renaming it is power efficient

• Non-conventional Multithreading

References[1] Bulldozer: An Apporach To Multithreaded Compute Performance

http://home.dei.polimi.it/sami/architetture_avanzate/AMDbulldozer.pdf (2011)

[2] AMD Bulldozer Microarchitecture

http://www.realworldtech.com/bulldozer/ (2010)

[3] Bulldozer (microarchitecture)

http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)

[4] Register Renaming

http://en.wikipedia.org/wiki/Register_renaming

http://home.dei.polimi.it/sami/architetture_avanzate/AMDbulldozer.pdf

http://www.realworldtech.com/bulldozer/

http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)

http://en.wikipedia.org/wiki/Register_renaming

Documents

AMD Bulldozer Microarchitecture