Click here to load reader

AMD Bulldozer Microarchitecture

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

AMD Bulldozer Microarchitecture. Overview. Two cores - to have high throughput per thread Bulldozer module can execute two threads via a combination of shared and dedicated resources. AMD’s design focuses on Multithreading . . High Level Block Diagram. The figure is taken from [ 3 ]. - PowerPoint PPT Presentation

Text of AMD Bulldozer Microarchitecture

AMD Bulldozer Microarchitecture

AMD Bulldozer MicroarchitectureOverviewTwo cores - to have high throughput per thread

Bulldozer module can execute two threads via a combination of shared and dedicated resources.

AMDs design focuses on Multithreading. High Level Block DiagramThe figure is taken from [3]Branch Prediction & FetchPrediction structures - shared between two threadsMultilevel BTBsGuess!!! Prediction runs ahead of the IF pipeline during fetch misses or other stalls. Instruction is prefetched into L1 cache using the prediction queue.DecodeFetch lines are queued in an instruction byte buffer.Decode unit extracts and decodes up to four x86 instructions per cycle.Decoded instructions dispatch to one integer core.Integer CoreReplicated (2 Integer Cores)Scheduler handles out of order execution. Core transparencyAvoids complexity Lean HardwareInteger Core

The figure is taken from [1]Floating Point UnitSingle floating point unit.Shared between integer cores.Floating point operations implemented in pipelined fashion & hence exploit SMT.Interfaces with the decode unit for receiving cops and load/store unit for data transferFloating Point UnitThe figure is taken from [1]Register RenamingPRF(Physical Register File)-based renamingTable containing mappings of names to locations (tags).Issued instructions execute after reading from PRF.Uses snapshots for recovering from branch mispredictions/ exceptions.Separate register files for integer cores and floating point unit.

Register RenamingAdvantagesEliminates data replication by not using distributed reservation stations.Less overhead of CDB. DisadvantagesIncrease in latency as the tags are fetched instead of the values.Complicated recovery mechanism for branch misprediction.

MultithreadingShared front end (vertical multithreading)Larger resource in single thread modeUtilize fetch bandwidthDedicated integer execution core (single thread)Keep the integer execution core small and simplePossible to run in a higher frequencyShared FPU (SMT)Consumes a great deal of area and powerRarely utilized to the full capacityShared L2 (thread agnostic)Good when 2 threads share instruction/data image

Cache HierarchyThe figure is taken from [1]TLB HierarchyThe figure is taken from [1]ConclusionDecoupled branch prediction and instruction fetch enables the instruction prefetchBy using PRF-based renaming it is power efficientNon-conventional Multithreading

References[1] Bulldozer: An Apporach To Multithreaded Compute Performancehttp://home.dei.polimi.it/sami/architetture_avanzate/AMDbulldozer.pdf (2011)

[2] AMD Bulldozer Microarchitecturehttp://www.realworldtech.com/bulldozer/ (2010)[3] Bulldozer (microarchitecture)http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)

[4] Register Renaming http://en.wikipedia.org/wiki/Register_renaming