23
www.compaq.com Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance Multiplying Alpha Performance Dr. Joel Emer Dr. Joel Emer Principal Member Technical Staff Principal Member Technical Staff Alpha Development Group Alpha Development Group Compaq Computer Corporation Compaq Computer Corporation

Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Simultaneous Multithreading:Simultaneous Multithreading:Multiplying Alpha PerformanceMultiplying Alpha Performance

Dr. Joel EmerDr. Joel EmerPrincipal Member Technical StaffPrincipal Member Technical Staff

Alpha Development GroupAlpha Development GroupCompaq Computer CorporationCompaq Computer Corporation

Page 2: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

OutlineOutline

uu Alpha Processor RoadmapAlpha Processor Roadmap

uu Motivat ion for Introducing SMTMotivat ion for Introducing SMT

uu Implementat ion of an SMT CPUImplementat ion of an SMT CPU

uu Performance Est imatesPerformance Est imates

uu Architectural Abstract ionArchitectural Abstract ion

Page 3: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Higher Performance

Low

er C

ost

2000 2001 2002 20031998 1999

2126421264EV6EV6

2126421264EV68EV68

0.35µ m

2126421264EV67EV67

0.28µ m

0.18µ m

EV7EV70.18µ m

...

EV8EV80.125µ m

Alpha Microprocessor OverviewAlpha Microprocessor Overview

EV78EV780.125µ m

First System Ship

Page 4: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

EV8 Technology OverviewEV8 Technology Overview

uu Leading edge process technology – 1.2- 2.0GHzLeading edge process technology – 1.2- 2.0GHzll 0.125µm CMOS0.125µm CMOSll SOI- compatibleSOI- compatiblell Cu interconnectCu interconnectll low- k dielectricslow- k dielectrics

uu Chip characterist icsChip characterist icsll ~1.2V ~1.2V VddVdd

ll ~250 Million transistors~250 Million transistors

ll ~1100 signal pins in f lip chip packaging~1100 signal pins in f lip chip packaging

Page 5: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

EV8 Architecture OverviewEV8 Architecture Overview

uu Enhanced out- of- order execut ionEnhanced out- of- order execut ion

uu 8- wide 8- wide superscalarsuperscalar

uu Large on- chip L2 cacheLarge on- chip L2 cache

uu Direct RAMBUS interfaceDirect RAMBUS interface

uu On- chip router for system interconnectOn- chip router for system interconnect

uu GluelessGlueless, directory- based, , directory- based, ccNUMAccNUMA for up to 512- way for up to 512- waySMPSMP

uu 4- way simultaneous mult ithreading (SMT)4- way simultaneous mult ithreading (SMT)

Page 6: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

GoalsGoals

uu Leadership single stream performanceLeadership single stream performance

uu Extra mult istream performance withExtra mult istream performance withmult ithreadingmult ithreading

ll Without major architectural changesWithout major architectural changes

ll Without signif icant addit ional costWithout signif icant addit ional cost

Page 7: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Instruction IssueInstruction Issue

Reduced function unit utilization due to dependencies

Time

Page 8: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Superscalar Superscalar IssueIssue

Superscalar leads to more performance, but lower utilization

Time

Page 9: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Predicated IssuePredicated Issue

Adds to function unit utilization, but results are thrown away

Time

Page 10: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Chip MultiprocessorChip Multiprocessor

Limited utilization when only running one thread

Time

Page 11: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Fine Grained MultithreadingFine Grained Multithreading

Intra-thread dependencies still limit performance

Time

Page 12: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Simultaneous MultithreadingSimultaneous Multithreading

Maximum utilization of function units by independent operations

Time

Page 13: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Basic Out-of-order PipelineBasic Out-of-order Pipeline

Fetch Decode/ Map

Queue RegRead

Execute

Dcache/ StoreBuffer

RegWrite

Retire

PC

Icache

RegisterMap

DcacheRegs Regs

Thread-blind

Page 14: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

SMT PipelineSMT Pipeline

Fetch Decode/ Map

Queue RegRead

Execute

Dcache/ StoreBuffer

RegWrite

Retire

Icache

Dcache

PC

RegisterMap

Regs Regs

Page 15: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Changes for SMTChanges for SMT

uu Basic pipeline – unchangedBasic pipeline – unchanged

uu Replicated resourcesReplicated resourcesll Program countersProgram countersll Register mapsRegister maps

uu Shared resourcesShared resourcesll Register f ile (size increased)Register f ile (size increased)ll Instruct ion queueInstruct ion queuell First and second level cachesFirst and second level cachesll Translat ion buffersTranslat ion buffersll Branch predictorBranch predictor

Page 16: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Multiprogrammed workloadMultiprogrammed workload

0%

50%

100%

150%

200%

250%

SpecInt SpecFP Mixed Int/FP

1T2T3T4T

Page 17: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Decomposed SPEC95 ApplicationsDecomposed SPEC95 Applications

0%

50%

100%

150%

200%

250%

Turb3d Swm256 Tomcatv

1T2T3T4T

Page 18: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Multithreaded ApplicationsMultithreaded Applications

0%

50%

100%

150%

200%

250%

300%

Barnes Chess Sort TP

1T2T4T

Page 19: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

Architectural AbstractionArchitectural Abstraction

uu 1 CPU with 4 Thread Processing Units (1 CPU with 4 Thread Processing Units (TPUsTPUs))

uu Shared hardware resourcesShared hardware resources

TPU 0 TPU1 TPU2 TPU3

I cache TLB Dcache

Scache

Page 20: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

System Block DiagramSystem Block Diagram

EV8M

IOEV8

M

IOEV8

M

IO

EV8M

IOEV8

M

IOEV8

M

IO

EV8M

IOEV8

M

IOEV8

M

IO

0 1 2 3

Page 21: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

QuiescingQuiescing Idle Threads Idle Threads

uu Problem:Problem:Spin looping thread consumes resourcesSpin looping thread consumes resources

uu Solut ion:Solut ion:Provide Provide quiescingquiescing operat ion that allows a operat ion that allows aTPU to sleep unt il a memory locat ion changesTPU to sleep unt il a memory locat ion changes

Page 22: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

SummarySummary

uu Alpha will maintain single stream performanceAlpha will maintain single stream performanceleadershipleadership

uu SMT will signif icant ly enhance mult istreamSMT will signif icant ly enhance mult istreamperformanceperformance

ll Across a wide range of applicat ions,Across a wide range of applicat ions,

ll Without signif icant hardware cost, andWithout signif icant hardware cost, and

ll Without major architectural changesWithout major architectural changes

Page 23: Simultaneous Multithreading: Multiplying Alpha Performance€¦ · u "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. u "Exploiting

www.compaq.com

ReferencesReferences

uu ""Simultaneous Multithreading: Maximizing On-Chip ParallelismSimultaneous Multithreading: Maximizing On-Chip Parallelism" by" by Tullsen Tullsen, Eggers, Eggersand Levy in ISCA95.and Levy in ISCA95.

uu ""Exploiting Choice: Instruction Fetch and Issue on anExploiting Choice: Instruction Fetch and Issue on an Implementable Implementable Simultaneous SimultaneousMultithreaded ProcessorMultithreaded Processor" by" by Tullsen Tullsen, Eggers, Emer, Levy, Lo and, Eggers, Emer, Levy, Lo and Stamm Stamm in inISCA96.ISCA96.

uu ““Converting Thread-Level Parallelism to Instruction-Level Parallelism via SimultaneousConverting Thread-Level Parallelism to Instruction-Level Parallelism via SimultaneousMultithreadingMultithreading” by Lo, Eggers, Emer, Levy, ” by Lo, Eggers, Emer, Levy, StammStamm and and Tullsen Tullsen in ACMin ACMTransact ions on Computer Systems, August 1997.Transact ions on Computer Systems, August 1997.

uu “Simultaneous Mult ithreading: A Plat form for Next- Generat ion“Simultaneous Mult ithreading: A Plat form for Next- Generat ionPrcoessorsPrcoessors” by Eggers, Emer, Levy, Lo, ” by Eggers, Emer, Levy, Lo, Stamm Stamm and and Tullsen Tullsen in IEEEin IEEEMicro, October, 1997.Micro, October, 1997.