31
A Low-Complexity, High- A Low-Complexity, High- Performance Fetch Unit for Performance Fetch Unit for Simultaneous Multithreading Simultaneous Multithreading Processors Processors Ayose Falcón Ayose Falcón Alex Ramirez Alex Ramirez Mateo Valero Mateo Valero HPCA-10 HPCA-10 February 18, 2004 February 18, 2004

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

  • Upload
    kali

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors. Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004. time. Simultaneous Multithreading. SMT [Tullsen95] / Multistreaming [Yamamoto95] - PowerPoint PPT Presentation

Citation preview

Page 1: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading ProcessorsUnit for Simultaneous Multithreading Processors

Ayose FalcónAyose Falcón Alex Ramirez Mateo Valero Alex Ramirez Mateo Valero

HPCA-10HPCA-10

February 18, 2004February 18, 2004

Page 2: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 22

Simultaneous Multithreading Simultaneous Multithreading

SMT [Tullsen95] / Multistreaming [Yamamoto95]

Instructions from different threads coexist in each processor stage

Resources are shared among different threads But…

Sharing implies competition• In caches, queues, FUs, …

Fetch policy decides! time

Page 3: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 33

MotivationMotivation

SMT performance is limited by fetch performance A superscalar fetch is not enough to feed an aggressive

SMT core SMT fetch is a bottleneck [Tullsen96] [Burns99]

Straightforward solution: Fetch from several threads each cyclea) Multiple fetch units (1 per thread) EXPENSIVE!

b) Shared fetch + fetch policy [Tullsen96]

Multiple PCs Multiple branch predictions per cycle Multiple I-cache accesses per cycle

Does the performance of this fetch organization compensate its complexity?

Page 4: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 44

Talk OutlineTalk Outline

Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions

Page 5: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 55

BranchPredictor

Instruction Cache

Fetching from a Single Thread (1.X)Fetching from a Single Thread (1.X)

Fine-grained, non-simultaneous sharing Simple similar to a superscalar fetch unit No additional HW needed

A fetch policy is needed Decides fetch priority among threads Several proposals in the literature

SH

IFT

&M

AS

K

Page 6: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 66

Fetching from a Single Thread (1.X)Fetching from a Single Thread (1.X)

But…a single thread is not enough to fill fetch BW Gshare / hybrid branch predictor + BTB limits fetch

width to one basic block per cycle (6-8 instructions)

0

1

2

3

4

5

6

7

8

1.8 1.16

Fetch Policy

Fe

tch

Th

rou

gh

pu

t (I

PF

C) Fetch BW is heavily

underused Avg 40% wasted with 1.8 Avg 60% wasted with 1.16

Fully use the fetch BW 31% fetch cycles with 1.8 6% fetch cycles with 1.16

Page 7: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 77

Fetching from Multiple Threads (2.X)Fetching from Multiple Threads (2.X)

Increases fetch throughput More threads more possibilities to fill fetch BW

More fetch BW use than 1.X

Fully use the fetch BW 54% of cycles with 2.8 16% of cycles with 2.16

1.8

1.16

0

1

2

3

4

5

6

7

8

2.8 2.16

Fe

tch

Th

rou

gh

pu

t (I

PF

C)

28%28%

33%33%

Page 8: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 88

Fetching from Multiple Threads (2.X)Fetching from Multiple Threads (2.X)

BranchPredictor

BANK 1

Instruction Cache

BANK 2

2 2 SH

IFT

&M

AS

KS

HIF

T&

MA

SK

ME

RG

E

2 predictions per cycle + 2 ports

Multibanked + multiportedinstruction cache

Replication of SHIFT & MASK logic

New HW to realign and merge cache lines

But…what is the additional HW cost of a 2.X fetch?

Page 9: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 99

Our GoalOur Goal

Can we take the best of both worlds? Low complexity of a 1.X fetch architecture

+ High performance of a 2.X fetch architecture

That is…can a single thread provide sufficient instructions to fill the available fetch bandwidth?

Page 10: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1010

Talk OutlineTalk Outline

Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions

Page 11: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1111

High Performance Fetch Engines (I)High Performance Fetch Engines (I)

We look for high performance Gshare / hybrid branch predictor + BTB Low performance

Limit fetch BW to one basic block per cycle • 6-8 instructions

We look for low complexity Trace cache, Branch Target Address Cache,

Collapsing Buffer, etc… Fetch multiple basic blocks per cycle

• 12-16 instructions High complexity

Page 12: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1212

High Performance Fetch Engines (II)High Performance Fetch Engines (II)

Our alternatives

Gskew [Michaud97] + FTB [Reinman99]

FTB fetch blocks are larger than basic blocks 5% speedup over gshare+BTB in superscalars

Stream Predictor [Ramirez02]

Streams are larger than FTB fetch blocks 11% speedup over gskew+FTB in superscalars

Page 13: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1313

Talk OutlineTalk Outline

Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions

Page 14: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1414

Simulation SetupSimulation Setup

Modified version of SMTSIM SMTSIM [Tullsen96]

Trace-driven, allowing wrong-path execution

Decoupled fetch (1 additional pipeline stage)

Branch predictor sizes of approx. 45KB

Decode & rename width limited to 8 instructions Fetch width 8/16 inst. Fetch buffer 32 inst.

Fetch policy ICOUNT

RAS /thread 64-entry

FTQ size /thread 4-entry

Functional units 6 int, 4 ld/st, 3 fp

Inst. queues 32 int, 32 ld/st, 32 fp

ROB /thread 256-entry

Physical registers

384 int, 384 fp

L1 I-cache & D-cache

32KB, 2W, 8 banks

L2 cache 1MB, 2W, 8banks, 10 cyc.

Line size 64B (16 instructions)

TLB 48 I + 48 D

Mem. lat. 100 cyc.

Page 15: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1515

WorkloadsWorkloads

SPECint2000 Code layout optimized

Spike [Cohn97] + profile data using train input

Most representative 300M instruction trace Using ref input

Workloads including 2, 4, 6, and 8 threads Classified according to threads characteristics:

ILPILP only ILP benchmarks MEMMEM memory-bounded benchmarks MIXMIX mix of ILP and MEM benchmarks

Page 16: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1616

Talk OutlineTalk Outline

Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results

ILP workloads MEM & MIX workloads

Summary & Conclusions

Only for 2 & 4 threads Only for 2 & 4 threads (see paper for the rest)(see paper for the rest)

Page 17: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1717

ILP Workloads - Fetch ThroughputILP Workloads - Fetch Throughput

With a given fetch bandwidth, fetching from two threads always benefits fetch performance

Critical point is 1.16 Stream predictor Better fetch performance than 2.8 Gshare+BTB / gskew+FTB Worse fetch perform. than 2.8

0

2

4

6

8

10

12

14

1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16

2_ILP 4_ILP

IPF

C

gshare+BTB gskew+FTB stream fetch

Fetch ThroughputFetch Throughput

Page 18: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1818

0

1

2

3

4

5

6

7

8

1.8 2.8 1.8 2.8

2_ILP 4_ILP

IPC

gshare+BTB gskew+FTB stream fetch

ILP Workloads – 1.XILP Workloads – 1.X (1.8) (1.8) vs 2.X vs 2.X (2.8) (2.8)

ILP benchmarks have few memory problems and high parallelism

Fetch unit is the real limiting factor The higher the fetch throughput, the higher the IPC

Commit ThroughputCommit Throughput

Page 19: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1919

ILP WorkloadsILP Workloads

So…2.X better than 1.X in ILP workloads…

But, what about 1.2X instead of 2.X? That is, 1.16 instead of 2.8 Maintain single thread fetch Cache lines and buses already 16-instruction wide

We have to modify the HW to select 16 instead of 8 instructions

Page 20: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2020

ILP Workloads ILP Workloads – 2.X– 2.X (2.8) (2.8) vs 1. vs 1.2X2X (1.16) (1.16)

With 1.16, stream predictor increases throughput (9% avg) Streams are long enough for a 16-wide fetch

Fetching a single block per cycle is not enough Gshare+BTB 10% slowdown Gskew+FTB 4% slowdown

0

1

2

3

4

5

6

7

8

2.8 1.16 2.16 2.8 1.16 2.16

2_ILP 4_ILP

IPC

gshare+BTB gskew+FTB stream fetch

Similar/Better Similar/Better performance than performance than 2.162.16!!

Similar/Better Similar/Better performance than performance than 2.162.16!!

Commit ThroughputCommit Throughput

Page 21: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2121

MEM & MIX Workloads - Fetch ThroughputMEM & MIX Workloads - Fetch Throughput

Same trend compared to ILP fetch throughput For a given fetch BW, fetching from two threads is better Stream > gskew + FTB > gshare + BTB

0

2

4

6

8

10

12

14

1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16

2_MIX 2_MEM 4_MIX 4_MEM

IPF

C

gshare+BTB gskew+FTB stream fetch

Fetch ThroughputFetch Throughput

Page 22: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2222

MEM & MIX Workloads – 1.XMEM & MIX Workloads – 1.X (1.8) (1.8) vs 2.X vs 2.X (2.8) (2.8)

With memory-bounded benchmarks…overall performance actually decreases!! Memory-bounded threads monopolize resources for many cycles Previously identified New fetch policies

Flush [Tullsen01] or stall [Luo01, El-Mousry03] problematic threads

00.5

11.5

22.5

33.5

44.5

5

1.8 2.8 1.8 2.8 1.8 2.8 1.8 2.8

2_MIX 2_MEM 4_MIX 4_MEM

IPC

gshare+BTB gskew+FTB stream fetch

Commit ThroughputCommit Throughput

Page 23: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2323

MEM & MIX workloadsMEM & MIX workloads

Fetching from only one thread allows to fetch only from the first, most priority thread Allows the highest priority thread to proceed with more

resources Avoids low-quality (less priority) threads to monopolize more

and more resources on cache misses Registers, IQ slots, etc.

Only the highest priority thread is fetched When cache miss is resolved, instructions from the second

thread will be consumed ICOUNT will give it more priority after the cache miss

resolution

A powerful fetch unit can be harmful if not well used

Page 24: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2424

MEM & MIX workloads – 1.XMEM & MIX workloads – 1.X (1.8) (1.8) vs 1. vs 1.2X2X (1.16) (1.16)

0

1

2

3

4

5

1.8 1.16 2.16 1.8 1.16 2.16 1.8 1.16 2.16 1.8 1.16 2.16

2_MIX 2_MEM 4_MIX 4_MEM

IPC

gshare+BTB gskew+FTB stream fetchCommit ThroughputCommit Throughput

Even 2.16 has worse commit performance than 1.8 More interference introduced by low-quality threads

Overall, 1.16 is the best combination Low complexity fetching from one thread High performance wide fetch

Page 25: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2525

Talk OutlineTalk Outline

Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions

Page 26: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2626

SummarySummary

Fetch unit is the most significant obstacle to obtain high SMT performance However, researchers usually don’t care about SMT fetch

performance They care on how to combine threads to maintain available fetch

throughput A simple gshare/hybrid + BTB is commonly used

Everybody assumes that 2.8 (2.X) is the correct answer

Fetching from many threads can be counterproductive Sharing implies competing Low-quality threads monopolize more and more resources

Page 27: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2727

ConclusionsConclusions

1.16 (1.2X) is the best fetch option Using a high-width fetch architecture

It’s not the prediction accuracy, it’s the fetch width

Beneficial for both ILP and MEM workloads 1.X is bad for ILP 2.X is bad for MEM

Fetches only from the most promising thread (according to fetch policy), and as much as possible

Offers the best performance/complexity tradeoff Fetching from a single thread may require

revisiting current SMT fetch policies

Page 28: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

ThanksThanks

Questions & AnswersQuestions & Answers

Page 29: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

Backup SlidesBackup Slides

Page 30: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 3030

SMT WorkloadsSMT Workloads

Workload Threads

2_ILP eon, gcc

2_MEM mcf, twolf

2_MIX gzip, twolf

4_ILP eon, gcc, gzip, bzip2

4_MEM mcf, twolf, vpr, perlbmk

4_MIX gzip, twolf, bzip2, mcf

6_ILP eon, gcc, gzip, bzip2, crafty, vortex

6_MIX gzip, twolf, bzip2, mcf, vpr, eon

8_ILP eon, gcc, gzip, bzip2, crafty, vortex, gap, parser

8_MIX gzip, twolf, bzip2, mcf, vpr, eon, gap, parser

Page 31: A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 3131

Simulation SetupSimulation Setup

Fetch policy ICOUNT

Gshare predictor 64K-entry, 16 bits history

Gskew predictor 3x32K-entry, 15 bits history

BTB/FTB 2K-entry, 4W asc.

Stream predictor 1K-entry, 4W + 4K-entry, 4W

RAS /thread 64-entry

FTQ size /thread 4-entry

Functional units 6 int, 4 ld/st, 3 fp

Inst. queues 32 int, 32 ld/st, 32 fp

ROB /thread 256-entry

Physical registers 384 int, 384 fp

L1 I-cache & D-cache 32KB, 2W, 8 banks

L2 cache 1MB, 2W, 8banks, 10 cyc.

Line size 64B (16 instructions)

TLB 48 I + 48 D

Mem. lat. 100 cyc.