A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading ProcessorsUnit for Simultaneous Multithreading Processors

Ayose FalcónAyose Falcón Alex Ramirez Mateo Valero Alex Ramirez Mateo Valero

HPCA-10HPCA-10

February 18, 2004February 18, 2004

HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 22

Simultaneous Multithreading Simultaneous Multithreading

SMT [Tullsen95] / Multistreaming [Yamamoto95]

Instructions from different threads coexist in each processor stage

Resources are shared among different threads But…

Sharing implies competition• In caches, queues, FUs, …

Fetch policy decides! time


MotivationMotivation

SMT performance is limited by fetch performance A superscalar fetch is not enough to feed an aggressive

SMT core SMT fetch is a bottleneck [Tullsen96] [Burns99]

Straightforward solution: Fetch from several threads each cyclea) Multiple fetch units (1 per thread) EXPENSIVE!

b) Shared fetch + fetch policy [Tullsen96]

Multiple PCs Multiple branch predictions per cycle Multiple I-cache accesses per cycle

Does the performance of this fetch organization compensate its complexity?


Talk OutlineTalk Outline

Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions


BranchPredictor

Instruction Cache

Fetching from a Single Thread (1.X)Fetching from a Single Thread (1.X)

Fine-grained, non-simultaneous sharing Simple similar to a superscalar fetch unit No additional HW needed

A fetch policy is needed Decides fetch priority among threads Several proposals in the literature

SH

IFT

&M

AS

K


Fetching from a Single Thread (1.X)Fetching from a Single Thread (1.X)

But…a single thread is not enough to fill fetch BW Gshare / hybrid branch predictor + BTB limits fetch

width to one basic block per cycle (6-8 instructions)

0

1

2

3

4

5

6

7

8

1.8 1.16

Fetch Policy

Fe

tch

Th

rou

gh

pu

t (I

PF

C) Fetch BW is heavily

underused Avg 40% wasted with 1.8 Avg 60% wasted with 1.16

Fully use the fetch BW 31% fetch cycles with 1.8 6% fetch cycles with 1.16


Fetching from Multiple Threads (2.X)Fetching from Multiple Threads (2.X)

Increases fetch throughput More threads more possibilities to fill fetch BW

More fetch BW use than 1.X

Fully use the fetch BW 54% of cycles with 2.8 16% of cycles with 2.16

1.8

1.16

0

1

2

3

4

5

6

7

8

2.8 2.16

Fe

tch

Th

rou

gh

pu

t (I

PF

C)

28%28%

33%33%


Fetching from Multiple Threads (2.X)Fetching from Multiple Threads (2.X)

BranchPredictor

BANK 1

Instruction Cache

BANK 2

2 2 SH

IFT

&M

AS

KS

HIF

T&

MA

SK

ME

RG

E

2 predictions per cycle + 2 ports

Multibanked + multiportedinstruction cache

Replication of SHIFT & MASK logic

New HW to realign and merge cache lines

But…what is the additional HW cost of a 2.X fetch?


Our GoalOur Goal

Can we take the best of both worlds? Low complexity of a 1.X fetch architecture

+ High performance of a 2.X fetch architecture

That is…can a single thread provide sufficient instructions to fill the available fetch bandwidth?





High Performance Fetch Engines (I)High Performance Fetch Engines (I)

We look for high performance Gshare / hybrid branch predictor + BTB Low performance

Limit fetch BW to one basic block per cycle • 6-8 instructions

We look for low complexity Trace cache, Branch Target Address Cache,

Collapsing Buffer, etc… Fetch multiple basic blocks per cycle

• 12-16 instructions High complexity


High Performance Fetch Engines (II)High Performance Fetch Engines (II)

Our alternatives

Gskew [Michaud97] + FTB [Reinman99]

FTB fetch blocks are larger than basic blocks 5% speedup over gshare+BTB in superscalars

Stream Predictor [Ramirez02]

Streams are larger than FTB fetch blocks 11% speedup over gskew+FTB in superscalars





Simulation SetupSimulation Setup

Modified version of SMTSIM SMTSIM [Tullsen96]

Trace-driven, allowing wrong-path execution

Decoupled fetch (1 additional pipeline stage)

Branch predictor sizes of approx. 45KB

Decode & rename width limited to 8 instructions Fetch width 8/16 inst. Fetch buffer 32 inst.

Fetch policy ICOUNT

RAS /thread 64-entry

FTQ size /thread 4-entry

Functional units 6 int, 4 ld/st, 3 fp

Inst. queues 32 int, 32 ld/st, 32 fp

ROB /thread 256-entry

Physical registers

384 int, 384 fp

L1 I-cache & D-cache

32KB, 2W, 8 banks

L2 cache 1MB, 2W, 8banks, 10 cyc.

Line size 64B (16 instructions)

TLB 48 I + 48 D

Mem. lat. 100 cyc.


WorkloadsWorkloads

SPECint2000 Code layout optimized

Spike [Cohn97] + profile data using train input

Most representative 300M instruction trace Using ref input

Workloads including 2, 4, 6, and 8 threads Classified according to threads characteristics:

ILPILP only ILP benchmarks MEMMEM memory-bounded benchmarks MIXMIX mix of ILP and MEM benchmarks



Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results

ILP workloads MEM & MIX workloads

Summary & Conclusions

Only for 2 & 4 threads Only for 2 & 4 threads (see paper for the rest)(see paper for the rest)


ILP Workloads - Fetch ThroughputILP Workloads - Fetch Throughput

With a given fetch bandwidth, fetching from two threads always benefits fetch performance

Critical point is 1.16 Stream predictor Better fetch performance than 2.8 Gshare+BTB / gskew+FTB Worse fetch perform. than 2.8

0

2

4

6

8

10

12

14

1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16

2_ILP 4_ILP

IPF

C

gshare+BTB gskew+FTB stream fetch

Fetch ThroughputFetch Throughput


0

1

2

3

4

5

6

7

8

1.8 2.8 1.8 2.8

2_ILP 4_ILP

IPC


ILP Workloads – 1.XILP Workloads – 1.X (1.8) (1.8) vs 2.X vs 2.X (2.8) (2.8)

ILP benchmarks have few memory problems and high parallelism

Fetch unit is the real limiting factor The higher the fetch throughput, the higher the IPC

Commit ThroughputCommit Throughput


ILP WorkloadsILP Workloads

So…2.X better than 1.X in ILP workloads…

But, what about 1.2X instead of 2.X? That is, 1.16 instead of 2.8 Maintain single thread fetch Cache lines and buses already 16-instruction wide

We have to modify the HW to select 16 instead of 8 instructions


ILP Workloads ILP Workloads – 2.X– 2.X (2.8) (2.8) vs 1. vs 1.2X2X (1.16) (1.16)

With 1.16, stream predictor increases throughput (9% avg) Streams are long enough for a 16-wide fetch

Fetching a single block per cycle is not enough Gshare+BTB 10% slowdown Gskew+FTB 4% slowdown

0

1

2

3

4

5

6

7

8

2.8 1.16 2.16 2.8 1.16 2.16

2_ILP 4_ILP

IPC


Similar/Better Similar/Better performance than performance than 2.162.16!!

Similar/Better Similar/Better performance than performance than 2.162.16!!



MEM & MIX Workloads - Fetch ThroughputMEM & MIX Workloads - Fetch Throughput

Same trend compared to ILP fetch throughput For a given fetch BW, fetching from two threads is better Stream > gskew + FTB > gshare + BTB

0

2

4

6

8

10

12

14

1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16

2_MIX 2_MEM 4_MIX 4_MEM

IPF

C


Fetch ThroughputFetch Throughput


MEM & MIX Workloads – 1.XMEM & MIX Workloads – 1.X (1.8) (1.8) vs 2.X vs 2.X (2.8) (2.8)

With memory-bounded benchmarks…overall performance actually decreases!! Memory-bounded threads monopolize resources for many cycles Previously identified New fetch policies

Flush [Tullsen01] or stall [Luo01, El-Mousry03] problematic threads

00.5

11.5

22.5

33.5

44.5

5

1.8 2.8 1.8 2.8 1.8 2.8 1.8 2.8


IPC




MEM & MIX workloadsMEM & MIX workloads

Fetching from only one thread allows to fetch only from the first, most priority thread Allows the highest priority thread to proceed with more

resources Avoids low-quality (less priority) threads to monopolize more

and more resources on cache misses Registers, IQ slots, etc.

Only the highest priority thread is fetched When cache miss is resolved, instructions from the second

thread will be consumed ICOUNT will give it more priority after the cache miss

resolution

A powerful fetch unit can be harmful if not well used


MEM & MIX workloads – 1.XMEM & MIX workloads – 1.X (1.8) (1.8) vs 1. vs 1.2X2X (1.16) (1.16)

0

1

2

3

4

5

1.8 1.16 2.16 1.8 1.16 2.16 1.8 1.16 2.16 1.8 1.16 2.16


IPC

gshare+BTB gskew+FTB stream fetchCommit ThroughputCommit Throughput

Even 2.16 has worse commit performance than 1.8 More interference introduced by low-quality threads

Overall, 1.16 is the best combination Low complexity fetching from one thread High performance wide fetch





SummarySummary

Fetch unit is the most significant obstacle to obtain high SMT performance However, researchers usually don’t care about SMT fetch

performance They care on how to combine threads to maintain available fetch

throughput A simple gshare/hybrid + BTB is commonly used

Everybody assumes that 2.8 (2.X) is the correct answer

Fetching from many threads can be counterproductive Sharing implies competing Low-quality threads monopolize more and more resources


ConclusionsConclusions

1.16 (1.2X) is the best fetch option Using a high-width fetch architecture

It’s not the prediction accuracy, it’s the fetch width

Beneficial for both ILP and MEM workloads 1.X is bad for ILP 2.X is bad for MEM

Fetches only from the most promising thread (according to fetch policy), and as much as possible

Offers the best performance/complexity tradeoff Fetching from a single thread may require

revisiting current SMT fetch policies

ThanksThanks

Questions & AnswersQuestions & Answers

Backup SlidesBackup Slides


SMT WorkloadsSMT Workloads

Workload Threads

2_ILP eon, gcc

2_MEM mcf, twolf

2_MIX gzip, twolf

4_ILP eon, gcc, gzip, bzip2

4_MEM mcf, twolf, vpr, perlbmk

4_MIX gzip, twolf, bzip2, mcf

6_ILP eon, gcc, gzip, bzip2, crafty, vortex

6_MIX gzip, twolf, bzip2, mcf, vpr, eon

8_ILP eon, gcc, gzip, bzip2, crafty, vortex, gap, parser

8_MIX gzip, twolf, bzip2, mcf, vpr, eon, gap, parser


Simulation SetupSimulation Setup

Fetch policy ICOUNT

Gshare predictor 64K-entry, 16 bits history

Gskew predictor 3x32K-entry, 15 bits history

BTB/FTB 2K-entry, 4W asc.

Stream predictor 1K-entry, 4W + 4K-entry, 4W

RAS /thread 64-entry

FTQ size /thread 4-entry

Functional units 6 int, 4 ld/st, 3 fp

Inst. queues 32 int, 32 ld/st, 32 fp

ROB /thread 256-entry

Physical registers 384 int, 384 fp

L1 I-cache & D-cache 32KB, 2W, 8 banks

L2 cache 1MB, 2W, 8banks, 10 cyc.

Line size 64B (16 instructions)

TLB 48 I + 48 D

Mem. lat. 100 cyc.

Documents

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors