29
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Embed Size (px)

Citation preview

Page 1: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

CS 7810 Lecture 16

Simultaneous Multithreading: MaximizingOn-Chip Parallelism

D.M. Tullsen, S.J. Eggers, H.M. LevyProceedings of ISCA-22

June 1995

Page 2: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Processor Under-Utilization

• Wide gap between average processor utilization and peak processor utilization

• Caused by dependences, long latency instrs, branch mispredicts

• Results in many idle cycles for many structures

Page 3: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Superscalar Utilization

Time

Resources (e.g. FUs)

• Suffers from horizontal waste (can’t find enough work in a cycle) and vertical waste (because of dependences, there is nothing to do for many cycles)• Utilization=19%• vertical:horizontal waste = 61:39

Thread-1

V waste

H waste

Page 4: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Chip Multiprocessors

Time

Resources (e.g. FUs)

• Single-thread performance goes down• Horizontal waste reduces

Thread-1

V waste

H waste

Thread-2

Page 5: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Fine-Grain Multithreading

Time

Resources (e.g. FUs)

• Low-cost context-switch at a fine grain• Reduces vertical waste

Thread-1

V waste

H waste

Thread-2

Page 6: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Simultaneous Multithreading

Time

Resources (e.g. FUs)

• Reduces vertical and horizontal waste

Thread-1

V waste

H waste

Thread-2

Page 7: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Pipeline Structure

FrontEnd

FrontEnd

FrontEnd

FrontEnd

Execution Engine

Rename ROB

I-Cache Bpred

Regs IQ

FUsDCache

Private/Shared

Front-end

PrivateFront-end

SharedExec Engine

What about RAS, LSQ?

Page 8: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Chip Multi-Processor

FrontEnd

FrontEnd

FrontEnd

FrontEnd

Rename ROB

I-Cache Bpred

Regs IQ

FUsDCache

PrivateFront-end

PrivateFront-end

PrivateExec Engine

ExecEngine

ExecEngine

ExecEngine

ExecEngine

Page 9: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Clustered SMT

FrontEnd

FrontEnd

FrontEnd

FrontEnd

Clusters

Page 10: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Evaluated Models

• Fine-Grained Multithreading

• Unrestricted SMT

• Restricted SMT X-issue: A thread can only issue up to X instrs in a cycle Limited connection: each thread is tied to a fixed FU

Page 11: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Results

• SMT nearly eliminates horizontal waste• In spite of priorities, single-thread performance degrades (cache contention)• Not much difference between private and shared caches – however, with few threads, the private caches go under-utilized

Page 12: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Comparison of Models

• Bullet

Page 13: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

CMP vs. SMT

Page 14: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

CS 7810 Lecture 16

Exploiting Choice: Instruction Fetch and Issueon an Implementable SMT Processor

D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm

Proceedings of ISCA-23June 1996

Page 15: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

New Bottlenecks

• Instruction fetch has a strong influence on total throughput

if the execution engine is executing at top speed, it is often hungry for new instrs some threads are more likely to have ready instrs than others – selection becomes important

Page 16: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

SMT Processor

Multiple PCs

Multiple Renames and ROBs

Multiple RAS More registers

Page 17: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

SMT Overheads

• Large register file – need at least 256 physical registers to support eight threads

increases cycle time/pipeline depth increases mispredict penalty increases bypass complexity increases register lifetime

• Results in 2% performance loss

Page 18: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Base Design

• Front-end is fine-grain multithreaded, rest is SMT• Bottlenecks:

Low fetch rate (4.2 instrs/cycle) IQ is often full, but only half the issue bandwidth is being used

Page 19: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Fetch Efficiency

• Base case uses RoundRobin.1.8

• RR.2.4: fetches four instrs each from two threads requires a banked organization requires additional multiplexing logic

• Increases the chances of finding eight instrs without a taken branch

• Yields instrs in spite of an I-cache miss

• RR.2.8: extends RR.2.4 by reading out larger line

Page 20: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Results

Page 21: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Fetch Effectiveness

• Are we picking the best instructions?

• IQ-clog: instrs that sit in the issue queue for ages; does it make sense to fetch their dependents?

• Wrong-path instructions waste issue slots

• Ideally, we want useful instructions that have short issue queue lifetimes

Page 22: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Fetch Effectiveness

• Useful instructions: throttle fetch if branch mpred probability is high confidence, num-branches (BRCOUNT), in-flight window size

• Short lifetimes: throttle fetch if you encounter a cache miss (MISSCOUNT), give priority to threads that have young instrs (IQPOSN)

Page 23: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

ICOUNT

• ICOUNT: priority is based on number of unissued instrs everyone gets a share of the issueq

• Long-latency instructions will not dominate the IQ

• Threads that have high issue rate will also have high fetch rate

• In-flight windows are short and wrong-path instrs are minimized

• Increased fairness more ready instrs per cycle

Page 24: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Results

Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)

Page 25: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Reducing IQ-clog

• IQBUF: a buffer before the issue queue

• ITAG: pre-examine the tags to detect I-cache misses and not waste fetch bandwidth

• OPT_last and SPEC_last: lower issue priority for speculative instrs

• These techniques entail overheads and result in minor improvements

Page 26: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Bottleneck Analysis

• The following are not bottlenecks: issue bandwidth, issue queue size, memory thruput

• Doubling fetch bandwidth improves thruput by 8% -- there is still room for improvement

• SMT is more tolerant of branch mpreds: perfect prediction improves 1-thread by 25% and 8-thread by 9% -- no speculation has a similar effect

• Register file can be a huge bottleneck

Page 27: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

IPC vs. Threads vs. Registers

Page 28: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Power and Energy

• Energy is heavily influenced by “work done” and by execution time compared to a single-thread machine, SMT does not reduce “work done”, but reduces execution time reduced energy

• Same work, less time higher power!

Page 29: CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Title

• Bullet