39
EECC722 - Shaaban EECC722 - Shaaban #1 Lec # 2 Fall 2000 9-11-20 Simultaneous Simultaneous Multithreading (SMT) Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors. SMT has the potential of greatly enhancing processor computational capabilities by: Exploiting thread-level parallelism (TLP), simultaneously executing instructions from different threads during the same cycle. Providing multiple hardware contexts, hardware thread scheduling and context switching capability.

Simultaneous Multithreading (SMT)

  • Upload
    coy

  • View
    60

  • Download
    0

Embed Size (px)

DESCRIPTION

Simultaneous Multithreading (SMT). An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors. - PowerPoint PPT Presentation

Citation preview

Page 1: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#1 Lec # 2 Fall 2000 9-11-2000

Simultaneous Multithreading (SMT)Simultaneous Multithreading (SMT)• An evolutionary processor architecture originally

introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors.

• SMT has the potential of greatly enhancing processor computational capabilities by:

– Exploiting thread-level parallelism (TLP), simultaneously executing instructions from different threads during the same cycle.

– Providing multiple hardware contexts, hardware thread scheduling and context switching capability.

Page 2: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#2 Lec # 2 Fall 2000 9-11-2000

Microprocessor Architecture TrendsMicroprocessor Architecture Trends

C IS C M ac h i n e sins truc tio ns take var iable t im e s to c o m ple te

R IS C M ac h i n e s ( m i c r o c o d e )s im ple ins truc tio ns , o ptim ize d fo r spe e d

R IS C M ac h i n e s ( p i p e l i n e d )s am e individual ins truc tio n late nc y

gre ate r thro ughput thro ugh ins truc tio n "o ve r lap"

S u p e r s c a l ar P r o c e s s o r sm ultiple ins truc tio ns e xe c uting s im ultane o us ly

M u l t i t h r e ad e d P r o c e s s o r saddit io nal H W re so urc e s ( re gs , P C , SP )e ac h c o nte xt ge ts pro c e s so r fo r x c yc le s

V L IW"Supe r ins truc tio ns " gro upe d to ge the r

de c re ase d H W c o ntro l c o m ple xity

S i n g l e C h i p M u l t i p r o c e s s o r sduplic ate e ntire pro c e s so rs

( te c h so o n due to M o o re 's Law)

S IM U L TA N E O U S M U L TITH R E A D IN Gm ultiple H W c o nte xts ( re gs , P C , SP )e ac h c yc le , any c o nte xt m ay e xe c ute

Page 3: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#3 Lec # 2 Fall 2000 9-11-2000

Performance Increase of Workstation-Class Performance Increase of Workstation-Class Microprocessors 1987-1997Microprocessors 1987-1997

Integer SPEC92 PerformanceInteger SPEC92 Performance

Page 4: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#4 Lec # 2 Fall 2000 9-11-2000

Year

Tra

nsis

tors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000

i80386

i4004

i8080

Pentium

i80486

i80286

i8086

Microprocessor Logic DensityMicroprocessor Logic Density

Moore’s Law:Moore’s Law:2X transistors/ChipEvery 1.5 years

Alpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million

Moore’s Law

Page 5: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#5 Lec # 2 Fall 2000 9-11-2000

Increase of Capacity of VLSI Dynamic RAM ChipsIncrease of Capacity of VLSI Dynamic RAM Chips

size

Year

Bit

s

1000

10000

100000

1000000

10000000

100000000

1000000000

1970 1975 1980 1985 1990 1995 2000

year size(Megabit)

1980 0.0625

1983 0.25

1986 1

1989 4

1992 16

1996 64

1999 256

2000 1024

1.55X/yr, or doubling every 1.6 years

Page 6: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#6 Lec # 2 Fall 2000 9-11-2000

CPU Architecture Evolution:CPU Architecture Evolution:

Single Threaded PipelineSingle Threaded Pipeline

• Traditional 5-stage pipeline.• Increases Throughput: Ideal CPI = 1

F etc h M em oryExec uteD ec ode W ritebac k

M em ory Hierarc hy (M anagem ent)

Regis ter F ile

P C

S P

Page 7: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#7 Lec # 2 Fall 2000 9-11-2000

F etc h i M em ory iExec ute iD ec ode i W ritebac k i

Regis ter F ile

P C

S P

F etc h i+ 1 M em ory i+ 1Exec ute i+ 1D ec ode i+ 1W ritebac k

i+ 1

Mem

ory Hierarchy (M

anagement)

F etc h i M em ory iExec ute iD ec ode i W ritebac k i

CPU Architecture Evolution:CPU Architecture Evolution:

Superscalar ArchitecturesSuperscalar Architectures• Fetch, decode, execute, etc. more than one instruction per cycle (CPI <

1).• Limited by instruction-level parallelism (ILP).

Page 8: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#8 Lec # 2 Fall 2000 9-11-2000

Traditional Multithreaded ProcessorTraditional Multithreaded Processor

• Multiple HW contexts (PC, SP, and registers)

• One context gets CPU for x cycles at a time.

• Limited by thread-level parallelism (TLP).

Advanced CPU Architectures:Advanced CPU Architectures:

Page 9: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#9 Lec # 2 Fall 2000 9-11-2000

VLIW: Intel/HPVLIW: Intel/HP

Explicitly Parallel Instruction Computing Explicitly Parallel Instruction Computing (EPIC)(EPIC)

• Strengths: – Allows for a high level of instruction parallelism (ILP).

– Takes a lot of the dependency analysis out of HW and places focus on smart compilers.

• Weakness: – Keeping Functional Units (FUs) busy (control hazards).

– Static FUs Scheduling limits performance gains.

Advanced CPU Architectures:Advanced CPU Architectures:

Page 10: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#10 Lec # 2 Fall 2000 9-11-2000

Single Chip MultiprocessorSingle Chip Multiprocessor• Strengths:

– Create a single processor block and duplicate.

– Takes a lot of the dependency analysis out of HW and places focus on smart compilers.

• Weakness: – Performance limited by individual thread performance

(ILP).

Advanced CPU Architectures:Advanced CPU Architectures:

Page 11: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#11 Lec # 2 Fall 2000 9-11-2000

Advanced CPU Architectures:Advanced CPU Architectures:

Regis ter F ile i

P C i

S P i

R e g is t e r F i le i+ 1

P C i+ 1

S P i+ 1

Regis ter F ile n

P C n

S P n

S upers c alar (T w o-w ay) P ipelinei

S upers c alar (T w o-w ay) P ipelinei+ 1

S upers c alar (T w o-w ay) P ipelinen

Mem

ory Hierarchy (M

anagement)

Contro lUnit

i

Contro lUniti+ 1

Contro lUnit

n

Single Chip Multiprocessor

Page 12: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#12 Lec # 2 Fall 2000 9-11-2000

SMT: Simultaneous Multithreading• Multiple Hardware Contexts running at the same time

(HW context: registers, PC, and SP).

• Avoids both horizontal and vertical waste by having multiple threads keeping functional units busy during every cycle.

• Builds on top of current time-proven advancements in CPU design: superscalar, dynamic scheduling, hardware speculation, dynamic HW branch prediction.

• Enabling Technology: VLSI logic density in the order of hundreds of millions of transistors/Chip.

Page 13: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#13 Lec # 2 Fall 2000 9-11-2000

SMT• With multiple threads running penalties from long-

latency operations, cache misses, and branch mispredictions will be hidden.

• Pipelines are separated until issue stage

• Functional units are shared among all contexts during every cycle

– More complicated writeback stage.

• More threads issuing to functional units results in higher resource utilization

Page 14: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#14 Lec # 2 Fall 2000 9-11-2000

SMT: Simultaneous Multithreading

Regis ter F ile i

P C i

S P i

R e g is t e r F i le i+ 1

P C i+ 1

S P i+ 1

Regis ter F ile n

P C n

S P n

S upers c alar (T w o-w ay) P ipelinei

S upers c alar (T w o-w ay) P ipelinei+ 1

S upers c alar (T w o-w ay) P ipelinen

Mem

ory Hierarchy (M

anagement)

Control U

nit (Chip-W

ide)

Page 15: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#15 Lec # 2 Fall 2000 9-11-2000

The Power Of SMTThe Power Of SMT1 1

1

1 1 1 1

1 1

1

1 1

2 2

3 3

4

5 5

1 1 1 1

2 2 2

3

4 4 4

1 1 2

2 2 3

3 3 4 5

2 2 4

4 5

1 1 1 1

2 2 3

1 2 4

1 2 5

Tim

e (p

roce

ssor

cyc

les)

Superscalar Traditional Multithreaded

Simultaneous Multithreading

Rows of squares represent instruction issue slotsBox with number x: instruction issued from thread xEmpty box: slot is wasted

Page 16: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#16 Lec # 2 Fall 2000 9-11-2000

SMT Performance ComparisonSMT Performance Comparison• Instruction throughput from simulations by Eggers et al. at The University of Washington, using both multiprogramming and parallel workloads:

Multiprogramming workload

Superscalar Traditional SMTThreads Multithreading 1 2.7 2.6 3.1 2 - 3.3 3.5 4 - 3.6 5.7 8 - 2.8 6.2

Parallel Workload

Superscalar MP2 MP4 Traditional SMTThreads Multithreading 1 3.3 2.4 1.5 3.3 3.3 2 - 4.3 2.6 4.1 4.7 4 - - 4.2 4.2 5.6 8 - - - 3.5 6.1

Page 17: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#17 Lec # 2 Fall 2000 9-11-2000

SMT Instruction Scheduling MethodsSMT Instruction Scheduling Methods

• Round Robin:

– Instruction from Thread 1, then Thread 2, then Thread 3, etc.

• I-Count:

– Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline.

• Other:

– Branch First: Branch instructions issued first

– Spec Last: Speculative instructions given low priority

Page 18: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#18 Lec # 2 Fall 2000 9-11-2000

SMT Performance ExampleSMT Performance Example

Inst Code Description Functional unitA LUI R5,100 R5 = 100 Int ALUB FMUL F1,F2,F3 F1 = F2 x F3 FP ALUC ADD R4,R4,8 R4 = R4 + 8 Int ALUD MUL R3,R4,R5 R3 = R4 x R5 Int mul/divE LW R6,R4 R6 = (R4) Memory portF ADD R1,R2,R3 R1 = R2 + R3 Int ALUG NOT R7,R7 R7 = !R7 Int ALUH FADD F4,F1,F2 F4=F1 + F2 FP ALUI XOR R8,R1,R7 R8 = R1 XOR R7 Int ALUJ SUBI R2,R1,4 R2 = R1 – 4 Int ALUK SW ADDR,R2 (ADDR) = R2 Memory port

• 4 integer ALUs (1 cycle latency)

• 1 integer multiplier/divider (3 cycle latency)

• 3 memory ports (2 cycle latency, assume cache hit)

• 2 FP ALUs (5 cycle latency)

• Assume all functional units are fully-pipelined

Page 19: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#19 Lec # 2 Fall 2000 9-11-2000

SMT Performance Example SMT Performance Example (continued)(continued)

Cycle Superscalar Issuing Slots SMT Issuing Slots1 2 3 4 1 2 3 4

1 LUI (A) FMUL (B) ADD (C) T1.LUI (A) T1.FMUL(B)

T1.ADD (C) T2.LUI (A)

2 MUL (D) LW (E) T1.MUL (D) T1.LW (E) T2.FMUL (B) T2.ADD (C)3 T2.MUL (D) T2.LW (E)45 ADD (F) NOT (G) T1.ADD (F) T1.NOT (G)6 FADD (H) XOR (I) SUBI (J ) T1.FADD (H) T1.XOR (I) T1.SUBI (J ) T2.ADD (F)7 SW (K) T1.SW (K) T2.NOT (G) T2.FADD (H)8 T2.XOR (I) T2.SUBI (J )9 T2.SW (K)

• 2 additional cycles to complete program 2

• Throughput:

– Superscalar: 11 inst/7 cycles = 1.57 IPC

– SMT: 22 inst/9 cycles = 2.44 IPC

Page 20: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#20 Lec # 2 Fall 2000 9-11-2000

Simulator (sim-SMT) @ RIT CESimulator (sim-SMT) @ RIT CE

• Execution-driven, performance simulator.

• Derived from Simple Scalar tool set.

• Simulates cache, branch prediction, five pipeline stages

• Flexible:– Configuration File controls cache size, buffer sizes, number

of functional units.

• Cross compiler used to generate Simple Scalar assembly language.

• Binary utilities, compiler, and assembler available.

• Standard C library (libc) has been ported.

Page 21: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#21 Lec # 2 Fall 2000 9-11-2000

Simulator Memory Address SpaceSimulator Memory Address Space

Page 22: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#22 Lec # 2 Fall 2000 9-11-2000

Alternate Functional Unit Alternate Functional Unit ConfigurationsConfigurations

• New functional unit configurations attempted (by adding one of each type of FU):– +1 integer multiplier/divider

• +2.8% IPC, issue rate

• -74% times with no FU available

• Simulator very flexible (only one line in configuration file required change)

Page 23: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#23 Lec # 2 Fall 2000 9-11-2000

Sim-SMT Simulator LimitationsSim-SMT Simulator Limitations

• Does not keep precise exceptions.

• System Call’s instructions not tracked.

• Limited memory space:

– Four test programs’ memory spaces running on one simulator memory space

– Easy to run out of stack space

Page 24: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#24 Lec # 2 Fall 2000 9-11-2000

Simulation Runs & ResultsSimulation Runs & Results• Test Programs used:

– Newton interpolation.– Matrix Solver using LU decomposition.– Integer Test Program.– FP Test Program.

• Simulations of a single program– 1,2, and 4 threads.

• System simulations involve a combination of all programs simultaneously– Several different combinations were run

• From simulation results:– Performance increase:

• Biggest increase occurs when changing from one to two threads.

– Higher issue rate, functional unit utilization.

Page 25: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#25 Lec # 2 Fall 2000 9-11-2000

Performance (IPC)Performance (IPC)Simulation Results:Simulation Results:

Page 26: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#26 Lec # 2 Fall 2000 9-11-2000

Simulation Results: Simulation Results: Simulation Time Simulation Time

Page 27: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#27 Lec # 2 Fall 2000 9-11-2000

Instruction Issue RateInstruction Issue RateSimulation Results:Simulation Results:

Page 28: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#28 Lec # 2 Fall 2000 9-11-2000

Performance Vs. Issue BWPerformance Vs. Issue BWSimulation Results:Simulation Results:

Page 29: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#29 Lec # 2 Fall 2000 9-11-2000

Functional Unit UtilizationFunctional Unit UtilizationSimulation Results:Simulation Results:

Page 30: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#30 Lec # 2 Fall 2000 9-11-2000

No Functional Unit AvailableNo Functional Unit AvailableSimulation Results:Simulation Results:

Page 31: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#31 Lec # 2 Fall 2000 9-11-2000

Horizontal Waste RateHorizontal Waste RateSimulation Results:Simulation Results:

Page 32: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#32 Lec # 2 Fall 2000 9-11-2000

Vertical Waste RateVertical Waste RateSimulation Results:Simulation Results:

Page 33: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#33 Lec # 2 Fall 2000 9-11-2000

SMT: Simultaneous Multithreading• Strengths:

– Overcomes the limitations imposed by low single thread instruction-level parallelism.

– Multiple threads running will hide individual control hazards (branch mispredictions).

• Weaknesses: – Additional stress placed on memory hierarchy Control unit

complexity.– Sizing of resources (cache, branch prediction, etc.)– Accessing registers (32 integer + 32 FP for each HW context):

• Some designs devote two clock cycles for both register reads and register writes.

Page 34: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#34 Lec # 2 Fall 2000 9-11-2000

SMT: Simultaneous MultithreadingSMT: Simultaneous Multithreading

Kernel CodeKernel Code• Many, if not all, benchmarks are based upon a limited

interaction with kernel code.

• How can the kernel overhead be minimized (context-switching, process management, etc.)?– CHAOS (Context Hardware Accelerated Operating

System).

• Introduce a lightweight dedicated kernel context to handle process management:– When there are 4 contexts, there is a good chance that one

of them will continue to run, why take an (expensive) chance in swapping it out when it will be brought right back in by the swapper (process management).

Page 35: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#35 Lec # 2 Fall 2000 9-11-2000

SMT & TechnologySMT & Technology

• SMT architecture has not been implemented in any existing commercial microprocessor yet (First 4-thread SMT CPU: Alpha EV8 ~2001).

• Current technology has the potential for 4-8 simultaneous threads:

– Based on transistor count and design complexity.

Page 36: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#36 Lec # 2 Fall 2000 9-11-2000

RIT-CE SMT Project GoalsRIT-CE SMT Project Goals• Investigate performance gains from exploiting Thread-

Level Parallelism (TLP) in addition to current Instruction-Level Parallelism (ILP) in processor design.

• Design and simulate an architecture incorporating Simultaneous Multithreading (SMT).

• Study operating system and compiler modifications needed to support SMT processor architectures.

• Define a standard interface for efficient SMT-processor/OS kernel interaction.

• Modify an existing OS kernel (Linux?) to take advantage of hardware multithreading capabilities.

• Long term: VLSI implementation of an SMT prototype.

Page 37: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#37 Lec # 2 Fall 2000 9-11-2000

Current Project StatusCurrent Project Status• Architecture/OS interface definition.

• Study of design alternatives and impact on performance.

• SMT Simulator Development:

– System call development, kernel support, and compiler/assembler changes.

• Development of code (programs and OS kernel) is key to getting results.

Page 38: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#38 Lec # 2 Fall 2000 9-11-2000

Short-Term Project ChartShort-Term Project Chart

S im u la to r Com piler

Linker/Loader

S im ulation Res ults(running program )

S y s te m C a ll Pro x y(O S s pe cif ic)

Kernel Code

S im ulator w ill repres enthardw are w ith kernelc ontext

Kernel Code w illprovide the threadthat w ill be heldin the HW kernelc ontext

Com piler is s im ply ahac ked vers ion gc c(us ing as s em bler from hos ts ys tem )

P roc es s M anagem entM em ory M anagem ent

S M T Kernel S im ulation

Page 39: Simultaneous Multithreading (SMT)

EECC722 - ShaabanEECC722 - Shaaban#39 Lec # 2 Fall 2000 9-11-2000

Current/Future Project GoalsCurrent/Future Project Goals• SMT simulator completion refinement, and further testing.

• Development of an SMT-capable OS kernel.

• Extensive performance studies with various workloads using the simulator/OS/compiler:– Suitability for fine-grained parallel applications?– Effect on multimedia applications?

• Architectural changes based on benchmarks.

• Cache impact on SMT performance investigation.• Investigation of an in-order SMT processor (C or VHDL model)

• MOSIS Tiny Chip (partial/full) implementation.

• Investigate the suitability of SMT processors as building blocks for MPPs.