Download pptx - Advanced Microarchitecture

Transcript
Page 1: Advanced  Microarchitecture

Advanced MicroarchitectureMulti-This, Multi-That, …

Page 2: Advanced  Microarchitecture

2

Limits on IPC• Lam92

– This paper focused on impact of control flow on ILP– Speculative execution can expose 10-400 IPC

• assumes no machine limitations except for control dependencies and actual dataflow dependencies

• Wall91– This paper looked at limits more broadly

• No branch prediction, no register renaming, no memory disambiguation: 1-2 IPC

• ∞-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPC

• perfect bpred, register renaming and memory disambiguation: 7-60 IPC

– This paper did not consider “control independent” instructions

Lecture 17: Multi-This, Multi-That, ...

Page 3: Advanced  Microarchitecture

3

Practical Limits• Today, 1-2 IPC sustained

– far from the 10’s-100’s reported by limit studies• Limited by:

– branch prediction accuracy– underlying DFG

• influenced by algorithms, compiler– memory bottleneck

– design complexity• implementation, test, validation, manufacturing, etc.

– power– die area

Lecture 17: Multi-This, Multi-That, ...

Page 4: Advanced  Microarchitecture

4

Differences BetweenReal Hardware and Limit Studies?• Real branch predictors aren’t 100%

accurate• Memory disambiguation is not perfect• Physical resources are limited

– can’t have infinite register renaming w/o infinite PRF

– need infinite-entry ROB, RS and LSQ– need 10’s-100’s of execution units for 10’s-

100’s of IPC• Bandwidth/Latencies are limited

– studies assumed single-cycle execution– infinite fetch/commit bandwidth– infinite memory bandwidth (perfect caching)Lecture 17: Multi-This, Multi-That, ...

Page 5: Advanced  Microarchitecture

5

Bridging the Gap

Lecture 17: Multi-This, Multi-That, ...

IPC

100

10

1

Single-IssuePipelined

SuperscalarOut-of-Order

(Today)

SuperscalarOut-of-Order

(Hypothetical-Aggressive)

Limits

Diminishing returns w.r.t.larger instruction window,

higher issue-width

Power has been growingexponentially as well

Watts/

Page 6: Advanced  Microarchitecture

6

Past the Knee of the Curve?

Lecture 17: Multi-This, Multi-That, ...

“Effort”

Performance

ScalarIn-Order

Moderate-PipeSuperscalar/OOO

Very-Deep-PipeAggressive

Superscalar/OOO

Made sense to goSuperscalar/OOO:

good ROI

Very little gain forsubstantial effort

Page 7: Advanced  Microarchitecture

7

So how do we get more Performance?• Keep pushing IPC and/or frequenecy?

– possible, but too costly• design complexity (time to market), cooling (cost),

power delivery (cost), etc.

• Look for other parallelism– ILP/IPC: fine-grained parallelism– Multi-programming: coarse grained parallelism

• assumes multiple user-visible processing elements• all parallelism up to this point was user-invisible

Lecture 17: Multi-This, Multi-That, ...

Page 8: Advanced  Microarchitecture

8

User Visible/Invisible• All microarchitecture performance gains up

to this point were “free”– free in that no user intervention required

beyond buying the new processor/system• recompilation/rewriting could provide even more

benefit, but you get some even if you do nothing

• Multi-processing pushes the problem of finding the parallelism to above the ISA interface

Lecture 17: Multi-This, Multi-That, ...

Page 9: Advanced  Microarchitecture

9

Workload Benefits

Lecture 17: Multi-This, Multi-That, ...

3-wideOOOCPU

Task A Task B

4-wideOOOCPU

Task A Task B

Benefit

3-wideOOOCPU

Task A Task B3-wideOOOCPU

2-wideOOOCPU

Task BTask A2-wide

OOOCPU

runtime

This assumes you have twotasks/programs to execute…

Page 10: Advanced  Microarchitecture

10

… If Only One Task

Lecture 17: Multi-This, Multi-That, ...

3-wideOOOCPU

Task A

4-wideOOOCPU

Task ABenefit

3-wideOOOCPU

3-wideOOOCPU

Task A

2-wideOOOCPU

2-wideOOOCPU

Task A

runtime

Idle

No benefit over 1 CPU

Performancedegradation!

Page 11: Advanced  Microarchitecture

11

Sources of (Coarse) Parallelism• Different applications

– MP3 player in background while you work on Office

– Other background tasks: OS/kernel, virus check, etc.

– Piped applications• gunzip -c foo.gz | grep bar | perl some-script.pl

• Within the same application– Java (scheduling, GC, etc.)– Explicitly coded multi-threading

• pthreads, MPI, etc.

Lecture 17: Multi-This, Multi-That, ...

Page 12: Advanced  Microarchitecture

12

(Execution) Latency vs. Bandwidth• Desktop processing

– typically want an application to execute as quickly as possible (minimize latency)

• Server/Enterprise processing– often throughput oriented (maximize

bandwidth)– latency of individual task less important

• ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time

Lecture 17: Multi-This, Multi-That, ...

Page 13: Advanced  Microarchitecture

13

Benefit of MP Depends on Workload• Limited number of parallel tasks to run on

PC– adding more CPUs than tasks provide zero

performance benefit• Even for parallel code, Amdahl’s law will

likely result in sub-linear speedup

Lecture 17: Multi-This, Multi-That, ...

parallelizable

1CPU 2CPUs 3CPUs 4CPUs

• In practice, parallelizable portion may not be evenly divisible

Page 14: Advanced  Microarchitecture

14

Cache Coherency Protocols• Not covered in this course

– You should have seen a bunch of this in CS6290

• Many different protocols– different number of states– different bandwidth/performance/complexity

tradeoffs

– current protocols usually referred to by their states• ex. MESI, MOESI, etc.

Lecture 17: Multi-This, Multi-That, ...

Page 15: Advanced  Microarchitecture

15

Shared Memory Focus• Most small-medium multi-processors (these

days) use some sort of shared memory– shared memory doesn’t scale as well to larger

number of nodes• communications are broadcast based• bus becomes a severe bottleneck

– or you have to deal with directory-based implementations

– message passing doesn’t need centralized bus• can arrange multi-processor like a graph

– nodes = CPUs, edges = independent links/routes• can have multiple communications/messages in

transit at the same time

Lecture 17: Multi-This, Multi-That, ...

Page 16: Advanced  Microarchitecture

16

SMP Machines• SMP = Symmetric Multi-Processing

– Symmetric = All CPUs are “equal”– Equal = any process can run on any CPU

• contrast with older parallel systems with master CPU and multiple worker CPUs

Lecture 17: Multi-This, Multi-That, ...

CPU0

CPU1

CPU2

CPU3

Pictures found from google images

Page 17: Advanced  Microarchitecture

17

Hardware Modifications for SMP• Processor

– mainly support for cache coherence protocols• includes caches, write buffers, LSQ• control complexity increases, as memory latencies may

be substantially more variable

• Motherboard– multiple sockets (one per CPU)– datapaths between CPUs and memory controller

• Other– Case: larger for bigger mobo, better airflow– Power: bigger power supply for N CPUs– Cooling: need to remove N CPUs’ worth of heat

Lecture 17: Multi-This, Multi-That, ...

Page 18: Advanced  Microarchitecture

18

Chip-Multiprocessing• Simple SMP on the same chip

Lecture 17: Multi-This, Multi-That, ...

Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX

Pictures found from google images

Page 19: Advanced  Microarchitecture

19

Shared Caches• Resources can be

shared between CPUs– ex. IBM Power 5

Lecture 17: Multi-This, Multi-That, ...

CPU0 CPU1

L2 cache shared betweenboth CPUs (no need to

keep two copies coherent)

L3 cache is also shared (only tagsare on-chip; data are off-chip)

Page 20: Advanced  Microarchitecture

20

Benefits?• Cheaper than mobo-based SMP

– all/most interface logic integrated on to main chip (fewer total chips, single CPU socket, single interface to main memory)

– less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication)

• Performance– on-chip communication is faster

• Efficiency– potentially better use of hardware resources

than trying to make wider/more OOO single-threaded CPU

Lecture 17: Multi-This, Multi-That, ...

Page 21: Advanced  Microarchitecture

21

Performance vs. Power• 2x CPUs not necessarily equal to 2x

performance

• 2x CPUs ½ power for each– maybe a little better than ½ if resources can be

shared

• Back-of-the-Envelope calculation:– 3.8 GHz CPU at 100W– Dual-core: 50W per CPU– P V3: Vorig

3/VCMP3 = 100W/50W VCMP = 0.8

Vorig

– f V: fCMP = 3.0GHzLecture 17: Multi-This, Multi-That, ...

Page 22: Advanced  Microarchitecture

22

Simultaneous Multi-Threading• Uni-Processor: 4-6 wide, lucky if you get 1-2

IPC– poor utilization

• SMP: 2-4 CPUs, but need independent tasks– else poor utilization as well

• SMT: Idea is to use a single large uni-processor as a multi-processor

Lecture 17: Multi-This, Multi-That, ...

Page 23: Advanced  Microarchitecture

23

SMT (2)

Lecture 17: Multi-This, Multi-That, ...

Regular CPU

CMP

2x HW Cost

SMT (4 threads)

Approx 1x HW Cost

Page 24: Advanced  Microarchitecture

24

Overview of SMT Hardware Changes• For an N-way (N threads) SMT, we need:

– Ability to fetch from N threads– N sets of registers (including PCs)– N rename tables (RATs)– N virtual memory spaces

• But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)

Lecture 17: Multi-This, Multi-That, ...

Page 25: Advanced  Microarchitecture

25

SMT Fetch• Duplicate fetch logic

Lecture 17: Multi-This, Multi-That, ...

I$

fetch

fetch

fetch

Decode, Rename, DispatchPC0

PC1

PC2

RS

• Cycle-Multiplexed fetch logic

I$PC0

PC1

PC2

cycle % N

fetch Decode, etc.

RS

• Alternatives– Other-Multiplexed fetch

logic– Duplicate I$ as well

Page 26: Advanced  Microarchitecture

26

SMT Rename• Thread #1’s R12 != Thread #2’s R12

– separate name spaces– need to disambiguate

Lecture 17: Multi-This, Multi-That, ...

RAT0

RAT1

Thread0

Register #

Thread1

Register #

PRF RAT PRF

Thread-ID

Register #

concat

Page 27: Advanced  Microarchitecture

27

SMT Issue, Exec, Bypass, …• No change needed

Lecture 17: Multi-This, Multi-That, ...

Thread 0:

Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]

Thread 1:

Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]

Thread 0:

Add T12 = RT20 + T8Sub T19 = T12 – T16Xor T14 = T12 ^ T19Load T23 = 0[T14]Thread 1:

Add T17 = RT29 + T3Sub T5 = T17 – T2Xor T31 = T17 ^ T5Load T25 = 0[T31]

Add T12 = RT20 + T8

Sub T19 = T12 – T16

Xor T14 = T12 ^ T19

Load T23 = 0[T14]

Add T17 = RT29 + T3

Sub T5 = T17 – T2

Xor T31 = T17 ^ T5

Load T25 = 0[T31]

Shared RS Entries

After Renaming

Page 28: Advanced  Microarchitecture

28

SMT Cache• Each process has own virtual address

space– TLB must be thread-aware

• translate (thread-id,virtual page) physical page– Virtual portion of caches must also be thread-

aware• VIVT cache must now be (virutal addr, thread-id)-

indexed, (virtual addr, thread-id)-tagged• Similar for VIPT cache

Lecture 17: Multi-This, Multi-That, ...

Page 29: Advanced  Microarchitecture

29

SMT Commit• One “Commit PC” per thread• Register File Management

– ARF/PRF organization• need one ARF per thread

– Unified PRF• need one “architected RAT” per thread

• Need to maintain interrupts, exceptions, faults on a per-thread basis– like OOO needs to appear to outside world that

it is in-order, SMT needs to appear as if it is actually N CPUs

Lecture 17: Multi-This, Multi-That, ...

Page 30: Advanced  Microarchitecture

30

SMT Design Space• Number of threads• Full-SMT vs. Hard-partitioned SMT

– full-SMT: ROB-entries can be allocated arbitrarily between the threads

– hard-partitioned: if only one thread, use all ROB entries; if two threads, each is limited to one half of the ROB (even if the other thread uses only a few entries); possibly similar for RS, LSQ, PRF, etc.

• Amount of duplication– Duplicate I$, D$, fetch engine, decoders, schedulers,

etc.?– There’s a continuum of possibilities between SMT and

CMP• ex. could have CMP where FP unit is shared SMT-styled

Lecture 17: Multi-This, Multi-That, ...

Page 31: Advanced  Microarchitecture

31

SMT Performance• When it works, it fills idle “issue slots” with

work from other threads; throughput improves

Lecture 17: Multi-This, Multi-That, ...

• But sometimes it can cause performance degradation!

Time( ) < Time( )Finish one task,

then do the otherDo both at sametime using SMT

Page 32: Advanced  Microarchitecture

32

How?• Cache thrashing

Lecture 17: Multi-This, Multi-That, ...

I$ D$

Thread0 just fits inthe Level-1 Caches

Executesreasonablyquickly due

to high cachehit rates

Context switch to Thread1

I$ D$

Thread1 also fitsnicely in the caches

I$ D$

Caches were just big enoughto hold one thread’s data, but

not two thread’s worth

L2

Now both threads havesignificantly higher cache

miss rates

Page 33: Advanced  Microarchitecture

33

Fairness• Consider two programs

– By themselves:• Program A: runtime = 10 seconds• Program B: runtime = 10 seconds

– On SMT:• Program A: runtime = 14 seconds• Program B: runtime = 18 seconds

• Standard Deviation of Speedups (lower = better)– A’s speedup: 10/14 = 0.71– B’s speedup: 10/18 = 0.56– SDS = 0.11

Lecture 17: Multi-This, Multi-That, ...

Page 34: Advanced  Microarchitecture

34

Fairness (2)• SDS encourages everyone to be punished

similarly– does not account for actual performance, so if

everyone is 1000x slower, it’s still “fair”• Alternative: Harmonic Mean of Weighted

IPCs (HMWIPC)– IPCi = achieved IPC for thread i

– SingleIPCi = IPC when thread i runs alone– HMWIPC = N

SingleIPC1 + SingleIPC2 +… + SingleIPCN

IPC1 IPC2 IPCN

Lecture 17: Multi-This, Multi-That, ...

Page 35: Advanced  Microarchitecture

35

This is all combinable• Can have a system that supports SMP, CMP

and SMT at the same time

• Take a dual-socket SMP motherboard…• Insert two chips, each with a dual-core

CMP…• Where each core supports two-way SMT

• This example provides 8 threads worth of execution, shared on 4 actual “cores”, split across two physical packages

Lecture 17: Multi-This, Multi-That, ...

Page 36: Advanced  Microarchitecture

36

OS Confusion• SMT/CMP is supposed to look like multiple

CPUs to the software/OS

Lecture 17: Multi-This, Multi-That, ...

2-waySMT

2-waySMT

2 cores(either SMP/CMP)

CPU0

CPU1

CPU2

CPU3

Say OS has twotasks to run…

A

B

idle

idle

Schedule tasks to(virtual) CPUs

A/B

idle

Performanceworse thanif SMT wasturned offand used

2-way SMPonly

Page 37: Advanced  Microarchitecture

37

OS Confusion (2)• Asymmetries in MP-Hierarchy can be very

difficult for the OS to deal with– need to break abstraction: OS needs to know

which CPUs are real physical processor (SMP), which are shared in the same package (CMP), and which are virtual (SMT)

– Distinct applications should be scheduled to physically different CPUs• no cache contention, no power contention

– Cooperative applications (different threads of the same program) should maybe be scheduled to the same physical chip (CMP)• reduce latency of inter-thread communication,

possibly reduce duplication if shared L2 is used– Use SMT as last choice

Lecture 17: Multi-This, Multi-That, ...

Page 38: Advanced  Microarchitecture

38

Multi-* is Happening• Intel Pentium 4 already had

“Hyperthreading” (SMT)– went away for a while, but is back in Core i7

• IBM Power 5 and later have SMT• Dual, Quad core already here• Octo-core soon

– Intel Core i7: 8 cores, each with 2-thread SMT

• So is single-thread performance dead?• Is single-thread microarchitecture

performance dead?Lecture 17: Multi-This, Multi-That, ...

Following adapted from Mark Hill’s HPCA08 keynote talk

Page 39: Advanced  Microarchitecture

Recall Amdahl’s Law

• Begins with Simple Software Assumption (Limit Arg.)– Fraction F of execution time perfectly parallelizable– No Overhead for

– Scheduling– Synchronization– Communication, etc.

– Fraction 1 – F Completely Serial

• Time on 1 core = (1 – F) / 1 + F / 1 = 1

• Time on N cores = (1 – F) / 1 + F / N

39 The following slides derived from Mark Hill’s HPCA’08 Keynote

Page 40: Advanced  Microarchitecture

Recall Amdahl’s Law [1967]

• For mainframes, Amdahl expected 1 - F = 35%– For a 4-processor speedup = 2– For infinite-processor speedup < 3– Therefore, stay with mainframes with one/few

processors

• Do multicore chips repeal Amdahl’s Law?• Answer: No, But.

40

Amdahl’s Speedup =1

+1 - F1

F

N

Page 41: Advanced  Microarchitecture

Designing Multicore Chips Hard• Designers must confront single-core design

options– Instruction fetch, wakeup, select– Execution unit configuation & operand bypass– Load/queue(s) & data cache– Checkpoint, log, runahead, commit.

• As well as additional design degrees of freedom– How many cores? How big each?– Shared caches: levels? How many banks?– Memory interface: How many banks?– On-chip interconnect: bus, switched, ordered? 41

Page 42: Advanced  Microarchitecture

Want Simple Multicore Hardware ModelTo Complement Amdahl’s Simple Software

Model

(1) Chip Hardware Roughly Partitioned into– Multiple Cores (with L1 caches)– The Rest (L2/L3 cache banks, interconnect,

pads, etc.)– Changing Core Size/Number does NOT change

The Rest

(2) Resources for Multiple Cores Bounded– Bound of N resources per chip for cores– Due to area, power, cost ($$$), or multiple

factors– Bound = Power? (but our pictures use Area)

42

Page 43: Advanced  Microarchitecture

Want Simple Multicore Hardware Model, cont.

(3) Micro-architects can improve single-core performance using more of the bounded resource

• A Simple Base Core– Consumes 1 Base Core Equivalent (BCE) resources– Provides performance normalized to 1

• An Enhanced Core (in same process generation)– Consumes R BCEs– Performance as a function Perf(R)

• What does function Perf(R) look like?

43

Page 44: Advanced  Microarchitecture

More on Enhanced Cores• (Performance Perf(R) consuming R BCEs

resources)

• If Perf(R) > R Always enhance core• Cost-effectively speedups both sequential &

parallel

• Therefore, Equations Assume Perf(R) < R

• Graphs Assume Perf(R) = square root of R– 2x performance for 4 BCEs, 3x for 9 BCEs, etc.– Why? Models diminishing returns with “no coefficients”

• How to speedup enhanced core?– <Insert favorite or TBD micro-architectural ideas here>

44

Page 45: Advanced  Microarchitecture

How Many (Symmetric) Cores per Chip?

• Each Chip Bounded to N BCEs (for all cores)• Each Core consumes R BCEs• Assume Symmetric Multicore = All Cores

Identical• Therefore, N/R Cores per Chip — (N/R)*R = N• For an N = 16 BCE Chip:

45

Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core

Page 46: Advanced  Microarchitecture

Performance of Symmetric Multicore Chips• Serial Fraction 1-F uses 1 core at rate

Perf(R) • Serial time = (1 – F) / Perf(R)

• Parallel Fraction uses N/R cores at rate Perf(R) each

• Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N

• Therefore, w.r.t. one base core:

• Implications?46

Symmetric Speedup =

1

+1 - FPerf(R)

F * R

Perf(R)*N

Enhanced Cores speed Serial & Parallel

Page 47: Advanced  Microarchitecture

Symmetric Multicore Chip, N = 16 BCEs

F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16))Need to increase parallelism to make multicore optimal!

47

0.16 1.6 160

2

4

6

8

10

12

14

16

R BCEs

Sym

metr

ic S

peed

up

F=0.5

(16 cores)

(8 cores) (2 cores) (1 core)

F=0.5R=16,

Cores=1,Speedup=4

(4 cores)

Page 48: Advanced  Microarchitecture

0.16 1.6 160

2

4

6

8

10

12

14

16

R BCEs

Sym

metr

ic S

peed

up

F=0.9

F=0.5

Symmetric Multicore Chip, N = 16 BCEs

At F=0.9, Multicore optimal, but speedup limited

Need to obtain even more parallelism!48

F=0.5R=16,

Cores=1,Speedup=4

F=0.9, R=2, Cores=8, Speedup=6.7

Page 49: Advanced  Microarchitecture

Symmetric Multicore Chip, N = 16 BCEs

49

0.16 1.6 160

2

4

6

8

10

12

14

16

R BCEs

Sym

metr

ic S

peed

up

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5

F matters: Amdahl’s Law applies to multicore chipsResearchers should target parallelism F first

F1, R=1, Cores=16, Speedup16

Page 50: Advanced  Microarchitecture

Symmetric Multicore Chip, N = 16 BCEs

50

As Moore’s Law enables N to go from 16 to 256 BCEs,More core enhancements? More cores? Or both?

0.16 1.6 160

2

4

6

8

10

12

14

16

R BCEs

Sym

metr

ic S

peed

up

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5

Recall F=0.9, R=2, Cores=8, Speedup=6.7

Page 51: Advanced  Microarchitecture

Symmetric Multicore Chip, N = 256 BCEs

As Moore’s Law increases N, often need enhanced core designsSome researchers should target single-core performance

51

0.256 2.56 25.6 2560

50

100

150

200

250

R BCEs

Sym

metr

ic S

peed

up

F=0.999

F=0.99

F=0.975

F=0.9F=0.5

F=0.9R=28 (vs. 2)

Cores=9 (vs. 8)Speedup=26.7 (vs. 6.7)

CORE ENHANCEMENTS!

F1R=1 (vs. 1)

Cores=256 (vs. 16)Speedup=204 (vs. 16)

MORE CORES!

F=0.99R=3 (vs. 1)

Cores=85 (vs. 16)Speedup=80 (vs. 13.9)

CORE ENHANCEMENTS& MORE CORES!