46
1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

Embed Size (px)

Citation preview

Page 1: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

1

PhD Defense Presentation

Managing Shared Resources in Chip Multiprocessor Memory Systems

12. October 2010

Magnus Jahre

Page 2: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

2

Outline

• Chip Multiprocessors (CMPs)

• CMP Resource Management

• Miss Bandwidth Management– Greedy Miss Bandwidth Management– Interference Measurement– Model-Based Miss Bandwidth Management

• Off-Chip Bandwidth Management

• Conclusion

Page 3: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

3

CHIP MULTIPROCESSORS

Page 4: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

4

Historical Processor Performance

19

78

19

80

19

82

19

84

19

86

19

88

19

90

19

92

19

94

19

96

19

98

20

00

20

02

20

04

20

06

20

08

20

10

1

10

100

1000

10000

100000

1000000

Processor Performance Components Per Chip (Moore's Law)

Year

No

rma

lize

d V

alu

e

Technology scaling is used to increase

clock frequency

Technology scaling is used to add processor

cores

Moore’s Law: 50% annual increase in thenumber of components per chip

52% Annual Performance Increase20% Annual Performance

Increase

Aggregate performance still follows Moore’s Law

Page 5: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

5

Power Dissipation Limits Practical Clock Frequency

Pe

ntiu

m

Pe

ntiu

m M

MX

Pe

ntiu

m II

Pe

ntiu

m II

I

Pe

ntiu

m 4

Pe

ntiu

m D

Co

re 2

Co

re i3

Co

re i5

Co

re i7

020406080

100120

Intel Processor Families

Th

erm

al

De

sig

n P

ow

er

(W)

Pe

ntiu

m

Pe

ntiu

m M

MX

Pe

ntiu

m II

Pe

ntiu

m II

I

Pe

ntiu

m 4

Pe

ntiu

m D

Co

re 2

Co

re i3

Co

re i5

Co

re i7

00.5

11.5

22.5

33.5

Intel Processor Family

Clo

ck

Fre

qu

en

cy

(G

Hz)

Source: Wikipedia, List of CPU Power Dissipation, Retrieved 02.06.10

Technology scaling increases clock

frequency

Technology scalingadds processor cores

Power Wall

Page 6: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

6

Chip Multiprocessors (CMPs)

• CMPs utilize chip resources with a constant power budget

• How does technology scaling impact CMPs?

Intel Nehalem

Page 7: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

7

Projected Number of Cores

2007 2008 2009 2010 2011 2012 2013 2014 2015 20160

10

20

30

40

50

60

70

80

90

ITRS Year of Production

Nu

mb

er

of

Co

res

ITRS expects 40% annual increase

Observation 2: Software parallelism is needed in the long-term

Observation 1: Multiprogramming can

provide near-term throughput improvement

Page 8: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

8

Processor Memory Gap1

97

8

19

80

19

82

19

84

19

86

19

88

19

90

19

92

19

94

19

96

19

98

20

00

20

02

20

04

20

06

20

08

20

10

1

10

100

1000

10000

100000

Main Memory Latency Processor Performance

Year

Re

lati

ve

Pe

rfo

rma

nc

e

7% Annual Memory Latency Improvement

Me

mo

ry W

all

Observation 3: Latency hiding techniques are necessary

Page 9: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

9

Performance vs. Bandwidth

2007 2008 2009 2010 2011 2012 2013 2014 20150

5

10

15

20

25

Processor Performance Off-Chip Bandwidth

ITRS Year of Production

Re

lati

ve

P

erf

orm

an

ce

/Ba

nd

-w

idth

Observation 4: Bandwidth must be used efficiently

Page 10: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

10

Software parallelism

Multi-programming

Latency hiding

Bandwidth efficiency

Concurrent applications share

hardware

Complex Memory Systems

Shared Resource Management

Application Trends Hardware Trends

Page 11: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

11

CMP RESOURCE MANAGEMENT

Page 12: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

12

Why Manage Shared Resources?

Provide predictable performance

Support OS scheduler assumptions

Cloud: Fulfill Service Level Agreement

Page 13: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

13

Performance Variability Metrics

• Fairness– The performance reduction due to interference between processes

is distributed across all processes in proportion to their priorities– Equal priorities: Performance reduction from sharing affects all

processes equally

• Quality of Service– The performance of a process is never drops below a certain limit

regardless of the behavior of co-scheduled processes

Page 14: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

14

Performance Variability (Fairness)

1 4 7 10 13 16 19 22 25 28 31 34 37 400

0.2

0.4

0.6

0.8

1

1.2

Crossbar-Based, 1 channel Ring-Based, 1 channel Crossbar-Based, 2 channelsRing-Based, 2 channels Crossbar-Based, 4 channels Ring-Based, 4 channels

Number of Workloads

Lo

wes

t F

arin

ess

Val

ue

Paper B.I

Page 15: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

15

Resource Management Tasks

Measurement

Allocation(Policy)

Enforcement(Mechanism)

Page 16: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

16

Mis

s B

and

wid

th

Ma

na

ge

men

tP

refe

tch

S

ch

ed

ulin

g

Off-line InterferenceMeasurement

On-line InterferenceMeasurement

Dynamic Miss Handling

Architecture

Greedy Miss Bandwidth Allocation

Performance Model Based Miss

Bandwidth Allocation

Low-Cost Open Page Prefetching

Opportunistic Prefetch Scheduling

Contributions

Page 17: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

17

GREEDY MISS BANDWIDTH MANAGEMENT

Miss Bandwidth Management

Page 18: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

18

Conventional Resource Allocation Implementation

CPU 1

Cro

ssba

r

MainMemory

Memory Bus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-CacheS

hare

d C

ach

e

Mem

ory

Co

ntro

ller

Private Memory System

Measurement

Allocation

Enforcement

Page 19: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

19

Alternative Resource Allocation Implementation

CPU 1

Cro

ssba

rMain

MemoryMemory Bus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-Cache

Sha

red

Cac

he

Mem

ory

Con

trol

ler

Private Memory System

Measurement

Allocation

Enforcement

4

Dynamic Miss Handling Architecture

Page 20: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

20

A

D

B

E

C

1

1

10

0

0

1

Cache

A

B

C

D 1

E 1

Miss Handling Architecture (MHA)

Address Target Info. U

Dynamic Miss Handling Architecture

A

D

B

E

C

Accesses

Cache is blocked

Tim

e

A DMHA controls the number of concurrent shared memory system requests that are allowed for each processor

Page 21: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

21

Greedy Miss Bandwidth Management• Idea: Reduce the number of MSHRs if a metric

exceeds a certain threshold

• Metrics:– Paper A.II: Memory bus utilization– Paper A.III: Simple interference counters (Interference Points)

• Performance feedback avoids excessive performance degradations

Paper A.II and A.III

Page 22: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

22

INTERFERENCE MEASUREMENT

Miss Bandwidth Management

Page 23: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

23

Resource Allocation Baselines

Baseline = Interference-free configuration

Quantify performance impact from interference

Private Mode and Shared Mode

Page 24: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

24

Interference Definition

InterferencePrivate Mode

Latency

Estimate ErrorPrivate

Mode Latency Measurement

Shared Mode Latency

PrivateMode Latency

Estimate

Page 25: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

25

Offline Interference Measurement

Interference Penalty Frequency (IPF) counts the number requests that experienced an interference latency of i cycles

Interference Impact Factor (IIF) is the interference latency times the probability of it arising, i.e. IIF(i) = i ∙ P(i)

Paper B.I

Page 26: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

26

Aggregate Interference Impact

Paper B.I

CB

1

CB

2

CB

4

Rin

g 1

Rin

g 2

Rin

g 4

CB

1

CB

2

CB

4

Rin

g 1

Rin

g 2

Rin

g 4

CB

1

CB

2

CB

4

Rin

g 1

Rin

g 2

Rin

g 4

4-core CMP 8-core CMP 16-core CMP

0

50

100

150

200

250

300

Memory Bus Cache Interconnect

Page 27: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

27

Resource Management Baselines

Processor A

Processor B

Shared Cache

Multiprogrammed Baseline (MPB)

Interconnect Memory Bus

Processor A

Processor B

Shared Cache

Processor A

Processor B

Shared Cache

Single Program Baseline (SPB)

Interconnect Memory Bus

Interconnect Memory Bus

A

B

Page 28: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

28

Baseline Weaknesses

• Multiprogrammed Baseline– Only accounts for interference in partitioned resources– Static and equal division of DRAM bandwidth does not give equal

latency– Complex relationship between resource allocation and performance

• Single Program Baseline– Does not exist in shared mode

Online Interference MeasurementDynamic Interference Estimation

Framework (DIEF)Paper B.II

Page 29: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

29

Online Interference Measurement

• Dynamic Interference Estimation Framework (DIEF)

• Estimates private mode average memory latency

• General, component-based framework

Paper B.II

Page 30: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

30

Shared Cache InterferenceAuxiliary Tag Directories

CP

U 0

CP

U 1

Cache Accesses:

Shared Cache

M N

A M B NMiss

Hit

C A B C

B D N

B MAA M B CD B A

Eviction may not be interference

Interference latency cost = miss penalty

D CB A

Eviction is interference

Page 31: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

31

Bus Interference Requirements

• Out-of-order memory bus scheduling• Shared mode only cache misses and cache hits• Shared cache writebacks

Computing private latency based on shared mode queue contents is difficult

Emulate private scheduling in the shared mode

Page 32: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

32

E D

Shared Bus Queue

C B

D C B A

1202004040

Arrival Order

Head Pointer

Execution Order

15

32

Latency Lookup Table

Bank 0

Bank 1

...

...

Open Page Emulation Registers

Memory Latency Estimation Buffer

Bank/ Page Mapping: A à (0,15), B à (0,19), C à (0,15), D à (1,32)

Estimated Queue Latency 120 40 40+ +=

BCD 40200

Page 33: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

33

MODEL-BASED MISS BANDWIDTH MANAGEMENT

Miss Bandwidth Management

Page 34: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

34

Model-Based Miss Bandwidth Allocation

DIEF provides accurate estimates of the average private mode memory latency

Can we use the estimates provided by DIEF to choose miss bandwidth allocations?

We need a model that relates average memory latency to performance

Paper A.IV

Page 35: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

35

Performance Model

Paper A.IV

Observation: The memory latency performance impact depends on the parallelism of memory requests

Very similar in private and shared mode

Shared mode measurements can provide private mode

performance estimates

Page 36: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

36

Bandwidth Management Flow

Paper A.IV

Measurement Modeling Allocation

Shared ModeMemory Latency

Private ModeMemory Latency

CommittedInstructions

Number ofMemory Requests

CPU Stall Time

Per-CPU Models

Perf. Metric Model

Find MSHR allocation that maximizes the

chosen performance metric

Set number of MSHRs for all last-level private caches

Page 37: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

37

OFF-CHIP BANDWIDTH MANAGEMENT

Page 38: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

38

Modern DRAM Interfaces

• Maximize bandwidth with 3D organization

• Repeated requests to the row buffer are very efficient

Row address

Column address

DRAM

Banks

Row Buffer

Rows

Co

lum

ns

Page 39: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

39

Low-Cost Open Page Prefetching• Idea: Piggyback

prefetches to open DRAM pages on demand reads

• Performance win if prefetcher accuracy is above ~40%

Paper C.I

Page 40: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

40

Opportunistic Prefetch Scheduling

Page Vector Table (PVT)

99

100

101

102

Demand Access

Prefetch Request

Idea: Issue prefetches when a page is closed

Increased efficiency: 8 transfers for 3 activations

Issued Prefetch

Paper C.II

A

A

A

Page 41: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

41

CONCLUSION

Page 42: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

42

Conclusion

• Managing bandwidth allocations can improve CMP system performance

• Miss bandwidth management– Greedy allocations– Management guided by accurate measurements and performance

models

• Off-chip bandwidth management with prefetching

Page 43: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

43

Thank You

Visit our website:http://research.idi.ntnu.no/multicore/

Page 44: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

44

EXTRA SLIDES

Page 45: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

45

Future Work

• Performance-directed management of shared caches and the memory bus

• Improving OS and system software with dynamic measurements

• Combining dynamic MHAs with prefetching to improve system performance

• Managing workloads of single-threaded and multi-threaded benchmarks

Page 46: 1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

46

Example Chip Multiprocessor

CPU 1

Inte

rcon

nect

MainMemory

MemoryBus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-Cache

Sha

red

Cac

he

Mem

ory

Con

trol

ler

Private Memory System Shared Memory System