1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

1

PhD Defense Presentation

Managing Shared Resources in Chip Multiprocessor Memory Systems

12. October 2010

Magnus Jahre

2

Outline

• Chip Multiprocessors (CMPs)

• CMP Resource Management

• Miss Bandwidth Management– Greedy Miss Bandwidth Management– Interference Measurement– Model-Based Miss Bandwidth Management

• Off-Chip Bandwidth Management

• Conclusion

3

CHIP MULTIPROCESSORS

4

Historical Processor Performance

19

78

19

80

19

82

19

84

19

86

19

88

19

90

19

92

19

94

19

96

19

98

20

00

20

02

20

04

20

06

20

08

20

10

1

10

100

1000

10000

100000

1000000

Processor Performance Components Per Chip (Moore's Law)

Year

No

rma

lize

d V

alu

e

Technology scaling is used to increase

clock frequency

Technology scaling is used to add processor

cores

Moore’s Law: 50% annual increase in thenumber of components per chip

52% Annual Performance Increase20% Annual Performance

Increase

Aggregate performance still follows Moore’s Law

5

Power Dissipation Limits Practical Clock Frequency

Pe

ntiu

m

Pe

ntiu

m M

MX

Pe

ntiu

m II

Pe

ntiu

m II

I

Pe

ntiu

m 4

Pe

ntiu

m D

Co

re 2

Co

re i3

Co

re i5

Co

re i7

020406080

100120

Intel Processor Families

Th

erm

al

De

sig

n P

ow

er

(W)

Pe

ntiu

m

Pe

ntiu

m M

MX

Pe

ntiu

m II

Pe

ntiu

m II

I

Pe

ntiu

m 4

Pe

ntiu

m D

Co

re 2

Co

re i3

Co

re i5

Co

re i7

00.5

11.5

22.5

33.5

Intel Processor Family

Clo

ck

Fre

qu

en

cy

(G

Hz)

Source: Wikipedia, List of CPU Power Dissipation, Retrieved 02.06.10

Technology scaling increases clock

frequency

Technology scalingadds processor cores

Power Wall

6

Chip Multiprocessors (CMPs)

• CMPs utilize chip resources with a constant power budget

• How does technology scaling impact CMPs?

Intel Nehalem

7

Projected Number of Cores

2007 2008 2009 2010 2011 2012 2013 2014 2015 20160

10

20

30

40

50

60

70

80

90

ITRS Year of Production

Nu

mb

er

of

Co

res

ITRS expects 40% annual increase

Observation 2: Software parallelism is needed in the long-term

Observation 1: Multiprogramming can

provide near-term throughput improvement

8

Processor Memory Gap1

97

8

19

80

19

82

19

84

19

86

19

88

19

90

19

92

19

94

19

96

19

98

20

00

20

02

20

04

20

06

20

08

20

10

1

10

100

1000

10000

100000

Main Memory Latency Processor Performance

Year

Re

lati

ve

Pe

rfo

rma

nc

e

7% Annual Memory Latency Improvement

Me

mo

ry W

all

Observation 3: Latency hiding techniques are necessary

9

Performance vs. Bandwidth

2007 2008 2009 2010 2011 2012 2013 2014 20150

5

10

15

20

25

Processor Performance Off-Chip Bandwidth

ITRS Year of Production

Re

lati

ve

P

erf

orm

an

ce

/Ba

nd

-w

idth

Observation 4: Bandwidth must be used efficiently

10

Software parallelism

Multi-programming

Latency hiding

Bandwidth efficiency

Concurrent applications share

hardware

Complex Memory Systems

Shared Resource Management

Application Trends Hardware Trends

11

CMP RESOURCE MANAGEMENT

12

Why Manage Shared Resources?

Provide predictable performance

Support OS scheduler assumptions

Cloud: Fulfill Service Level Agreement

13

Performance Variability Metrics

• Fairness– The performance reduction due to interference between processes

is distributed across all processes in proportion to their priorities– Equal priorities: Performance reduction from sharing affects all

processes equally

• Quality of Service– The performance of a process is never drops below a certain limit

regardless of the behavior of co-scheduled processes

14

Performance Variability (Fairness)

1 4 7 10 13 16 19 22 25 28 31 34 37 400

0.2

0.4

0.6

0.8

1

1.2

Crossbar-Based, 1 channel Ring-Based, 1 channel Crossbar-Based, 2 channelsRing-Based, 2 channels Crossbar-Based, 4 channels Ring-Based, 4 channels

Number of Workloads

Lo

wes

t F

arin

ess

Val

ue

Paper B.I

15

Resource Management Tasks

Measurement

Allocation(Policy)

Enforcement(Mechanism)

16

Mis

s B

and

wid

th

Ma

na

ge

men

tP

refe

tch

S

ch

ed

ulin

g

Off-line InterferenceMeasurement

On-line InterferenceMeasurement

Dynamic Miss Handling

Architecture

Greedy Miss Bandwidth Allocation

Performance Model Based Miss

Bandwidth Allocation

Low-Cost Open Page Prefetching

Opportunistic Prefetch Scheduling

Contributions

17

GREEDY MISS BANDWIDTH MANAGEMENT

Miss Bandwidth Management

18

Conventional Resource Allocation Implementation

CPU 1

Cro

ssba

r

MainMemory

Memory Bus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-CacheS

hare

d C

ach

e

Mem

ory

Co

ntro

ller

Private Memory System

Measurement

Allocation

Enforcement

19

Alternative Resource Allocation Implementation

CPU 1

Cro

ssba

rMain

MemoryMemory Bus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-Cache

Sha

red

Cac

he

Mem

ory

Con

trol

ler

Private Memory System

Measurement

Allocation

Enforcement

4

Dynamic Miss Handling Architecture

20

A

D

B

E

C

1

1

10

0

0

1

Cache

A

B

C

D 1

E 1

Miss Handling Architecture (MHA)

Address Target Info. U

Dynamic Miss Handling Architecture

A

D

B

E

C

Accesses

Cache is blocked

Tim

e

A DMHA controls the number of concurrent shared memory system requests that are allowed for each processor

21

Greedy Miss Bandwidth Management• Idea: Reduce the number of MSHRs if a metric

exceeds a certain threshold

• Metrics:– Paper A.II: Memory bus utilization– Paper A.III: Simple interference counters (Interference Points)

• Performance feedback avoids excessive performance degradations

Paper A.II and A.III

22

INTERFERENCE MEASUREMENT


23

Resource Allocation Baselines

Baseline = Interference-free configuration

Quantify performance impact from interference

Private Mode and Shared Mode

24

Interference Definition

InterferencePrivate Mode

Latency

Estimate ErrorPrivate

Mode Latency Measurement

Shared Mode Latency

PrivateMode Latency

Estimate

25

Offline Interference Measurement

Interference Penalty Frequency (IPF) counts the number requests that experienced an interference latency of i cycles

Interference Impact Factor (IIF) is the interference latency times the probability of it arising, i.e. IIF(i) = i ∙ P(i)

Paper B.I

26

Aggregate Interference Impact

Paper B.I

CB

1

CB

2

CB

4

Rin

g 1

Rin

g 2

Rin

g 4

CB

1

CB

2

CB

4

Rin

g 1

Rin

g 2

Rin

g 4

CB

1

CB

2

CB

4

Rin

g 1

Rin

g 2

Rin

g 4

4-core CMP 8-core CMP 16-core CMP

0

50

100

150

200

250

300

Memory Bus Cache Interconnect

27

Resource Management Baselines

Processor A

Processor B

Shared Cache

Multiprogrammed Baseline (MPB)

Interconnect Memory Bus

Processor A

Processor B

Shared Cache

Processor A

Processor B

Shared Cache

Single Program Baseline (SPB)



A

B

28

Baseline Weaknesses

• Multiprogrammed Baseline– Only accounts for interference in partitioned resources– Static and equal division of DRAM bandwidth does not give equal

latency– Complex relationship between resource allocation and performance

• Single Program Baseline– Does not exist in shared mode

Online Interference MeasurementDynamic Interference Estimation

Framework (DIEF)Paper B.II

29

Online Interference Measurement

• Dynamic Interference Estimation Framework (DIEF)

• Estimates private mode average memory latency

• General, component-based framework

Paper B.II

30

Shared Cache InterferenceAuxiliary Tag Directories

CP

U 0

CP

U 1

Cache Accesses:

Shared Cache

M N

A M B NMiss

Hit

C A B C

B D N

B MAA M B CD B A

Eviction may not be interference

Interference latency cost = miss penalty

D CB A

Eviction is interference

31

Bus Interference Requirements

• Out-of-order memory bus scheduling• Shared mode only cache misses and cache hits• Shared cache writebacks

Computing private latency based on shared mode queue contents is difficult

Emulate private scheduling in the shared mode

32

E D

Shared Bus Queue

C B

D C B A

1202004040

Arrival Order

Head Pointer

Execution Order

15

32

Latency Lookup Table

Bank 0

Bank 1

...

...

Open Page Emulation Registers

Memory Latency Estimation Buffer

Bank/ Page Mapping: A à (0,15), B à (0,19), C à (0,15), D à (1,32)

Estimated Queue Latency 120 40 40+ +=

BCD 40200

33

MODEL-BASED MISS BANDWIDTH MANAGEMENT


34

Model-Based Miss Bandwidth Allocation

DIEF provides accurate estimates of the average private mode memory latency

Can we use the estimates provided by DIEF to choose miss bandwidth allocations?

We need a model that relates average memory latency to performance

Paper A.IV

35

Performance Model

Paper A.IV

Observation: The memory latency performance impact depends on the parallelism of memory requests

Very similar in private and shared mode

Shared mode measurements can provide private mode

performance estimates

36

Bandwidth Management Flow

Paper A.IV

Measurement Modeling Allocation

Shared ModeMemory Latency

Private ModeMemory Latency

CommittedInstructions

Number ofMemory Requests

CPU Stall Time

Per-CPU Models

Perf. Metric Model

Find MSHR allocation that maximizes the

chosen performance metric

Set number of MSHRs for all last-level private caches

37

OFF-CHIP BANDWIDTH MANAGEMENT

38

Modern DRAM Interfaces

• Maximize bandwidth with 3D organization

• Repeated requests to the row buffer are very efficient

Row address

Column address

DRAM

Banks

Row Buffer

Rows

Co

lum

ns

39

Low-Cost Open Page Prefetching• Idea: Piggyback

prefetches to open DRAM pages on demand reads

• Performance win if prefetcher accuracy is above ~40%

Paper C.I

40

Opportunistic Prefetch Scheduling

Page Vector Table (PVT)

99

100

101

102

Demand Access

Prefetch Request

Idea: Issue prefetches when a page is closed

Increased efficiency: 8 transfers for 3 activations

Issued Prefetch

Paper C.II

A

A

A

41

CONCLUSION

42

Conclusion

• Managing bandwidth allocations can improve CMP system performance

• Miss bandwidth management– Greedy allocations– Management guided by accurate measurements and performance

models

• Off-chip bandwidth management with prefetching

43

Thank You

Visit our website:http://research.idi.ntnu.no/multicore/

44

EXTRA SLIDES

45

Future Work

• Performance-directed management of shared caches and the memory bus

• Improving OS and system software with dynamic measurements

• Combining dynamic MHAs with prefetching to improve system performance

• Managing workloads of single-threaded and multi-threaded benchmarks

46

Example Chip Multiprocessor

CPU 1

Inte

rcon

nect

MainMemory

MemoryBus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-Cache

Sha

red

Cac

he

Mem

ory

Con

trol

ler

Private Memory System Shared Memory System

Documents

1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre