1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12....

PhD Defense Presentation

Managing Shared Resources in Chip Multiprocessor Memory Systems

12. October 2010

Magnus Jahre

Outline

• Chip Multiprocessors (CMPs)

• CMP Resource Management

• Miss Bandwidth Management– Greedy Miss Bandwidth Management– Interference Measurement– Model-Based Miss Bandwidth Management

• Off-Chip Bandwidth Management

• Conclusion

CHIP MULTIPROCESSORS

Historical Processor Performance

100000

1000000

Processor Performance Components Per Chip (Moore's Law)

Technology scaling is used to increase

clock frequency

Technology scaling is used to add processor

Moore’s Law: 50% annual increase in thenumber of components per chip

52% Annual Performance Increase20% Annual Performance

Increase

Aggregate performance still follows Moore’s Law

Power Dissipation Limits Practical Clock Frequency

020406080

100120

Intel Processor Families

Intel Processor Family

Source: Wikipedia, List of CPU Power Dissipation, Retrieved 02.06.10

Technology scaling increases clock

frequency

Technology scalingadds processor cores

Power Wall

Chip Multiprocessors (CMPs)

• CMPs utilize chip resources with a constant power budget

• How does technology scaling impact CMPs?

Intel Nehalem

Projected Number of Cores

2007 2008 2009 2010 2011 2012 2013 2014 2015 20160

ITRS Year of Production

ITRS expects 40% annual increase

Observation 2: Software parallelism is needed in the long-term

Observation 1: Multiprogramming can

provide near-term throughput improvement

Processor Memory Gap1

100000

Main Memory Latency Processor Performance

7% Annual Memory Latency Improvement

Observation 3: Latency hiding techniques are necessary

Performance vs. Bandwidth

2007 2008 2009 2010 2011 2012 2013 2014 20150

Processor Performance Off-Chip Bandwidth

ITRS Year of Production

Observation 4: Bandwidth must be used efficiently

Software parallelism

Multi-programming

Latency hiding

Bandwidth efficiency

Concurrent applications share

hardware

Complex Memory Systems

Shared Resource Management

Application Trends Hardware Trends

CMP RESOURCE MANAGEMENT

Why Manage Shared Resources?

Provide predictable performance

Support OS scheduler assumptions

Cloud: Fulfill Service Level Agreement

Performance Variability Metrics

• Fairness– The performance reduction due to interference between processes

is distributed across all processes in proportion to their priorities– Equal priorities: Performance reduction from sharing affects all

processes equally

• Quality of Service– The performance of a process is never drops below a certain limit

regardless of the behavior of co-scheduled processes

Performance Variability (Fairness)

1 4 7 10 13 16 19 22 25 28 31 34 37 400

Crossbar-Based, 1 channel Ring-Based, 1 channel Crossbar-Based, 2 channelsRing-Based, 2 channels Crossbar-Based, 4 channels Ring-Based, 4 channels

Number of Workloads

Paper B.I

Resource Management Tasks

Measurement

Allocation(Policy)

Enforcement(Mechanism)

Off-line InterferenceMeasurement

On-line InterferenceMeasurement

Dynamic Miss Handling

Architecture

Greedy Miss Bandwidth Allocation

Performance Model Based Miss

Bandwidth Allocation

Low-Cost Open Page Prefetching

Opportunistic Prefetch Scheduling

Contributions

GREEDY MISS BANDWIDTH MANAGEMENT

Miss Bandwidth Management

Conventional Resource Allocation Implementation

MainMemory

Memory Bus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-CacheS

Private Memory System

Measurement

Allocation

Enforcement

Alternative Resource Allocation Implementation

MemoryMemory Bus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-Cache

Private Memory System

Measurement

Allocation

Enforcement

Dynamic Miss Handling Architecture

Miss Handling Architecture (MHA)

Address Target Info. U

Dynamic Miss Handling Architecture

Accesses

Cache is blocked

A DMHA controls the number of concurrent shared memory system requests that are allowed for each processor

Greedy Miss Bandwidth Management• Idea: Reduce the number of MSHRs if a metric

exceeds a certain threshold

• Metrics:– Paper A.II: Memory bus utilization– Paper A.III: Simple interference counters (Interference Points)

• Performance feedback avoids excessive performance degradations

Paper A.II and A.III

INTERFERENCE MEASUREMENT

Resource Allocation Baselines

Baseline = Interference-free configuration

Quantify performance impact from interference

Private Mode and Shared Mode

Interference Definition

InterferencePrivate Mode

Latency

Estimate ErrorPrivate

Mode Latency Measurement

Shared Mode Latency

PrivateMode Latency

Estimate

Offline Interference Measurement

Interference Penalty Frequency (IPF) counts the number requests that experienced an interference latency of i cycles

Interference Impact Factor (IIF) is the interference latency times the probability of it arising, i.e. IIF(i) = i ∙ P(i)

Paper B.I

Aggregate Interference Impact

Paper B.I

4-core CMP 8-core CMP 16-core CMP

Memory Bus Cache Interconnect

Resource Management Baselines

Processor A

Processor B

Shared Cache

Multiprogrammed Baseline (MPB)

Interconnect Memory Bus

Processor A

Processor B

Shared Cache

Processor A

Processor B

Shared Cache

Single Program Baseline (SPB)

Interconnect Memory Bus

Baseline Weaknesses

• Multiprogrammed Baseline– Only accounts for interference in partitioned resources– Static and equal division of DRAM bandwidth does not give equal

latency– Complex relationship between resource allocation and performance

• Single Program Baseline– Does not exist in shared mode

Online Interference MeasurementDynamic Interference Estimation

Framework (DIEF)Paper B.II

Online Interference Measurement

• Dynamic Interference Estimation Framework (DIEF)

• Estimates private mode average memory latency

• General, component-based framework

Paper B.II

Shared Cache InterferenceAuxiliary Tag Directories

Cache Accesses:

Shared Cache

A M B NMiss

C A B C

B MAA M B CD B A

Eviction may not be interference

Interference latency cost = miss penalty

D CB A

Eviction is interference

Bus Interference Requirements

• Out-of-order memory bus scheduling• Shared mode only cache misses and cache hits• Shared cache writebacks

Computing private latency based on shared mode queue contents is difficult

Emulate private scheduling in the shared mode

Shared Bus Queue

D C B A

1202004040

Arrival Order

Head Pointer

Execution Order

Latency Lookup Table

Bank 0

Bank 1

Open Page Emulation Registers

Memory Latency Estimation Buffer

Bank/ Page Mapping: A à (0,15), B à (0,19), C à (0,15), D à (1,32)

Estimated Queue Latency 120 40 40+ +=

BCD 40200

MODEL-BASED MISS BANDWIDTH MANAGEMENT

Model-Based Miss Bandwidth Allocation

DIEF provides accurate estimates of the average private mode memory latency

Can we use the estimates provided by DIEF to choose miss bandwidth allocations?

We need a model that relates average memory latency to performance

Paper A.IV

Performance Model

Paper A.IV

Observation: The memory latency performance impact depends on the parallelism of memory requests

Very similar in private and shared mode

Shared mode measurements can provide private mode

performance estimates

Bandwidth Management Flow

Paper A.IV

Measurement Modeling Allocation

Shared ModeMemory Latency

Private ModeMemory Latency

CommittedInstructions

Number ofMemory Requests

CPU Stall Time

Per-CPU Models

Perf. Metric Model

Find MSHR allocation that maximizes the

chosen performance metric

Set number of MSHRs for all last-level private caches

OFF-CHIP BANDWIDTH MANAGEMENT

Modern DRAM Interfaces

• Maximize bandwidth with 3D organization

• Repeated requests to the row buffer are very efficient

Row address

Column address

Row Buffer

Low-Cost Open Page Prefetching• Idea: Piggyback

prefetches to open DRAM pages on demand reads

• Performance win if prefetcher accuracy is above ~40%

Paper C.I

Opportunistic Prefetch Scheduling

Page Vector Table (PVT)

Demand Access

Prefetch Request

Idea: Issue prefetches when a page is closed

Increased efficiency: 8 transfers for 3 activations

Issued Prefetch

Paper C.II

CONCLUSION

Conclusion

• Managing bandwidth allocations can improve CMP system performance

• Miss bandwidth management– Greedy allocations– Management guided by accurate measurements and performance

models

• Off-chip bandwidth management with prefetching

Thank You

Visit our website:http://research.idi.ntnu.no/multicore/

EXTRA SLIDES

Future Work

• Performance-directed management of shared caches and the memory bus

• Improving OS and system software with dynamic measurements

• Combining dynamic MHAs with prefetching to improve system performance

• Managing workloads of single-threaded and multi-threaded benchmarks

Example Chip Multiprocessor

MainMemory

MemoryBus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-Cache

Private Memory System Shared Memory System

1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12....

Documents

Simulation of Multiprocessor System Scheduling · SIMULATION OF MULTIPROCESSOR SYSTEM SCHEDULING ... 3.3.3 Tightly-coupled verses Loosely-coupled Multiprocessor ... SMP Symmetric

Multiprocessor Simulators LV

Multiprocessor scheduling 2

„Die journalistische Antwort auf die Globalisierung“ · 3 Eine außergewöhnliche Gruppenreise im Jahre 1966: Günter Grass war dabei, Hans Magnus Enzensberger, Marcel Reich-Ranicki

Bit Multiprocessor

Creating Multiprocessor Nios II Systems Tutorialjjackson.eng.ua.edu/courses/ece680/lectures/Creating Multiprocessor... · Benefits of Multiprocessor Systems quickly design and build

Multiprocessor architecture

Scheduling multiprocessor tasks

Multiprocessor Initialization

VARIO Magnus [e]Magnus

Multiprocessor scheduling 3

ch9. Multiprocessor Virtualization

Chapter 15 – Multiprocessor Management - Virginia Techcourses.cs.vt.edu/~cs3204/spring2004/Notes/OS3e_15.pdf15.6 Multiprocessor Scheduling 15.6.1 Job-Blind Multiprocessor Scheduling

Symmetric multiprocessor

A Fully Preemptive Multiprocessor Semaphore Protocol for ...bbb/papers/talks/ecrts13b.pdf · Multiprocessor Semaphore Protocol for Latency ... A fully preemptive multiprocessor semaphore

Perfect Reconstructability of Control Flow from Demand ... · Dependence Graphs HELGE BAHMANN, Google Zurich¨ NICO REISSMANN, MAGNUS JAHRE, and JAN CHRISTIAN MEYER, Norwegian University

Sunbeam Multiprocessor Manual

Multiprocessor architectures

MultiProcessor Specification - pdos.csail.mit.edu

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian