Integrated Memory Controllers with Parallel Coherence Streams

Integrated Memory Integrated Memory Controllers with Controllers with

Parallel Coherence StreamsParallel Coherence Streams

Mainak Chaudhuri Mark Mainak Chaudhuri Mark HeinrichHeinrich

IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida


Talk in One SlideTalk in One Slide Ever-increasing on-die integrationEver-increasing on-die integration

– Faster memory controllers and coherence Faster memory controllers and coherence processorsprocessors

– Leads to new trade-offs in the domain of Leads to new trade-offs in the domain of programmable coherence engines for programmable coherence engines for scalable directory-based DSM scalable directory-based DSM multiprocessorsmultiprocessors

– We show that multiple coherence engines We show that multiple coherence engines are unnecessary in such environmentsare unnecessary in such environments

– We develop a useful analytical model to We develop a useful analytical model to quickly decide the coherence bandwidth quickly decide the coherence bandwidth requirement of parallel applicationsrequirement of parallel applications


SketchSketch BackgroundBackground Memory controller architectureMemory controller architecture Analytical modelAnalytical model Evaluation frameworkEvaluation framework Simulation resultsSimulation results

– Validation of model for directory Validation of model for directory protocolsprotocols

– Directory-less broadcast protocolsDirectory-less broadcast protocols– MultiprogrammingMultiprogramming

SummarySummary


Background: Integrated MCBackground: Integrated MC A direct solution to reduce round-trip A direct solution to reduce round-trip

cache miss latencycache miss latency– Other advantages related to Other advantages related to

maintenance and glueless maintenance and glueless multiprocessingmultiprocessing

Widely accepted in high-end industryWidely accepted in high-end industry– Alpha 21364, IBM Power5, AMD Opteron, Alpha 21364, IBM Power5, AMD Opteron,

Sun UltraSPARC III and IV, Sun NiagaraSun UltraSPARC III and IV, Sun Niagara Shared memory multiprocessors Shared memory multiprocessors

employing iMC are naturally DSMsemploying iMC are naturally DSMs– Bandwidth-thrifty directory coherence is Bandwidth-thrifty directory coherence is

the choicethe choice


Background: Directory Background: Directory ProcessingProcessing Home-based coherence protocolsHome-based coherence protocols

– Each cache block has a home nodeEach cache block has a home node– Upper few bits of physical addressUpper few bits of physical address– Each coherence request (miss or dirty Each coherence request (miss or dirty

eviction from the last level of cache) is eviction from the last level of cache) is first sent to the home node of the cache first sent to the home node of the cache blockblock

– At home node, sharing information of the At home node, sharing information of the cache block is maintained in a data cache block is maintained in a data structure called directory (can be in SRAM structure called directory (can be in SRAM or DRAM)or DRAM)

– Coherence controller of the home looks up Coherence controller of the home looks up directory and takes appropriate actionsdirectory and takes appropriate actions

Each node has at least one embedded Each node has at least one embedded directory coherence controllerdirectory coherence controller


Background: Directory Background: Directory ProcessingProcessing Two different trends in directory Two different trends in directory

coherence controller architecturecoherence controller architecture– Hardwired controllersHardwired controllers

Less flexible, tedious verification, often Less flexible, tedious verification, often affects project’s critical path, but high-affects project’s critical path, but high-performanceperformance

MIT Alewife, KSR1, SGI Origin, Stanford MIT Alewife, KSR1, SGI Origin, Stanford DASHDASH

– Custom programmable controllersCustom programmable controllers Executes protocol software on a protocol Executes protocol software on a protocol

processor embedded in memory controllerprocessor embedded in memory controller Flexible in choice of protocol, easier to Flexible in choice of protocol, easier to

verify the protocol, loss of performanceverify the protocol, loss of performance Compaq Piranha, Opteron-Horus, Stanford Compaq Piranha, Opteron-Horus, Stanford

FLASH, Sequent STiNG, Sun S3.mpFLASH, Sequent STiNG, Sun S3.mp


Background: Flexible Background: Flexible ProcessingProcessing

Past research reports up to 12% Past research reports up to 12% performance loss [Stanford FLASH]performance loss [Stanford FLASH]– Main reason why industry is shy of Main reason why industry is shy of

pursuing this optionpursuing this option Coherence controller occupancy has Coherence controller occupancy has

emerged as the most important emerged as the most important parameterparameter– Naturally, hardwired controllers get an Naturally, hardwired controllers get an

upper handupper hand– Past research has established the Past research has established the

importance of multiple hardwired importance of multiple hardwired controllers in SMP nodescontrollers in SMP nodes


Background: Flexible Background: Flexible ProcessingProcessing New technology often changes the New technology often changes the

trade-offstrade-offs– Reconsider programmable directory Reconsider programmable directory

controllers in the light of increased controllers in the light of increased integrationintegration Bring the programmable controller on dieBring the programmable controller on die Faster clock rates lead to lowered occupancyFaster clock rates lead to lowered occupancy

– New research questions:New research questions: Can the integrated programmable controllers Can the integrated programmable controllers

offer enough coherence bandwidth?offer enough coherence bandwidth? Do we need multiple of those?Do we need multiple of those? Can the integrated controllers cope up with Can the integrated controllers cope up with

the extra pressure of emerging multi-the extra pressure of emerging multi-threaded nodes?threaded nodes?


Background: Flexible Background: Flexible ProcessingProcessing Executes coherence protocol handlers Executes coherence protocol handlers

in software with hardware supportin software with hardware support– Does not require interruptsDoes not require interrupts

Two major architecturesTwo major architectures– Integrated custom protocol processor(s)Integrated custom protocol processor(s)

Can use one or more simple cores (this work Can use one or more simple cores (this work considers one or two static dual-issue in-order considers one or two static dual-issue in-order cores with dedicated one level of caches)cores with dedicated one level of caches)

– Reserved protocol thread context(s) in a Reserved protocol thread context(s) in a simultaneous multi-threaded (SMT) nodesimultaneous multi-threaded (SMT) node SMTp: SMT with one or more protocol SMTp: SMT with one or more protocol

contextscontexts Eliminates the protocol processorEliminates the protocol processor


Background: Flexible Background: Flexible ProcessingProcessing

OOO SMTCore(ATs)

L2 cache

SDRAMBanks

In-orderPP

Router

MC

OOO SMTCore

(ATs+PTs)

L2 cache

MC SDRAMBanks

Router

PP SMTpIL1 DL1

IL1DL1

IL1 DL1


Aside: A Protocol HandlerAside: A Protocol Handler Computes directory address from Computes directory address from

requested address (simple hash)requested address (simple hash) Loads directory entry into a registerLoads directory entry into a register Computes coherence actions based Computes coherence actions based

on directory state and headeron directory state and header– Simple integer arithmeticSimple integer arithmetic

Sends out coherence messages as Sends out coherence messages as neededneeded– Custom instructions or uncached stores Custom instructions or uncached stores

to write header and address to send unitto write header and address to send unit– May carry cache line data read from May carry cache line data read from

DRAMDRAM


Scope of this WorkScope of this Work Distributed shared memory (DSM) Distributed shared memory (DSM)

NUMANUMA– Up to 16 nodes, directory-based or Up to 16 nodes, directory-based or

directory-less broadcast coherencedirectory-less broadcast coherence– Each node has an SMT processor capable Each node has an SMT processor capable

of running four application threads, an of running four application threads, an integrated memory controller, integrated integrated memory controller, integrated one or two protocol processors (PPs) or one or two protocol processors (PPs) or protocol threads (PTs), integrated protocol threads (PTs), integrated hypercube routerhypercube router

– Six parallel applications and 4-way Six parallel applications and 4-way multiprogrammed workloadsmultiprogrammed workloads

Applicability to multi-node chip-MPsApplicability to multi-node chip-MPs


ContributionsContributions Two primary contributionsTwo primary contributions

– Evaluates two kinds of programmable Evaluates two kinds of programmable coherence engines in the light of on-die coherence engines in the light of on-die integration and multi-threadingintegration and multi-threading

– Develops a simple and generic analytical Develops a simple and generic analytical model relating the directory protocol model relating the directory protocol occupancy, DRAM latency, and DRAM occupancy, DRAM latency, and DRAM channel bandwidthchannel bandwidth Introduces the concept of Introduces the concept of occupancy marginoccupancy margin Offers valuable insights on hot-spot situationsOffers valuable insights on hot-spot situations Accurately predicts whether an additional Accurately predicts whether an additional

coherence engine is helpfulcoherence engine is helpful


HighlightsHighlights Key resultsKey results

– Single integrated programmable Single integrated programmable controller offers sufficient coherence controller offers sufficient coherence bandwidth for directory-based systems bandwidth for directory-based systems (contrast with off-chip controller studies)(contrast with off-chip controller studies) Analytical model helps explain why typical Analytical model helps explain why typical

hot-spot situations (involving locks and flags) hot-spot situations (involving locks and flags) cannot be improved with parallel coherence cannot be improved with parallel coherence stream processingstream processing

– Directory-less broadcast systems (e.g., Directory-less broadcast systems (e.g., AMD Opteron) enjoy significant benefit AMD Opteron) enjoy significant benefit from parallel coherence stream from parallel coherence stream processing with multiple programmable processing with multiple programmable “snoop” engines“snoop” engines





SummarySummary


Memory Controller Memory Controller Architecture: PPArchitecture: PP

PI Inbound NI Inbound SWQ Head

Processor Router VCs Software Queue

Round Robin Dispatch

Protocol Processor

Send Unit

OMBCAM Lookup

Icache

Dcache

SDRAMBanks

PI Out NI Out

PPWQ


Memory Controller Memory Controller Architecture: SMTpArchitecture: SMTp

PI Inbound NI Inbound SWQ Head

Processor Router VCs Software Queue

Round Robin Dispatch

Protocol Thread

Send Unit

OMBCAM Lookup

SDRAMBanks

PI Out NI Out

PPWQ


Memory Controller Memory Controller Architecture: SMTpArchitecture: SMTp

Protocol thread participates in Protocol thread participates in ICOUNT fetching when PC is validICOUNT fetching when PC is valid

Shares pipeline resources with Shares pipeline resources with application threads including the application threads including the entire cache hierarchyentire cache hierarchy

Deadlock avoidance with reserved Deadlock avoidance with reserved resourcesresources– Queue buffers, branch checkpoints, Queue buffers, branch checkpoints,

integer registers, integer and LS queue integer registers, integer and LS queue buffers, store buffers, MSHRsbuffers, store buffers, MSHRs


Parallel Coherence StreamsParallel Coherence Streams Replicated resourcesReplicated resources

– Multiple protocol processors or protocol Multiple protocol processors or protocol threadsthreads

– Multiple OMBsMultiple OMBs– Multi-ported Icache and Dcache for protocol Multi-ported Icache and Dcache for protocol

processor (does not apply to SMTp)processor (does not apply to SMTp) Control flowControl flow

– Mutual exclusion in directory access: Mutual exclusion in directory access: requires six-ported CAM lookup in OMB and requires six-ported CAM lookup in OMB and PPWQPPWQ

– Schedule a message every cycleSchedule a message every cycle Protocol processors/threads arbitrate for PPWQ Protocol processors/threads arbitrate for PPWQ

read port; smallest id wins (dynamic priority)read port; smallest id wins (dynamic priority)


Parallel Coherence StreamsParallel Coherence Streams Critical sectionsCritical sections

– Conventional LL/SC and test-and-set locksConventional LL/SC and test-and-set locks– For higher throughput test-and-set locks For higher throughput test-and-set locks

are maintained in on-chip registersare maintained in on-chip registers– Software queue and its related states Software queue and its related states

(e.g., occupancy) are the major shared (e.g., occupancy) are the major shared read/write variablesread/write variables

– Leads to increased average dynamic Leads to increased average dynamic instruction count per handlerinstruction count per handler

Trades occupancy of individual handler with Trades occupancy of individual handler with concurrency across handlersconcurrency across handlers


Parallel Coherence StreamsParallel Coherence Streams Boot sequenceBoot sequence

– Only one protocol processor/thread Only one protocol processor/thread executes the entire boot sequence executes the entire boot sequence initializing the memory controller and initializing the memory controller and peripheral statesperipheral states

– The other processors/threads only initialize The other processors/threads only initialize their architectural register statestheir architectural register states

Out-of-order issueOut-of-order issue– Address conflict between six heads and Address conflict between six heads and

PPWQ/OMB may lead to idle schedule PPWQ/OMB may lead to idle schedule cyclescycles

– Consider all requests in the five queues, Consider all requests in the five queues, not just the heads (SWQ is still FIFO)not just the heads (SWQ is still FIFO)

– Queues need to be collapsible with address Queues need to be collapsible with address CAMsCAMs





SummarySummary


Analytical ModelAnalytical Model Goal of the model is to decide if a Goal of the model is to decide if a

second protocol engine can improve second protocol engine can improve performanceperformance

The model is applicable to any system The model is applicable to any system exercising directory-based protocolsexercising directory-based protocols– Not just limited to systems with Not just limited to systems with

integrated controllersintegrated controllers We analyze the time spent by a batch We analyze the time spent by a batch

of requests in the memory systemof requests in the memory system– Focus only on handling of read and read-Focus only on handling of read and read-

exclusive at home (most time consuming)exclusive at home (most time consuming)– These require DRAM access (initiated These require DRAM access (initiated

speculatively)speculatively)


Analytical ModelAnalytical Model Three parts in the life of a request after Three parts in the life of a request after

it is dispatchedit is dispatched– DRAM occupancy or access latency (ODRAM occupancy or access latency (Omm))– Protocol handler occupancy or protocol Protocol handler occupancy or protocol

processing latency (Oprocessing latency (Opp))– DRAM channel occupancy or cache line DRAM channel occupancy or cache line

transfer latency (Otransfer latency (Occ)) We look at four scenarios involving two We look at four scenarios involving two

concurrently arriving bank-parallel concurrently arriving bank-parallel requests (consider only Orequests (consider only Omm > O > Opp))– Single- and dual-channel DRAM controller Single- and dual-channel DRAM controller

with one and two protocol engineswith one and two protocol engines


Single-channel DRAM Single-channel DRAM ControllerController

R1

R2

O1 M1 C1

M2 O2 C2

1PPU

R1

R2

2PPUO1 M1 C1

O2 M2 C2

What if 2Op > Om + 2Oc ?


Single-channel DRAM Single-channel DRAM ControllerController

R1

R2

O1M1 C1

M2 O2C2

1PPU

R1

R2

2PPUO1M1 C1

O2M2 C2

Saved: 2Op – (Om + 2Oc)


Dual-channel DRAM Dual-channel DRAM ControllerController

R1

R2

O1 M1 C1

M2 O2 C2

1PPU

R1

R2

2PPUO1 M1 C1

O2 M2 C2

What if 2Op > Om + Oc ?


Dual-channel DRAM Dual-channel DRAM ControllerController

R1

R2

O1M1 C1

M2 O2C2

1PPU

R1

R2

2PPUO1 M1 C1

O2 M2 C2

Saved: 2Op – (Om + Oc)


General FormulationGeneral Formulation Burst arrival of requests: k at a timeBurst arrival of requests: k at a time Single-channel DRAM controller:Single-channel DRAM controller:

– Total protocol occupancy must get Total protocol occupancy must get exposed if adding a second coherence exposed if adding a second coherence engine has to be effectiveengine has to be effective

– Required condition: kORequired condition: kOpp > O > Omm + kO + kOcc

– Re-arranging: ORe-arranging: Opp > O > Omm/k + O/k + Occ

Dual-channel DRAM controller:Dual-channel DRAM controller:– Required condition: kORequired condition: kOpp > O > Omm + kO + kOc c /2/2– Re-arranging: ORe-arranging: Opp > O > Omm/k + O/k + Oc c /2/2

Occupancy marginOccupancy margin: left minus right: left minus right


Take-away PointsTake-away Points For highly bursty requests (high k)For highly bursty requests (high k)

– OOmm has diminishing effect has diminishing effect– Balance between OBalance between Opp and O and Occ becomes the becomes the

most important determinant: tension most important determinant: tension between two competing bandwidthsbetween two competing bandwidths

For small kFor small k– A large occupancy margin is unlikely, as A large occupancy margin is unlikely, as

the contribution from Othe contribution from Omm would be large would be large– Adding a second coherence engine would Adding a second coherence engine would

not be useful: less concurrencynot be useful: less concurrency Extra DRAM bandwidth may convert a Extra DRAM bandwidth may convert a

negative occupancy margin to negative occupancy margin to positive positive


Hot-spots and Bank ConflictsHot-spots and Bank Conflicts Bank conflicts can delay DRAM Bank conflicts can delay DRAM

accesses of a burst of requestsaccesses of a burst of requests– Hot-spots often arise from access to the Hot-spots often arise from access to the

samesame cache block system-wide: an cache block system-wide: an obvious case of bank conflictobvious case of bank conflict

– Can multiple coherence engines help?Can multiple coherence engines help? Since OSince Omm > O > Opp on average, for two on average, for two

conflicting requests the total conflicting requests the total protocol occupancy (2Oprotocol occupancy (2Opp) is hidden ) is hidden under memory access latency (2Ounder memory access latency (2Omm))– Multiple coherence engines will not Multiple coherence engines will not

improve performanceimprove performance


Hot-spots and Bank ConflictsHot-spots and Bank Conflicts What about row buffer hits?What about row buffer hits?

– The first request in a batch will suffer The first request in a batch will suffer from a row buffer missfrom a row buffer miss

– The subsequent ones will enjoy hits with The subsequent ones will enjoy hits with high probabilityhigh probability

– Row buffer hits lower the average value Row buffer hits lower the average value of Oof Omm and may uncover portions of kO and may uncover portions of kOpp even in the case of k conflicting requestseven in the case of k conflicting requests

– Required condition: kORequired condition: kOpp > O > Omissmiss + (k-1)O + (k-1)Ohithit + kO+ kOcc/w for w-channel DRAM/w for w-channel DRAM

– Simplifying: OSimplifying: Opp > O > Omm + O + Oc c /w which /w which contradicts Ocontradicts Omm > O > Opp (typical of integrated (typical of integrated controllers)controllers)





SummarySummary


Evaluation FrameworkEvaluation Framework Evaluates both integrated protocol Evaluates both integrated protocol

processors (PPs) and threads in SMTpprocessors (PPs) and threads in SMTp Depending on the integration level Depending on the integration level

the memory controller and PP can be the memory controller and PP can be clocked at different frequenciesclocked at different frequencies– Explores 400 MHz, 800 MHz, 1.6 GHz Explores 400 MHz, 800 MHz, 1.6 GHz

with 1.6 GHz main SMT processorwith 1.6 GHz main SMT processor Protocol threads in SMTp, by design, Protocol threads in SMTp, by design,

always run at full frequency (1.6 always run at full frequency (1.6 GHz)GHz)

Each node has an SMT processor and Each node has an SMT processor and runs up to four ATs (64-threaded runs up to four ATs (64-threaded apps) and two PTsapps) and two PTs


Flashback: Flexible ProcessingFlashback: Flexible Processing

OOO SMTCore(ATs)

L2 cache

SDRAMBanks

In-orderPP

Router

MC

OOO SMTCore

(ATs+PTs)

L2 cache

MC SDRAMBanks

Router

PP SMTpIL1 DL1

IL1DL1

IL1 DL1





SummarySummary


400 MHz Protocol Processors: 400 MHz Protocol Processors: Execution TimeExecution Time

8%


400 MHz Protocol 400 MHz Protocol Processors: Processors:

Dispatcher’s Wait CyclesDispatcher’s Wait Cycles


400 MHz Protocol Processors: 400 MHz Protocol Processors: OccupancyOccupancy


Model Validation: 400 MHz PPModel Validation: 400 MHz PP OOcc is fixed at 20 ns (128B @ 6.4 GB/s) is fixed at 20 ns (128B @ 6.4 GB/s) Predict from 1PP measurements:Predict from 1PP measurements:

OOp p (ns) O(ns) Om m (ns) k(ns) kmaxmax OM (ns)OM (ns)FFTFFT 30.3 30.3 54.6 5 54.6 5 -0.6-0.6FFTW 29.4 54.3 7 FFTW 29.4 54.3 7 1.61.6LU 25.3 40.1 6 -1.4LU 25.3 40.1 6 -1.4Ocean 36.6 54.4 7 Ocean 36.6 54.4 7 8.88.8Radix-Sort 33.8 45.8 4 Radix-Sort 33.8 45.8 4 2.32.3Water 28.1 40.0 4 -1.9Water 28.1 40.0 4 -1.9


1.6 GHz Protocol Processors: 1.6 GHz Protocol Processors: Execution TimeExecution Time


1.6 GHz Protocol 1.6 GHz Protocol Processors: Processors:

Dispatcher’s Wait CyclesDispatcher’s Wait Cycles


1.6 GHz Protocol Processors: 1.6 GHz Protocol Processors: OccupancyOccupancy


Model Validation: 1.6 GHz PPModel Validation: 1.6 GHz PP OOcc is fixed at 20 ns (128B @ 6.4 GB/s) is fixed at 20 ns (128B @ 6.4 GB/s) Predict from 1PP measurements:Predict from 1PP measurements:

OOp p (ns) O(ns) Om m (ns) k(ns) kmaxmax OM (ns)OM (ns)FFTFFT 8.4 8.4 54.4 12 -16.1 54.4 12 -16.1FFTW 7.5 53.9 16 -15.9FFTW 7.5 53.9 16 -15.9LU 6.3 40.2 16 -16.2LU 6.3 40.2 16 -16.2Ocean 9.4 54.6 16 -14.0Ocean 9.4 54.6 16 -14.0Radix-Sort 8.8 45.8 8 -16.9Radix-Sort 8.8 45.8 8 -16.9Water 7.2 40.0 8 -17.8Water 7.2 40.0 8 -17.8


SMTp: Execution TimeSMTp: Execution Time


SMTp: Dispatcher’s Wait SMTp: Dispatcher’s Wait CyclesCycles


SMTp: OccupancySMTp: Occupancy


Model Validation: SMTpModel Validation: SMTp OOcc is fixed at 20 ns (128B @ 6.4 GB/s) is fixed at 20 ns (128B @ 6.4 GB/s) Predict from 1PP measurements:Predict from 1PP measurements:

OOp p (ns) O(ns) Om m (ns) k(ns) kmaxmax OM (ns)OM (ns)FFTFFT 15.3 15.3 55.2 10 -10.2 55.2 10 -10.2FFTW 15.6 57.6 14 -8.5FFTW 15.6 57.6 14 -8.5LU 13.1 42.3 16 -9.5LU 13.1 42.3 16 -9.5Ocean 17.5 55.9 16 -6.0Ocean 17.5 55.9 16 -6.0Radix-Sort 16.6 51.8 11 -8.1Radix-Sort 16.6 51.8 11 -8.1Water 14.4 40.1 8 -10.6Water 14.4 40.1 8 -10.6


Summary: Execution TimeSummary: Execution Time


Take-away PointsTake-away Points Doubling controller frequency is Doubling controller frequency is

always better than adding a second always better than adding a second oneone– Reducing individual handler occupancy is Reducing individual handler occupancy is

more important than reducing the more important than reducing the occupancy of a burstoccupancy of a burst

– Ocean shows the importance of burst mixOcean shows the importance of burst mix Increasing frequency has diminishing Increasing frequency has diminishing

returnreturn– Instead of building complex hardwired Instead of building complex hardwired

protocol engines, dedicate a thread or protocol engines, dedicate a thread or core in the emerging processorscore in the emerging processors


Application to Many-coreApplication to Many-core Multi-node many-core systems have a Multi-node many-core systems have a

natural hierarchy of coherencenatural hierarchy of coherence– Private last levels (L1 or L2) kept coherent Private last levels (L1 or L2) kept coherent

with the next shared level (L2 or L3) via a with the next shared level (L2 or L3) via a directory protocol (Niagara, Power5)directory protocol (Niagara, Power5)

– Multiple nodes employ a global dir. protocolMultiple nodes employ a global dir. protocol– Our model applies to both the levelsOur model applies to both the levels– The on-chip coherence controllers per bank The on-chip coherence controllers per bank

in a shared NUCA are unique in one sensein a shared NUCA are unique in one sense OOmm and O and Occ are much smaller than DRAM: our are much smaller than DRAM: our

model may dictate positive OM depending on model may dictate positive OM depending on burstiness (k) and protocol’s complexity (Oburstiness (k) and protocol’s complexity (Opp))


Directory-less Broadcast Directory-less Broadcast ProtocolsProtocols

Sends request to home, home Sends request to home, home broadcasts it to all and replies to broadcasts it to all and replies to requester, all snoop local caches and requester, all snoop local caches and reply to requester, requester picks reply to requester, requester picks correct responsecorrect response– Still NUMA (a la AMD Opteron)Still NUMA (a la AMD Opteron)– 26.1 messages per miss (compare with 26.1 messages per miss (compare with

2.5 in directory protocol)2.5 in directory protocol) A lot of concurrency in the coherence A lot of concurrency in the coherence

processing layerprocessing layer

– 16.1% average improvement when a 16.1% average improvement when a second coherence engine is introduced second coherence engine is introduced (averaged across FFT, FFTW, Ocean, (averaged across FFT, FFTW, Ocean, Radix-Sort)Radix-Sort)


Single-node MultiprogrammingSingle-node Multiprogramming Multiprogrammed workloads have a Multiprogrammed workloads have a

large data footprint and no sharing large data footprint and no sharing across threadsacross threads– A lot of outer-level cache misses, A lot of outer-level cache misses,

exercises the coherence engine a lot exercises the coherence engine a lot more than parallel applicationsmore than parallel applications

– Our model correctly dictates that there Our model correctly dictates that there is no gain in introducing a second is no gain in introducing a second controllercontroller OOpp is too small to satisfy the inequalities is too small to satisfy the inequalities

– Take-away point: coherence bandwidth Take-away point: coherence bandwidth requirement is not directly related to requirement is not directly related to cache miss ratecache miss rate


Single-node MultiprogrammingSingle-node Multiprogramming


Prior ResearchPrior Research Programmable coherence controllersProgrammable coherence controllers

– Stanford FLASH, Wisconsin Typhoon, Stanford FLASH, Wisconsin Typhoon, Sequent STiNG, Sun S3.mp, Compaq Sequent STiNG, Sun S3.mp, Compaq Piranha CMP, Newisys Opteron-HorusPiranha CMP, Newisys Opteron-Horus

– All controllers are off-chip (and hence All controllers are off-chip (and hence lags by at least two generations of lags by at least two generations of process)process)

Multiple coherence controllersMultiple coherence controllers– Explored with SMP nodes having off-chip Explored with SMP nodes having off-chip

directory controllers (IBM, UIUC, Purdue)directory controllers (IBM, UIUC, Purdue)– Local/remote address partitioning in Local/remote address partitioning in

Opteron-Horus, STiNG, and S3.mp Opteron-Horus, STiNG, and S3.mp


SummarySummary A useful model for coherence layer A useful model for coherence layer

designersdesigners– A simple and intuitive inequality rulesA simple and intuitive inequality rules– DRAM latency contributes little to this DRAM latency contributes little to this

decision in the case of highly bursty decision in the case of highly bursty applicationsapplications

– Bank-conflicting requests enjoy little or no Bank-conflicting requests enjoy little or no benefit from parallel coherence stream benefit from parallel coherence stream processing (common case for hot processing (common case for hot locks/flags)locks/flags)

Two controllers improve performance Two controllers improve performance by up to 8% when freq. ratio is fourby up to 8% when freq. ratio is four

Broadcast protocols enjoy larger Broadcast protocols enjoy larger benefitbenefit


AcknowledgmentsAcknowledgments Anshuman Gupta (UCSD)Anshuman Gupta (UCSD)

– Preliminary simulations (part of Preliminary simulations (part of independent study at IIT Kanpur)independent study at IIT Kanpur)

Varun Khaneja (AMD)Varun Khaneja (AMD)– Development of directory-less Development of directory-less

broadcast protocol (part of MTech broadcast protocol (part of MTech thesis at IIT Kanpur)thesis at IIT Kanpur)

IIT Kanpur Security CenterIIT Kanpur Security Center– For hosting part of simulation infra-For hosting part of simulation infra-

structure structure

Integrated Memory Integrated Memory Controllers with Controllers with


Mainak Chaudhuri Mark Mainak Chaudhuri Mark HeinrichHeinrich

IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida

THANK YOU!

To appear in IEEE TPDS

Documents

Integrated Memory Controllers with Parallel Coherence Streams