Upload
elsie
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Integrated Memory Controllers with Parallel Coherence Streams. Mainak Chaudhuri Mark Heinrich IIT Kanpur University of Central Florida. Talk in One Slide. Ever-increasing on-die integration Faster memory controllers and coherence processors - PowerPoint PPT Presentation
Citation preview
Integrated Memory Integrated Memory Controllers with Controllers with
Parallel Coherence StreamsParallel Coherence Streams
Mainak Chaudhuri Mark Mainak Chaudhuri Mark HeinrichHeinrich
IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida
Parallel Coherence StreamsParallel Coherence Streams
Talk in One SlideTalk in One Slide Ever-increasing on-die integrationEver-increasing on-die integration
– Faster memory controllers and coherence Faster memory controllers and coherence processorsprocessors
– Leads to new trade-offs in the domain of Leads to new trade-offs in the domain of programmable coherence engines for programmable coherence engines for scalable directory-based DSM scalable directory-based DSM multiprocessorsmultiprocessors
– We show that multiple coherence engines We show that multiple coherence engines are unnecessary in such environmentsare unnecessary in such environments
– We develop a useful analytical model to We develop a useful analytical model to quickly decide the coherence bandwidth quickly decide the coherence bandwidth requirement of parallel applicationsrequirement of parallel applications
Parallel Coherence StreamsParallel Coherence Streams
SketchSketch BackgroundBackground Memory controller architectureMemory controller architecture Analytical modelAnalytical model Evaluation frameworkEvaluation framework Simulation resultsSimulation results
– Validation of model for directory Validation of model for directory protocolsprotocols
– Directory-less broadcast protocolsDirectory-less broadcast protocols– MultiprogrammingMultiprogramming
SummarySummary
Parallel Coherence StreamsParallel Coherence Streams
Background: Integrated MCBackground: Integrated MC A direct solution to reduce round-trip A direct solution to reduce round-trip
cache miss latencycache miss latency– Other advantages related to Other advantages related to
maintenance and glueless maintenance and glueless multiprocessingmultiprocessing
Widely accepted in high-end industryWidely accepted in high-end industry– Alpha 21364, IBM Power5, AMD Opteron, Alpha 21364, IBM Power5, AMD Opteron,
Sun UltraSPARC III and IV, Sun NiagaraSun UltraSPARC III and IV, Sun Niagara Shared memory multiprocessors Shared memory multiprocessors
employing iMC are naturally DSMsemploying iMC are naturally DSMs– Bandwidth-thrifty directory coherence is Bandwidth-thrifty directory coherence is
the choicethe choice
Parallel Coherence StreamsParallel Coherence Streams
Background: Directory Background: Directory ProcessingProcessing Home-based coherence protocolsHome-based coherence protocols
– Each cache block has a home nodeEach cache block has a home node– Upper few bits of physical addressUpper few bits of physical address– Each coherence request (miss or dirty Each coherence request (miss or dirty
eviction from the last level of cache) is eviction from the last level of cache) is first sent to the home node of the cache first sent to the home node of the cache blockblock
– At home node, sharing information of the At home node, sharing information of the cache block is maintained in a data cache block is maintained in a data structure called directory (can be in SRAM structure called directory (can be in SRAM or DRAM)or DRAM)
– Coherence controller of the home looks up Coherence controller of the home looks up directory and takes appropriate actionsdirectory and takes appropriate actions
Each node has at least one embedded Each node has at least one embedded directory coherence controllerdirectory coherence controller
Parallel Coherence StreamsParallel Coherence Streams
Background: Directory Background: Directory ProcessingProcessing Two different trends in directory Two different trends in directory
coherence controller architecturecoherence controller architecture– Hardwired controllersHardwired controllers
Less flexible, tedious verification, often Less flexible, tedious verification, often affects project’s critical path, but high-affects project’s critical path, but high-performanceperformance
MIT Alewife, KSR1, SGI Origin, Stanford MIT Alewife, KSR1, SGI Origin, Stanford DASHDASH
– Custom programmable controllersCustom programmable controllers Executes protocol software on a protocol Executes protocol software on a protocol
processor embedded in memory controllerprocessor embedded in memory controller Flexible in choice of protocol, easier to Flexible in choice of protocol, easier to
verify the protocol, loss of performanceverify the protocol, loss of performance Compaq Piranha, Opteron-Horus, Stanford Compaq Piranha, Opteron-Horus, Stanford
FLASH, Sequent STiNG, Sun S3.mpFLASH, Sequent STiNG, Sun S3.mp
Parallel Coherence StreamsParallel Coherence Streams
Background: Flexible Background: Flexible ProcessingProcessing
Past research reports up to 12% Past research reports up to 12% performance loss [Stanford FLASH]performance loss [Stanford FLASH]– Main reason why industry is shy of Main reason why industry is shy of
pursuing this optionpursuing this option Coherence controller occupancy has Coherence controller occupancy has
emerged as the most important emerged as the most important parameterparameter– Naturally, hardwired controllers get an Naturally, hardwired controllers get an
upper handupper hand– Past research has established the Past research has established the
importance of multiple hardwired importance of multiple hardwired controllers in SMP nodescontrollers in SMP nodes
Parallel Coherence StreamsParallel Coherence Streams
Background: Flexible Background: Flexible ProcessingProcessing New technology often changes the New technology often changes the
trade-offstrade-offs– Reconsider programmable directory Reconsider programmable directory
controllers in the light of increased controllers in the light of increased integrationintegration Bring the programmable controller on dieBring the programmable controller on die Faster clock rates lead to lowered occupancyFaster clock rates lead to lowered occupancy
– New research questions:New research questions: Can the integrated programmable controllers Can the integrated programmable controllers
offer enough coherence bandwidth?offer enough coherence bandwidth? Do we need multiple of those?Do we need multiple of those? Can the integrated controllers cope up with Can the integrated controllers cope up with
the extra pressure of emerging multi-the extra pressure of emerging multi-threaded nodes?threaded nodes?
Parallel Coherence StreamsParallel Coherence Streams
Background: Flexible Background: Flexible ProcessingProcessing Executes coherence protocol handlers Executes coherence protocol handlers
in software with hardware supportin software with hardware support– Does not require interruptsDoes not require interrupts
Two major architecturesTwo major architectures– Integrated custom protocol processor(s)Integrated custom protocol processor(s)
Can use one or more simple cores (this work Can use one or more simple cores (this work considers one or two static dual-issue in-order considers one or two static dual-issue in-order cores with dedicated one level of caches)cores with dedicated one level of caches)
– Reserved protocol thread context(s) in a Reserved protocol thread context(s) in a simultaneous multi-threaded (SMT) nodesimultaneous multi-threaded (SMT) node SMTp: SMT with one or more protocol SMTp: SMT with one or more protocol
contextscontexts Eliminates the protocol processorEliminates the protocol processor
Parallel Coherence StreamsParallel Coherence Streams
Background: Flexible Background: Flexible ProcessingProcessing
OOO SMTCore(ATs)
L2 cache
SDRAMBanks
In-orderPP
Router
MC
OOO SMTCore
(ATs+PTs)
L2 cache
MC SDRAMBanks
Router
PP SMTpIL1 DL1
IL1DL1
IL1 DL1
Parallel Coherence StreamsParallel Coherence Streams
Aside: A Protocol HandlerAside: A Protocol Handler Computes directory address from Computes directory address from
requested address (simple hash)requested address (simple hash) Loads directory entry into a registerLoads directory entry into a register Computes coherence actions based Computes coherence actions based
on directory state and headeron directory state and header– Simple integer arithmeticSimple integer arithmetic
Sends out coherence messages as Sends out coherence messages as neededneeded– Custom instructions or uncached stores Custom instructions or uncached stores
to write header and address to send unitto write header and address to send unit– May carry cache line data read from May carry cache line data read from
DRAMDRAM
Parallel Coherence StreamsParallel Coherence Streams
Scope of this WorkScope of this Work Distributed shared memory (DSM) Distributed shared memory (DSM)
NUMANUMA– Up to 16 nodes, directory-based or Up to 16 nodes, directory-based or
directory-less broadcast coherencedirectory-less broadcast coherence– Each node has an SMT processor capable Each node has an SMT processor capable
of running four application threads, an of running four application threads, an integrated memory controller, integrated integrated memory controller, integrated one or two protocol processors (PPs) or one or two protocol processors (PPs) or protocol threads (PTs), integrated protocol threads (PTs), integrated hypercube routerhypercube router
– Six parallel applications and 4-way Six parallel applications and 4-way multiprogrammed workloadsmultiprogrammed workloads
Applicability to multi-node chip-MPsApplicability to multi-node chip-MPs
Parallel Coherence StreamsParallel Coherence Streams
ContributionsContributions Two primary contributionsTwo primary contributions
– Evaluates two kinds of programmable Evaluates two kinds of programmable coherence engines in the light of on-die coherence engines in the light of on-die integration and multi-threadingintegration and multi-threading
– Develops a simple and generic analytical Develops a simple and generic analytical model relating the directory protocol model relating the directory protocol occupancy, DRAM latency, and DRAM occupancy, DRAM latency, and DRAM channel bandwidthchannel bandwidth Introduces the concept of Introduces the concept of occupancy marginoccupancy margin Offers valuable insights on hot-spot situationsOffers valuable insights on hot-spot situations Accurately predicts whether an additional Accurately predicts whether an additional
coherence engine is helpfulcoherence engine is helpful
Parallel Coherence StreamsParallel Coherence Streams
HighlightsHighlights Key resultsKey results
– Single integrated programmable Single integrated programmable controller offers sufficient coherence controller offers sufficient coherence bandwidth for directory-based systems bandwidth for directory-based systems (contrast with off-chip controller studies)(contrast with off-chip controller studies) Analytical model helps explain why typical Analytical model helps explain why typical
hot-spot situations (involving locks and flags) hot-spot situations (involving locks and flags) cannot be improved with parallel coherence cannot be improved with parallel coherence stream processingstream processing
– Directory-less broadcast systems (e.g., Directory-less broadcast systems (e.g., AMD Opteron) enjoy significant benefit AMD Opteron) enjoy significant benefit from parallel coherence stream from parallel coherence stream processing with multiple programmable processing with multiple programmable “snoop” engines“snoop” engines
Parallel Coherence StreamsParallel Coherence Streams
SketchSketch BackgroundBackground Memory controller architectureMemory controller architecture Analytical modelAnalytical model Evaluation frameworkEvaluation framework Simulation resultsSimulation results
– Validation of model for directory Validation of model for directory protocolsprotocols
– Directory-less broadcast protocolsDirectory-less broadcast protocols– MultiprogrammingMultiprogramming
SummarySummary
Parallel Coherence StreamsParallel Coherence Streams
Memory Controller Memory Controller Architecture: PPArchitecture: PP
PI Inbound NI Inbound SWQ Head
Processor Router VCs Software Queue
Round Robin Dispatch
Protocol Processor
Send Unit
OMBCAM Lookup
Icache
Dcache
SDRAMBanks
PI Out NI Out
PPWQ
Parallel Coherence StreamsParallel Coherence Streams
Memory Controller Memory Controller Architecture: SMTpArchitecture: SMTp
PI Inbound NI Inbound SWQ Head
Processor Router VCs Software Queue
Round Robin Dispatch
Protocol Thread
Send Unit
OMBCAM Lookup
SDRAMBanks
PI Out NI Out
PPWQ
Parallel Coherence StreamsParallel Coherence Streams
Memory Controller Memory Controller Architecture: SMTpArchitecture: SMTp
Protocol thread participates in Protocol thread participates in ICOUNT fetching when PC is validICOUNT fetching when PC is valid
Shares pipeline resources with Shares pipeline resources with application threads including the application threads including the entire cache hierarchyentire cache hierarchy
Deadlock avoidance with reserved Deadlock avoidance with reserved resourcesresources– Queue buffers, branch checkpoints, Queue buffers, branch checkpoints,
integer registers, integer and LS queue integer registers, integer and LS queue buffers, store buffers, MSHRsbuffers, store buffers, MSHRs
Parallel Coherence StreamsParallel Coherence Streams
Parallel Coherence StreamsParallel Coherence Streams Replicated resourcesReplicated resources
– Multiple protocol processors or protocol Multiple protocol processors or protocol threadsthreads
– Multiple OMBsMultiple OMBs– Multi-ported Icache and Dcache for protocol Multi-ported Icache and Dcache for protocol
processor (does not apply to SMTp)processor (does not apply to SMTp) Control flowControl flow
– Mutual exclusion in directory access: Mutual exclusion in directory access: requires six-ported CAM lookup in OMB and requires six-ported CAM lookup in OMB and PPWQPPWQ
– Schedule a message every cycleSchedule a message every cycle Protocol processors/threads arbitrate for PPWQ Protocol processors/threads arbitrate for PPWQ
read port; smallest id wins (dynamic priority)read port; smallest id wins (dynamic priority)
Parallel Coherence StreamsParallel Coherence Streams
Parallel Coherence StreamsParallel Coherence Streams Critical sectionsCritical sections
– Conventional LL/SC and test-and-set locksConventional LL/SC and test-and-set locks– For higher throughput test-and-set locks For higher throughput test-and-set locks
are maintained in on-chip registersare maintained in on-chip registers– Software queue and its related states Software queue and its related states
(e.g., occupancy) are the major shared (e.g., occupancy) are the major shared read/write variablesread/write variables
– Leads to increased average dynamic Leads to increased average dynamic instruction count per handlerinstruction count per handler
Trades occupancy of individual handler with Trades occupancy of individual handler with concurrency across handlersconcurrency across handlers
Parallel Coherence StreamsParallel Coherence Streams
Parallel Coherence StreamsParallel Coherence Streams Boot sequenceBoot sequence
– Only one protocol processor/thread Only one protocol processor/thread executes the entire boot sequence executes the entire boot sequence initializing the memory controller and initializing the memory controller and peripheral statesperipheral states
– The other processors/threads only initialize The other processors/threads only initialize their architectural register statestheir architectural register states
Out-of-order issueOut-of-order issue– Address conflict between six heads and Address conflict between six heads and
PPWQ/OMB may lead to idle schedule PPWQ/OMB may lead to idle schedule cyclescycles
– Consider all requests in the five queues, Consider all requests in the five queues, not just the heads (SWQ is still FIFO)not just the heads (SWQ is still FIFO)
– Queues need to be collapsible with address Queues need to be collapsible with address CAMsCAMs
Parallel Coherence StreamsParallel Coherence Streams
SketchSketch BackgroundBackground Memory controller architectureMemory controller architecture Analytical modelAnalytical model Evaluation frameworkEvaluation framework Simulation resultsSimulation results
– Validation of model for directory Validation of model for directory protocolsprotocols
– Directory-less broadcast protocolsDirectory-less broadcast protocols– MultiprogrammingMultiprogramming
SummarySummary
Parallel Coherence StreamsParallel Coherence Streams
Analytical ModelAnalytical Model Goal of the model is to decide if a Goal of the model is to decide if a
second protocol engine can improve second protocol engine can improve performanceperformance
The model is applicable to any system The model is applicable to any system exercising directory-based protocolsexercising directory-based protocols– Not just limited to systems with Not just limited to systems with
integrated controllersintegrated controllers We analyze the time spent by a batch We analyze the time spent by a batch
of requests in the memory systemof requests in the memory system– Focus only on handling of read and read-Focus only on handling of read and read-
exclusive at home (most time consuming)exclusive at home (most time consuming)– These require DRAM access (initiated These require DRAM access (initiated
speculatively)speculatively)
Parallel Coherence StreamsParallel Coherence Streams
Analytical ModelAnalytical Model Three parts in the life of a request after Three parts in the life of a request after
it is dispatchedit is dispatched– DRAM occupancy or access latency (ODRAM occupancy or access latency (Omm))– Protocol handler occupancy or protocol Protocol handler occupancy or protocol
processing latency (Oprocessing latency (Opp))– DRAM channel occupancy or cache line DRAM channel occupancy or cache line
transfer latency (Otransfer latency (Occ)) We look at four scenarios involving two We look at four scenarios involving two
concurrently arriving bank-parallel concurrently arriving bank-parallel requests (consider only Orequests (consider only Omm > O > Opp))– Single- and dual-channel DRAM controller Single- and dual-channel DRAM controller
with one and two protocol engineswith one and two protocol engines
Parallel Coherence StreamsParallel Coherence Streams
Single-channel DRAM Single-channel DRAM ControllerController
R1
R2
O1 M1 C1
M2 O2 C2
1PPU
R1
R2
2PPUO1 M1 C1
O2 M2 C2
What if 2Op > Om + 2Oc ?
Parallel Coherence StreamsParallel Coherence Streams
Single-channel DRAM Single-channel DRAM ControllerController
R1
R2
O1M1 C1
M2 O2C2
1PPU
R1
R2
2PPUO1M1 C1
O2M2 C2
Saved: 2Op – (Om + 2Oc)
Parallel Coherence StreamsParallel Coherence Streams
Dual-channel DRAM Dual-channel DRAM ControllerController
R1
R2
O1 M1 C1
M2 O2 C2
1PPU
R1
R2
2PPUO1 M1 C1
O2 M2 C2
What if 2Op > Om + Oc ?
Parallel Coherence StreamsParallel Coherence Streams
Dual-channel DRAM Dual-channel DRAM ControllerController
R1
R2
O1M1 C1
M2 O2C2
1PPU
R1
R2
2PPUO1 M1 C1
O2 M2 C2
Saved: 2Op – (Om + Oc)
Parallel Coherence StreamsParallel Coherence Streams
General FormulationGeneral Formulation Burst arrival of requests: k at a timeBurst arrival of requests: k at a time Single-channel DRAM controller:Single-channel DRAM controller:
– Total protocol occupancy must get Total protocol occupancy must get exposed if adding a second coherence exposed if adding a second coherence engine has to be effectiveengine has to be effective
– Required condition: kORequired condition: kOpp > O > Omm + kO + kOcc
– Re-arranging: ORe-arranging: Opp > O > Omm/k + O/k + Occ
Dual-channel DRAM controller:Dual-channel DRAM controller:– Required condition: kORequired condition: kOpp > O > Omm + kO + kOc c /2/2– Re-arranging: ORe-arranging: Opp > O > Omm/k + O/k + Oc c /2/2
Occupancy marginOccupancy margin: left minus right: left minus right
Parallel Coherence StreamsParallel Coherence Streams
Take-away PointsTake-away Points For highly bursty requests (high k)For highly bursty requests (high k)
– OOmm has diminishing effect has diminishing effect– Balance between OBalance between Opp and O and Occ becomes the becomes the
most important determinant: tension most important determinant: tension between two competing bandwidthsbetween two competing bandwidths
For small kFor small k– A large occupancy margin is unlikely, as A large occupancy margin is unlikely, as
the contribution from Othe contribution from Omm would be large would be large– Adding a second coherence engine would Adding a second coherence engine would
not be useful: less concurrencynot be useful: less concurrency Extra DRAM bandwidth may convert a Extra DRAM bandwidth may convert a
negative occupancy margin to negative occupancy margin to positive positive
Parallel Coherence StreamsParallel Coherence Streams
Hot-spots and Bank ConflictsHot-spots and Bank Conflicts Bank conflicts can delay DRAM Bank conflicts can delay DRAM
accesses of a burst of requestsaccesses of a burst of requests– Hot-spots often arise from access to the Hot-spots often arise from access to the
samesame cache block system-wide: an cache block system-wide: an obvious case of bank conflictobvious case of bank conflict
– Can multiple coherence engines help?Can multiple coherence engines help? Since OSince Omm > O > Opp on average, for two on average, for two
conflicting requests the total conflicting requests the total protocol occupancy (2Oprotocol occupancy (2Opp) is hidden ) is hidden under memory access latency (2Ounder memory access latency (2Omm))– Multiple coherence engines will not Multiple coherence engines will not
improve performanceimprove performance
Parallel Coherence StreamsParallel Coherence Streams
Hot-spots and Bank ConflictsHot-spots and Bank Conflicts What about row buffer hits?What about row buffer hits?
– The first request in a batch will suffer The first request in a batch will suffer from a row buffer missfrom a row buffer miss
– The subsequent ones will enjoy hits with The subsequent ones will enjoy hits with high probabilityhigh probability
– Row buffer hits lower the average value Row buffer hits lower the average value of Oof Omm and may uncover portions of kO and may uncover portions of kOpp even in the case of k conflicting requestseven in the case of k conflicting requests
– Required condition: kORequired condition: kOpp > O > Omissmiss + (k-1)O + (k-1)Ohithit + kO+ kOcc/w for w-channel DRAM/w for w-channel DRAM
– Simplifying: OSimplifying: Opp > O > Omm + O + Oc c /w which /w which contradicts Ocontradicts Omm > O > Opp (typical of integrated (typical of integrated controllers)controllers)
Parallel Coherence StreamsParallel Coherence Streams
SketchSketch BackgroundBackground Memory controller architectureMemory controller architecture Analytical modelAnalytical model Evaluation frameworkEvaluation framework Simulation resultsSimulation results
– Validation of model for directory Validation of model for directory protocolsprotocols
– Directory-less broadcast protocolsDirectory-less broadcast protocols– MultiprogrammingMultiprogramming
SummarySummary
Parallel Coherence StreamsParallel Coherence Streams
Evaluation FrameworkEvaluation Framework Evaluates both integrated protocol Evaluates both integrated protocol
processors (PPs) and threads in SMTpprocessors (PPs) and threads in SMTp Depending on the integration level Depending on the integration level
the memory controller and PP can be the memory controller and PP can be clocked at different frequenciesclocked at different frequencies– Explores 400 MHz, 800 MHz, 1.6 GHz Explores 400 MHz, 800 MHz, 1.6 GHz
with 1.6 GHz main SMT processorwith 1.6 GHz main SMT processor Protocol threads in SMTp, by design, Protocol threads in SMTp, by design,
always run at full frequency (1.6 always run at full frequency (1.6 GHz)GHz)
Each node has an SMT processor and Each node has an SMT processor and runs up to four ATs (64-threaded runs up to four ATs (64-threaded apps) and two PTsapps) and two PTs
Parallel Coherence StreamsParallel Coherence Streams
Flashback: Flexible ProcessingFlashback: Flexible Processing
OOO SMTCore(ATs)
L2 cache
SDRAMBanks
In-orderPP
Router
MC
OOO SMTCore
(ATs+PTs)
L2 cache
MC SDRAMBanks
Router
PP SMTpIL1 DL1
IL1DL1
IL1 DL1
Parallel Coherence StreamsParallel Coherence Streams
SketchSketch BackgroundBackground Memory controller architectureMemory controller architecture Analytical modelAnalytical model Evaluation frameworkEvaluation framework Simulation resultsSimulation results
– Validation of model for directory Validation of model for directory protocolsprotocols
– Directory-less broadcast protocolsDirectory-less broadcast protocols– MultiprogrammingMultiprogramming
SummarySummary
Parallel Coherence StreamsParallel Coherence Streams
400 MHz Protocol Processors: 400 MHz Protocol Processors: Execution TimeExecution Time
8%
Parallel Coherence StreamsParallel Coherence Streams
400 MHz Protocol 400 MHz Protocol Processors: Processors:
Dispatcher’s Wait CyclesDispatcher’s Wait Cycles
Parallel Coherence StreamsParallel Coherence Streams
400 MHz Protocol Processors: 400 MHz Protocol Processors: OccupancyOccupancy
Parallel Coherence StreamsParallel Coherence Streams
Model Validation: 400 MHz PPModel Validation: 400 MHz PP OOcc is fixed at 20 ns (128B @ 6.4 GB/s) is fixed at 20 ns (128B @ 6.4 GB/s) Predict from 1PP measurements:Predict from 1PP measurements:
OOp p (ns) O(ns) Om m (ns) k(ns) kmaxmax OM (ns)OM (ns)FFTFFT 30.3 30.3 54.6 5 54.6 5 -0.6-0.6FFTW 29.4 54.3 7 FFTW 29.4 54.3 7 1.61.6LU 25.3 40.1 6 -1.4LU 25.3 40.1 6 -1.4Ocean 36.6 54.4 7 Ocean 36.6 54.4 7 8.88.8Radix-Sort 33.8 45.8 4 Radix-Sort 33.8 45.8 4 2.32.3Water 28.1 40.0 4 -1.9Water 28.1 40.0 4 -1.9
Parallel Coherence StreamsParallel Coherence Streams
1.6 GHz Protocol Processors: 1.6 GHz Protocol Processors: Execution TimeExecution Time
Parallel Coherence StreamsParallel Coherence Streams
1.6 GHz Protocol 1.6 GHz Protocol Processors: Processors:
Dispatcher’s Wait CyclesDispatcher’s Wait Cycles
Parallel Coherence StreamsParallel Coherence Streams
1.6 GHz Protocol Processors: 1.6 GHz Protocol Processors: OccupancyOccupancy
Parallel Coherence StreamsParallel Coherence Streams
Model Validation: 1.6 GHz PPModel Validation: 1.6 GHz PP OOcc is fixed at 20 ns (128B @ 6.4 GB/s) is fixed at 20 ns (128B @ 6.4 GB/s) Predict from 1PP measurements:Predict from 1PP measurements:
OOp p (ns) O(ns) Om m (ns) k(ns) kmaxmax OM (ns)OM (ns)FFTFFT 8.4 8.4 54.4 12 -16.1 54.4 12 -16.1FFTW 7.5 53.9 16 -15.9FFTW 7.5 53.9 16 -15.9LU 6.3 40.2 16 -16.2LU 6.3 40.2 16 -16.2Ocean 9.4 54.6 16 -14.0Ocean 9.4 54.6 16 -14.0Radix-Sort 8.8 45.8 8 -16.9Radix-Sort 8.8 45.8 8 -16.9Water 7.2 40.0 8 -17.8Water 7.2 40.0 8 -17.8
Parallel Coherence StreamsParallel Coherence Streams
SMTp: Execution TimeSMTp: Execution Time
Parallel Coherence StreamsParallel Coherence Streams
SMTp: Dispatcher’s Wait SMTp: Dispatcher’s Wait CyclesCycles
Parallel Coherence StreamsParallel Coherence Streams
SMTp: OccupancySMTp: Occupancy
Parallel Coherence StreamsParallel Coherence Streams
Model Validation: SMTpModel Validation: SMTp OOcc is fixed at 20 ns (128B @ 6.4 GB/s) is fixed at 20 ns (128B @ 6.4 GB/s) Predict from 1PP measurements:Predict from 1PP measurements:
OOp p (ns) O(ns) Om m (ns) k(ns) kmaxmax OM (ns)OM (ns)FFTFFT 15.3 15.3 55.2 10 -10.2 55.2 10 -10.2FFTW 15.6 57.6 14 -8.5FFTW 15.6 57.6 14 -8.5LU 13.1 42.3 16 -9.5LU 13.1 42.3 16 -9.5Ocean 17.5 55.9 16 -6.0Ocean 17.5 55.9 16 -6.0Radix-Sort 16.6 51.8 11 -8.1Radix-Sort 16.6 51.8 11 -8.1Water 14.4 40.1 8 -10.6Water 14.4 40.1 8 -10.6
Parallel Coherence StreamsParallel Coherence Streams
Summary: Execution TimeSummary: Execution Time
Parallel Coherence StreamsParallel Coherence Streams
Take-away PointsTake-away Points Doubling controller frequency is Doubling controller frequency is
always better than adding a second always better than adding a second oneone– Reducing individual handler occupancy is Reducing individual handler occupancy is
more important than reducing the more important than reducing the occupancy of a burstoccupancy of a burst
– Ocean shows the importance of burst mixOcean shows the importance of burst mix Increasing frequency has diminishing Increasing frequency has diminishing
returnreturn– Instead of building complex hardwired Instead of building complex hardwired
protocol engines, dedicate a thread or protocol engines, dedicate a thread or core in the emerging processorscore in the emerging processors
Parallel Coherence StreamsParallel Coherence Streams
Application to Many-coreApplication to Many-core Multi-node many-core systems have a Multi-node many-core systems have a
natural hierarchy of coherencenatural hierarchy of coherence– Private last levels (L1 or L2) kept coherent Private last levels (L1 or L2) kept coherent
with the next shared level (L2 or L3) via a with the next shared level (L2 or L3) via a directory protocol (Niagara, Power5)directory protocol (Niagara, Power5)
– Multiple nodes employ a global dir. protocolMultiple nodes employ a global dir. protocol– Our model applies to both the levelsOur model applies to both the levels– The on-chip coherence controllers per bank The on-chip coherence controllers per bank
in a shared NUCA are unique in one sensein a shared NUCA are unique in one sense OOmm and O and Occ are much smaller than DRAM: our are much smaller than DRAM: our
model may dictate positive OM depending on model may dictate positive OM depending on burstiness (k) and protocol’s complexity (Oburstiness (k) and protocol’s complexity (Opp))
Parallel Coherence StreamsParallel Coherence Streams
Directory-less Broadcast Directory-less Broadcast ProtocolsProtocols
Sends request to home, home Sends request to home, home broadcasts it to all and replies to broadcasts it to all and replies to requester, all snoop local caches and requester, all snoop local caches and reply to requester, requester picks reply to requester, requester picks correct responsecorrect response– Still NUMA (a la AMD Opteron)Still NUMA (a la AMD Opteron)– 26.1 messages per miss (compare with 26.1 messages per miss (compare with
2.5 in directory protocol)2.5 in directory protocol) A lot of concurrency in the coherence A lot of concurrency in the coherence
processing layerprocessing layer
– 16.1% average improvement when a 16.1% average improvement when a second coherence engine is introduced second coherence engine is introduced (averaged across FFT, FFTW, Ocean, (averaged across FFT, FFTW, Ocean, Radix-Sort)Radix-Sort)
Parallel Coherence StreamsParallel Coherence Streams
Single-node MultiprogrammingSingle-node Multiprogramming Multiprogrammed workloads have a Multiprogrammed workloads have a
large data footprint and no sharing large data footprint and no sharing across threadsacross threads– A lot of outer-level cache misses, A lot of outer-level cache misses,
exercises the coherence engine a lot exercises the coherence engine a lot more than parallel applicationsmore than parallel applications
– Our model correctly dictates that there Our model correctly dictates that there is no gain in introducing a second is no gain in introducing a second controllercontroller OOpp is too small to satisfy the inequalities is too small to satisfy the inequalities
– Take-away point: coherence bandwidth Take-away point: coherence bandwidth requirement is not directly related to requirement is not directly related to cache miss ratecache miss rate
Parallel Coherence StreamsParallel Coherence Streams
Single-node MultiprogrammingSingle-node Multiprogramming
Parallel Coherence StreamsParallel Coherence Streams
Prior ResearchPrior Research Programmable coherence controllersProgrammable coherence controllers
– Stanford FLASH, Wisconsin Typhoon, Stanford FLASH, Wisconsin Typhoon, Sequent STiNG, Sun S3.mp, Compaq Sequent STiNG, Sun S3.mp, Compaq Piranha CMP, Newisys Opteron-HorusPiranha CMP, Newisys Opteron-Horus
– All controllers are off-chip (and hence All controllers are off-chip (and hence lags by at least two generations of lags by at least two generations of process)process)
Multiple coherence controllersMultiple coherence controllers– Explored with SMP nodes having off-chip Explored with SMP nodes having off-chip
directory controllers (IBM, UIUC, Purdue)directory controllers (IBM, UIUC, Purdue)– Local/remote address partitioning in Local/remote address partitioning in
Opteron-Horus, STiNG, and S3.mp Opteron-Horus, STiNG, and S3.mp
Parallel Coherence StreamsParallel Coherence Streams
SummarySummary A useful model for coherence layer A useful model for coherence layer
designersdesigners– A simple and intuitive inequality rulesA simple and intuitive inequality rules– DRAM latency contributes little to this DRAM latency contributes little to this
decision in the case of highly bursty decision in the case of highly bursty applicationsapplications
– Bank-conflicting requests enjoy little or no Bank-conflicting requests enjoy little or no benefit from parallel coherence stream benefit from parallel coherence stream processing (common case for hot processing (common case for hot locks/flags)locks/flags)
Two controllers improve performance Two controllers improve performance by up to 8% when freq. ratio is fourby up to 8% when freq. ratio is four
Broadcast protocols enjoy larger Broadcast protocols enjoy larger benefitbenefit
Parallel Coherence StreamsParallel Coherence Streams
AcknowledgmentsAcknowledgments Anshuman Gupta (UCSD)Anshuman Gupta (UCSD)
– Preliminary simulations (part of Preliminary simulations (part of independent study at IIT Kanpur)independent study at IIT Kanpur)
Varun Khaneja (AMD)Varun Khaneja (AMD)– Development of directory-less Development of directory-less
broadcast protocol (part of MTech broadcast protocol (part of MTech thesis at IIT Kanpur)thesis at IIT Kanpur)
IIT Kanpur Security CenterIIT Kanpur Security Center– For hosting part of simulation infra-For hosting part of simulation infra-
structure structure
Integrated Memory Integrated Memory Controllers with Controllers with
Parallel Coherence StreamsParallel Coherence Streams
Mainak Chaudhuri Mark Mainak Chaudhuri Mark HeinrichHeinrich
IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida
THANK YOU!
To appear in IEEE TPDS