An Exploration of Multi- Core Memory Architecturesmeseec.ce.rit.edu/756-projects/fall2013/1-4.pdfAn Exploration of Multi-Core Memory Architectures Presenters: Hemsley Pichardo and

An Exploration of Multi-Core Memory Architectures

Presenters:

Hemsley Pichardoand

Rajeev Verma

Summary• Introduction• Current trend in Multi-Core chip• Tiled Chip Architecture Designs

o Hierarchical, Pipeline and Array• Multi-Core Design Alternatives

o Shrink, Shrink & Merge, Constant Die and Single Chip• Cache Coherence for CMP• Baseline Protocol• Optimization in Cache Coherence• CoCCA Architecture and Protocol (CoCCA: Co-designed Coherent Cache Architecture.)

• CoCCA Pattern Table• CoCCA Home Node management • CoCCa Transaction Message model

Introduction

• Designers are turning to multi-core system on chips.

• Doing so in order to counteract problems encountered during advances in today’s microprocessors.o Increased clock rateso Complex designso Interconnectso Power limitso Cache sharing (coherence)o Networking issues

Current Technologies• First : Pure Logic Technology

o Basic transistor is a low threshold device that can switch quickly, and use the natural 6-transistor SRAM that can be built out of it as the memory basis.

• Second : DRAM based which is called “embedded DRAM” (EDRAM)o Higher density, but slower access than SRAM.

• Third : Pure DRAM technology o Where the basic transistor is a higher threshold device that does not

leak as much. o This is important when the data is being stored in a capacitor.

Current Technologies

Variations in Memory Technology

DESIGN CONSTRAINTS AND OVERHEADS

Technology Curves Including Overhead

Tiled Chip Architectural Designs• Three type of emerging architectures.

Hierarchical Designs• Multiple cores share multiple caches in tree-like configurations, with the

caches at each level of higher capacity than the prior level.

• This all means that the off-chip bandwidth must increase proportionately to the product of number of cores (density) and local clock rate.

• Root “cache” having the final off-chip connection to external memory.

• If all cores are arranged to appear as a single integrated SMP node, then this bandwidth increases only with clock rate.

Pipelined Designs• High speed data enters from one chip, and passes successively through

different cores, where at each core some different processing step is performed.

• Increasing the local clock rate decreases the number of cores needed for a pipeline, increasing the area available for additional

pipelines.

• Processed data leaves the last core to proceed off chip.

• Increasing the number of pipelines increases the number of I/O ports linearly,

Array Designs• The on-chip memory is physically divided into separate banks, with processing

logic nestled next to that logic.

• Common interconnect and control logic is often centralized and provides overall synchronization and interaction.

• For some designs, the off-chip bandwidth needs are relatively independent of the number of cores.

• Other designs have the bandwidth needs varying more like the surface area” of the array of nodes, rather than the “volume” or total number of nodes.

Multi-Core Design Alternatives1. Shrink• Assumes constant chip contents but advancing technology.2. Shrink & Merge• Similar to 1• Conversion of SRAM to DRAM• Integration of off-chip memories with on-chip DRAM3. Constant Die• Same as 2• Die size constant• Basic architecture constant• Increases number of cores to fill die4. Single Chip• Combines memory and cores in ways to reduce overhead structures• Maintains constant storage/performance ratio

Multi-Core Design Results : Shrink

Multi-Core Design Results : Shrink & Merge

Multi-Core Design Results : Constant Die

Multi-Core Design Results : Single Chip

Cache coherence for CMP Architecture• Embedded Applications: image, video, data stream and workflow

processings

• These applications tend to access data in pattern.

• These patterns can be used to optimize the cache coherency protocol, by prefetching data and reducing the number of memory transactions.

• A scalable network (Network on Chip, NoC) usually based on a mesh topology is used to connect these cores.

Cache coherence for CMP Architecture cont.• Cache coherence is either directly managed by the programmer or falls

under the control of a cache coherence unit (usually hardware based).

• Coherence issue occurs when

o Data are replicated on different cache memories of cores, due toconcurrent read and write operations.

o Copies of a variable can be present in multiple caches.o A write by one processor may not become visible to others:

Baseline Protocol• Basic solution for coherence• In order to maintain consistency, one popular approach is to use a four-

state, directory-based cache coherence protocol.

Baseline Protocol cont.● The coherence state field represents four states (MESI)

○ M (modified): a single valid copy exists across the whole system; the core owning this copy is called Owner

○ E (exclusive): a single valid copy exists across the whole system, the core owning this copy is named the Owner

○ S (shared): multiple copies of the data exist, all copy are in read•only mode. Any associated core is named Sharer.

○ I (invalid): the copy is currently invalid, should not be used and so will be discarded.

Optimization in cache coherence

• Working on columns in a picture can be achieved with the help of data access patterns.

• Patterns can be used to speculate on the next accesses, prefetching data where they will be most likely used in a near future.

• Patterns can also be used to save bandwidth, by reducing the number of protocol messages: one transaction can provide access to a whole set of data.

• Thats how CoCCA come into picture.

• CoCCA : Co-designed Coherent Cache Architecture.

CoCCA Architecture and ProtocolPrinciples and motivation:●CoCCA provides support for managing regular memory access pattern.●CoCCA uses “speculative message” to manage the pattern.●CoCCA uses a hardware component which:

○ store pattern ○ control transactions.

●Requester needs to send speculative message to Hybrid Home node(HHN), if matches.

●Otherwise, requester sends baseline message to baseline Home Node (BHN).

●Advantages:○ Reduction of throughput○ Lower time of memory access.

●CoCCA Architecture and ProtocolPrinciple

CoCCA Architecture and ProtocolRoles of a core:

•Requester: The core asking for data.

•Home Node: The core which is in charge of tracking the coherenceinformation of a given data in system.

•Sharer: A core which has copy of the data in its cache in “shared mode”. Multiple copies of this data can exist on other cores.

•Owner : Core which has a “Exclusive” or “Modified” copy of the data in its cache. Only one copy of this data can be exist in whole system.

CoCCA Pattern Table

● Patterns are used to get the data according to spatial locality.

● A “trigger” is used to get a specific signature of a pattern. The signature could be base address.

Baseaddress: The address of first cache line of the pattern

Size: No of elementsStride: distance between two excess patterns

CoCCA Pattern Table Cont.• The pattern descriptor enables to describe the CoCCA pattern table entry

• Desc the pattern descriptor that results from applying function fn() with the given parameters.

• Example pattern table in c:

• PatternNew(), PatternAddLength(), PatternAddStride() etc can support to up

CoCCA Pattern Table Cont.C Functions used to update the pattern table.•PatternNew(): function to create a pattern,

•PatternAddOffset(): function to add an offset entry,

•PatternAddLength(): function to add a length entry,

•PatternAddStride(): function to add a stride entry,

•PatternFree(): function to release the pattern after use.

CoCCA: Protocol and Home Node managementCharacteristics of Hybrid architecture:

•Difference between baseline and speculative messages,

•Speculative messages that permit to read all addresses of pattern through their base address,

•Requests of speculative messages by page granularity,

•Round-Robin method to choose the Home Node (HN).

•Still use the Home Node (HN) for message management.

•Always the initial step is to determine the HN for the requested data.

• The coherence information about this data is kept in extra storage which is called “Coherence Directory”.

•Authors use line granularity to determine HN and page granularity for CoCCA protocol

CoCCA: Transaction Message ModelRead Transaction message

CoCCA: Transaction Message Model• Comparison between Baseline and Pattern based approach

CODE INSTRUMENTATION AND FIRST EVALUATION• Application used : a cascading convolution filter: it is very typical of image

processing or preprocessing. • Source and destination images have a resolution of 640x480• The CMP architecture is chosen as a 7x7 processor matrix, each with

256KB of L2 cache.

CoCCA : Instrumentation● The Pin/Pintool software is used to extract shared data read and write for

each core.● “inscount” pintools instruments a program to count the number of

instructions executed.● We can also modify the API to get core wise instructions count.● Example:

CoCCA: Approach to patterns

CoCCA : Results and Conclusions• Result by periodic execution of the program:

• 37% of the reduction of message throughput.• A new hardware component is also being introduced which can be used

to store and retrieve pattern • On the benchmark, the evaluation shows a performance boost of over 60%

�Questions??

�Thank you!

Documents

An Exploration of Multi- Core Memory Architecturesmeseec.ce.rit.edu/756-projects/fall2013/1-4.pdfAn Exploration of Multi-Core Memory Architectures Presenters: Hemsley Pichardo and