Adaptive Memory Reconfiguration Management: The AMRM Project

Adaptive Memory Adaptive Memory Reconfiguration Management:Reconfiguration Management:

The AMRM ProjectThe AMRM Project

Rajesh Gupta, Alex Nicolau

University of California, Irvine

Andrew Chien

University of California, San Diego

DARPA DIS PI Meeting, Santa Fe, October 1998

OutlineOutline

• Project Drivers– application needs for diverse (cache) memory configurations

– technology trends favoring reconfigurablity in high-performance designs

• Project Goals and Deliverables

• Project Implementation Plan

• Project Team

• Summary of New Ideas Proposed by AMRM

CPU

L1

L2

Memory

TLB 3 cycles2 GB/s

57-72 ~ MB/s

106 cycles from disk

33 cycles

IntroductionIntroduction

• Many defense applications are data-starved– large data-sets, irregular locality characteristics

» FMM Radar Cross-section Modeling, OODB, CG

• Memory access times falling behind CPU speeds

– increased memory penalty and data starvation.

• No single architecture works well:– Data-intensive applications need a variety of strategies

to deliver high-performance according to application memory reference needs:

» multilevel caches/policies

» intelligent prefetching schemes

» dynamic “cache-like” structures: prediction tables, stream caches, victim caches

» even simple optimizations like block size selection improve performance significantly.

Technology EvolutionTechnology Evolution

0

10

20

30

40

50

60

70

80

90

89 92 95 98 1 4 7

Wire delay, ns/cm.

Evolutionary growth but its effects are subtle and powerful!Evolutionary growth but its effects are subtle and powerful!

Industry continues to outpace NTRS projections on Industry continues to outpace NTRS projections on technology scaling and IC density.technology scaling and IC density.

Feature Size

250

180

130100

70

0

50

100

150

200

250

300

97 98 99 0 1 2 3 4 5 6 7 8 9 12Year of Shipment

NTRS-94

NTRS-97

Average interconnect delay is greater than the gate delays!Average interconnect delay is greater than the gate delays!• Reduced marginal cost of logic coupled with signal regeneration makes it possible to include logic in inter-block interconnect.

Consider InterconnectConsider Interconnect

Dynamic interconnect

Static interconnect

I II III

Avg. Interconnect Length

Critical Length

CROSS-OVER REGION

Feature Size (nm)

Length (um)

1000

2000

3000

1000 100

The Opportunity of Application-Adaptive The Opportunity of Application-Adaptive ArchitecturesArchitectures

• Use interconnect and data-path reconfiguration to– adapt architectures for increased performance, combat

performance fragility and improve fault tolerance

• AMRM technological basis is in re-configurable hw:– configurable hardware is used to improve utilization of

performance critical resources (instead of using configurable hardware to build additional resources)

– design goal is to achieve peak performance across applications

– configurable hardware leveraged in efficient utilization of performance critical resources

First quantitative answers to utility of architectural adaptation provided by the MORPH Point Design Study (PDS)

MORPH Point Design Study:MORPH Point Design Study:Custom Mechanisms ExploredCustom Mechanisms Explored

• Combat latency deterioration– optimal prefetching:

» “memory side pointer chasing”

– blocking mechanisms

– fast barrier, broadcast support

– synchronization support

• Bandwidth management– memory (re)organization to suit application

characteristics

– translate and gather hardware

» “prefetching with compaction”

• Memory controller design

Adaptation for Latency ToleranceAdaptation for Latency Tolerance

• Operation1. Application sets prefetch parameters

(compiler controlled)

» set lower/upper bounds on memory regions (for memory protection etc.)

» download pointer extraction function

» element size

2. Prefetching event generation (runtime

controlled)

» when a new cache block is filled

Prefetcher

virtual addr./data

physical addr.

additional addr.

CPU/L1

L2 Cache

data

if(start<=vAddr<=end) { if(pAddr & 0x20) addr = pAddr - 0x20 else addr = pAddr + 0x20 <initiate fetch of cache line at addr to L1> }

if(start<=vAddr<=end) { if(pAddr & 0x20) addr = pAddr - 0x20 else addr = pAddr + 0x20 <initiate fetch of cache line at addr to L1> }

Adaptation for Bandwidth ReductionAdaptation for Bandwidth Reduction• Prefetching Entire Row/Column

• Pack Cache with Used Data Only

Processor

Addr. Translation

val1 val2 val3

cache

memory

Gather Logic

val1, RowPtr1, ColPtr1



valrowcol

rowPtrcolPtr

valrowcol

rowPtrcolPtr

valrowcol

rowPtrcolPtr

valrowcol

rowPtrcolPtr

valcolvalcol...

valcolvalcol...

translatetranslateAccessAccess ReturnReturn

+ 64+ 64synthesizesynthesize

Program ViewProgram View Physical LayoutPhysical Layout

L1 CacheL1 Cache

• No Change in Program Logical Data Structures

• Partition Cache

• Translate Data

• Synthesize Pointer

Adaptation ResultsAdaptation Results

0

5

10

15

20

25

Read Write

Mis

s R

ate(

%)

NaiveSW-BlockingHW GatherHW Bypass

0

100

200

300

400

500

600

Dat

a Tr

affi

c (M

B)

Hardware Block LSI 10KCells

XilinxCLBs

Delay(cycles)

Prefetcher 4083 1558 3Gather 627 1408 3Translate 557 1378 2

10x reduction in miss rate.10x reduction in miss rate.10x reduction in miss rate.10x reduction in miss rate.

100x reduction in BW.100x reduction in BW.100x reduction in BW.100x reduction in BW.

Going Beyond PDSGoing Beyond PDS

• Memory hierarchy utilization– estimate working set size

– memory grain size

– miss types: conflict, capacity, coherence, cold-start

– memory access patterns: sequential, stride prediction

– assess marginal miss rates and “what-if” scenarios

• Dynamic cache structures– victim caches, stream caches, stride prediction, buffers.

• Memory bank conflicts– detect array references that cause bank conflicts

• PE load profiling

• Continuous validation hardware

Challenges in Building AA Architectures Challenges in Building AA Architectures

• Without automatic application analysis application adaptation is still pretty much subject to hand-crafting

– Compiler support for identification and use of appropriate Compiler support for identification and use of appropriate architectural assists is crucialarchitectural assists is crucial

• Significant semantic loss occurs when going from application to compiler-level optimizations.

• The runtime system must actively support architectural customization safely.

Project GoalsProject Goals

• Design an Adaptive Memory Reconfiguration Management (AMRM) system that provides

– 100X improvement in hierarchical memory system performance over conventional static memory hierarchy in terms of latency and available bandwidth.

• Develop compiler algorithms that statically select adaptation of memory hierarchy on a per application basis

• Develop operating system and architecture features which ensure process isolation, error detection and containment for a robust multi-process computing environment.

Project DeliverablesProject Deliverables

• An architecture for adaptive memory hierarchy

• Architectural mechanisms and policies for efficient memory system adaptation

• Compiler support (identification and selection) of the machine adaptation

• OS and HW architecture features which enable process isolation, error detection, and containment in dynamic adaptive systems.

ImpactImpact

• Optimized data placement and movement through the memory hierarchy per application sustained performance close to peak

machine performance

– particularly for applications with non-contiguous large data-sets such as

» sparse-matrix and conjugate gradient computations, circuit simulation

» data-base (relational and object-oriented) systems

» imaging data

» security-sensitive applications

Impact (continued)Impact (continued)

• Integration with core system mechanisms enables multi-process, robust and safe computing

– enables basic software modularity through processes on adaptive hardware

– ensures static and dynamic adaptation will not compromise system robustness -- errors generally confined to a single process

– provides mechanisms for online validation of dynamic adaptation (catch compiler and hardware synthesis errors) enabling fallback to earlier versions for correctness

• High system performance using standard CPU components– adaptive cache management achieved using reconfigurable logic,

compiler and OS smarts

– 15-20X improvement in sparse matrix/conjugate gradient computations

– 20X improvement in radar cross section modeling code

– high system performance without changing computation resources preserves the DOD investment into existing software

The AMRM Project:The AMRM Project:Enabling Co-ordinated AdaptationEnabling Co-ordinated Adaptation

Operating System Strategies

Fault Detection and Containment

Continuous Validation

1. Flexible Memory System Architecture1. Flexible Memory System Architecture

Synthesis &MappingSoftware

3. Safe and Protected Execution3. Safe and Protected Execution

CPU

L1

L2

Memory

TLB

Adaptive CacheStructures

logic

Base m/c

adapt

Ad

aptiv

e M

ach

ine

De

finiti

on

Compilation for Adaptive MemoryCompilation for Adaptive Memory

Application Analysis

2. Compiler Control of Cache Adaptation2. Compiler Control of Cache Adaptation

Application Instrumentation for runtime adaptation

Vicitim cache

Stride predictor

Prefetcher

Stream cache

Miss stride buffer

Stream buffer

Write buffer

Project OrganizationProject Organization

• Three coordinated thrusts

T1 design of a flexible memory system architectureT1 design of a flexible memory system architecture

T2 compiler control of the adaptation processT2 compiler control of the adaptation process

T3 safe and protected execution environmentT3 safe and protected execution environment

• System architecture enables machine adaptation – by implementing architectural assists, mechanisms and

policies

• Compiler enables application-specific machine adaptation

– by providing powerful memory behavior analysis techniques

• Protection and validation enables a robust multi-process software environment

– by ensuring process isolation and online validation

Project PersonnelProject Personnel

• Project Co-PIs– Professor Rajesh Gupta, UC Irvine

– Professor Alex Nicolau, UC Irvine

– Professor Andrew Chien, UC San Diego

• Collaborators– Dr. Phil Kuekes, HP Laboratories, Palo Alto

• Research Specialist– Dr. Alexander Veidenbaum, UC Irvine

• Graduate Research Assistants

• Contract Technical Monitor– Dr. Larry Carter, AIC , Fort Huachuca, AZ

Prashant AroraXiaomei JiDan NicolaescuRajesh Satapathy

Chang ChunWeiyu TangYibo Jiang

Louis GianniniJay Byun

Summary of New Ideas in AMRMSummary of New Ideas in AMRM1. Application-adaptive architectural mechanisms and policies for

memory latency and bandwidth management:– combat latency deterioration using hardware-assisted blocking,

prefetching

– manage bandwidth through adaptive translation, movement and placement of application-data for the most efficient access

– cache organization, coherence, dynamic cache structures are modified as needed by an application

2. Cache memory adaptation is driven by compiler techniques– semantic retention applied at language and architectural levels

– control memory adaptation and maintain machine usability through application software

3. OS and Architecture features enable process isolation and online validation of adaptations

– OS and architecture features enable error detection, isolation and containment; online validation extends to dynamic adaptations

– modular, robust static and dynamic reconfiguration with precise characterization of isolation properties

Documents

Adaptive Memory Reconfiguration Management: The AMRM Project