Upload
yael
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Adaptive Memory Reconfiguration Management: The AMRM Project. Rajesh Gupta, Alex Nicolau University of California, Irvine Andrew Chien University of California, San Diego. DARPA DIS PI Meeting, Santa Fe, October 1998. Outline. Project Drivers - PowerPoint PPT Presentation
Citation preview
Adaptive Memory Adaptive Memory Reconfiguration Management:Reconfiguration Management:
The AMRM ProjectThe AMRM Project
Rajesh Gupta, Alex Nicolau
University of California, Irvine
Andrew Chien
University of California, San Diego
DARPA DIS PI Meeting, Santa Fe, October 1998
OutlineOutline
• Project Drivers– application needs for diverse (cache) memory configurations
– technology trends favoring reconfigurablity in high-performance designs
• Project Goals and Deliverables
• Project Implementation Plan
• Project Team
• Summary of New Ideas Proposed by AMRM
CPU
L1
L2
Memory
TLB 3 cycles2 GB/s
57-72 ~ MB/s
106 cycles from disk
33 cycles
IntroductionIntroduction
• Many defense applications are data-starved– large data-sets, irregular locality characteristics
» FMM Radar Cross-section Modeling, OODB, CG
• Memory access times falling behind CPU speeds
– increased memory penalty and data starvation.
• No single architecture works well:– Data-intensive applications need a variety of strategies
to deliver high-performance according to application memory reference needs:
» multilevel caches/policies
» intelligent prefetching schemes
» dynamic “cache-like” structures: prediction tables, stream caches, victim caches
» even simple optimizations like block size selection improve performance significantly.
Technology EvolutionTechnology Evolution
0
10
20
30
40
50
60
70
80
90
89 92 95 98 1 4 7
Wire delay, ns/cm.
Evolutionary growth but its effects are subtle and powerful!Evolutionary growth but its effects are subtle and powerful!
Industry continues to outpace NTRS projections on Industry continues to outpace NTRS projections on technology scaling and IC density.technology scaling and IC density.
Feature Size
250
180
130100
70
0
50
100
150
200
250
300
97 98 99 0 1 2 3 4 5 6 7 8 9 12Year of Shipment
NTRS-94
NTRS-97
Average interconnect delay is greater than the gate delays!Average interconnect delay is greater than the gate delays!• Reduced marginal cost of logic coupled with signal regeneration makes it possible to include logic in inter-block interconnect.
Consider InterconnectConsider Interconnect
Dynamic interconnect
Static interconnect
I II III
Avg. Interconnect Length
Critical Length
CROSS-OVER REGION
Feature Size (nm)
Length (um)
1000
2000
3000
1000 100
The Opportunity of Application-Adaptive The Opportunity of Application-Adaptive ArchitecturesArchitectures
• Use interconnect and data-path reconfiguration to– adapt architectures for increased performance, combat
performance fragility and improve fault tolerance
• AMRM technological basis is in re-configurable hw:– configurable hardware is used to improve utilization of
performance critical resources (instead of using configurable hardware to build additional resources)
– design goal is to achieve peak performance across applications
– configurable hardware leveraged in efficient utilization of performance critical resources
First quantitative answers to utility of architectural adaptation provided by the MORPH Point Design Study (PDS)
MORPH Point Design Study:MORPH Point Design Study:Custom Mechanisms ExploredCustom Mechanisms Explored
• Combat latency deterioration– optimal prefetching:
» “memory side pointer chasing”
– blocking mechanisms
– fast barrier, broadcast support
– synchronization support
• Bandwidth management– memory (re)organization to suit application
characteristics
– translate and gather hardware
» “prefetching with compaction”
• Memory controller design
Adaptation for Latency ToleranceAdaptation for Latency Tolerance
• Operation1. Application sets prefetch parameters
(compiler controlled)
» set lower/upper bounds on memory regions (for memory protection etc.)
» download pointer extraction function
» element size
2. Prefetching event generation (runtime
controlled)
» when a new cache block is filled
Prefetcher
virtual addr./data
physical addr.
additional addr.
CPU/L1
L2 Cache
data
if(start<=vAddr<=end) { if(pAddr & 0x20) addr = pAddr - 0x20 else addr = pAddr + 0x20 <initiate fetch of cache line at addr to L1> }
if(start<=vAddr<=end) { if(pAddr & 0x20) addr = pAddr - 0x20 else addr = pAddr + 0x20 <initiate fetch of cache line at addr to L1> }
Adaptation for Bandwidth ReductionAdaptation for Bandwidth Reduction• Prefetching Entire Row/Column
• Pack Cache with Used Data Only
Processor
Addr. Translation
val1 val2 val3
cache
memory
Gather Logic
val1, RowPtr1, ColPtr1
val2, RowPtr2, ColPtr2
val3, RowPtr3, ColPtr3
valrowcol
rowPtrcolPtr
valrowcol
rowPtrcolPtr
valrowcol
rowPtrcolPtr
valrowcol
rowPtrcolPtr
valcolvalcol...
valcolvalcol...
translatetranslateAccessAccess ReturnReturn
+ 64+ 64synthesizesynthesize
Program ViewProgram View Physical LayoutPhysical Layout
L1 CacheL1 Cache
• No Change in Program Logical Data Structures
• Partition Cache
• Translate Data
• Synthesize Pointer
Adaptation ResultsAdaptation Results
0
5
10
15
20
25
Read Write
Mis
s R
ate(
%)
NaiveSW-BlockingHW GatherHW Bypass
0
100
200
300
400
500
600
Dat
a Tr
affi
c (M
B)
Hardware Block LSI 10KCells
XilinxCLBs
Delay(cycles)
Prefetcher 4083 1558 3Gather 627 1408 3Translate 557 1378 2
10x reduction in miss rate.10x reduction in miss rate.10x reduction in miss rate.10x reduction in miss rate.
100x reduction in BW.100x reduction in BW.100x reduction in BW.100x reduction in BW.
Going Beyond PDSGoing Beyond PDS
• Memory hierarchy utilization– estimate working set size
– memory grain size
– miss types: conflict, capacity, coherence, cold-start
– memory access patterns: sequential, stride prediction
– assess marginal miss rates and “what-if” scenarios
• Dynamic cache structures– victim caches, stream caches, stride prediction, buffers.
• Memory bank conflicts– detect array references that cause bank conflicts
• PE load profiling
• Continuous validation hardware
Challenges in Building AA Architectures Challenges in Building AA Architectures
• Without automatic application analysis application adaptation is still pretty much subject to hand-crafting
– Compiler support for identification and use of appropriate Compiler support for identification and use of appropriate architectural assists is crucialarchitectural assists is crucial
• Significant semantic loss occurs when going from application to compiler-level optimizations.
• The runtime system must actively support architectural customization safely.
Project GoalsProject Goals
• Design an Adaptive Memory Reconfiguration Management (AMRM) system that provides
– 100X improvement in hierarchical memory system performance over conventional static memory hierarchy in terms of latency and available bandwidth.
• Develop compiler algorithms that statically select adaptation of memory hierarchy on a per application basis
• Develop operating system and architecture features which ensure process isolation, error detection and containment for a robust multi-process computing environment.
Project DeliverablesProject Deliverables
• An architecture for adaptive memory hierarchy
• Architectural mechanisms and policies for efficient memory system adaptation
• Compiler support (identification and selection) of the machine adaptation
• OS and HW architecture features which enable process isolation, error detection, and containment in dynamic adaptive systems.
ImpactImpact
• Optimized data placement and movement through the memory hierarchy per application sustained performance close to peak
machine performance
– particularly for applications with non-contiguous large data-sets such as
» sparse-matrix and conjugate gradient computations, circuit simulation
» data-base (relational and object-oriented) systems
» imaging data
» security-sensitive applications
Impact (continued)Impact (continued)
• Integration with core system mechanisms enables multi-process, robust and safe computing
– enables basic software modularity through processes on adaptive hardware
– ensures static and dynamic adaptation will not compromise system robustness -- errors generally confined to a single process
– provides mechanisms for online validation of dynamic adaptation (catch compiler and hardware synthesis errors) enabling fallback to earlier versions for correctness
• High system performance using standard CPU components– adaptive cache management achieved using reconfigurable logic,
compiler and OS smarts
– 15-20X improvement in sparse matrix/conjugate gradient computations
– 20X improvement in radar cross section modeling code
– high system performance without changing computation resources preserves the DOD investment into existing software
The AMRM Project:The AMRM Project:Enabling Co-ordinated AdaptationEnabling Co-ordinated Adaptation
Operating System Strategies
Fault Detection and Containment
Continuous Validation
1. Flexible Memory System Architecture1. Flexible Memory System Architecture
Synthesis &MappingSoftware
3. Safe and Protected Execution3. Safe and Protected Execution
CPU
L1
L2
Memory
TLB
Adaptive CacheStructures
logic
Base m/c
adapt
Ad
aptiv
e M
ach
ine
De
finiti
on
Compilation for Adaptive MemoryCompilation for Adaptive Memory
Application Analysis
2. Compiler Control of Cache Adaptation2. Compiler Control of Cache Adaptation
Application Instrumentation for runtime adaptation
Vicitim cache
Stride predictor
Prefetcher
Stream cache
Miss stride buffer
Stream buffer
Write buffer
Project OrganizationProject Organization
• Three coordinated thrusts
T1 design of a flexible memory system architectureT1 design of a flexible memory system architecture
T2 compiler control of the adaptation processT2 compiler control of the adaptation process
T3 safe and protected execution environmentT3 safe and protected execution environment
• System architecture enables machine adaptation – by implementing architectural assists, mechanisms and
policies
• Compiler enables application-specific machine adaptation
– by providing powerful memory behavior analysis techniques
• Protection and validation enables a robust multi-process software environment
– by ensuring process isolation and online validation
Project PersonnelProject Personnel
• Project Co-PIs– Professor Rajesh Gupta, UC Irvine
– Professor Alex Nicolau, UC Irvine
– Professor Andrew Chien, UC San Diego
• Collaborators– Dr. Phil Kuekes, HP Laboratories, Palo Alto
• Research Specialist– Dr. Alexander Veidenbaum, UC Irvine
• Graduate Research Assistants
• Contract Technical Monitor– Dr. Larry Carter, AIC , Fort Huachuca, AZ
Prashant AroraXiaomei JiDan NicolaescuRajesh Satapathy
Chang ChunWeiyu TangYibo Jiang
Louis GianniniJay Byun
Summary of New Ideas in AMRMSummary of New Ideas in AMRM1. Application-adaptive architectural mechanisms and policies for
memory latency and bandwidth management:– combat latency deterioration using hardware-assisted blocking,
prefetching
– manage bandwidth through adaptive translation, movement and placement of application-data for the most efficient access
– cache organization, coherence, dynamic cache structures are modified as needed by an application
2. Cache memory adaptation is driven by compiler techniques– semantic retention applied at language and architectural levels
– control memory adaptation and maintain machine usability through application software
3. OS and Architecture features enable process isolation and online validation of adaptations
– OS and architecture features enable error detection, isolation and containment; online validation extends to dynamic adaptations
– modular, robust static and dynamic reconfiguration with precise characterization of isolation properties