29
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos {zebchuk,elham,moshovos}@eecg.toronto.edu AENAO Research Group Department of Electrical and Computer Engineering University of Toronto

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  • Upload
    ossie

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy. Jason Zebchuk , Elham Safi, and Andreas Moshovos { zebchuk ,elham,moshovos}@eecg.toronto.edu AENAO Research Group Department of Electrical and Computer Engineering University of Toronto. - PowerPoint PPT Presentation

Citation preview

Page 1: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

A Framework for Coarse-Grain Optimizations in the On-Chip

Memory Hierarchy

Jason Zebchuk, Elham Safi, and Andreas Moshovos{zebchuk,elham,moshovos}@eecg.toronto.edu

AENAO Research GroupDepartment of Electrical and Computer Engineering

University of Toronto

Page 2: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 2 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Conventional Block Centric Cache

“Small” Blocks Optimizes Bandwidth and Performance

Large L2/L3 caches especially

Fine-Grain View of Memory

L2 Cache

Big Picture Lost

Page 3: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 3 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

“Big Picture” View

Region: 2n sized, aligned area of memory Patterns and behavior exposed

Spatial locality

Exploit for performance/area/power

Coarse-Grain View of Memory

L2 Cache

Page 4: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 4 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Exploiting Coarse-Grain Patterns

Many existing coarse-grain optimizations Add new structures to track coarse-grain information

CPU

L2 Cache

Stealth Prefetching

Flexible Snooping

Destination-Set Prediction

Spatial Memory Streaming

Coarse-Grain Coherence Tracking

RegionScout

Circuit-Switched

Coherence

Hard to justify for a commercial design

Coarse-Grain Framework

Page 5: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 5 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Exploiting Coarse-Grain Patterns

CPU

L2 Cache

Coarse-Grain Framework

Embed coarse-grain information in tag array

Support many different optimizations with less area overhead

Adaptable optimization FRAMEWORK

Page 6: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 6 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

L2 Cache

RegionTracker Solution

Manage blocks, but also track and manage regions

Tag Array

L1

L1

L1

L1

Data Array

Data Blocks

BlockRequests

Block Requests

RegionTracker

RegionProbes

RegionResponses

Page 7: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 7 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

RegionTracker Summary

Replace conventional tag array: 4-core CMP with 8MB shared L2 cache Within 1% of original performance Up to 20% less tag area Average 33% less energy consumption

Optimization Framework: Stealth Prefetching: same performance, 36% less area RegionScout: 2x more snoops avoided, no area overhead

Page 8: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 8 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Road Map

Introduction

Goals

Coarse-Grain Cache Designs

RegionTracker: A Tag Array Replacement

RegionTracker: An Optimization Framework

Conclusion

Page 9: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 9 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Goals

1. Conventional Tag Array Functionality Identify data block location and state Leave data array un-changed

2. Optimization Framework Functionality Is Region X cached? Which blocks of Region X are cached? Where? Evict or migrate Region X

Page 10: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 10 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Coarse-Grain Cache Designs

Increased BW, Decreased hit-rates

Region X

Large Block Size

Tag Array Data Array

Page 11: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 11 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Sector Cache

Decreased hit-rates

Region X

Tag Array

Data Array

Page 12: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 12 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Sector Pool Cache

High Associativity (2 - 4 times)

Region X

Tag Array Data Array

Page 13: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 13 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Decoupled Sector Cache

Region information not exposed Region replacement requires scanning multiple entries

Region X

Tag Array Data ArrayStatus Table

Page 14: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 14 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Design Requirements

Small block size (64B) Miss-rate does not increase Lookup associativity does not increase No additional access latency

(i.e., No scanning, no multiple block evictions)

Does not increase latency, area, or energy Allows banking and interleaving

Fit in conventional tag array “envelope”

Page 15: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 15 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

RegionTracker: A Tag Array Replacement

L1

L1

L1

L1

Data Array

3 SRAM arrays, combined smaller than tag array

RegionVectorArray

BlockStatusTable

EvictedRegionBuffer

Page 16: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 16 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Common Case: Hit

Region Tag RVA Index Region OffsetBlock Offset49 061021

Address:

Region Vector Array(RVA)

Region Tag ……

block0

block15

wayV

Block Offset19 6 0

Block Status Table(BST)

1 4

status

3 2

Data Array + BST Index

To Data Array

Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

Page 17: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 17 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Worst Case (Rare): Region Miss

Region Tag RVA Index Region OffsetBlock Offset

49 061021

Address:

Region Vector Array(RVA)

Region Tag ……

block0

block15

wayV

Block Offset19 6 0

Block Status Table(BST)

status

3

Ptr

2

Data Array + BST Index

EvictedRegionBuffer(ERB)No

Match!

Ptr

Page 18: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 18 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Methodology

Flexus simulator from CMU SimFlex group Based on Simics full-system simulator

4-core CMP modeled after Piranha Private 32KB, 4-way set-associative L1 caches Shared 8MB, 16-way set-associative L2 cache 64-byte blocks

Miss-rates: Functional simulation of 2 billion instructions per core Performance and Energy: Timing simulation using SMARTS sampling

methodology Area and Power: Full custom implementation on 130nm commercial

technology 9 commercial workloads:

WEB: SpecWEB on Apache and Zeus OLTP: TPC-C on DB2 and Oracle DSS: 5 TPC-H queries on DB2

Interconnect

L2

P

D$ I$

P

D$ I$

P

D$ I$

P

D$ I$

Page 19: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 19 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Miss-Rates vs. Area

Sector Cache: 512KB sectors, SPC and RT: 1KB regions Trade-offs comparable to conventional cache

0.99

1

1.01

1.02

1.03

1.04

1.05

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Sector Pool Cache

RegionTracker

Conventional Tags

better

Rela

tive M

iss-

Rate

Relative Tag Array Area

Sector Cache (0.25, 1.26)

14-way 15-way

52-way

48-way

Page 20: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 20 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Performance & Energy

0.97

0.98

0.99

1.00

1.01

1.02

1.03

WEB OLTP DSS0%

10%

20%

30%

40%

50%

WEB OLTP DSS

12-way set-associative RegionTracker: 20% less area Error bars: 95% confidence interval

Performance within 1%, with 33% tag energy reduction

Norm

aliz

ed E

xecu

tion T

ime

better

Reduct

ion in T

ag E

nerg

y

better

Performance Energy

Page 21: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 21 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Road Map

Introduction

Goals

Coarse-Grain Cache Designs

RegionTracker: A Tag Array Replacement

RegionTracker: An Optimization Framework

Conclusion

Page 22: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 22 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

RegionTracker: An Optimization Framework

L1

L1

L1

L1

RVA

ERB

Data Array

BST

Stealth Prefetching:Average 20% performance improvement

Drop-in RegionTracker for 36% less area overhead

RegionScout:In-depth analysis

Page 23: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 23 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Snoop Coherence: Common Case

Main Memory

CPU CPU CPU

Read x

mis

sm

iss

Read x+1Read x+n

Many snoops are to non-shared regions

Page 24: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 24 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

RegionScout

Eliminate broadcasts for non-shared regions

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss

Non-Shared Regions Locally Cached Regions

Read xRead x

RegionMiss

MissMiss

Page 25: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 25 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

RegionTracker Implementation

Minimal overhead to support RegionScout optimization

Still uses less area than conventional tag array

Non-Shared Regions

Add 1 bit to each RVA entry

Locally Cached Regions

Already provided by RVA

Page 26: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 26 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

RegionTracker + RegionScout

0%

10%

20%

30%

40%

50%

60%

RS 7KB RS 12KB RS 22KB RSRT

Reduct

ion in

Snoop B

roadca

sts

better

4 processors, 512KB L2 Caches 1KB regions

Avoid 41% of Snoop Broadcasts,no area overhead compared to conventional tag

array

BlockScout(4KB)

New optimization possible with

RegionTracker

Page 27: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 27 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Result Summary

Replace Conventional Tag Array: 20% Less tag area 33% Less tag energy Within 1% of original performance

Coarse-Grain Optimization Framework: 36% reduction in area overhead for Stealth Prefetching Filter 41% of snoop broadcasts with no area overhead

compared to conventional cache

Page 28: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Jason Zebchuk 28 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Exploiting Coarse-Grain Patterns

CPU

L2 Cache

Stealth Prefetching

Run-time Adaptive Cache Hierarchy Management via

Reference Analysis

Destination-Set Prediction

Spatial Memory Streaming

Coarse-Grain Coherence Tracking

RegionScout

Circuit-Switched

Coherence

Conclusion

RegionTracker framework makes coarse-grainoptimizations more attractive

CPU

L2 Cache

Page 29: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

A Framework for Coarse-Grain Optimizations in the On-Chip

Memory Hierarchy

Jason Zebchuk, Elham Safi, and Andreas Moshovos{zebchuk,elham,moshovos}@eecg.toronto.edu

AENAO Research GroupDepartment of Electrical and Computer Engineering

University of Toronto