32
Anurag Mukkara, Nathan Beckmann, Daniel Sanchez MIT CSAIL ASPLOS XXI - Atlanta, Georgia – 4 April 2016 W HIRLPOOL! IMPROVING D YNAMIC CACHE MANAGEMENT WITH STATIC D ATA CLASSIFICATION

WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Anurag Mukkara, Nathan Beckmann, Daniel Sanchez

MIT CSAIL

ASPLOS XXI - Atlanta, Georgia – 4 April 2016

WHIRLPOOL!

IMPROVING DYNAMIC CACHE MANAGEMENT

WITH STATIC DATA CLASSIFICATION

Page 2: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Processors are limited by data movement

Data movement often consumes >50% of time & energy

E.g., FP multiply-add: 20 pJ DRAM access: 20,000 pJ

To scale performance, must keep data near where its used

But how do programs use memory?

Cache banks

Good: nearby cache banks

Bad: faraway cache banks

Terrible: DRAM access

Page 3: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Static policies have limitations3

Program Code

Fixed policy

Exploits program semantics

Binary

E.g., scratchpads, bypass hints

Can’t adapt to application

phases, input-dependent

behavior, or shared systems

Static analysis

or profiling

Page 4: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Dynamic policies have limitations, too4

Binary

Dynamic policy

Responsive to actual

application behavior

E.g., data migration & replicationDifficult to recover program

semantics from loads/stores

Expensive mechanisms

(eg, extra data movement &

directories)

Observe

loads/stores

Page 5: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Combining static and dynamic is best5

Program Code

Binary

Static analysis

or profiling

Observe

loads/stores

Pool

A

Pool

B

Pool

C

Pool

D

Policy

A

Policy

B

Policy

C

Policy

D

Exploits program

semantics at low overhead

Responsive to actual

application behavior

Page 6: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Agenda6

Case study

Manual classification

Parallel applications

WhirlTool

Page 7: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

System configuration7

Core

L1i L1d

Private L2

Non-uniform cache access (NUCA):

Cache banks have different access latencies

Page 8: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

We apply Whirlpool to Jigsaw [Beckmann PACT’13],

a state-of-the-art NUCA cache

Allocates virtual caches, collections of parts of cache banks

Significantly outperforms prior D-NUCA schemes

Baseline dynamic NUCA scheme8

Reduce cache misses

Reduce on-chip

network traversals

Simple mechanisms

Page 9: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Dynamic policies can reduce data movement9

Jigsaw[Beckmann, PACT’13]

Dynamic policy performs somewhat better:

Static NUCA

4% better performance

12% lower energy

App: Delaunay

triangulation

Page 10: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Static analysis can help!10

Acc

ess

Inte

nsi

ty

Points

Vertices

Triangles

Accesses Footprint (MB)

Page 11: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Jigsaw with Static Classification11

Jigsaw[Beckmann, PACT’13]

Whirlpool!

Vs Jigsaw:

19% better performance

42% lower energy

Few data structures accessed

more frequently than others

Acc

ess

Inte

nsi

ty

Points

Vertices

Triangles

Page 12: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Agenda12

Case study

Manual classification

Parallel applications

WhirlTool

Page 13: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Whirlpool – Manual classification

Organize application data into memory pools

int poolPoints = pool_create();

Point* points = pool_malloc(sizeof(Point)*n, poolPoints);

int poolTris = pool_create();

Tri* smallTris = pool_malloc(sizeof(Tri)*m, poolTris);

Tri* largeTris = pool_malloc(sizeof(Tri)*M, poolTris);

Insight: Group semantically similar data into a pool

Points, Triangles

13

Page 14: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Minor changes to programs14

Application Pools LOC

Delaunay triangulation 3 11

Maximal matching 3 13

Delaunay refinement 3 8

Maximal independent set 3 13

Minimal spanning forest 3 11

401.bzip2 4 43

470.lbm 2 21

429.mcf 2 14

436.cactusADM 2 53

SPECCPU

2006

PBBS

Page 15: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Whirlpool on NUCA placement15

Use pools to improve Jigsaw’s decisions

Each pool is allocated to a virtual cache

Jigsaw transparently places pools in NUCA banks

Whirlpool requires no changes to core Jigsaw

Increase size of structures (few KBs)

Minor improvements, e.g. bypassing (see paper)

Pools useful elsewhere, eg to dynamic prefetching

Page 16: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Significant improvements on some apps16

bzip

2

refin

eM

ST

lbm

mcf

cactus

mat

ching

DT

MIS

0

10

20

30

40

50

60

En

erg

y s

avin

gs v

s J

igsa

w (

%)

bzip

2

refin

eM

ST

lbm

mcf

cactus

mat

ching

DT

MIS

0

2

4

6

8

10

12

14

Sp

ee

du

p v

s J

igsa

w (

%)

38

Up to 38% better performance Up to 53% lower energy

Performance Energy

Page 17: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Agenda17

Case study

Manual classification

Parallel applications

WhirlTool

Page 18: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Conventional runtimes can harm locality18

Optimize load

balance, not locality

Page 19: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Whirlpool co-locates tasks and data19

Break input into pools

Application indicates task affinity

Schedule + steal tasks from nearby their data

Dynamically adapt data placement

Requires minimal changes to task-parallel runtimes

Input

Page 20: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Whirlpool improves locality20

Page 21: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Whirlpool adapts schedule dynamically21

Data placement implicitly schedules tasks

Page 22: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Significant improvements at 16 cores22

MS FFT TC DT PR CC

0

10

20

30

40

50

60

70

Sp

ee

dup

vs J

igsaw

(%

)

MS FFT TC DT PR CC

1.0

1.5

2.0

2.5

3.0

En

erg

y s

avin

gs v

s J

igsa

w

Up to 67% better performance Up to 2.6x lower energy

ApplicationsDivide and conquer algorithms: Mergesort, FFT

Graph analytics: PageRank, Triangle Counting, Connected Components

Graphics: Delaunay Triangulation

Caveat: Splitting data into

pools can be expensive!

Page 23: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Agenda23

Case study

Manual classification

Parallel applications

WhirlTool

Page 24: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

WhirlTool – Automated classification24

Modifying program code is not always practical

A profile-guided tool can automatically classify data into

pools

WhirlTool

Profiler

WhirlTool

Analyzer

Per-callpoint

miss curves

Callpoint-to-

pool map

Application

WhirlTool

runtime

Whirlpool

Allocator

malloc()

pool_malloc()

Page 25: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

WhirlTool profiles miss curves25

Periodically records

per-callpoint

miss curves

Application

A B C ….

Allo

cA

ccs

Groups allocations

by callpoint

Profiles accesses

to each pool

T

i

m

e

Misses

Cache size

Page 26: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

WhirlTool analyzes curves to find pools26

Hardware can only support a limited number of pools

Jigsaw uses 3 virtual caches / thread

0.6% area overhead over LLC

Whirlpool adds 4 pools (each mapped to a virtual cache)

1.2% total area overhead over LLC

Must cluster callpoints into semantically similar groups

Per-callpoint

miss curves

Agglomerative

clustering

Callpoint-to-pool

mapping

Page 27: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Example of agglomerative clustering27

1

1

1

2

2

3

Page 28: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

WhirlTool’s distance metric28

Cache SizeM

isse

s

Small distance

Cache Size

Mis

ses

Large distance

Pool 1

Pool 2

Separated

Combined

Pool 3

How many misses are saved by separating pools?

Page 29: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

WhirlTool matches manual hints29

lesl

iegc

cge

ms

bzip

2om

net

ray

refin

esp

hinx

3M

ST

lbm

setC

over

sopl

exxa

lanc mcf SA

cact

usm

atch

ing

DT

MIS

0

2

4

6

8

10

12

14

Sp

eed

up

vs J

igsa

w (

%)

38

WhirlTool

lesl

iegc

cge

ms

bzip

2om

net

ray

refin

esp

hinx

3M

ST

lbm

setC

over

sopl

exxa

lanc mcf SA

cact

usm

atch

ing

DT

MIS

0

2

4

6

8

10

12

14

Sp

eed

up

vs J

igsa

w (

%)

38 38

WhirlTool

Manual

Page 30: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Multiprogram mixes30

4-core system with random SPECCPU2006 apps

Including those that do not benefit

Whirlpool improves performance by (gmean over 20 mixes)

35% over S-NUCA

30% over idealized shared-private D-NUCA [Hererro, ISCA’10]

26% over R-NUCA [Hardavellas, ISCA’09]

18% over page placement by Awasthi et al. [Awasthi HPCA’09]

5% over Jigsaw [Beckmann, PACT’13]

Page 31: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

Conclusion31

Semantic information from applications improves

performance of dynamic policies

Coordinated data and task placement gives large

improvements in parallel applications

Automated classification reduces programmer burden

Page 32: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA

THANKS FOR YOUR ATTENTION!

QUESTIONS ARE WELCOME!

32

WhirlTool code available at http://bit.ly/WhirlTool