45
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu

Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Rachata Ausavarungnirun

Saugata Ghose, Onur Kayiran, Gabriel H. Loh

Chita Das, Mahmut Kandemir, Onur Mutlu

Page 2: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Overview of This Talk• Problem: memory divergence

– Threads execute in lockstep, but not all threads hit in the cache

– A single long latency thread can stall an entire warp

• Key Observations:– Memory divergence characteristic differs across warps

– Some warps mostly hit in the cache, some mostly miss

– Divergence characteristic is stable over time

– L2 queuing exacerbates memory divergence problem

• Our Solution: Memory Divergence Correction– Uses cache bypassing, cache insertion and memory scheduling to

prioritize mostly-hit warps and deprioritize mostly-miss warps

• Key Results: – 21.8% better performance and 20.1% better energy efficiency

compared to state-of-the-art caching policy on GPU

2

Page 3: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Outline

• Background

• Key Observations

• Memory Divergence Correction (MeDiC)

• Results

3

Page 4: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Latency Hiding in GPGPU Execution

4

Stall

Time

GPU Core

Active

GPU Core Status

Active

Warp A

Warp B

Warp C

Warp D

LockstepExecutionThread

Page 5: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Problem: Memory Divergence

5

Cache Miss

Cache Hit

Warp A

Time

Cache Hit

Main Memory

Stall Time

Page 6: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Outline

• Background

• Key Observations

• Memory Divergence Correction (MeDiC)

• Results

6

Page 7: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Observation 1: Divergence Heterogeneity

7

ReducedStall Time

Cache Miss

Cache Hit

Mostly-hit warp Mostly-miss warp

Time

Key Idea: • Convert mostly-hit warps to

all-hit warps• Convert mostly-miss warps to

all-miss warps

All-hit warp All-miss warp

Page 8: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Observation 2: Stable Divergence Char.

8

• Warp retains its hit ratio during a program phase

– Hit ratio number of hits / number of access

Page 9: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Observation 2: Stable Divergence Char.

9

• Warp retains its hit ratio during a program phase

0.00.10.20.30.40.50.60.70.80.91.0

Hit

Rat

io

Cycles

Warp 1 Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Mostly-hit

Balanced

Mostly-miss

Page 10: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Observation 3: Queuing at L2 Banks

10

Shared L2 Cache

Bank 0

Bank 1

Bank 2

Bank n

……

……

Request Buffers

MemoryScheduler

To DRAM

45% of requests stall 20+ cycles at the L2 queue

Long queuing delays exacerbate the effect of memory divergence

Page 11: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Outline

• Background

• Key Observations

• Memory Divergence Correction (MeDiC)

• Results

11

Page 12: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Our Solution: MeDiC

12

• Key Ideas:

– Convert mostly-hit warps to all-hit warps

– Convert mostly-miss warps to all-miss warps

– Reduce L2 queuing latency

– Prioritize mostly-hit warps at the memory

– Maintain memory bandwidth

Page 13: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warp-type-awareMemory Scheduler

Memory Divergence Correction

Wa

rp Typ

e Iden

tificatio

n Lo

gic

MemoryRequest

Shared L2 Cache

Bank 0

Bank 1

Bank 2

Bank n

To DRAM

N

Y

Low Priority

High Priority

Any Requests in High Priority

……

Byp

assin

g Lo

gic

Warp-type-aware Cache BypassingMostly-miss, All-miss

Warp-type-aware Cache Insertion Policy13

Memory Scheduler

Page 14: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Mechanism to Identify Warp Type

14

• Profile hit ratio for each warp

• Group warp into one of five categories

• Periodically reset warp-type

All-missMostly-missBalancedMostly-hitAll-hit

Higher Priority Lower Priority

Page 15: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warp-type-aware Cache Bypassing

Warp-type-awareMemory Scheduler

Warp-type-aware Cache Bypassing

Wa

rp Typ

e Iden

tificatio

n Lo

gic

MemoryRequest

Shared L2 Cache

Bank 0

Bank 1

Bank 2

Bank n

To DRAM

N

Y

Low Priority

High Priority

Any Requests in High Priority

……

Byp

assin

g Lo

gic

Mostly-miss, All-miss

Warp-type-aware Cache Insertion Policy15

Page 16: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warp-type-aware Cache Bypassing

• Goal:

– Convert mostly-hit warps to all-hit warps

– Convert mostly-miss warps to all-miss warps

• Our Solution:

– All-miss and mostly-miss warps Bypass L2

– Adjust how we identify warps to maintain miss rate

• Key Benefits:

– More all-hit warps

– Reduce queuing latency for all warps

16

Page 17: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warp-type-aware Cache Bypassing

Warp-type-awareMemory Scheduler

Warp-type-aware Cache Insertion

Wa

rp Typ

e Iden

tificatio

n Lo

gic

MemoryRequest

Shared L2 Cache

Bank 0

Bank 1

Bank 2

Bank n

To DRAM

N

Y

Low Priority

High Priority

Any Requests in High Priority

……

Byp

assin

g Lo

gic

Mostly-miss, All-miss

Warp-type-aware Cache Insertion Policy17

Page 18: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warp-type-aware Cache Insertion

• Goal: Utilize the cache well

– Prioritize mostly-hit warps

– Maintain blocks with high reuse

• Our Solution:

– All-miss and mostly-miss Insert at LRU

– All-hit, mostly-hit and balanced Insert at MRU

• Benefits:

– All-hit and mostly-hit are less likely to be evicted

– Heavily reused cache blocks from mostly-miss are likely to remain in the cache

18

Page 19: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warp-type-aware Cache Bypassing

Warp-type-awareMemory Scheduler

Warp-type-aware Memory Sched.

Wa

rp Typ

e Iden

tificatio

n Lo

gic

MemoryRequest

Shared L2 Cache

Bank 0

Bank 1

Bank 2

Bank n

To DRAM

N

Y

Low Priority

High Priority

Any Requests in High Priority

……

Byp

assin

g Lo

gic

Mostly-miss, All-miss

Warp-type-aware Cache Insertion Policy19

Page 20: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Not All Blocks Can Be Cached

• Despite best efforts, accesses from mostly-hit warps can still miss in the cache

– Compulsory misses

– Cache thrashing

• Solution: Warp-type-aware memory scheduler

20

Page 21: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warp-type-aware Memory Sched.

• Goal: Prioritize mostly-hit over mostly-miss

• Mechanism: Two memory request queues

– High-priority all-hit and mostly-hit

– Low-priority balanced, mostly-miss and all-miss

• Benefits:

– Mostly-hit warps stall less

21

Page 22: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

MeDiC: Example

22

Mostly-hit Warp

Mostly-miss Warp

All-hit Warp

All-miss Warp

DRAM Queuing Latency

Cache/Mem Latency

BypassCache

Lowerqueuinglatency

Lowerstalltime

Insert atLRU

HighPriority

Insert atMRU

Cache Queuing Latency

Page 23: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Outline

• Background

• Key Observations

• Memory Divergence Correction (MeDiC)

• Results

23

Page 24: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Methodology

• Modified GPGPU-Sim modeling GTX480

– 15 GPU cores

– 6 memory partition

– 16KB 4-way L1, 768KB 16-way L2

– Model L2 queue and L2 queuing latency

– 1674 MHz GDDR5

• Workloads from CUDA-SDK, Rodinia, Mars and Lonestar suites

24

Page 25: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Comparison Points• FR-FCFS baseline [Rixner+, ISCA’00]• Cache Insertion:

– EAF [Seshadri+, PACT’12]• Tracks blocks that are recently evicted to detect high reuse and

inserts them at the MRU position• Does not take divergence heterogeneity into account• Does not lower queuing latency

• Cache Bypassing: – PCAL [Li+, HPCA’15]

• Uses tokens to limit number of warps that gets to access the L2 cache Lower cache thrashing

• Warps with highly reuse access gets more priority• Does not take divergence heterogeneity into account

– PC-based and Random bypassing policy

25

Page 26: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

0.5

1.0

1.5

2.0

2.5

Spe

ed

up

Ove

r B

ase

line

Baseline EAF PCAL MeDiC

Results: Performance of MeDiC

26

21.8%

MeDiC is effective in identifying warp-type and taking advantage of divergence heterogeneity

Page 27: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Results: Energy Efficiency of MeDiC

27

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

No

rm. E

ne

rgy

Effi

cie

ncy

Baseline EAF PCAL MeDiC

20.1%

Performance improvement outweighs the additional energy from extra cache misses

Page 28: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Other Results in the Paper

• Breakdowns of each component of MeDiC– Each component is effective

• Comparison against PC-based and random cache bypassing policy– MeDiC provides better performance

• Analysis of combining MeDiC+reuse mechanism– MeDiC is effective in caching highly-reused blocks

• Sensitivity analysis of each individual components– Minimal impact on L2 miss rate

– Minimal impact on row buffer locality

– Improved L2 queuing latency

28

Page 29: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Conclusion• Problem: memory divergence

– Threads execute in lockstep, but not all threads hit in the cache

– A single long latency thread can stall an entire warp

• Key Observations:– Memory divergence characteristic differs across warps

– Some warps mostly hit in the cache, some mostly miss

– Divergence characteristic is stable over time

– L2 queuing exacerbates memory divergence problem

• Our Solution: Memory Divergence Correction– Uses cache bypassing, cache insertion and memory scheduling to

prioritize mostly-hit warps and deprioritize mostly-miss warps

• Key Results: – 21.8% better performance and 20.1% better energy efficiency

compared to state-of-the-art caching policy on GPU

29

Page 30: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Rachata Ausavarungnirun

Saugata Ghose, Onur Kayiran, Gabriel H. Loh

Chita Das, Mahmut Kandemir, Onur Mutlu

Page 31: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Backup Slides

31

Page 32: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Queuing at L2 Banks: Real Workloads

32

0%

2%

4%

6%

8%

10%

12%

14%

16%

Frac

t. o

f L2

Re

qu

est

s

Queuing Time (cycles)

53.8%

Page 33: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Adding More Banks

33

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

No

rmal

ize

d P

erf

orm

ance 12 Banks 2 Ports 24 Banks 2 Ports

24 Banks 4 Ports 48 Banks 2 Ports

5%

Page 34: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

0

5

10

15

20

25

30

35

40

Qu

eu

ing

Late

ncy

(cy

cles

)

Baseline

WByp

MeDiC

88.5

Queuing Latency Reduction

34

69.8%

Page 35: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

MeDiC: Performance Breakdown

35

0.5

1.0

1.5

2.0

2.5

Sp

eed

up

Over

Baselin

e

WIP WMS WByp MeDiC

Page 36: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

MeDiC: Miss Rate

36

0.0

0.2

0.4

0.6

0.8

1.0

L2 C

ach

e M

iss R

ate

Baseline

Rand

WIP

MeDiC

Page 37: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

MeDiC: Row Buffer Hit Rate

37

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Ro

w B

uff

er

Hit

Rate

Baseline WMS MeDiC

Page 38: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

MeDiC-Reuse

38

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Sp

ee

du

p O

ver

Bas

eli

ne MeDiC MeDiC-reuse

Page 39: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

L2 Queuing Penalty

39

0

50

100

150

200

250

300

350

NN

CO

NS

SC

P

BP

HS

SC

IIX

PV

C

PV

R

SS

BF

S

BH

DM

R

MS

T

SS

SP

Avg

. P

en

alt

y f

rom

Div

erg

en

ce

L2

L2 + DRAM

0

100

200

300

400

500

600

700

800

900

1,000

NN

CO

NS

SC

P

BP

HS

SC

IIX

PV

C

PV

R

SS

BF

S

BH

DM

R

MS

T

SS

SP

Max

. P

en

alt

y f

rom

Div

erg

en

ce

L2

L2 + DRAM

Page 40: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Divergence Distribution

40

Page 41: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Divergence Distribution

41

Page 42: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Stable Divergence Characteristics

42

• Warp retains its hit ratio during a program phase

• Heterogeneity

– Control Divergence

– Memory Divergence

– Edge cases on the data the program is operating on

– Coalescing

– Affinity to different memory partition

• Stability

– Temporal + spatial locality

Page 43: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warps Can Fetch Data for Others

• All-miss and mostly-miss warps can fetch cache blocks for other warps

– Blocks with high reuse

– Shared address with all-hit and mostly-hit warps

• Solution: Warp-type aware cache insertion

43

Page 44: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warp-type Aware Cache Insertion

44

A1 A2

A3 A4

B1 A5

A6 B2

L2 Cache

LRU

MRU

A7 B3 B2

Future Cache Requests

LRU

LRU

MRU MRU

MRU

A8 A9

LRU

Page 45: Exploiting Inter-Warp Heterogeneity to Improve GPGPU ...omutlu/pub/MeDiC-for... · –MeDiC provides better performance •Analysis of combining MeDiC+reuse mechanism –MeDiC is

Warp-type Aware Memory Sched.

45

Memory Scheduler

Main Memory

Memory Request Queue

Warp Type

Memory Request

All-hit, Mostly-hit Balanced, Mostly-miss, All-miss

FR-FCFS Scheduler

Low Priority Queue

FR-FCFS Scheduler

High Priority Queue