Mosaic: A GPU Memory Manager with Application-Transparent ... · Challenges with Multiple Page Sizes 8 Large Page Frame 1 Unallocated App 1 App 2 State-of-the-Art Time App 1 Allocation

Mosaic: A GPU Memory Manager with Application-Transparent Support

for Multiple Page Sizes

Rachata Ausavarungnirun Joshua Landgraf Vance Miller

Saugata Ghose Jayneel Gandhi

Christopher J. Rossbach Onur Mutlu

Executive Summary

2

• Problem: No single best page size for GPU virtual memory• Large pages: Better TLB reach• Small pages: Lower demand paging latency

• Our goal: Transparently enable both page sizes

• Key observations• Can easily coalesce an application’s contiguously-allocated small pages

into a large page• Interleaved memory allocation across applications breaks page contiguity

• Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing

• Mosaic is a hardware/software cooperative framework that:• Coalesces small pages into a large page without data movement• Enables the benefits of both small and large pages

• Key result: 55% average performance improvement overstate-of-the-art GPU memory management mechanism

GPU Support for Virtual Memory

3

• Improves programmability with a unified address space

• Enables large data sets to be processed in the GPU

• Allows multiple applications to run on a GPU• Virtual memory can enforce memory protection

GPU Core

Private TLB

State-of-the-Art Virtual Memory on GPUs

4

GPU CoreGPU CoreGPU Core

Shared TLB

Private TLB

Page Table Walkers

Page Table(Main memory)

Private TLB Private TLB

Limited TLB reach

High latency page walks

Data (Main Memory)

CPU Memory

High latency

I/O

Private

Shared

CPU-side memory

GPU-side memory

Trade-Off with Page Size

5

•Larger pages: •Better TLB reach•High demand paging latency

•Smaller pages: • Lower demand paging latency• Limited TLB reach

Trade-Off with Page Size

6

Can we get the best of both page sizes?

0.0

0.2

0.4

0.6

0.8

1.0

No

rmal

ize

dP

erf

orm

ance

Small (4KB) Large (2MB)

0.0

0.2

0.4

0.6

0.8

1.0

No

rmal

ize

d

Pe

rfo

rman

ce

Small (4KB) Large (2MB)

No Paging Overhead With Paging Overhead

-93%

52%

Outline

7

• Background

• Key challenges and our goal

• Mosaic

• Experimental evaluations

• Conclusions

Challenges with Multiple Page Sizes

8

Large Page Frame 1

Unallocated App 1 App 2

State-of-the-ArtTime

App 1 Allocation

App 2 Allocation

App 1 Allocation

Large Page Frame 2

GPU Memory

App 2 Allocation

CoalesceApp 1 Pages

Large Page Frame 3

Large Page Frame 4

Large Page Frame 5

Cannot coalesce (without migrating multiple 4K pages)

Need to search which pages to coalesce

CoalesceApp 2 Pages

Desirable Allocation

9

Large Page Frame 1

Unallocated App 1 App 2

Desirable BehaviorTime

App 1 Allocation

App 2 Allocation

App 1 Allocation

Large Page Frame 2

GPU Memory

App 2 Allocation

CoalesceApp 1 Pages

CoalesceApp 2 Pages

Large Page Frame 3

Large Page Frame 4

Large Page Frame 5

Can coalesce (without moving data)

Our Goals

10

•High TLB reach

•Low demand paging latency

•Application transparency• Programmers do not need to modify the applications

Outline

11

• Background


• Mosaic

• Experimental evaluation

• Conclusions

Mosaic

12

Contiguity-ConservingAllocation

In-PlaceCoalescer

Contiguity-AwareCompaction

Hardware

GPU Runtime

Outline

13

• Background


• Mosaic• Contiguity-Conserving Allocation

• In-Place Coalescer

• Contiguity-Aware Compaction


• Conclusions

Conserves contiguity within the large page frame

Mosaic: Data Allocation

14


In-PlaceCoalescer


Allocate Memory

PageTable

Data

2

Large Page Frame

Hardware

GPU Runtime Application Demands Data1

Soft guarantee:A large page frame contains

pages from only a single address space

• Data transfer is done at a small page granularity• A page that is transferred is immediately ready to use


15


In-PlaceCoalescer


Transfer Data

PageTable

Data3 System I/O Bus

CPU Memory

Large Page Frame

Hardware

GPU Runtime

Allocate Memory2

Application Demands Data1


16


In-PlaceCoalescer


PageTable

Transfer Done4

Large Page Frame

Data

Hardware

GPU Runtime

Transfer Data

3 System I/O Bus

CPU Memory

Outline

17

• Background






• Conclusions

• Fully-allocated large page frame Coalesceable

• Allocator sends the list of coalesceable pages to the In-Place Coalescer

Mosaic: Coalescing

18


In-PlaceCoalescer


Large Page Frame

List of large pages1

Large Page Frame

Hardware

GPU Runtime

• In-Place Coalescer has:

• List of coalesceable large pages

• Key Task: Perform coalescing without moving data

• Simply need to update the page tables

Mosaic: Coalescing

19


In-PlaceCoalescer


PageTable

Data

Hardware

GPU Runtime

Update page tables2List of large pages1

Mosaic: Coalescing

20


In-PlaceCoalescer


PageTable

Data

Hardware

GPU Runtime

Update page tables2List of large pages1

0

Coalesced Bit

1

Large Page Table Small Page Table

• Application-transparent• Data can be accessed

using either page size• No TLB flush

Outline

21

• Background






• Conclusions

• Key Task: Free up not-fully-used large page frames

• Splinter pages Break down a large page into small pages

• Compaction Combine fragmented large page frames

Mosaic: Data Deallocation

22


In-PlaceCoalescer


Hardware

GPU Runtime

• Splinter only frames with deallocated pages


23


In-PlaceCoalescer


PageTable

Splinter Pages (reset the coalesced bit)2

Large Page Frame

Data

Hardware

GPU Runtime Application Deallocates Data 1

Mosaic: Compaction

24


In-PlaceCoalescer


Hardware

GPU Runtime

• Key Task: Free up not-fully-used large page frames

• Splinter pages Break down a large page into small pages

• Compaction Combine fragmented large page frames

• Compaction decreases memory bloat• Happens only when memory is highly fragmented

Mosaic: Compaction

25


In-PlaceCoalescer


PageTable

Compact Pages1

Large Page Frames

Data

Free large page

Hardware

GPU Runtime List of free pages2

Free large page

• Once pages are compacted, they become non-coalesceable• No virtual contiguity

• Maximizes number of free large page frames

Mosaic: Compaction

26


In-PlaceCoalescer


Hardware

GPU Runtime

Outline

27

• Background






• Conclusions

Baseline: State-of-the-Art GPU Virtual Memory

28

GPU Core

Private TLB

GPU CoreGPU CoreGPU Core

Shared TLB

Private TLB

Page Table Walkers

Page Table(Main memory)

Private TLB Private TLB

Data (Main Memory)

CPU Memory

Private

Shared

CPU-side memory

GPU-side memory

Methodology

29

• GPGPU-Sim (MAFIA) modeling GTX750 Ti• 30 GPU cores

• Multiple GPGPU applications execute concurrently

• 64KB 4-way L1, 2048KB 16-way L2

• 64-entry L1 TLB, 1024-entry L2 TLB

• 8-entry large page L1 TLB, 64-entry large page L2 TLB

• 3GB main memory

• Model sequential page walks

• Model page tables and virtual-to-physical mapping

• CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites• 235 total workloads evaluated

• Available at: https://github.com/CMU-SAFARI/Mosaic

Comparison Points

30

• State-of-the-art CPU-GPU memory management • GPU-MMU based on [Power et al., HPCA’14]

• Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance

• Downside: Limited TLB reach

• Ideal TLB: Every TLB access is an L1 TLB hit

Performance

31

23.7%43.1%

31.5%

21.4%

Homogeneous Heterogeneous

39.0%33.8%

55.4%

61.5%

95.0%

0

1

2

3

4

5

6

7

1 2 3 4 5 2 3 4 5

We

igh

ted

Sp

ee

du

p

Number of Concurrently-Executing Applications

GPU-MMU Mosaic Ideal TLB

Mosaic consistently improves performance across a wide variety of workloads

Mosaic performs within 10% of the ideal TLB

Other Results in the Paper

32

• TLB hit rate• Mosaic achieves average TLB hit rate of 99%

• Per-application IPC• 97% of all applications perform faster

• Sensitivity to different TLB sizes• Mosaic is effective for various TLB configurations

• Memory fragmentation analysis• Mosaic reduces memory fragmentation and improves

performance regardless of the original fragmentation

• Performance with and without demand paging

Outline

33

• Background






• Conclusions

Summary

34

• Problem: No single best page size for GPU virtual memory• Large pages: Better TLB reach• Small pages: Lower demand paging latency

• Our goal: Transparently enable both page sizes

• Key observations• Can easily coalesce an application’s contiguously-allocated small pages

into a large page• Interleaved memory allocation across applications breaks page contiguity

• Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing

• Mosaic is a hardware/software cooperative framework that:• Coalesces small pages into a large page without data movement• Enables the benefits of both small and large pages

• Key result: 55% average performance improvement overstate-of-the-art GPU memory management mechanism

Mosaic: A GPU Memory Manager with Application-Transparent Support

for Multiple Page Sizes

Rachata Ausavarungnirun Joshua Landgraf Vance Miller

Saugata Ghose Jayneel Gandhi

Christopher J. Rossbach Onur Mutlu

Backup Slides

Current Methods to Share GPUs

37

• Time sharing• Fine-grained context switching

• Coarse-grained context switching

• Spatial sharing• NVIDIA GRID

• Multi process service

Other Methods to Enforce Protection

38

• Segmented paging

• Static memory partitioning

TLB Flush

39

• With Mosaic, the contents in the page tables are the same

• TLB flush in Mosaic occurs when page table content is modified• This invalidates content in the TLB Need to be flushed

• Both large and small page TLBs are flushed

Performance with Demand Paging

40

0.0

0.5

1.0

1.5

2.0

Homogeneous Heterogeneous

No

rmal

ize

dP

erf

orm

ance

GPU-MMU no Paging

GPU-MMU with Paging

Mosaic with Paging

In-Place Coalescer: Coalescing

• Key assumption: Soft guarantee• Large page range always contains pages of the same application

41

Benefit: No data movement

L1 Page Table

Set Large Page Bit

Coalesce

L2 Page Table

Set Disabled Bit

Set Disabled Bit

Set Disabled Bit

Set Disabled Bit

PD PT POVA POQ: How to access large page base entry?

In-Place Coalescer: Large Page Walk

• Large page index is available at leaf PTE

42

L1 Page Table

Set Large Page Bit

Coalesce

L2 Page Table

Set Disabled Bit

Set Disabled Bit

Set Disabled Bit

Set Disabled Bit

0

1

2

3

4

5

Weig

hte

d S

peed

up

GPU-MMU Mosaic Ideal TLB

TLB-Friendly TLB-Sensitive

Sample Application Pairs

0%

20%

40%

60%

80%

100%

1 App 2 Apps 3 Apps 4 Apps 5 Apps

TL

B H

it R

ate

Number of Concurrently-Executing ApplicationsGPU-MMU

Mosaic

L1 L2 L1 L2 L1 L2 L1 L2 L1 L2

TLB Hit Rate

0.8

1.0

1.2

1.4

1.6

30% 50% 70% 90% 95% 97% 100%

No

rmalized

Perf

orm

an

ce

Fragmentation Index

no CAC CAC CAC-BC CAC-Ideal

Pre-Fragmenting DRAM

0.8

1.0

1.2

1.4

1.6

No

rmalized

Perf

orm

an

ce

Large Page Frame Occupancy

no CAC CAC CAC-BC CAC-Ideal

Page Occupancy Experiment

Memory Bloat

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1% 10% 25% 35% 50% 75%

Mem

ory

Blo

at

vs. G

PU

-MM

U

Page Occupancy

4KB Page GPU-MMU CAC

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50

No

rmali

zed

Perf

orm

an

ce

Sorted Application Number

GPU-MMU

Mosaic

Ideal-TLB

0

1

2

3

4

5

6

7

8

9

0 25 50 75

No

rmali

zed

Perf

orm

an

ce


GPU-MMU

Mosaic

Ideal-TLB

0

1

2

3

4

5

6

7

8

0 25 50 75 100

No

rmali

zed

Perf

orm

an

ce


GPU-MMU

Mosaic

Ideal-TLB

0

1

2

3

4

5

6

7

8

0 25 50 75 100 125

No

rmali

zed

Perf

orm

an

ce


GPU-MMU

Mosaic

Ideal-TLB

Individual Application IPC

0.8

0.9

1.0

1.1

1.2

1.3

1.4

64 128 256 512 1024 4096

No

rmalized

Perf

orm

an

ce

Shared L2 TLB Base Page Entries

GPU-MMU Mosaic

0.8

0.9

1.0

1.1

1.2

1.3

1.4

8 16 32 64 128 256

No

rmalized

Perf

orm

an

ce

Per-SM L1 TLB Base Page Entries

GPU-MMU Mosaic

0.8

0.9

1.0

1.1

1.2

1.3

1.4

32 64 128 256 512

No

rmalized

Pe

rfo

rman

ce

Shared L2 TLB Large Page Entries

GPU-MMU Mosaic

0.8

0.9

1.0

1.1

1.2

1.3

1.4

4 8 16 32 64

No

rmalized

Perf

orm

an

ce

Per-SM L1 TLB Large Page Entries

GPU-MMU Mosaic

Hardware

Mosaic: Putting Everything Together

50


In-PlaceCoalescer


Application Demands Data

GPU Runtime

Allocate Memory

PageTable

Data

TransferDone

List of Large PagesCoalesce

Pages

Application Deallocate Data

SplinterPages

CompactPages

List of Free Pages

System I/O Bus Transfer Data

Hardware


51


In-PlaceCoalescer


Application Demands Data

GPU Runtime

Allocate Memory

PageTable

Data

TransferDone

List of Large PagesCoalesce

Pages

System I/O Bus Transfer Data

Hardware


52


In-PlaceCoalescer


GPU Runtime

PageTable

Data

Application Deallocate Data

SplinterPages

CompactPages

List of Free Pages

Documents

Mosaic: A GPU Memory Manager with Application-Transparent ... · Challenges with Multiple Page Sizes 8 Large Page Frame 1 Unallocated App 1 App 2 State-of-the-Art Time App 1 Allocation