Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Mosaic: A GPU Memory Manager with Application-Transparent Support
for Multiple Page Sizes
Rachata Ausavarungnirun Joshua Landgraf Vance Miller
Saugata Ghose Jayneel Gandhi
Christopher J. Rossbach Onur Mutlu
Executive Summary
2
• Problem: No single best page size for GPU virtual memory• Large pages: Better TLB reach• Small pages: Lower demand paging latency
• Our goal: Transparently enable both page sizes
• Key observations• Can easily coalesce an application’s contiguously-allocated small pages
into a large page• Interleaved memory allocation across applications breaks page contiguity
• Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing
• Mosaic is a hardware/software cooperative framework that:• Coalesces small pages into a large page without data movement• Enables the benefits of both small and large pages
• Key result: 55% average performance improvement overstate-of-the-art GPU memory management mechanism
GPU Support for Virtual Memory
3
• Improves programmability with a unified address space
• Enables large data sets to be processed in the GPU
• Allows multiple applications to run on a GPU• Virtual memory can enforce memory protection
GPU Core
Private TLB
State-of-the-Art Virtual Memory on GPUs
4
GPU CoreGPU CoreGPU Core
Shared TLB
Private TLB
Page Table Walkers
Page Table(Main memory)
Private TLB Private TLB
Limited TLB reach
High latency page walks
Data (Main Memory)
CPU Memory
High latency
I/O
Private
Shared
CPU-side memory
GPU-side memory
Trade-Off with Page Size
5
•Larger pages: •Better TLB reach•High demand paging latency
•Smaller pages: • Lower demand paging latency• Limited TLB reach
Trade-Off with Page Size
6
Can we get the best of both page sizes?
0.0
0.2
0.4
0.6
0.8
1.0
No
rmal
ize
dP
erf
orm
ance
Small (4KB) Large (2MB)
0.0
0.2
0.4
0.6
0.8
1.0
No
rmal
ize
d
Pe
rfo
rman
ce
Small (4KB) Large (2MB)
No Paging Overhead With Paging Overhead
-93%
52%
Outline
7
• Background
• Key challenges and our goal
• Mosaic
• Experimental evaluations
• Conclusions
Challenges with Multiple Page Sizes
8
Large Page Frame 1
Unallocated App 1 App 2
State-of-the-ArtTime
App 1 Allocation
App 2 Allocation
App 1 Allocation
Large Page Frame 2
GPU Memory
App 2 Allocation
CoalesceApp 1 Pages
Large Page Frame 3
Large Page Frame 4
Large Page Frame 5
Cannot coalesce (without migrating multiple 4K pages)
Need to search which pages to coalesce
CoalesceApp 2 Pages
Desirable Allocation
9
Large Page Frame 1
Unallocated App 1 App 2
Desirable BehaviorTime
App 1 Allocation
App 2 Allocation
App 1 Allocation
Large Page Frame 2
GPU Memory
App 2 Allocation
CoalesceApp 1 Pages
CoalesceApp 2 Pages
Large Page Frame 3
Large Page Frame 4
Large Page Frame 5
Can coalesce (without moving data)
Our Goals
10
•High TLB reach
•Low demand paging latency
•Application transparency• Programmers do not need to modify the applications
Outline
11
• Background
• Key challenges and our goal
• Mosaic
• Experimental evaluation
• Conclusions
Mosaic
12
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
Hardware
GPU Runtime
Outline
13
• Background
• Key challenges and our goal
• Mosaic• Contiguity-Conserving Allocation
• In-Place Coalescer
• Contiguity-Aware Compaction
• Experimental evaluations
• Conclusions
Conserves contiguity within the large page frame
Mosaic: Data Allocation
14
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
Allocate Memory
PageTable
Data
2
Large Page Frame
Hardware
GPU Runtime Application Demands Data1
Soft guarantee:A large page frame contains
pages from only a single address space
• Data transfer is done at a small page granularity• A page that is transferred is immediately ready to use
Mosaic: Data Allocation
15
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
Transfer Data
PageTable
Data3 System I/O Bus
CPU Memory
Large Page Frame
Hardware
GPU Runtime
Allocate Memory2
Application Demands Data1
Mosaic: Data Allocation
16
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
PageTable
Transfer Done4
Large Page Frame
Data
Hardware
GPU Runtime
Transfer Data
3 System I/O Bus
CPU Memory
Outline
17
• Background
• Key challenges and our goal
• Mosaic• Contiguity-Conserving Allocation
• In-Place Coalescer
• Contiguity-Aware Compaction
• Experimental evaluations
• Conclusions
• Fully-allocated large page frame Coalesceable
• Allocator sends the list of coalesceable pages to the In-Place Coalescer
Mosaic: Coalescing
18
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
Large Page Frame
List of large pages1
Large Page Frame
Hardware
GPU Runtime
• In-Place Coalescer has:
• List of coalesceable large pages
• Key Task: Perform coalescing without moving data
• Simply need to update the page tables
Mosaic: Coalescing
19
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
PageTable
Data
Hardware
GPU Runtime
Update page tables2List of large pages1
Mosaic: Coalescing
20
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
PageTable
Data
Hardware
GPU Runtime
Update page tables2List of large pages1
0
Coalesced Bit
1
Large Page Table Small Page Table
• Application-transparent• Data can be accessed
using either page size• No TLB flush
Outline
21
• Background
• Key challenges and our goal
• Mosaic• Contiguity-Conserving Allocation
• In-Place Coalescer
• Contiguity-Aware Compaction
• Experimental evaluations
• Conclusions
• Key Task: Free up not-fully-used large page frames
• Splinter pages Break down a large page into small pages
• Compaction Combine fragmented large page frames
Mosaic: Data Deallocation
22
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
Hardware
GPU Runtime
• Splinter only frames with deallocated pages
Mosaic: Data Deallocation
23
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
PageTable
Splinter Pages (reset the coalesced bit)2
Large Page Frame
Data
Hardware
GPU Runtime Application Deallocates Data 1
Mosaic: Compaction
24
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
Hardware
GPU Runtime
• Key Task: Free up not-fully-used large page frames
• Splinter pages Break down a large page into small pages
• Compaction Combine fragmented large page frames
• Compaction decreases memory bloat• Happens only when memory is highly fragmented
Mosaic: Compaction
25
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
PageTable
Compact Pages1
Large Page Frames
Data
Free large page
Hardware
GPU Runtime List of free pages2
Free large page
• Once pages are compacted, they become non-coalesceable• No virtual contiguity
• Maximizes number of free large page frames
Mosaic: Compaction
26
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
Hardware
GPU Runtime
Outline
27
• Background
• Key challenges and our goal
• Mosaic• Contiguity-Conserving Allocation
• In-Place Coalescer
• Contiguity-Aware Compaction
• Experimental evaluations
• Conclusions
Baseline: State-of-the-Art GPU Virtual Memory
28
GPU Core
Private TLB
GPU CoreGPU CoreGPU Core
Shared TLB
Private TLB
Page Table Walkers
Page Table(Main memory)
Private TLB Private TLB
Data (Main Memory)
CPU Memory
Private
Shared
CPU-side memory
GPU-side memory
Methodology
29
• GPGPU-Sim (MAFIA) modeling GTX750 Ti• 30 GPU cores
• Multiple GPGPU applications execute concurrently
• 64KB 4-way L1, 2048KB 16-way L2
• 64-entry L1 TLB, 1024-entry L2 TLB
• 8-entry large page L1 TLB, 64-entry large page L2 TLB
• 3GB main memory
• Model sequential page walks
• Model page tables and virtual-to-physical mapping
• CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites• 235 total workloads evaluated
• Available at: https://github.com/CMU-SAFARI/Mosaic
Comparison Points
30
• State-of-the-art CPU-GPU memory management • GPU-MMU based on [Power et al., HPCA’14]
• Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance
• Downside: Limited TLB reach
• Ideal TLB: Every TLB access is an L1 TLB hit
Performance
31
23.7%43.1%
31.5%
21.4%
Homogeneous Heterogeneous
39.0%33.8%
55.4%
61.5%
95.0%
0
1
2
3
4
5
6
7
1 2 3 4 5 2 3 4 5
We
igh
ted
Sp
ee
du
p
Number of Concurrently-Executing Applications
GPU-MMU Mosaic Ideal TLB
Mosaic consistently improves performance across a wide variety of workloads
Mosaic performs within 10% of the ideal TLB
Other Results in the Paper
32
• TLB hit rate• Mosaic achieves average TLB hit rate of 99%
• Per-application IPC• 97% of all applications perform faster
• Sensitivity to different TLB sizes• Mosaic is effective for various TLB configurations
• Memory fragmentation analysis• Mosaic reduces memory fragmentation and improves
performance regardless of the original fragmentation
• Performance with and without demand paging
Outline
33
• Background
• Key challenges and our goal
• Mosaic• Contiguity-Conserving Allocation
• In-Place Coalescer
• Contiguity-Aware Compaction
• Experimental evaluations
• Conclusions
Summary
34
• Problem: No single best page size for GPU virtual memory• Large pages: Better TLB reach• Small pages: Lower demand paging latency
• Our goal: Transparently enable both page sizes
• Key observations• Can easily coalesce an application’s contiguously-allocated small pages
into a large page• Interleaved memory allocation across applications breaks page contiguity
• Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing
• Mosaic is a hardware/software cooperative framework that:• Coalesces small pages into a large page without data movement• Enables the benefits of both small and large pages
• Key result: 55% average performance improvement overstate-of-the-art GPU memory management mechanism
Mosaic: A GPU Memory Manager with Application-Transparent Support
for Multiple Page Sizes
Rachata Ausavarungnirun Joshua Landgraf Vance Miller
Saugata Ghose Jayneel Gandhi
Christopher J. Rossbach Onur Mutlu
Backup Slides
Current Methods to Share GPUs
37
• Time sharing• Fine-grained context switching
• Coarse-grained context switching
• Spatial sharing• NVIDIA GRID
• Multi process service
Other Methods to Enforce Protection
38
• Segmented paging
• Static memory partitioning
TLB Flush
39
• With Mosaic, the contents in the page tables are the same
• TLB flush in Mosaic occurs when page table content is modified• This invalidates content in the TLB Need to be flushed
• Both large and small page TLBs are flushed
Performance with Demand Paging
40
0.0
0.5
1.0
1.5
2.0
Homogeneous Heterogeneous
No
rmal
ize
dP
erf
orm
ance
GPU-MMU no Paging
GPU-MMU with Paging
Mosaic with Paging
In-Place Coalescer: Coalescing
• Key assumption: Soft guarantee• Large page range always contains pages of the same application
41
Benefit: No data movement
L1 Page Table
Set Large Page Bit
Coalesce
L2 Page Table
Set Disabled Bit
Set Disabled Bit
Set Disabled Bit
Set Disabled Bit
PD PT POVA POQ: How to access large page base entry?
In-Place Coalescer: Large Page Walk
• Large page index is available at leaf PTE
42
L1 Page Table
Set Large Page Bit
Coalesce
L2 Page Table
Set Disabled Bit
Set Disabled Bit
Set Disabled Bit
Set Disabled Bit
0
1
2
3
4
5
Weig
hte
d S
peed
up
GPU-MMU Mosaic Ideal TLB
TLB-Friendly TLB-Sensitive
Sample Application Pairs
0%
20%
40%
60%
80%
100%
1 App 2 Apps 3 Apps 4 Apps 5 Apps
TL
B H
it R
ate
Number of Concurrently-Executing ApplicationsGPU-MMU
Mosaic
L1 L2 L1 L2 L1 L2 L1 L2 L1 L2
TLB Hit Rate
0.8
1.0
1.2
1.4
1.6
30% 50% 70% 90% 95% 97% 100%
No
rmalized
Perf
orm
an
ce
Fragmentation Index
no CAC CAC CAC-BC CAC-Ideal
Pre-Fragmenting DRAM
0.8
1.0
1.2
1.4
1.6
No
rmalized
Perf
orm
an
ce
Large Page Frame Occupancy
no CAC CAC CAC-BC CAC-Ideal
Page Occupancy Experiment
Memory Bloat
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1% 10% 25% 35% 50% 75%
Mem
ory
Blo
at
vs. G
PU
-MM
U
Page Occupancy
4KB Page GPU-MMU CAC
0
1
2
3
4
5
6
7
8
0 10 20 30 40 50
No
rmali
zed
Perf
orm
an
ce
Sorted Application Number
GPU-MMU
Mosaic
Ideal-TLB
0
1
2
3
4
5
6
7
8
9
0 25 50 75
No
rmali
zed
Perf
orm
an
ce
Sorted Application Number
GPU-MMU
Mosaic
Ideal-TLB
0
1
2
3
4
5
6
7
8
0 25 50 75 100
No
rmali
zed
Perf
orm
an
ce
Sorted Application Number
GPU-MMU
Mosaic
Ideal-TLB
0
1
2
3
4
5
6
7
8
0 25 50 75 100 125
No
rmali
zed
Perf
orm
an
ce
Sorted Application Number
GPU-MMU
Mosaic
Ideal-TLB
Individual Application IPC
0.8
0.9
1.0
1.1
1.2
1.3
1.4
64 128 256 512 1024 4096
No
rmalized
Perf
orm
an
ce
Shared L2 TLB Base Page Entries
GPU-MMU Mosaic
0.8
0.9
1.0
1.1
1.2
1.3
1.4
8 16 32 64 128 256
No
rmalized
Perf
orm
an
ce
Per-SM L1 TLB Base Page Entries
GPU-MMU Mosaic
0.8
0.9
1.0
1.1
1.2
1.3
1.4
32 64 128 256 512
No
rmalized
Pe
rfo
rman
ce
Shared L2 TLB Large Page Entries
GPU-MMU Mosaic
0.8
0.9
1.0
1.1
1.2
1.3
1.4
4 8 16 32 64
No
rmalized
Perf
orm
an
ce
Per-SM L1 TLB Large Page Entries
GPU-MMU Mosaic
Hardware
Mosaic: Putting Everything Together
50
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
Application Demands Data
GPU Runtime
Allocate Memory
PageTable
Data
TransferDone
List of Large PagesCoalesce
Pages
Application Deallocate Data
SplinterPages
CompactPages
List of Free Pages
System I/O Bus Transfer Data
Hardware
Mosaic: Data Allocation
51
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
Application Demands Data
GPU Runtime
Allocate Memory
PageTable
Data
TransferDone
List of Large PagesCoalesce
Pages
System I/O Bus Transfer Data
Hardware
Mosaic: Data Deallocation
52
Contiguity-ConservingAllocation
In-PlaceCoalescer
Contiguity-AwareCompaction
GPU Runtime
PageTable
Data
Application Deallocate Data
SplinterPages
CompactPages
List of Free Pages