Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Rust as a Language
for High Performance
Garbage Collector
ImplementationYi Lin, Stephen M. Blackurn, Antony L. Hosking, Michael Norrish
1Introduction
Motivation for Thesis
▪ Fast yet robust garbage collector is key to garbage
collected language runtimes
▫ Manipulate raw memory with optimized code
▫ Rich in concurrency and thread parallelism
▫ Prone to memory bugs and race conditions
▪ Importance of high performance encourages use of
languages such as C / C++
▫ Weak type system, lack of memory safety and lack of
integrated support for concurrency
▫ Developers are solely responsible for memory and
thread safety
▪ Rust is a systems programming language that runs
blazingly fast, prevents segfaults, and guarantees thread
safety
Overview of Thesis
▪ Overview of Rust
▪ Implementing an Immix Garbage Collector in
Rust
▫ Overview of an Immix Garbage Collector
▫ Distinct Elements of Rust
▫ Abusing Rust
▪ Evaluation of Garbage Collector using Rust
▫ Extent of utilizing Rust’s safety
▫ Comparing Immix in Rust and Immix in C
▫ Comparing Immix in Rust and BDW in C
▪ Conclusion
2Overview of Rust
Rust: Introduction
▪ Open source, sponsored by
Mozilla Research
▪ Designed to be safe,
concurrent and practical
▪ Syntactically similar to
C++, but designed for better
memory safety while
maintaining performance
▪ First place for “Most loved
programming language” in the
Stack Overflow Developer
Survey in 2016 and 2017
Rust: Ownership
▪ Grants variable unique ownership
of the value it is bound to
▪ Unbound variables are not allowed
▪ Rebinding involves move semantics
▪ Ownership expires and resources
reclaimed when variable goes out
of scope
▪ Key concept for memory safety
Rust: References
▪ Borrow reference to access objects
▫ Less expensive - compiler does not need
extra code for destruction on expiry
▪ Immutable vs mutable
▫ 1+ coexisting immutable references
▫ 1 mutable and 0 immutable references
▪ Ownership cannot be moved when
borrowed
▫ Eliminates data races due to mutual
exclusivity of mutable (write) and
immutable (read) references
Rust: Data Guarantees
▪ Provides various wrapper types with
different guarantees and trade offs
▪ Box<T>: Pointer which owns a piece of
heap allocated data
▪ Arc<T>: Provides an atomically
reference-counted shared pointer to data
▫ All data stays accessible until every
Arc<T> goes out of scope
▪ Arc<Mutex<T>>: Provides mutual exclusive
lock and allows sharing the mutex lock
across threads
Rust: Unsafe Code
▪ Rust’s safety sometimes become too
restrictive or expensive
▪ Allows for unsafe code
▫ Raw pointers (e.g. *mut T)
▫ Forcefully allowing sharing data across
threads (e.g. unsafe impl Sync for T{})
▫ Intrinsic functions (e.g. mem::transmute())
▫ External functions from other languages
(e.g. libc::malloc())
▪ Alert programmers by requiring unsafe
code to be marked with unsafe
3Implementing an Immix
Garbage Collector in Rust
Overview of Immix
▪ High performance garbage collector
▪ Mark-region Collector (Naive)
▫ Memory divided into fixed size regions
▫ Each region is either “free” or “unavailable”
▫ Allocates into free regions until all are used
▫ Triggers collection and marks free regions
▪ Optimizes Mark-region by operating at 2 levels
▫ Coarse grained blocks
▫ Fine grained lines
▪ Uses Opportunistic Defragmentation
▫ Identify source and target blocks
▫ Evacuate objects from source to target
▪ http://www.cs.utexas.edu/users/speedway/DaCapo/papers/im
mix-pldi-2008.pdf
Immix: Optimized Mark-Region▪ Initial Allocation: All blocks are initially empty.
Allocator obtains empty block and allocates object. When
block is exhausted, request a new block. Repeat until heap
is exhausted, then trigger collection.
▪ Identification: Collector traces object graph and marks
objects and lines in a line map
▪ Reclamation: Performs a coarse-grained sweep, identifying
free blocks and lines. Returns free blocks to global pool
and recycles semi-free blocks
▪ Steady State Allocation: Resume allocation into recycled
blocks, skipping over full and empty blocks. Allocator
scans line map to find holes in a recycled block to
allocate objects until the recycled block is exhausted.
Immix: Opportunistic Defragmentation
▪ Identify source and target blocks
▪ Evacuate objects from source to target
Target Source
▪ Single pass to mark and copy
Utilizing Rust: Goals for Immix
▪ Use Immix as proof of concept implementation
▫ High-performance garbage collector
▫ Interesting characteristics beyond simple
mark-sweep or copying collector
▫ Well-documented publicly available reference
▪ Three key principles:
▫ Collector must be high performance
▫ Do not use unsafe code unless unavoidable
▫ Do not modify the Rust language in any way
▪ Four distinct elements in using Rust
▫ Encapsulating Address Types
▫ Ownership of Memory Blocks
▫ Globally Accessible Per-Thread State
▫ Library-Supported Parallelism
Encapsulating Address Types (1)
▪ Address: Arbitrary location in memory space
managed by GC (address arithmetic is necessary)
▪ Object Reference: Maps directly to a
language-level object in raw memory
▪ Important to abstract over both raw addresses
and references to user-level objects
▫ Offers type safety and disambiguation
▫ Differentiate addresses and object references
■ Object reference → Address : Valid
■ Address → Object reference : Unsafe
▪ Must be efficient in space and time
▫ Used pervasively in GC implementation
Encapsulating Address Types (2)▪ Abstract over word-width integer, usize
▫ Disables operations on inner type
▫ New operations on abstract type
▫ No overhead in type size
▫ Static methods can be marked with
#[inline(always)] to remove call
overhead
▪ Restrict creation of Address from raw
pointer or existing Address
▪ Exception for Address::zero()
▫ Initial value for fields with type
Address within other structs
▫ Safer Alternative: Option<Address>
■ Performance overhead (4%)
▪ ObjectReference similar to Address
▫ Access to per-object memory manager
metadata
▫ ObjectReference can always be safely
cast to Address, but not vise-versa
Ownership of Memory Blocks▪ Memory Manager must ensure it correctly
manages raw memory blocks to thread-local
allocators, ensuring exclusive ownership
of any given raw block
▪ Guarantee each block is either usable,
used, or being allocated by unique thread
▫ Create Block objects
▪ Global memory pool owns the Blocks
▫ Arranges into list of usable and used
▪ When allocator attempts to allocate
▫ Acquires ownership from usable list
▫ Gets memory address and allocation
context from the Block
▫ Allocations into corresponding memory
▪ When thread-local memory Block is full
▫ Block is returned to the used list
▫ Waits there for collection
▫ Moved to usable list if freed
Globally Accessible Per-Thread State
▪ Thread-local allocator avoids costly
synchronization due to mutual exclusion
among allocators
▪ However, some parts of thread-local data
structure may be shared at collector time
▫ e.g. Allocators told to yield by
collector thread
▫ Rust does not allow for mixed
ownership
▪ Break per-thread Allocator into 2 parts
▫ Thread-local Part:
■ Data that is accessible strictly
within the current thread
■ Arc reference to its global part
▫ Global Part:
■ Safe wrapper for mutable fields
■ Allows shared per-thread data to
be safely accessed globally
Library-Supported Parallelism
▪ Efficiency depends on implementation of fast, correct, parallel
work queues (mark stack) for pending work
▫ Thread finds new marking work → add object reference to
work queue
▫ Thread needs work → take from work queue
▪ Safe abstractions from standard and external libraries in Rust
▫ std::sync::mpsc
■ Multiple-producers single-consumer FIFO queue
▫ crossbeam::sync::chase_lev
■ Work stealing deque, multiple stealers & single worker
▪ Parallel collector starts single-threaded, work on local queue
▪ Turns into controller and with multiple stealer collectors
▪ Controller creates asynchronous mpsc channel and shared deque
▫ Collector keeps channel’s receiver end and deque’s worker
▪ Controller is responsible for receiving object references from
stealer threads and pushing them onto the shared deque
▪ Stealers steal work from deque, perform mark and trace, then
push references to local queue (for thread local tracing) or
global deque (if local queue is full)
Abusing Rust▪ Safety model too restrictive at times, so must use unsafe code
▪ Implementing the Line Mark Table
▫ Remember the state of every line in memory
▫ Map of unsigned bytes (u8) for every 256-bytes of memory
▪ Allocation: Multiple allocators may access line mark table
▫ Rust array of u8 disallows concurrent writing
▪ Collection: Set lines to live by atomically storing to a byte
▫ Rust does not support Atomic unsigned bytes
▪ Work Around: Generalize Line Mark
Table as AddressMapTable
▪ Wrap unsafe code into impl of
AddressMapTable
▪ Rely on compiler to generate x86
Byte store which is atomic
4Evaluation of Garbage
Collector in Rust
Utilization of Rust’s Safety
▪ 58 lines of unsafe code out of 1449 lines
▪ Mainly from:
▫ Required access to raw memory
▫ Work around Rust’s restricted semantics
▪ 96% of the implementation is safe
▪ “While Garbage Collectors are usually
considered low-level modules that operate
heavily on raw memory, the vast majority
of its code can in fact be safe and can
benefit from the implementation language
if the language offers safety”
Immix in Rust vs Immix in C
▪ Three performance-critical paths of collector to
run single-threaded as micro benchmarks
▫ Allocation, Object Marking, Object Tracing
▪ 50 Million objects of 24 bytes each (1200 MB of
heap memory)
▪ 2000 MB memory to control when tracing and
collection occurs (no spontaneous collection)
▪ Rust implementation matches performance of C
implementation across all micro benchmarks
Immix in Rust vs Immix in C▪ Library-based Parallel Mark and Trace
▫ High-level approach is also performant
▪ 1 Worker: Very large overhead (716 ms vs 605 ms)
▫ Send object references to global deque using asynchronous
channel & stealing from shared deque when local queue is
empty
▪ 2-3 Workers: Satisfactory scaling
▪ 4+ Workers: Scaling falls off slightly
▫ Share resources from same core after every core hosts 1 worker
▫ One central controller becomes performance bottleneck
Immix in Rust vs BDW in C
▪ Using gcbench & mt-gcbench micro benchmarks
▫ Thread-local allocators and parallel marking
with 8 GC threads on BDW
▪ Bump pointer allocators (Immix) generally
outperforms free list allocators (BDW)
▫ Immix implementation is conservative with
stacks put precise with heap, while BDW is
conservative with both
▫ Immix implementation presumes specified heap
size with contiguous memory space while BDW
allows dynamic growing of discontiguous heap
4Conclusion
References
▪ http://users.cecs.anu.edu.au/~steveb/downloads/pdf/rust
-ismm-2016.pdf
▪ http://www.cs.utexas.edu/users/speedway/DaCapo/papers/i
mmix-pldi-2008.pdf
▪ http://slideplayer.com/slide/8827713/