Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

1

Parallel GC(Chapter 14)

Eleanor AinyDecember 16th 2014

2

Outline of Today’s Talk

How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction

3

Till now …

Multiple mutator threads

But only 1 collector thread

Poor use of resources!

Assumption remains: No mutators run in parallel to the collector!

Introduction

4

Parallel vs. Non-Parallel Collection

MutatorCollection

Cycle 1Collection

Cycle 2

Introduction

5

The Goal

To reduce:• Time overhead of garbage collection• Pause times in case of stop-the-world collection

Introduction

6

Parallel GC Challenges

Ensure there is sufficient work to be done. Otherwise it’s not worth it!

Load balancing – distribute work & other resources in a way that minimizes the coordination needed.

Synchronization – needed for both correctness and to avoid repeating work.

Introduction

7

More on Load Balancing

Static Partitioning• Some processors will probably have more work to do compared to others.

• Some processors will exhaust their resources before others do.

Introduction

8

Dynamic Load Balancing • Sometimes it’s possible to obtain a good estimate of the amount of work to be done in advance

• More often it’s not possible to estimate that Solution: (1) Over-partition the work into more tasks (2) Have each thread compete to claim one task at a time to execute.Advantages:(1) More resilient to changes in the number of processors available(2) If one task takes longer to execute other threads can execute

anyfurther work

More on Load BalancingIntroduction

9

Why not divide the work to the smallest possible independent tasks?

The coordination cost is too expensive!Synchronization guarantees correctness and avoids unnecessary work, but has time & space overheads!

Algorithms try to minimize the synchronization needed by using thread-local data structures, for instance.

More on Load BalancingIntroduction

10

Processor-centric VS. Memory-centric

Processor-centric algorithms:• threads acquire work that vary in size.• threads steal work from other threads• little regard to the location of the objects

Memory-centric algorithms:• take location into greater account• operate on continuous blocks of heap memory• acquire/release work from/to shared pools of fixed-size buffers of work

Introduction

11

Algorithms’ Abstraction

Assumption: Each collector thread executes the following loop (*):

while not terminated()acquireWork()performWork()generateWork()

(*) in most cases.

Introduction

12



13

Marking comprises of…

1) Acquisition of an object from a work list2) Testing & setting marks3) Generating further marking work by adding the

object’s children to the work list

Parallel Marking

14

Important Note

All known parallel marking algorithms areprocessor-centric!

Parallel Marking

15

When is Synchronization Required?

No synchronization:If the work list is thread-local.Example: when an object’s mark is represented by a bit in its header.

Synchronization needed:Otherwise the thread must acquire work atomically from some otherthread’s work list or from some global list.Example: when marks are stored in a shared bitmap.

Parallel Marking

16

Endo et al [1997] Parallel Mark Sweep Algorithm

N – total number of threadsEach marker thread has its own:• local mark stack• a stealable work queue.

shared stealableWorkQueue[N]me myThreadId

acquireWork():if not isEmpty(myMarkStack)

returnstealFromMyself() if isEmpty(myMarkStack)

stealFromOthers()

Parallel Marking

17

An idle thread acquires work by first examining its own queue andthen other threads’ queues.

stealFromMyself():lock(stealableWorkQueue[me])n size(stealableWorkQueue[me]) / 2transfer(stealableWorkQueue[me], n, myMarkStack) unlock(stealableWorkQueue[me])

Parallel Marking


18

An idle thread acquires work by first examining its own queue andthen other threads’ queues. stealFromOthers():

for each j in Threads if not locked(stealableWorkQueue[j] )

if lock(stealableWorkQueue[j]) n size(stealableWorkQueue[j]) / 2 transfer(stealableWorkQueue[j], n, myMarkStack) unlock(stealableWorkQueue[j]) return

Parallel Marking


19

performWork():while pop(myMarkStack, ref)

for each fld in Pointers(ref) child *fld if child null && not isMarked(child) setMarked(child)

push(myMarkStack, child)

Parallel Marking


20

Parallel Marking


Stack BStack A

P1P2

Thread A Thread B

Queue BQueue A

C1

Notice: it is possible for threads to mark the same child object.

21

Each thread checks its own mark queue. If it’s empty it transfers all its mark stack (apart from local roots) to the queue.

generateWork():if isEmpty(stealableWorkQueue[me])

n size(myMarkStack)lock(stealableWorkQueue[me])transfer(myMarkStack, n, stealableWorkQueue[me])unlock(stealableWorkQueue[me])

Parallel Marking


22

Parallel Marking With a BitmapThe collector tests the bit and only if it isn’t set, attempts to set itatomically, retrying if the set fails.

setMarked(ref):oldByte markByte(ref)bitPosition markBit(ref)loop

if isMarked(oldByte, bitPosition) returnnewByte mark(oldByte, bitPosition)if (CompareAndSet(&markByte(ref), oldByte, newByte) return

Parallel Marking


CompareAndSet(x,old,new): atomic curr *x if curr = old *x new return true return false

23

Termination Detection – Reminder From Previous Lecture:• Separate thread for termination detection.

• Symmetric detection – every thread can play the role of the detector.

Parallel Marking


24

Termination Detection – Reminder From Previous Lecture:

shared jobs[N] initial work assignmentsshared busy[N] [true, …]shared jobsMoved falseshared allDone falseme myThreadId

Parallel Marking


25


worker(): loop

while not isEmpty(jobs[me]) job dequeue(jobs[me]) perform jobif another thread j exists whose jobs set appears relatively large some stealJobs(j) enqueue(jobs[me], some) continuebusy[me] falsewhile no thread has jobs to steal && not allDone /* do nothing: wait for work or termination*/if allDone returnbusy[me] true

Parallel Marking


26


stealJobs(j): some atomicallyRemoveJobs(jobs[j])if not isEmpty(some)

jobsMoved truereturn some

Parallel Marking


27


detect(): anyActive truewhile anyActive

anyActive ( i) (busy[i])anyActive anyActive || jobsMoved jobsMoved false

allDone true

Parallel Marking


28

Running Example

Queue B

Stack B

Queue A

Stack A

Initially: queues are empty!acquireWork – if stack is non-empty returns.

Thread A Thread B

Endo et al [1997] Parallel Mark Sweep AlgorithmParallel Marking

29

Running Example

performWork pops, marks and pushes children.

Stack B

Queue BO1

O2

O3

O4

O4

O1

O3

O2

Parallel Marking


30

Running Example

Queue B

Stack B

generateWork moves all the objects from the stack to the queue!

Queue B

O3

O2

Stack B

O3O2

Parallel Marking


31

Running Example

acquireWork – if stack is empty moves half the queue to the stack.

Stack B

Queue BQueue B

Parallel Marking


32

Running Example

Queue B

acquireWork – if queue is also empty, steals from other queues.This continues until there is no more work (the detector will detect this!).

Stack A Stack B

Queue AQueue A

Parallel Marking


33

N – total number of threads• Each thread has its own stealable deque (double-ended queue).• The deques are fixed size to avoid allocation during collection

causes overflow.• All threads share a global overflow set implemented as a list of list.

shared overflowSetshared deque[N]me myThreadId

Parallel Marking

Flood et al [2001] Parallel Mark Sweep Algorithm

acquireWork():if not isEmpty(deque[me])

return n dequeFixedSize/2if extractFromOverflowSet(n)

returnstealFromOthers()

34

• The Java class structure holds the head of a list of overflow objects of that type, linked through the class pointer field in their header.

• An object’s type field can be restored on remove from overflow set (stop-the-world enables the type field to be used here).

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

35

Idle threads acquire work by trying to fill half their deque from the overflow set before stealing from other deques.

extractFromOverflowSet(n): transfer(overflowSet, n, deque[me])

Parallel Marking

Flood et al [2001] Parallel Mark Sweep Algorithm

36

Idle threads steal work from the top of others’ deques using remove.

stealFromOthers():for each j in Threads

ref remove(deque[j])if ref null push(deque[me], ref) return

remove:requires synchronization!


37

performWork():loop

ref pop(deque[me])if ref = null return for each fld in Pointers(ref) child *fld if (child null && not isMarked(child) setMarked(child) if not push(deque[me], child) n size(deque[me]) / 2 transfer(deque[me], n, overflowSet)

pop:requires synchronizationonly to claim the lastelement of the deque.

push: does not requiresynchronization.


38

Work is generated inside peformWork by pushing to the deque or transferring to the overflow set.

generateWork():/* nop */


39

Termination Detection• Variation of symmetric detection that we saw in previous lecture.• Status word – one bit per thread (active/inactive).


40

Running Example

Deque BDeque A

Initially: deques are non-empty!acquireWork – if deque is non-empty return.

Thread A Thread B


41

Running Example

performWork – pop, mark and push children.

O1

O4

O5

O2

O3

O6

O7


42

Running Example

Deque B

performWork – if push causes overflow copies half the queue to the overflow set.

Thread B

O1

O4

O5

O2

O3

O6

O7

O3 O4 O5 O6

O7

A

A

B


O1O2

43

Running Example

performWork – the overflow set in this case:

A

A

B

Class A Structure

Class B Structure

O5

O6

O7


O1

O4

O5

O2

O3

O6

O7

44

Running Example

Deque B

acquireWork – if deque is empty, takes work from overflow set. If fails, removes from other deques.

Thread B

Deque A

Thread A

O9 O9


45

• This technique is best employed when the number of threads is known in advance.

• May be difficult for a thread:• To choose the best queue from which to steal.• To detect termination.

Mark Stacks With Work Stealing - Disadvantages Parallel Marking

46

• Threads exchange marking tasks through single writer, single reader channels.

• In a system of N threads, each thread has an array of N-1 queues.• Annotation for input channel from thread i to thread j i j.

This is also an output channel of thread i.

shared channel[N,N]me myThreadId

Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking

47

If the thread’s stack is empty, it takes a task from some input channel k me.

acquireWork():if not isEmpty(myMarkStack)

returnfor each k in Threads

if not isEmpty(channel[k, me]) ref remove(channel[k, me]) push(myMarkStack, ref) return


48

Threads first try to add new tasks (marking children) to other threads’input channels (their output channels).performWork():

loopif isEmpty(myMarkStack) returnref pop(myMarkStack)for each fld in Pointers(ref) child *fld if child null && not isMarked(child) if not generateWork(child)

push(myMarkStack, child)


49

• When a thread generates a new task, it first checks whether any other thread k needs work.

• If so, adds the task to the output channel me k.• Otherwise, pushes the task to its own stack.

generateWork(ref):for each k in Threads

if needsWork(k) && not isFull(channel[me,k]) add(channel[me,k], ref) return true

return false


50

Advantages:• No expensive atomic operations!• Performs better on servers with many processors.• Keeps all threads busy.

(*) On a machine with 16 Intel Xeon processors queues of size one ortwo were found to scale best.


51



52

Copying is Different From Marking…

It’s essential that an object be copied only once!If an object is marked twice it usually does not affect the correctness of the program.

Parallel Copying

53

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying

Each copying thread is given its own stack and transfers work between its local stack and a shared stack.

k – size of a local stack

shared sharedStackmyCopyStack[k]sp 0 /* local stack pointer */

Parallel Copying

54

Using rooms, they allow multiple threads to:• pop elements from the shared stack in parallel• push elements to the shared stack in parallelBut not pop and push in parallel!

shared gate openshared popClients /* number of clients in the pop room */shared pushClients /* number of clients in the push room */

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel

Copying

55

while not terminated()enterRoom() /* enter pop room */for i 1 to k

if isLocalStackEmpty() acquireWork() if isLocalStackEmpty() breakperformWork()

transitionRooms()generateWork()if exitRoom() /* exit push room */ terminate()

acquireWork(): sharedPop()

performWork(): ref localPop() scan(ref)

generateWork(): sharedPush()

isLocalStackEmpty(): return sp = 0


Copying

56

localPush(ref):myCopyStack[sp++] ref

localPop():return myCopyStack[--sp]

Local Stack

SPref

1. localPop()2. localPush(ref)


Copying

57

sharedPop():cursor FetchAndAdd(&sharedStack, 1) if cursor stackLimit FetchAndAdd(&sharedStack, -1)else myCopyStack[sp++] cursor[0]

FetchAndAdd(x, v): atomic old *x *x old + v return old


Copying

58

sharedPush():cursor FetchAndAdd(&sharedStack, -sp) - sp for i 0 to sp-1

cursor[i] myCopyStack[i]sp 0

FetchAndAdd(x, v): atomic old *x *x old + v return old

Parallel Copying


59

enterRoom():while gate OPEN

/* do nothing: wait */FetchAndAdd(&popClients, 1)while gate OPEN

FetchAndAdd(&popClients, -1) /* failure - return to previous state*/

while gate OPEN /* do nothing: wait */ FetchAndAdd(&popClients, 1) /* try again */

Parallel Copying


60

transitionRooms(): /* move from pop room to push room */gate CLOSED /* close gate to pop room */FetchAndAdd(&pushClients, 1)FetchAndAdd(&popClients, -1) while popClients > 0

/* do nothing: wait till none popping */


Copying

61

exitRoom():pushers FetchAndAdd(&pushClients, -1) - 1if pushers = 0 /* last in push room */ gate OPEN if isEmpty(sharedStack) /* no work left */

return true else

return false


Copying

62

Problem:Any processor waiting to enter the push room must wait until allprocessors in the pop room have finished their work!

Possible Solution:The work can be done outside the rooms!It increases the likelihood that the pop room is empty threads will be able to enter the push room more quickly

Parallel Copying


63

• Divide the heap into small, fixed-size chunks.• Each thread receives its own chunks to scan and into which to copy

survivors.• Once a thread chunk copy is full it’s transferred to a global pool

where idle threads compete to scan it and a new empty chunk isobtained for the thread itself.

Parallel Copying

Memory-Centric Techniques:Block-Structured Heaps

64

Mechanisms Used To Ensure Good Load Balancing:• Chunks acquired were small (256 words).

• To avoid fragmentation, they used big bag of pages allocation for small objects

• Larger objects and chunks were allocated from the shared heap using a lock.

Parallel Copying


65

Mechanisms Used To Ensure Good Load Balancing:• Balanced load in finer granularity.• Each chunk was divided into smaller blocks (32 words).

Memory-Centric Techniques:Block-Structured HeapsParallel

Copying

66

Mechanisms Used To Ensure Good Load Balancing:• After scanning a slot, the thread checks whether it reached the block boundary.

• If so and the next object was smaller than a block:• the thread advanced its scan pointer to the start of its current copy

block.• It reduced contention – the thread did not have to compete to

acquire a new scan block.• Un-scanned blocks in that area are given to the global pool.

• If the object was larger than a block but smaller than a chunk, the scan pointer was advanced to the start of its current copy chunk.

• If the object was large, the thread continued to scan it.

Parallel Copying


67

Mechanisms Used To Ensure Good Load Balancing:

Parallel Copying


68

Block States and Transitions:

Memory-Centric Techniques:Block-Structured HeapsParallel

Copying

69

State Transition Logic:

Parallel Copying


70



71

1) Statically partition the heap into contiguous blocks for threads to sweep.

2) Over-partition the heap and have threads compete for a block to sweep to a free-list.

ProblemThe free-list becomes a bottleneck!

SolutionProcessors will have their own free-lists.

Parallel Sweeping

Simple Strategies

72

• A naturally parallel solution to sweeping partially full blocks.• In the sweep phase, we need to identify empty blocks and return

them to the block allocator.• Need to reduce contention.• Gave each thread several consecutive blocks to process locally.• They used bitmap marking with bitmaps held in block headers

(used to determine whether a block is empty or not).• Empty blocks are added to a local free-block list.• Partially full blocks are added to local reclaim list for subsequent

lazy sweeping.• Once a processor finishes with its sweep set it merges its local list

with the global free-block list.

Parallel Sweeping

Endo et al [1997] Lazy Sweeping

73



74

Observation:Uniprocessor compaction algorithms typically slide all live data to one end of the heap space.

If multiple threads do so in parallel one thread can overwrite live data before another thread has moved it!

Thread 1 compaction data. Thread 2 compaction data.

B CA DC

Parallel Compaction

Flood et al [2001] Parallel Mark-Compact

75

Suggested Solution:• Divide the heap space into several regions, one for each

compacting thread.• To reduce fragmentation, they also have threads alternate the

direction in which they move objects in even and odd numbered regions.

Flood et al [2001] Parallel Mark-CompactParallel Compaction

76

4 Phases:1) Parallel marking.2) Calculate forwarding addresses.3) Update references.4) Move objects.


77

Phase 2 - Calculating Forwarding Addresses:• Over-partition the space into M = 4N (N- number of threads) units of

roughly the same size.• Threads compete to claim units.• Each thread counts the volume of live data in its unit.• According to these volumes, they partition the space into N regions that

contain approximately the same amount of live data.• Threads compete to claim units and install forwarding addresses of each

live object of their units.

3 6 13 7 10 5 7 5 12 48 9

30 29 30

M = 12 units, N = 3 regions/threads


78

Phase 3 - Updating References:• Updating references to point to objects’ new locations requires scanning:

• Objects stored in mutator threads’ stacks that might contain references to objects in the heap space (young generation).

• Live objects in the heap space (old generation).• Threads compete to claim old generation units to scan and a single

thread scans the young generation.

Phase 4 – Moving Objects:• Each thread is in charge of a region. • Good load balancing is guaranteed because the regions contain roughly

equal volumes of live data.


79

Disadvantages:1) The algorithm makes 3 passes over the heap while other

compacting algorithms make fewer passes.2) Rather than compacting all live data to one end of the heap,

the algorithm compacts into N regions, leaving (N +1)/2 gaps for allocation. If a large number of threads in used, it’s difficult for mutators to allocate very large objects.


80

1) Address the 3 passes problem:• Calculate rather than store forwarding addresses using the mark

bitmap and an offset vector that holds the new address of the first live object in each block.

• To construct the offset vector one pass over the mark-bit vector is needed.

• Only a single pass over the heap is needed to move objects and update references using these vectors.

Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction

81

1) Address the 3 passes problem:• Bits in the mark-bit vector indicate the start and end of each live object.• Words in the offset vector hold the address to which the first live object

in their corresponding block will be moved. • Forwarding addresses are not stored but are calculated when needed

from the offset and mark-bit vectors.


82

2) Address the small gaps problem:• Over-partition the heap into fairly large areas. • Threads race to claim the next area to compact, using an atomic

operation to increment a global area index.• If the thread succeeds, it has obtained an area to compact.• If it fails, it tries to claim the next area.


83

2) Address the small gaps problem:• A table holds pointers to the beginning of the free space for each area. • After winning an area to compact, a thread races to obtain an area

into which it can move objects. It claims an area by trying to write null into its corresponding table slot.

• Threads never try to compact from or into an area whose table entry is null.

• Objects are never moved from a lower to a higher numbered area.• Progress is guaranteed since a thread can always compact an area into

itself.• Once a thread has finished with an area, it updates the area’s free

pointer. If an area is full, its free space pointer will remain null.


84

2) Address the small gaps problem:

…1 2 3

Area Index: 0Area Index: 1Area Index: 2

Free pointers table 200 1000 1800 …

A B CA B C

NULL

D EA B C D E

200 400

400NULL

1000 1800


85

2) Address the small gaps problem:Explored two ways in which objects can be moved:a. Slide object by object.b. To reduce compaction time, slide only complete blocks (256 bytes).

Free space in each block is not squeezed out.


86

Discussion

• What is the tradeoff in the choice of the chunk size in parallel copying?• Parallel copying with no synchronization can cause issues? For example if

an object is copied twice by two different threads, what can be the consequence?

A

B

A A

FA X

B

87

Something Extra

https://www.youtube.com/watch?v=YhKZe22tZlc

88

Conclusions & Summary

• There should be enough work for parallel collection • Need to take into account synchronization costs• Need to balance loads between the multiple threads• Learned different algorithms for marking, sweeping, copying and

compaction that take all this challenges into account.• Difference between marking and copying – marking an object twice is

not so bad. Copying an object twice can harm the correctness.

Documents

Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1