88
Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

Embed Size (px)

Citation preview

Page 1: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

1

Parallel GC(Chapter 14)

Eleanor AinyDecember 16th 2014

Page 2: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

2

Outline of Today’s Talk

How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction

Page 3: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

3

Till now …

Multiple mutator threads

But only 1 collector thread

Poor use of resources!

Assumption remains: No mutators run in parallel to the collector!

Introduction

Page 4: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

4

Parallel vs. Non-Parallel Collection

MutatorCollection

Cycle 1Collection

Cycle 2

Introduction

Page 5: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

5

The Goal

To reduce:• Time overhead of garbage collection• Pause times in case of stop-the-world collection

Introduction

Page 6: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

6

Parallel GC Challenges

Ensure there is sufficient work to be done. Otherwise it’s not worth it!

Load balancing – distribute work & other resources in a way that minimizes the coordination needed.

Synchronization – needed for both correctness and to avoid repeating work.

Introduction

Page 7: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

7

More on Load Balancing

Static Partitioning• Some processors will probably have more work to do compared to others.

• Some processors will exhaust their resources before others do.

Introduction

Page 8: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

8

Dynamic Load Balancing • Sometimes it’s possible to obtain a good estimate of the amount of work to be done in advance

• More often it’s not possible to estimate that Solution: (1) Over-partition the work into more tasks (2) Have each thread compete to claim one task at a time to execute.Advantages:(1) More resilient to changes in the number of processors available(2) If one task takes longer to execute other threads can execute

anyfurther work

More on Load BalancingIntroduction

Page 9: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

9

Why not divide the work to the smallest possible independent tasks?

The coordination cost is too expensive!Synchronization guarantees correctness and avoids unnecessary work, but has time & space overheads!

Algorithms try to minimize the synchronization needed by using thread-local data structures, for instance.

More on Load BalancingIntroduction

Page 10: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

10

Processor-centric VS. Memory-centric

Processor-centric algorithms:• threads acquire work that vary in size.• threads steal work from other threads• little regard to the location of the objects

Memory-centric algorithms:• take location into greater account• operate on continuous blocks of heap memory• acquire/release work from/to shared pools of fixed-size buffers of work

Introduction

Page 11: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

11

Algorithms’ Abstraction

Assumption: Each collector thread executes the following loop (*):

while not terminated()acquireWork()performWork()generateWork()

(*) in most cases.

Introduction

Page 12: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

12

Outline of Today’s Talk

How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction

Page 13: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

13

Marking comprises of…

1) Acquisition of an object from a work list2) Testing & setting marks3) Generating further marking work by adding the

object’s children to the work list

Parallel Marking

Page 14: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

14

Important Note

All known parallel marking algorithms areprocessor-centric!

Parallel Marking

Page 15: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

15

When is Synchronization Required?

No synchronization:If the work list is thread-local.Example: when an object’s mark is represented by a bit in its header.

Synchronization needed:Otherwise the thread must acquire work atomically from some otherthread’s work list or from some global list.Example: when marks are stored in a shared bitmap.

Parallel Marking

Page 16: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

16

Endo et al [1997] Parallel Mark Sweep Algorithm

N – total number of threadsEach marker thread has its own:• local mark stack• a stealable work queue.

shared stealableWorkQueue[N]me myThreadId

acquireWork():if not isEmpty(myMarkStack)

returnstealFromMyself() if isEmpty(myMarkStack)

stealFromOthers()

Parallel Marking

Page 17: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

17

An idle thread acquires work by first examining its own queue andthen other threads’ queues.

stealFromMyself():lock(stealableWorkQueue[me])n size(stealableWorkQueue[me]) / 2transfer(stealableWorkQueue[me], n, myMarkStack) unlock(stealableWorkQueue[me])

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 18: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

18

An idle thread acquires work by first examining its own queue andthen other threads’ queues. stealFromOthers():

for each j in Threads if not locked(stealableWorkQueue[j] )

if lock(stealableWorkQueue[j]) n size(stealableWorkQueue[j]) / 2 transfer(stealableWorkQueue[j], n, myMarkStack) unlock(stealableWorkQueue[j]) return

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 19: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

19

performWork():while pop(myMarkStack, ref)

for each fld in Pointers(ref) child *fld if child null && not isMarked(child) setMarked(child)

push(myMarkStack, child)

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 20: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

20

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Stack BStack A

P1P2

Thread A Thread B

Queue BQueue A

C1

Notice: it is possible for threads to mark the same child object.

Page 21: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

21

Each thread checks its own mark queue. If it’s empty it transfers all its mark stack (apart from local roots) to the queue.

generateWork():if isEmpty(stealableWorkQueue[me])

n size(myMarkStack)lock(stealableWorkQueue[me])transfer(myMarkStack, n, stealableWorkQueue[me])unlock(stealableWorkQueue[me])

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 22: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

22

Parallel Marking With a BitmapThe collector tests the bit and only if it isn’t set, attempts to set itatomically, retrying if the set fails.

setMarked(ref):oldByte markByte(ref)bitPosition markBit(ref)loop

if isMarked(oldByte, bitPosition) returnnewByte mark(oldByte, bitPosition)if (CompareAndSet(&markByte(ref), oldByte, newByte) return

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

CompareAndSet(x,old,new): atomic curr *x if curr = old *x new return true return false

Page 23: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

23

Termination Detection – Reminder From Previous Lecture:• Separate thread for termination detection.

• Symmetric detection – every thread can play the role of the detector.

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 24: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

24

Termination Detection – Reminder From Previous Lecture:

shared jobs[N] initial work assignmentsshared busy[N] [true, …]shared jobsMoved falseshared allDone falseme myThreadId

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 25: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

25

Termination Detection – Reminder From Previous Lecture:

worker(): loop

while not isEmpty(jobs[me]) job dequeue(jobs[me]) perform jobif another thread j exists whose jobs set appears relatively large some stealJobs(j) enqueue(jobs[me], some) continuebusy[me] falsewhile no thread has jobs to steal && not allDone /* do nothing: wait for work or termination*/if allDone returnbusy[me] true

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 26: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

26

Termination Detection – Reminder From Previous Lecture:

stealJobs(j): some atomicallyRemoveJobs(jobs[j])if not isEmpty(some)

jobsMoved truereturn some

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 27: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

27

Termination Detection – Reminder From Previous Lecture:

detect(): anyActive truewhile anyActive

anyActive ( i) (busy[i])anyActive anyActive || jobsMoved jobsMoved false

allDone true

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 28: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

28

Running Example

Queue B

Stack B

Queue A

Stack A

Initially: queues are empty!acquireWork – if stack is non-empty returns.

Thread A Thread B

Endo et al [1997] Parallel Mark Sweep AlgorithmParallel Marking

Page 29: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

29

Running Example

performWork pops, marks and pushes children.

Stack B

Queue BO1

O2

O3

O4

O4

O1

O3

O2

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 30: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

30

Running Example

Queue B

Stack B

generateWork moves all the objects from the stack to the queue!

Queue B

O3

O2

Stack B

O3O2

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 31: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

31

Running Example

acquireWork – if stack is empty moves half the queue to the stack.

Stack B

Queue BQueue B

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 32: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

32

Running Example

Queue B

acquireWork – if queue is also empty, steals from other queues.This continues until there is no more work (the detector will detect this!).

Stack A Stack B

Queue AQueue A

Parallel Marking

Endo et al [1997] Parallel Mark Sweep Algorithm

Page 33: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

33

N – total number of threads• Each thread has its own stealable deque (double-ended queue).• The deques are fixed size to avoid allocation during collection

causes overflow.• All threads share a global overflow set implemented as a list of list.

shared overflowSetshared deque[N]me myThreadId

Parallel Marking

Flood et al [2001] Parallel Mark Sweep Algorithm

acquireWork():if not isEmpty(deque[me])

return n dequeFixedSize/2if extractFromOverflowSet(n)

returnstealFromOthers()

Page 34: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

34

• The Java class structure holds the head of a list of overflow objects of that type, linked through the class pointer field in their header.

• An object’s type field can be restored on remove from overflow set (stop-the-world enables the type field to be used here).

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

Page 35: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

35

Idle threads acquire work by trying to fill half their deque from the overflow set before stealing from other deques.

extractFromOverflowSet(n): transfer(overflowSet, n, deque[me])

Parallel Marking

Flood et al [2001] Parallel Mark Sweep Algorithm

Page 36: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

36

Idle threads steal work from the top of others’ deques using remove.

stealFromOthers():for each j in Threads

ref remove(deque[j])if ref null push(deque[me], ref) return

remove:requires synchronization!

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

Page 37: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

37

performWork():loop

ref pop(deque[me])if ref = null return for each fld in Pointers(ref) child *fld if (child null && not isMarked(child) setMarked(child) if not push(deque[me], child) n size(deque[me]) / 2 transfer(deque[me], n, overflowSet)

pop:requires synchronizationonly to claim the lastelement of the deque.

push: does not requiresynchronization.

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

Page 38: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

38

Work is generated inside peformWork by pushing to the deque or transferring to the overflow set.

generateWork():/* nop */

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

Page 39: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

39

Termination Detection• Variation of symmetric detection that we saw in previous lecture.• Status word – one bit per thread (active/inactive).

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

Page 40: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

40

Running Example

Deque BDeque A

Initially: deques are non-empty!acquireWork – if deque is non-empty return.

Thread A Thread B

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

Page 41: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

41

Running Example

performWork – pop, mark and push children.

O1

O4

O5

O2

O3

O6

O7

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

Page 42: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

42

Running Example

Deque B

performWork – if push causes overflow copies half the queue to the overflow set.

Thread B

O1

O4

O5

O2

O3

O6

O7

O3 O4 O5 O6

O7

A

A

B

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

O1O2

Page 43: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

43

Running Example

performWork – the overflow set in this case:

A

A

B

Class A Structure

Class B Structure

O5

O6

O7

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

O1

O4

O5

O2

O3

O6

O7

Page 44: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

44

Running Example

Deque B

acquireWork – if deque is empty, takes work from overflow set. If fails, removes from other deques.

Thread B

Deque A

Thread A

O9 O9

Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking

Page 45: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

45

• This technique is best employed when the number of threads is known in advance.

• May be difficult for a thread:• To choose the best queue from which to steal.• To detect termination.

Mark Stacks With Work Stealing - Disadvantages Parallel Marking

Page 46: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

46

• Threads exchange marking tasks through single writer, single reader channels.

• In a system of N threads, each thread has an array of N-1 queues.• Annotation for input channel from thread i to thread j i j.

This is also an output channel of thread i.

shared channel[N,N]me myThreadId

Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking

Page 47: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

47

If the thread’s stack is empty, it takes a task from some input channel k me.

acquireWork():if not isEmpty(myMarkStack)

returnfor each k in Threads

if not isEmpty(channel[k, me]) ref remove(channel[k, me]) push(myMarkStack, ref) return

Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking

Page 48: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

48

Threads first try to add new tasks (marking children) to other threads’input channels (their output channels).performWork():

loopif isEmpty(myMarkStack) returnref pop(myMarkStack)for each fld in Pointers(ref) child *fld if child null && not isMarked(child) if not generateWork(child)

push(myMarkStack, child)

Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking

Page 49: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

49

• When a thread generates a new task, it first checks whether any other thread k needs work.

• If so, adds the task to the output channel me k.• Otherwise, pushes the task to its own stack.

generateWork(ref):for each k in Threads

if needsWork(k) && not isFull(channel[me,k]) add(channel[me,k], ref) return true

return false

Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking

Page 50: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

50

Advantages:• No expensive atomic operations!• Performs better on servers with many processors.• Keeps all threads busy.

(*) On a machine with 16 Intel Xeon processors queues of size one ortwo were found to scale best.

Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking

Page 51: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

51

Outline of Today’s Talk

How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction

Page 52: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

52

Copying is Different From Marking…

It’s essential that an object be copied only once!If an object is marked twice it usually does not affect the correctness of the program.

Parallel Copying

Page 53: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

53

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying

Each copying thread is given its own stack and transfers work between its local stack and a shared stack.

k – size of a local stack

shared sharedStackmyCopyStack[k]sp 0 /* local stack pointer */

Parallel Copying

Page 54: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

54

Using rooms, they allow multiple threads to:• pop elements from the shared stack in parallel• push elements to the shared stack in parallelBut not pop and push in parallel!

shared gate openshared popClients /* number of clients in the pop room */shared pushClients /* number of clients in the push room */

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel

Copying

Page 55: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

55

while not terminated()enterRoom() /* enter pop room */for i 1 to k

if isLocalStackEmpty() acquireWork() if isLocalStackEmpty() breakperformWork()

transitionRooms()generateWork()if exitRoom() /* exit push room */ terminate()

acquireWork(): sharedPop()

performWork(): ref localPop() scan(ref)

generateWork(): sharedPush()

isLocalStackEmpty(): return sp = 0

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel

Copying

Page 56: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

56

localPush(ref):myCopyStack[sp++] ref

localPop():return myCopyStack[--sp]

Local Stack

SPref

1. localPop()2. localPush(ref)

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel

Copying

Page 57: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

57

sharedPop():cursor FetchAndAdd(&sharedStack, 1) if cursor stackLimit FetchAndAdd(&sharedStack, -1)else myCopyStack[sp++] cursor[0]

FetchAndAdd(x, v): atomic old *x *x old + v return old

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel

Copying

Page 58: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

58

sharedPush():cursor FetchAndAdd(&sharedStack, -sp) - sp for i 0 to sp-1

cursor[i] myCopyStack[i]sp 0

FetchAndAdd(x, v): atomic old *x *x old + v return old

Parallel Copying

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying

Page 59: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

59

enterRoom():while gate OPEN

/* do nothing: wait */FetchAndAdd(&popClients, 1)while gate OPEN

FetchAndAdd(&popClients, -1) /* failure - return to previous state*/

while gate OPEN /* do nothing: wait */ FetchAndAdd(&popClients, 1) /* try again */

Parallel Copying

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying

Page 60: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

60

transitionRooms(): /* move from pop room to push room */gate CLOSED /* close gate to pop room */FetchAndAdd(&pushClients, 1)FetchAndAdd(&popClients, -1) while popClients > 0

/* do nothing: wait till none popping */

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel

Copying

Page 61: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

61

exitRoom():pushers FetchAndAdd(&pushClients, -1) - 1if pushers = 0 /* last in push room */ gate OPEN if isEmpty(sharedStack) /* no work left */

return true else

return false

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel

Copying

Page 62: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

62

Problem:Any processor waiting to enter the push room must wait until allprocessors in the pop room have finished their work!

Possible Solution:The work can be done outside the rooms!It increases the likelihood that the pop room is empty threads will be able to enter the push room more quickly

Parallel Copying

Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying

Page 63: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

63

• Divide the heap into small, fixed-size chunks.• Each thread receives its own chunks to scan and into which to copy

survivors.• Once a thread chunk copy is full it’s transferred to a global pool

where idle threads compete to scan it and a new empty chunk isobtained for the thread itself.

Parallel Copying

Memory-Centric Techniques:Block-Structured Heaps

Page 64: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

64

Mechanisms Used To Ensure Good Load Balancing:• Chunks acquired were small (256 words).

• To avoid fragmentation, they used big bag of pages allocation for small objects

• Larger objects and chunks were allocated from the shared heap using a lock.

Parallel Copying

Memory-Centric Techniques:Block-Structured Heaps

Page 65: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

65

Mechanisms Used To Ensure Good Load Balancing:• Balanced load in finer granularity.• Each chunk was divided into smaller blocks (32 words).

Memory-Centric Techniques:Block-Structured HeapsParallel

Copying

Page 66: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

66

Mechanisms Used To Ensure Good Load Balancing:• After scanning a slot, the thread checks whether it reached the block boundary.

• If so and the next object was smaller than a block:• the thread advanced its scan pointer to the start of its current copy

block.• It reduced contention – the thread did not have to compete to

acquire a new scan block.• Un-scanned blocks in that area are given to the global pool.

• If the object was larger than a block but smaller than a chunk, the scan pointer was advanced to the start of its current copy chunk.

• If the object was large, the thread continued to scan it.

Parallel Copying

Memory-Centric Techniques:Block-Structured Heaps

Page 67: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

67

Mechanisms Used To Ensure Good Load Balancing:

Parallel Copying

Memory-Centric Techniques:Block-Structured Heaps

Page 68: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

68

Block States and Transitions:

Memory-Centric Techniques:Block-Structured HeapsParallel

Copying

Page 69: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

69

State Transition Logic:

Parallel Copying

Memory-Centric Techniques:Block-Structured Heaps

Page 70: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

70

Outline of Today’s Talk

How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction

Page 71: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

71

1) Statically partition the heap into contiguous blocks for threads to sweep.

2) Over-partition the heap and have threads compete for a block to sweep to a free-list.

ProblemThe free-list becomes a bottleneck!

SolutionProcessors will have their own free-lists.

Parallel Sweeping

Simple Strategies

Page 72: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

72

• A naturally parallel solution to sweeping partially full blocks.• In the sweep phase, we need to identify empty blocks and return

them to the block allocator.• Need to reduce contention.• Gave each thread several consecutive blocks to process locally.• They used bitmap marking with bitmaps held in block headers

(used to determine whether a block is empty or not).• Empty blocks are added to a local free-block list.• Partially full blocks are added to local reclaim list for subsequent

lazy sweeping.• Once a processor finishes with its sweep set it merges its local list

with the global free-block list.

Parallel Sweeping

Endo et al [1997] Lazy Sweeping

Page 73: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

73

Outline of Today’s Talk

How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction

Page 74: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

74

Observation:Uniprocessor compaction algorithms typically slide all live data to one end of the heap space.

If multiple threads do so in parallel one thread can overwrite live data before another thread has moved it!

Thread 1 compaction data. Thread 2 compaction data.

B CA DC

Parallel Compaction

Flood et al [2001] Parallel Mark-Compact

Page 75: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

75

Suggested Solution:• Divide the heap space into several regions, one for each

compacting thread.• To reduce fragmentation, they also have threads alternate the

direction in which they move objects in even and odd numbered regions.

Flood et al [2001] Parallel Mark-CompactParallel Compaction

Page 76: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

76

4 Phases:1) Parallel marking.2) Calculate forwarding addresses.3) Update references.4) Move objects.

Flood et al [2001] Parallel Mark-CompactParallel Compaction

Page 77: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

77

Phase 2 - Calculating Forwarding Addresses:• Over-partition the space into M = 4N (N- number of threads) units of

roughly the same size.• Threads compete to claim units.• Each thread counts the volume of live data in its unit.• According to these volumes, they partition the space into N regions that

contain approximately the same amount of live data.• Threads compete to claim units and install forwarding addresses of each

live object of their units.

3 6 13 7 10 5 7 5 12 48 9

30 29 30

M = 12 units, N = 3 regions/threads

Flood et al [2001] Parallel Mark-CompactParallel Compaction

Page 78: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

78

Phase 3 - Updating References:• Updating references to point to objects’ new locations requires scanning:

• Objects stored in mutator threads’ stacks that might contain references to objects in the heap space (young generation).

• Live objects in the heap space (old generation).• Threads compete to claim old generation units to scan and a single

thread scans the young generation.

Phase 4 – Moving Objects:• Each thread is in charge of a region. • Good load balancing is guaranteed because the regions contain roughly

equal volumes of live data.

Flood et al [2001] Parallel Mark-CompactParallel Compaction

Page 79: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

79

Disadvantages:1) The algorithm makes 3 passes over the heap while other

compacting algorithms make fewer passes.2) Rather than compacting all live data to one end of the heap,

the algorithm compacts into N regions, leaving (N +1)/2 gaps for allocation. If a large number of threads in used, it’s difficult for mutators to allocate very large objects.

Flood et al [2001] Parallel Mark-CompactParallel Compaction

Page 80: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

80

1) Address the 3 passes problem:• Calculate rather than store forwarding addresses using the mark

bitmap and an offset vector that holds the new address of the first live object in each block.

• To construct the offset vector one pass over the mark-bit vector is needed.

• Only a single pass over the heap is needed to move objects and update references using these vectors.

Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction

Page 81: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

81

1) Address the 3 passes problem:• Bits in the mark-bit vector indicate the start and end of each live object.• Words in the offset vector hold the address to which the first live object

in their corresponding block will be moved. • Forwarding addresses are not stored but are calculated when needed

from the offset and mark-bit vectors.

Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction

Page 82: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

82

2) Address the small gaps problem:• Over-partition the heap into fairly large areas. • Threads race to claim the next area to compact, using an atomic

operation to increment a global area index.• If the thread succeeds, it has obtained an area to compact.• If it fails, it tries to claim the next area.

Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction

Page 83: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

83

2) Address the small gaps problem:• A table holds pointers to the beginning of the free space for each area. • After winning an area to compact, a thread races to obtain an area

into which it can move objects. It claims an area by trying to write null into its corresponding table slot.

• Threads never try to compact from or into an area whose table entry is null.

• Objects are never moved from a lower to a higher numbered area.• Progress is guaranteed since a thread can always compact an area into

itself.• Once a thread has finished with an area, it updates the area’s free

pointer. If an area is full, its free space pointer will remain null.

Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction

Page 84: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

84

2) Address the small gaps problem:

…1 2 3

Area Index: 0Area Index: 1Area Index: 2

Free pointers table 200 1000 1800 …

A B CA B C

NULL

D EA B C D E

200 400

400NULL

1000 1800

Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction

Page 85: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

85

2) Address the small gaps problem:Explored two ways in which objects can be moved:a. Slide object by object.b. To reduce compaction time, slide only complete blocks (256 bytes).

Free space in each block is not squeezed out.

Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction

Page 86: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

86

Discussion

• What is the tradeoff in the choice of the chunk size in parallel copying?• Parallel copying with no synchronization can cause issues? For example if

an object is copied twice by two different threads, what can be the consequence?

A

B

A A

FA X

B

Page 87: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

87

Something Extra

https://www.youtube.com/watch?v=YhKZe22tZlc

Page 88: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

88

Conclusions & Summary

• There should be enough work for parallel collection • Need to take into account synchronization costs• Need to balance loads between the multiple threads• Learned different algorithms for marking, sweeping, copying and

compaction that take all this challenges into account.• Difference between marking and copying – marking an object twice is

not so bad. Copying an object twice can harm the correctness.