A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge

A New Approach to Parallelising

Tracing Algorithms

Computer Science Department

University of Western Ontario

Computer Laboratory

University of Cambridge

Cosmin E. Oancea, Alan Mycroft & Stephen M. Watt

I. Motivation & High Level Goal

We study more We study more scalable algorithms for parallel tracing: algorithms for parallel tracing: memory management is the primary motivation, memory management is the primary motivation, butbut

do not claim immediate improvements to state-of-the-art GC.do not claim immediate improvements to state-of-the-art GC.

Tracing is important to computing:Tracing is important to computing: sequential & flat memory model – well understood,sequential & flat memory model – well understood,

parallel & multi-level memory – less clear: parallel & multi-level memory – less clear: processor communication cost grows w.r.t. raw instr speed x P x ILPprocessor communication cost grows w.r.t. raw instr speed x P x ILP

Memory-centric algorithm for copy collection (a general Memory-centric algorithm for copy collection (a general form of tracing) -- free of locks on the mainline path.form of tracing) -- free of locks on the mainline path.

I. Abstract Tracing Algorithm

Assume an initialisation phase has already marked and Assume an initialisation phase has already marked and processed some root nodes.processed some root nodes.

Implementing the implicit fix-point via Implementing the implicit fix-point via worklists, yields: , yields:

1. mark and process any unmarked child of a marked node;

2. until no further marking is possible.

1. pick a node from a worklist; 2. if unmarked then mark it, process it, and add its unmarked childreen to worklists;

3. repeat until all worklists are empty.

I. Worklist Semantics: Classical

What should worklists model?What should worklists model?

Classical approach: Classical approach: processing semantics.processing semantics.

Worklist 1

WorklistWorklist ii stores nodes to be processed by stores nodes to be processed by processorprocessor ii!!

Worklist 2 Worklist 3 Worklist 4

I. Classic Algorithm

Two layers of synchronisation:Two layers of synchronisation: Worklist level – small overhead via Worklist level – small overhead via dequedeque (Arora (Arora et al.et al.) or ) or work work

tealingtealing (Michael (Michael et al.et al.)) Frustrating atomic block – gives idempotent copy, thus enables the Frustrating atomic block – gives idempotent copy, thus enables the

above small overhead worklist-access solutions. above small overhead worklist-access solutions.

while (!worklist.isEmpty()) { int ind = 0; Object from_child, to_child, to_obj = worklist.deqRand(); foreach( from_child in to_obj.fields() ) { ind++; atomic{ if(from_child.isForwarded())continue; to_child = copy(from_child); setForwardingPtr(from_child,to_child); } to_obj.setField(to_child, ind-1); queue.enqueue(to_child);} }

I. Related Work

Halstad (MultiLisp) – first parallel semi-space collector, Halstad (MultiLisp) – first parallel semi-space collector, but may lead to load imbalance. Solutions:but may lead to load imbalance. Solutions:

Object stealing: Arora Object stealing: Arora et al.et al. Flood Flood et al.et al., Endo , Endo et al.et al. ... ... Block-based approaches: Imai and Tick, Attanasio et al., Block-based approaches: Imai and Tick, Attanasio et al.,

Marlow et al., ...Marlow et al., ... Free-of-locks solutions via exploiting immutable data: Free-of-locks solutions via exploiting immutable data:

Doligez and Leroy, Huelsbergen and LarusDoligez and Leroy, Huelsbergen and Larus

Memory-centric solutions – studied only in the sequential Memory-centric solutions – studied only in the sequential case: Shuf case: Shuf et al.et al., Demers , Demers et alet al., Chicha and Watt.., Chicha and Watt.

II. Memory-Centric Tracing (High Level)

LL == memory partition (local) size; gives the trade-off between == memory partition (local) size; gives the trade-off between locality of reference and load balancing.locality of reference and load balancing.

Worklist Worklist jj stores slots: the to-space address pointing to a stores slots: the to-space address pointing to a from- from-space field space field ff of the currently copied/scanned object of the currently copied/scanned object oo && &&j = ( o.f quo L ) rem Nj = ( o.f quo L ) rem N

II. Memory-Centric Tracing (High Level)1. Arrow Semantics: double ended – copy to-space, dashed – insert in queue, solid – slots pointing to fields

1. Each worklist w is owned by at most one collector c (owner)2. Forwarded slots of c: those slots belonging to a partition owned by c, but discovered by another collector.

3. Eager strategy for acquiring worklists ownership. Initially all roots are placed in worklists, if non-empty owned.

Dispatching Slots to Worklists or Forwarding Queues

II. Memory-Centric Tracing Implem.

Each collector processes its forwarding queues (size Each collector processes its forwarding queues (size FF)) Empty worklists are released (ownership).Empty worklists are released (ownership).

Each collector processes Each collector processes F*P*4F*P*4 items from its owned items from its owned worklists (worklists (44 empirically chosen – forwarding ratio inv). empirically chosen – forwarding ratio inv). No locking when accessing worklists or when copying.No locking when accessing worklists or when copying.

L (local partition size) gives the locality-of-reference level.L (local partition size) gives the locality-of-reference level.

Repeat untilRepeat until no owned worklists && all forw. no owned worklists && all forw. queues empty && all worklists empty. queues empty && all worklists empty.

II. Forwarding Queues on INTEL IA-32

Implement inter-processor communication:Implement inter-processor communication: with with PP collectors have a collectors have a PxPPxP matrix of queues; entry matrix of queues; entry (i,j)(i,j)

holds items enqueued by collector holds items enqueued by collector ii and dequeued by and dequeued by jj wait-free, lock-free and mfence-free IA-32 implementation.wait-free, lock-free and mfence-free IA-32 implementation.

volatile int tail=0, head=0, buff[F]; next : k -> (k+1)%F;

bool enq(Address slot) { bool is_empty() int new_tl=next(tail); { return head == tail; } if(new_tl == head) return false; Address deq() { buff[tail] = slot; Address slot= buff[head]; tail = new_tl; head = next(head); return true; return slot; } }

II. Forwarding Queues on INTEL IA-32

The sequentially inconsistent pattern occurs, but The sequentially inconsistent pattern occurs, but algorithm still safe:algorithm still safe: head & tailhead & tail interaction – reduces to a collector failing to interaction – reduces to a collector failing to

deq from a non-empty list (and to enq into a non-full list);deq from a non-empty list (and to enq into a non-full list);

buff[tail_prev] & head==tail_prevbuff[tail_prev] & head==tail_prev interaction interaction is safe because writes are not re-ordered.is safe because writes are not re-ordered.

a = b = 0; // Initially // (two enq) || (two is_empty; deq)

// // Proc 1 Proc 2 // Proc i Proc j a = 1; b = 1; buff[tail]=...; head=next(head);// mfence; mfence; tail =...; if(head!=tail) x = a; y = b; if(new_tl==head) ..=buff[head]; // x == 0 & y == 0!

II. Dynamic Load Balancing

Small partitions (64K) -- OK under static ownership:Small partitions (64K) -- OK under static ownership: grey object -- randomly distributed among the N partitions,grey object -- randomly distributed among the N partitions, still gives some locality of reference still gives some locality of reference

(otherwise forwarding would be too expensive)(otherwise forwarding would be too expensive)

Larger partitions may need dynamic load balancing:Larger partitions may need dynamic load balancing: Partition ownership must be transferred:Partition ownership must be transferred:

A starving collector A starving collector cc signals nearby collectors; these may release signals nearby collectors; these may release ownership of an owned worklist ownership of an owned worklist ww while placing an item of while placing an item of ww on on collector collector cc's forwarding queue.'s forwarding queue.

Partition stealing requires locking on the mainline path since Partition stealing requires locking on the mainline path since the copy operation is not idempotent without it (the copy operation is not idempotent without it (Michael Michael et al.et al.)! )!

II. Optimisation; Run-Time Adaptation

Inter-collector producer-consumer relations are detected when Inter-collector producer-consumer relations are detected when forwarding queues are found full (forwarding queues are found full (F*P*4F*P*4 processed items/iter processed items/iter): ): transfer ownership to the producer collector to optimise forwarding.transfer ownership to the producer collector to optimise forwarding.

Run-time adapt: monitor forw ratio (Run-time adapt: monitor forw ratio (FRFR) & load balancing () & load balancing (LBLB):): start with large start with large LL; ; whilewhile poor poor LBLB decrease decrease LL if FR > FR_MAX or L < L_MINif FR > FR_MAX or L < L_MIN switch to classical! switch to classical!

III. Empirical Results – Small Data Two quad-core AMD Opteron machine on Two quad-core AMD Opteron machine on smallsmall live live

data-sets applications against MMTK: data-sets applications against MMTK: Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS.Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS. Heap Size = 120-200M, IFRav = 4.2, Heap Size = 120-200M, IFRav = 4.2, LL = 64K. = 64K.

1 2 4 6

0

20

40

60

80

100

120

140

160

139.714285714286

92.4285714285714

59.8571428571429

49.5714285714286

111.714285714286

77.2857142857143

54.285714285714348.7142857142857

GC Time Small Live Data Sets (Sequential Time is 100)

Memory-Centric SD

Classical SD

Number of Processors

Nor

mal

ised

Tim

e

III. Empirical Results – Large Data Two quad-core AMD Opteron machine on Two quad-core AMD Opteron machine on largelarge live live

data-sets applications against MMTK: data-sets applications against MMTK: Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST, Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST,

TSP, Perimet, BH.TSP, Perimet, BH. Heap Size > 500M, IFR average = 6.3, Heap Size > 500M, IFR average = 6.3, LL = 128K. = 128K.

1 2 4 6 8

0

20

40

60

80

100

120

140 131

78.5

40.5

27.62523.125

111.25

96.87592.5 92.75

95.625

GC Time Large Live Data Sets(Sequential Time is 100)

Memory-Centric LD

Classical LD


Nor

mal

ised

Tim

e

III. Empirical Results – Eclipse Quad-core Intel machine on Eclipse (Quad-core Intel machine on Eclipse (largelarge live data-set): live data-set):

Heap Size = 500M, IFR average = (only) Heap Size = 500M, IFR average = (only) 2.6 2.6 for for LL = 512K, = 512K, otherwise 2.1!otherwise 2.1!

1 2 3 4

0

20

40

60

80

100

120

140

160148

100

81

69

116

69

57

48

GC Time Eclipse Large(Sequential Time is 100)

Memory-CentricClassical


Nor

mal

ised

Tim

e

III. Empirical Results – Jython Two quad-core AMD machine on Jython:Two quad-core AMD machine on Jython:

Heap Size = 200M, IFR average = (only) 3.0Heap Size = 200M, IFR average = (only) 3.0!!

1 2 4 6

0

20

40

60

80

100

120

140

160

145

102

64

53

108

70

58

44

GC Time Jython(Sequential Time is 100)

Memory-Centric

Classical


No

rma

lise

d T

ime

III. Conclusions Memory-centric algorithms may be an important

alternative to processing-centric algorithms, especially on non-homogeneous hardware.

How to explicitly represent and optimise two abstractions: locality of reference (L) and inter-processor communication (FR). L trade-offs locality for load balancing.

Robust behaviour: scales well with both data size and number of processors.

Documents

A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge