An On-the-Fly Reference Counting Garbage Collector for Java Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni – Microsoft

An On-the-Fly Reference Counting

Garbage Collector for Java

Erez Petrank

Technion – Israel Institute of Technology

Joint work with Yossi Levanoni – Microsoft Corporation

ACM Conference on Object Oriented

Programming Systems Languages & Applications

Tampa, Florida

October 18, 2001

Levanoni & PetrankOn-the-Fly Reference Counting2

Garbage Collection Today• Two classic approaches:

– Tracing [McCarthy 1960]: trace reachable objects, reclaim objects not traced.

– Reference counting [Collins 1960]: keep reference count for each object, reclaim objects with count 0.

• Today’s advanced environments:– multiprocessors – huge memories


Motivation for RC• Reference Counting work is proportional

to work on creations and modifications.– Can tracing deal with tomorrow’s huge

heaps?

• Reference counting has good locality.• Tracing rules JVM’s, is it justified? • The Challenge:

– RC write barriers seem too expensive. – RC seems impossible to “parallelize”.


This work• An improved RC (suitable for Java)

– Reduced overhead on write barrier,– Concurrent with low overhead: on-the-fly,

no sync. operation in write barrier, multiprocessor.

– Thus: low latency, high performance.

• Implementation:– JVM: SUN’s Java Virtual Machine 1.2.2– Platform: 4-way IBM Netfinity 8500R server

with 550MHz Intel III Xeon and 2GB memory.


Agenda

IntroductionMotivationThe Algorithm• Related issues• Implementation and

Measurements • Conclusions


Terminology

Stop-the-World

Parallel

Concurrent

On-the-Fly

programGC


Basic Reference Counting• Each object has an RC field, new

objects get o.RC:=1.• When p that points to o1 is modified to

point to o2 we do: o1.RC--, o2.RC++.• if then o1.RC==0:

– Delete o1.– Decrement o.RC for all sons of o1.– Recursively delete objects whose RC is

decremented to 0.


Basic Reference Counting• Each object has an RC field, new objects

get o.RC:=1.• When p that points to o1

is modified to point to o2 we do: o1.RC--, o2.RC++.

• if then o1.RC==0:– Delete o1.– Decrement o.RC for all sons of o1.– Recursively delete objects whose RC is

decremented to 0.

o1 o2

p

Deferred Reference Counting

• Problem: overhead on updating program variables (locals) costs too much.

• Solution [Deutch & Bobrow] :– Don’t update RC for locals.– “Once in a while”: collect all objects with

o.RC=0 that are not referenced from local roots.

• Deferred RC reduces overhead by 80%. Used in most modern RC systems.

Multithreaded RC?• Problem:

– Parallel updates confuse counts:

– (And more: Update ref counts in parallel races.)

A

B DC

Thread 2: Read A.next;A.next D;B.RC- -; D.RC++

Thread 1: Read A.next;A.next C;B.RC- -; C.RC++

Multithreaded RC

• Problem:– Parallel updates confuse counts.– Update ref counts in parallel races.

• [DeTreville]:– Lock heap for each pointer modification.– Thread records its updates in a buffer. – Once in a while (snapshot alike):

• GC thread reads all buffers to update ref counts• Reclaims all objects with 0 rc that are not local.

To Summarize…

• Overhead on write barrier is considered high.– Even with deferred RC of Deutch &

Bobrow.

• Using reference counting concurrently with program threads seems to bear high synchronization cost. – Lock or “compare & swap” for each

pointer update.

Improving RC

• Consider a pointer p that takes the following values between GC’s: O0,O1, O2, …, On .

• All RC algorithms perform 2n operations: O0.RC--; O1.RC++; O1.RC--; O2.RC++; O2.RC--; … ; On.RC++;

• But only two operations are needed:O0.RC-- and On.RC++

p

O1 O2 O3 On. . . . .O4O0

Improving RC cont’d

• Don’t record all pointer modifications.Record first modifications between GC’s (O0).

• During the collection, for each recorded ptr p: – find O0 by checking the record,

– find On by reading the heap during the collection.

• Apply only two operations for each such pointer: O0.RC-- and On.RC++

p

O1 O2 O3 On. . . . .O4O0

This reduces number of logging & counter updates by a factor of 100-1000 for normal benchmarks!

Improving Synch. Overhead

• Simple solutions bear unacceptable overhead:– DeTreville uses a lock for all pointer

modifications– Simple alternatives require 3 compare-

and-swap’s• Our second contribution:

– A carefully designed write barrier (and an observation) allows elimination of all sync. operations from the write barrier.

The write barrierUpdate(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) { log( slot, old ) SetDirty(slot) } *slot = new}

Observation:If two threads:1. invoke the write barrier

in parallel, and 2. both log an old value,then both record the same old value.

Intermediate Algorithm:Snapshot Oriented, Concurrent• Use write barrier with program threads. • To collect:

– Stop all threads– Scan roots (locals)– get the buffers with modified slots – Clear all dirty bits. – Resume threads– For each modified slot:

• decrease rc for old value (written in buffer),• increase rc for current value (“read heap”),

– Reclaim non-local objects with rc 0.

The Sliding View AlgorithmOn-th-Fly

• Do all collection as threads run: – Read threads buffers (one thread at a time),– Clear all dirty bits,– Update reference counts,– Read roots of each thread, one at a time, – Reclaim (recursively) objects with rc 0.

• Note: rc’s are not correct for any specific point in time, yet, with care, most dead objects may be reclaimed!

• Borrow ideas from [Lamport et. Al.]

Sliding View

Cycles Collection

• Our solution: use a tracing algorithm infrequently.

• Currently this is the most efficient solution. Cycle collectors have high cost.

• We propose a new on-the-fly mark & sweep algorithm that works best with the same sliding view.Can also be used “on its own”.

Implementation for Java

• Based on Sun’s JDK1.2.2 for Windows NT• Main features

– 2-bit RC field per object (á la [Wise et. al.])– A supplemental sliding view tracing

algorithm– A custom allocator for on-the-fly RC:

• Multi leveled fine grained locking• Supports sporadic reclamation of objects• Supports sweeping the heap

Performance Measurements

• First multiprocessor measurements in a “normal” environment! – (Previous measured reports assumed one

CPU is free for GC all the time.)

• Benchmarks:– Server benchmarks

• SPECjbb2000 --- simulates business-like transactions in a large firm

• MTRT --- a multi-threaded ray tracer

– Client benchmarks• SPECjvm98 --- a suite of mostly single-threaded client

benchmarks

Improved RC• How many RC updates are eliminated?

BenchmarkNo of storesNo of “first” stored

Ratio of “first” stores

jbb71,011,357264,1151/269

Compress64,905511/1273

Db33,124,78030,6961/1079

Jack135,174,7751,5461/87435

Javac22,042,028535,2961/41

Jess26,258,10727,3331/961

mpegaudio5,517,795511/108192

SPECjbb Latency(Max Transaction Time)

0

2000

4000

6000

8000

10000

Milliseconds

# Threads

SPECjbb -- M ax. Response Time (600M B)

RC 16 16 47 78 110 146 245 329

Original 7433 8037 8463 6923 7857 7536 6593 5997

1 2 4 6 8 10 15 20

SPECjbb ThroughputSPECjbb -- performance vs. # threads (600MB)

-6.0%

-4.0%

-2.0%

0.0%

2.0%

4.0%

6.0%

Threads

Cha

nge

in T

hrou

ghpu

t

(%)

RC 0.4% 4.0% -5.4% -2.0% -1.0% -2.2% -0.3% 2.4%

1 2 4 6 8 10 15 20

MTRT Throughput

MTRT -- Improvement in Execution Time

-2.0%

0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

# Threads

Tim

e

( seco

nd

s

)

RC 4.9% 5.0% 7.2% 5.6% 11.4% 0.2% -0.1%

1 2 3 4 8 12 16

SPECjbb Heap Utilization

SPECjbb --- Heap Usage

0

50

100

150

200

250

300

350

# Threads

MB

All

oca

ted

( no

t F

ree

)

RC 27 44 77 108 170 171 251 329

Original 26 42 74 104 135 166 243 320

1 2 4 6 8 10 15 20

Client PerformanceSPECjvm98 -- Total Execution Time

0.0%

1.0%

2.0%

3.0%

4.0%

GC Version

%

slo

we

r

% Slower ExecutionTime

3.6%

RC

Related Work

• On-the-fly tracing: – Dijkstra et. al. (1976), Steele (1976), Lamport

(1976), – Kung & Song (1977), Gries (1977) Ben-Ari

(1982,1984), Huelsbergen et. al. (1993,1998) – Doligez-Gonthier-Leroy (1993-4), Domani-

Kolodner-Petrank (2000)

• Concurrent reference counting: – DeTreville (1990), – Martinez et. al. (1990), Lins (1992)– Plakal & Fischer (2001), – Bacon et. al. (2001)

Conclusions

• A new algorithm for reference counting.– Low overhead on pointer modification– On-the-fly

• Implementation for Java• Measurements show high throughput

and low latency.• To be out soon: A matching paper on

the sliding view tracing collector.

Documents

An On-the-Fly Reference Counting Garbage Collector for Java Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni – Microsoft