View
216
Download
0
Embed Size (px)
Citation preview
An On-the-Fly Reference Counting
Garbage Collector for Java
Erez Petrank
Technion – Israel Institute of Technology
Joint work with Yossi Levanoni – Microsoft Corporation
ACM Conference on Object Oriented
Programming Systems Languages & Applications
Tampa, Florida
October 18, 2001
Levanoni & PetrankOn-the-Fly Reference Counting2
Garbage Collection Today• Two classic approaches:
– Tracing [McCarthy 1960]: trace reachable objects, reclaim objects not traced.
– Reference counting [Collins 1960]: keep reference count for each object, reclaim objects with count 0.
• Today’s advanced environments:– multiprocessors – huge memories
Levanoni & PetrankOn-the-Fly Reference Counting3
Motivation for RC• Reference Counting work is proportional
to work on creations and modifications.– Can tracing deal with tomorrow’s huge
heaps?
• Reference counting has good locality.• Tracing rules JVM’s, is it justified? • The Challenge:
– RC write barriers seem too expensive. – RC seems impossible to “parallelize”.
Levanoni & PetrankOn-the-Fly Reference Counting4
This work• An improved RC (suitable for Java)
– Reduced overhead on write barrier,– Concurrent with low overhead: on-the-fly,
no sync. operation in write barrier, multiprocessor.
– Thus: low latency, high performance.
• Implementation:– JVM: SUN’s Java Virtual Machine 1.2.2– Platform: 4-way IBM Netfinity 8500R server
with 550MHz Intel III Xeon and 2GB memory.
Levanoni & PetrankOn-the-Fly Reference Counting5
Agenda
IntroductionMotivationThe Algorithm• Related issues• Implementation and
Measurements • Conclusions
Levanoni & PetrankOn-the-Fly Reference Counting6
Terminology
Stop-the-World
Parallel
Concurrent
On-the-Fly
programGC
Levanoni & PetrankOn-the-Fly Reference Counting7
Basic Reference Counting• Each object has an RC field, new
objects get o.RC:=1.• When p that points to o1 is modified to
point to o2 we do: o1.RC--, o2.RC++.• if then o1.RC==0:
– Delete o1.– Decrement o.RC for all sons of o1.– Recursively delete objects whose RC is
decremented to 0.
Levanoni & PetrankOn-the-Fly Reference Counting8
Basic Reference Counting• Each object has an RC field, new objects
get o.RC:=1.• When p that points to o1
is modified to point to o2 we do: o1.RC--, o2.RC++.
• if then o1.RC==0:– Delete o1.– Decrement o.RC for all sons of o1.– Recursively delete objects whose RC is
decremented to 0.
o1 o2
p
Deferred Reference Counting
• Problem: overhead on updating program variables (locals) costs too much.
• Solution [Deutch & Bobrow] :– Don’t update RC for locals.– “Once in a while”: collect all objects with
o.RC=0 that are not referenced from local roots.
• Deferred RC reduces overhead by 80%. Used in most modern RC systems.
Multithreaded RC?• Problem:
– Parallel updates confuse counts:
– (And more: Update ref counts in parallel races.)
A
B DC
Thread 2: Read A.next;A.next D;B.RC- -; D.RC++
Thread 1: Read A.next;A.next C;B.RC- -; C.RC++
Multithreaded RC
• Problem:– Parallel updates confuse counts.– Update ref counts in parallel races.
• [DeTreville]:– Lock heap for each pointer modification.– Thread records its updates in a buffer. – Once in a while (snapshot alike):
• GC thread reads all buffers to update ref counts• Reclaims all objects with 0 rc that are not local.
To Summarize…
• Overhead on write barrier is considered high.– Even with deferred RC of Deutch &
Bobrow.
• Using reference counting concurrently with program threads seems to bear high synchronization cost. – Lock or “compare & swap” for each
pointer update.
Improving RC
• Consider a pointer p that takes the following values between GC’s: O0,O1, O2, …, On .
• All RC algorithms perform 2n operations: O0.RC--; O1.RC++; O1.RC--; O2.RC++; O2.RC--; … ; On.RC++;
• But only two operations are needed:O0.RC-- and On.RC++
p
O1 O2 O3 On. . . . .O4O0
Improving RC cont’d
• Don’t record all pointer modifications.Record first modifications between GC’s (O0).
• During the collection, for each recorded ptr p: – find O0 by checking the record,
– find On by reading the heap during the collection.
• Apply only two operations for each such pointer: O0.RC-- and On.RC++
p
O1 O2 O3 On. . . . .O4O0
This reduces number of logging & counter updates by a factor of 100-1000 for normal benchmarks!
Improving Synch. Overhead
• Simple solutions bear unacceptable overhead:– DeTreville uses a lock for all pointer
modifications– Simple alternatives require 3 compare-
and-swap’s• Our second contribution:
– A carefully designed write barrier (and an observation) allows elimination of all sync. operations from the write barrier.
The write barrierUpdate(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) { log( slot, old ) SetDirty(slot) } *slot = new}
Observation:If two threads:1. invoke the write barrier
in parallel, and 2. both log an old value,then both record the same old value.
Intermediate Algorithm:Snapshot Oriented, Concurrent• Use write barrier with program threads. • To collect:
– Stop all threads– Scan roots (locals)– get the buffers with modified slots – Clear all dirty bits. – Resume threads– For each modified slot:
• decrease rc for old value (written in buffer),• increase rc for current value (“read heap”),
– Reclaim non-local objects with rc 0.
The Sliding View AlgorithmOn-th-Fly
• Do all collection as threads run: – Read threads buffers (one thread at a time),– Clear all dirty bits,– Update reference counts,– Read roots of each thread, one at a time, – Reclaim (recursively) objects with rc 0.
• Note: rc’s are not correct for any specific point in time, yet, with care, most dead objects may be reclaimed!
• Borrow ideas from [Lamport et. Al.]
Sliding View
Cycles Collection
• Our solution: use a tracing algorithm infrequently.
• Currently this is the most efficient solution. Cycle collectors have high cost.
• We propose a new on-the-fly mark & sweep algorithm that works best with the same sliding view.Can also be used “on its own”.
Implementation for Java
• Based on Sun’s JDK1.2.2 for Windows NT• Main features
– 2-bit RC field per object (á la [Wise et. al.])– A supplemental sliding view tracing
algorithm– A custom allocator for on-the-fly RC:
• Multi leveled fine grained locking• Supports sporadic reclamation of objects• Supports sweeping the heap
Performance Measurements
• First multiprocessor measurements in a “normal” environment! – (Previous measured reports assumed one
CPU is free for GC all the time.)
• Benchmarks:– Server benchmarks
• SPECjbb2000 --- simulates business-like transactions in a large firm
• MTRT --- a multi-threaded ray tracer
– Client benchmarks• SPECjvm98 --- a suite of mostly single-threaded client
benchmarks
Improved RC• How many RC updates are eliminated?
BenchmarkNo of storesNo of “first” stored
Ratio of “first” stores
jbb71,011,357264,1151/269
Compress64,905511/1273
Db33,124,78030,6961/1079
Jack135,174,7751,5461/87435
Javac22,042,028535,2961/41
Jess26,258,10727,3331/961
mpegaudio5,517,795511/108192
SPECjbb Latency(Max Transaction Time)
0
2000
4000
6000
8000
10000
Milliseconds
# Threads
SPECjbb -- M ax. Response Time (600M B)
RC 16 16 47 78 110 146 245 329
Original 7433 8037 8463 6923 7857 7536 6593 5997
1 2 4 6 8 10 15 20
SPECjbb ThroughputSPECjbb -- performance vs. # threads (600MB)
-6.0%
-4.0%
-2.0%
0.0%
2.0%
4.0%
6.0%
Threads
Cha
nge
in T
hrou
ghpu
t
(%)
RC 0.4% 4.0% -5.4% -2.0% -1.0% -2.2% -0.3% 2.4%
1 2 4 6 8 10 15 20
MTRT Throughput
MTRT -- Improvement in Execution Time
-2.0%
0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
12.0%
# Threads
Tim
e
( seco
nd
s
)
RC 4.9% 5.0% 7.2% 5.6% 11.4% 0.2% -0.1%
1 2 3 4 8 12 16
SPECjbb Heap Utilization
SPECjbb --- Heap Usage
0
50
100
150
200
250
300
350
# Threads
MB
All
oca
ted
( no
t F
ree
)
RC 27 44 77 108 170 171 251 329
Original 26 42 74 104 135 166 243 320
1 2 4 6 8 10 15 20
Client PerformanceSPECjvm98 -- Total Execution Time
0.0%
1.0%
2.0%
3.0%
4.0%
GC Version
%
slo
we
r
% Slower ExecutionTime
3.6%
RC
Related Work
• On-the-fly tracing: – Dijkstra et. al. (1976), Steele (1976), Lamport
(1976), – Kung & Song (1977), Gries (1977) Ben-Ari
(1982,1984), Huelsbergen et. al. (1993,1998) – Doligez-Gonthier-Leroy (1993-4), Domani-
Kolodner-Petrank (2000)
• Concurrent reference counting: – DeTreville (1990), – Martinez et. al. (1990), Lins (1992)– Plakal & Fischer (2001), – Bacon et. al. (2001)
Conclusions
• A new algorithm for reference counting.– Low overhead on pointer modification– On-the-fly
• Implementation for Java• Measurements show high throughput
and low latency.• To be out soon: A matching paper on
the sliding view tracing collector.