1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008

1

Sampling-based Program Locality Approximation

Yutao Zhong, Wentao Chang

Department of Computer ScienceGeorge Mason University

June 8th,2008

2

Outline

• Background information

• Motivation

• Our sampling approach

• Experimental results

3

Reuse distance and reuse signature

a b c a a c b

• Reuse distance: the number of distinct data elements accessed between two consecutive uses of the same element

• Reuse signature: a histogram of reuse distances demonstrating the distribution of reuse distances over different lengths

2

2

Starting Point

Ending Point

4

Reuse signature application

• Relationship to cache behavior :• Capacity miss

<= reuse distance ≥ cache size• Reduce reuse distance

=> improve cache effectiveness• Current applications :

• Predict cache miss rate [Zhong+03][Marin & Mellor-Crummey 04] [Fang+05][Zhong+07]• Reorganize data [Zhong+04] • Provide caching hint [Beyls & D’Hollander 02]• Evaluate program optimizations [Beyls & D’Hollander 01] [Ding 00]

5

Reuse distance measurement

AccessTime Table

AccessTrace

DistanceHistogram

GetAccessed Memory

Address

Search Update

Address Search, Count Update

Last Record distance

Distance

① Large space and a long counting time required to store traces and count memory access

② Enormous efforts for memory-intensive program

Data Structure:

a c a b b aStarting Point

Ending Point

1

6

Motivation

• Sampling is generally effective to reduce the overhead of program behavior profiling

• We are devoted to balance efficiency and accuracy• Sample only 1% memory accesses• Improve measurement speed by 7.5 times in

average• Achieve over 99% accuracy

7

Sampling algorithms

• Utilize common structure of bursty tracing [Hirzel &

Chilimbi 01]

• Sampling rate r =|Is|/(|Is| +|IH|)

• Naïve sampling• Turn off profiling during hibernating intervals

• Non guarantee of accuracy

8

Naive sampling

. . c a b c a c a b c a c a b c d a . . . .

Memory access trace:

IH IS

Naïve sampling:

IH IS

① ② ③ ④1

Inaccurate measurement

⑤3

9

Biased sampling• Ignore datum that has been referenced within

the current hibernating period

• Measured distance always larger than or equal to actual distance

• Probability of being sampled not uniform

• Probability of being sampled not uniform

10

Biased sampling

. . c a b c a f a b c a c a b f d a . . . .


IH IS

Biased sampling:

IH IS

① ② ③ ④

⑤

11

History-preserved representative sampling

• Add an additional tag for each address in access trace

• Mark references within a sampling period as sampled in the tag

• Reuse will only be sampled when starting point marked sampled

12

History-preserved representative sampling

. . c a b c a f a b c a c a b f d a . . . .


IH IS

History-preserved representative sampling:

IH IS

① ② ③ ④

⑤

13

Further improvements

• Simplifying maintenance in hibernating intervals• Reference trace implementation: splay tree [Ding & Zhong

03]

• In sampling period, full tree maintenance

• In hibernating period, instead of a new leaf node for each access, we construct a single node for each hibernating period with a counter of the number of distinct accesses

• Fast sample tag marking and checking• To save space cost, we fix the length of sampling and

hibernating period, avoid additional tag

14

Experiments

• Benchmarks from SPEC 2006, Olden, Chaos:• Floating point programs: CactusADM, Milc,

Soplex, Apsi, MolDyn• Integer programs: Bzip2,Gcc, Libquatum,

Perimeter, TSP

• Instrumentation tool: Valgrind 3.2.3• Sampling rate : 1%• We run each individual benchmark with 3 to 6

different inputs• Repeat three time for each input

15

Experiments cont’d• Comparison of accuracy and efficiency

• Ding and Zhong ’s approximation method [Ding & Zhong 03]

• Time distance measurement [Shen+07]

• Implementation of four algorithms:• Naive sampling, biased sampling, basic and

optimized representative sampling

16

Accuracy

17

Efficiency

Sampling even outperforms the lower bound :time distance measurement

Generally, speedup is less when the input size is small

18

Efficiency

• Speedup of basic representative sampling : around 4-5 times for most cases

• Speedup of optimized representative sampling: • around 7-10 for most cases, up to 33 times • geometric mean is 7.5

• Sampling rate effect (TSP):

19

Related work• Reuse signature collection

• [Mattson+70] [Bennett & Kruskal 75] [Olken81] [Kim+91] [Sugumar & Abraham 93] [Almasi+02] [Ding & Zhong 03] [Shen+07]

• Selective monitoring• Time sampling [Zagha+96] [Anderson+97] [Burrows+00][Whaley 00] [Arnold & Sweeney 00] [Arnold & Ryder 01] [Hirzel & Chilimbi 01] [Chilimbi & Hirzel 02] [Itzkowitz+03] [Arnold & Grove 05]

• Data sampling [Larus 90] [Ding & Zhong 02] [Zhao+07]

• Uses of efficient locality analysis [Huang & Shen 96] [Li+96] [Ding 2000] [Beyls & D’ Hollander 01] [Almasi+02] [Beyls & D’ Hollander 02] [Zhong+04] [Marin & Mellor-Crummey 04] [Fang+05] [Zhong+07]

20

Future work

• Dynamically adjust sampling/hibernating lengths

• Store references in temporary buffer and then process them in batch

• Combine time sampling with data sampling

21

Thank you!

Questions?

Documents

1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008