Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
'
&
$
%Boston University / Computer Science Department PDIS'96
Parallel and Distributed Information Systems, Miami Beach, FL
Characterizing Reference Locality in the WWW
Azer Bestavros & Mark Crovella
Computer Science Department
Boston University
Virgilio Almeida & Adriana de Oliveira
Computer Science Department
Universidade Federal de Minas Gerais
Thursday December 19th 1996
1
'
&
$
%Boston University / Computer Science Department PDIS'96
Talk Outline
� Introduction, Motivation, and Applications
� Experimental Environment and Data Collection
� Characterizing Web Document Popularity
� Characterizing Web Reference Locality
{ Temporal Locality: Evidence and Model
{ Spatial Locality: Evidence and Model
� Synthetic Web Reference Trace Generation
� Related Work
� Current and Future Work
2
'
&
$
%Boston University / Computer Science Department PDIS'96
Introduction
� The characteristics of Web access patterns fall into
two categories:
{ Static (e.g. popularity pro�les)
{ Dynamic (e.g. reference locality)
� Characterizing Web access patterns is crucial for
performance tuning and evaluation.
{ Client/server caching and prefetching protocols
{ Scheduling and load balancing protocols
{ Networking issues
3
'
&
$
%Boston University / Computer Science Department PDIS'96
Evaluating A Workload Model
Workload System
M o d e l
Systemρ
� A workload model W is a perfect representation of
the real workload R if the performance metrics �
obtained using W and R in the same system are
indistiguishable.
4
'
&
$
%Boston University / Computer Science Department PDIS'96
Data Collection
Log NCSA SDSC EPA BU
Duration 1 day 1 day 1 day 2 weeks
Start Date Dec 19 Aug 22 Aug 29 Oct 08
Total requests 46,955 28,338 47,748 80,518
Unique requests 4,851 1,267 6,518 4,471
Summary of Access Log Data
ClientIP : TimeStamp : RequestURL : Size
Data in a Typical Log Record
5
'
&
$
%Boston University / Computer Science Department PDIS'96
Characterizing Document Popularity
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3
log1
0(N
umbe
r of
Ref
eren
ces)
log10(Document Rank)
Zipf's Law Applied to Web Documents
6
'
&
$
%Boston University / Computer Science Department PDIS'96
Characterizing Document Popularity
10
100
1000
1 10 100
Ref
eren
ces
Rank
Length 1Length 2Length 3
Zipf's Law Applied to Sequences
7
'
&
$
%Boston University / Computer Science Department PDIS'96
Measuring Locality of Reference
O1O2O3O4O5
OiOi+1
OiO1O2O3O4
Oi-1Oi+1
Object O is
referencedi
LRU stack distance
LRU Stack Distance Model
8
'
&
$
%Boston University / Computer Science Department PDIS'96
Measuring Locality of Reference
� For any string of requests R = r1:r2:r3 : : : we can
compute a corresponding string of stack distances
D = d1:d2:d3 : : :
� The request and distance strings are equivalent in
terms of the locality of reference information they
capture.
� The average stack distance of D is a measure of
the number of intervening requests to unique objects
between recurrring requests.
9
'
&
$
%Boston University / Computer Science Department PDIS'96
Evidence of Temporal Locality
� Consider a scrambled request string R0 that
corresponds to a random permutation of R.
� R and R0 have the same Zipf popularity pro�le since
they are permutations of each other.
� The di�erence in stack distance distribution for R
and R0 would be a measure of temporal locality.
BU Trace Original Scrambled
Mean Stack Distance 479.798 645.586
Standard Deviation 941.430 968.840
10
'
&
$
%Boston University / Computer Science Department PDIS'96
Characterizing Temporal Locality
� If FD is the distribution of the stack distance D,
then the miss rate M (C) of a cache that can hold
C �les is
M (C) = P [D > C] = 1� FD(C)
Knowledge of FD provides enough information to
predict the performance of a cache of any size for
the given trace.
� Our analysis shows that FD has a long tail, yet it
does not seem to follow a power-law (e.g. Pareto).
11
'
&
$
%Boston University / Computer Science Department PDIS'96
Characterizing Temporal Locality
� Lognormal distributions with parameters � and �
seem to provide the best �t for the distributions of
the stack distance in the traces we considered.
BU NCSA SDSC EPA
�̂ 1.829 1.730 1.568 2.150
�̂ 0.947 0.836 0.827 0.921
Lognormal Distribution Parameters
12
'
&
$
%Boston University / Computer Science Department PDIS'96
Characterizing Temporal Locality
0 1000 2000 3000 4000
02
00
00
40
00
0
Stack Distance
Nu
mb
er
of
Occu
rre
nce
s
0 1000 2000 3000
01
00
00
20
00
03
00
00
Stack Distance
Nu
mb
er
of
Occu
rre
nce
s
0 500 1000 15000
50
00
15
00
0
Stack Distance
Nu
mb
er
of
Occu
rre
nce
s
0 2000 4000 6000 8000
01
00
00
20
00
03
00
00
Stack Distance
Nu
mb
er
of
Occu
rre
nce
s
Stack Distance Distributions (BU, NCSA, SDSC, EPA)
0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
log10(Stack Distance)0 1 2 3 4
0.0
0.2
0.4
log10(Stack Distance)0 1 2 3 4
0.0
0.2
0.4
log10(Stack Distance)0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
log10(Stack Distance)
Lognormal Distributions and Fit (BU, NCSA, SDSC, EPA)
13
'
&
$
%Boston University / Computer Science Department PDIS'96
Temporal Locality Modeling Accuracy
0
0.2
0.4
0.6
0.8
1
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Mis
s R
ate
Cache Size
Actual DataModel
0
0.2
0.4
0.6
0.8
1
0 500 1000 1500 2000 2500 3000 3500 4000
Mis
s R
ate
Cache Size
Actual DataModel
0
0.2
0.4
0.6
0.8
1
0 200 400 600 800 1000 1200 1400 1600 1800
Mis
s R
ate
Cache Size
Actual DataModel
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Mis
s R
ate
Cache Size
Actual DataModel
Actual and Predicted Miss Rates (BU, NCSA, SDSC, EPA)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Miss
Rat
e
Cache Size
SDSC (m = 1.6, s = .83)NCSA (m = 1.7, s = .84)
BU (m = 1.8, s = .95)EPA (m = 2.2, s = .92)
Comparative Predicted Miss Rates (BU, NCSA, SDSC, EPA)
14
'
&
$
%Boston University / Computer Science Department PDIS'96
Evidence of Spatial Locality
5000
10000
15000
20000
25000
30000
2 4 6 8 10
Sequence Length
Num
ber
of U
niqu
e Se
quen
ces
BU1 Server Trace
BU1 Server Trace (Random Permutation)
BU1 Client Trace
Unique sequences observed in the BU trace
15
'
&
$
%Boston University / Computer Science Department PDIS'96
Characterizing Spatial Locality
� Stack distance series are bursty at all timescales
; They exhibit self-similar characteristics.
0
50000
100000
150000
200000
250000
300000
350000
38000 42000 46000 500000
50000
100000
150000
200000
250000
300000
350000
38000 42000 46000 50000
0
5000
10000
15000
20000
25000
30000
42000 42500 43000 43500 44000 44500 450000
5000
10000
15000
20000
25000
30000
42000 42500 43000 43500 44000 44500 45000
0
1000
2000
3000
4000
5000
6000
7000
44500 44600 44700 44800 44900 45000 451000
1000
2000
3000
4000
5000
6000
7000
44500 44600 44700 44800 44900 45000 45100
Stack Distance Scaling Behavior of Original (left) and Scrambled (right) BU Traces
16
'
&
$
%Boston University / Computer Science Department PDIS'96
Characterizing Spatial Locality
� Stack distance self-similarity is evidence of very
long-range correlations, which correspond to long
periods of very large stack distances|caused by
phase changes in referencing behavior.
� The degree of self-similarity is captured by theHurst
parameter H , which takes values between 0.5 and
1.0. As H ! 1, the burstiness becomes more
pronounced at high levels of aggregation.
�We use four methods to estimate the H parameter
for our datasets: the variance-time plot, the R/S
plot, the periodogram, and the Whittle estimator,
which provides con�dence intervals as well.
17
'
&
$
%Boston University / Computer Science Department PDIS'96
Characterizing Spatial Locality
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
log1
0(N
orm
aliz
ed V
aria
nce)
log10(m)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 1.5 2 2.5 3 3.5 4 4.5 5
log1
0(r/
s)
log10(d)
6
7
8
9
10
11
12
13
14
-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0
log1
0(pe
riodo
gram
)
log10(frequency)
Graphical Estimators of H for BU Trace
V-T R/S Per. Wtl. (95% conf.)
BU .82 .78 .87 .85 (0.84, 0.87)
NCSA .71 .74 .74 .74 (0.73, 0.77)
SDSC .71 .68 .69 .68 (0.66, 0.71)
EPA .64 .66 .66 .65 (0.64, 0.67)
Estimates of H for Original Traces
18
'
&
$
%Boston University / Computer Science Department PDIS'96
Characterizing Spatial Locality
Original Trace Scrambled Trace
V-T R/S Per. Wtl. V-T R/S Per. Wtl.
BU .82 .78 .87 .85 .50 .55 .50 .50
NCSA .71 .74 .74 .74 .50 .51 .51 .49
SDSC .71 .68 .69 .68 .52 .54 .50 .50
EPA .64 .66 .66 .65 .51 .55 .47 .50
H for Original vs Scrambled Traces
19
'
&
$
%Boston University / Computer Science Department PDIS'96
Synthetic Web Reference Trace Generation
Step 1 :
Select parameters � and � re ecting temporal
locality, and H re ecting spatial locality|based on
empirical measurement of traces to be imitated, or
based on our results.
Step 2 :
Generate a stack distance trace with marginal
distribution determined by � and � and long-range
dependence determined by H using the two-phase
approach described in [Huang et al: 1995].
Step 3 :
Invert the stack distance trace to form a sequence of
�le names.
20
'
&
$
%Boston University / Computer Science Department PDIS'96
Related Work
Traditional Memory Systems
� Fundamentals of reference locality in hierarchical
memories [Denning and Schwartz: 1972].
� Stack distance analysis and algorithms [Mattson
et al: 1970].
� Establish the existence of long-range dependence
in reference strings [Spirn: 1976].
� Relate the fractal dimension of cache misses to
software complexity [Voldman et al: 1983].
�Model memory access pattern as a random walk
with fractal dimension [Thiebaut: 1989].
21
'
&
$
%Boston University / Computer Science Department PDIS'96
Related Work
Large-scale Information Systems
� Caching and replication for distributed �le
systems [Howard et al: 1988].
�Model Web access using Zipf-based popularity
pro�les [Glassman: 1994].
�Model Web access using frequency and recency
rates of past accesses [Recker and Pitkow: 1994].
� Characteristics of client access patterns [Cunha,
Bestavros, and Crovella: 1995].
� Used Markov processes to model access inter-
dependencies [Cunha and Bestavros: 1995, 1996].
22
'
&
$
%Boston University / Computer Science Department PDIS'96
Current and Future Work
� Incorporate �le size information in the access
pattern characterization.
� Study the e�ect of increased multiprogramming
levels on access pattern characteristics.
� Study the implication on caching and prefetching
algorithms at clients and servers.
� Use measured characteristics to design benchmarks
for evaluating client and server software.
23