23

arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Parallel and Distributed Information Systems, Miami Beach, FL

Characterizing Reference Locality in the WWW

Azer Bestavros & Mark Crovella

Computer Science Department

Boston University

Virgilio Almeida & Adriana de Oliveira

Computer Science Department

Universidade Federal de Minas Gerais

Thursday December 19th 1996

1

Page 2: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Talk Outline

� Introduction, Motivation, and Applications

� Experimental Environment and Data Collection

� Characterizing Web Document Popularity

� Characterizing Web Reference Locality

{ Temporal Locality: Evidence and Model

{ Spatial Locality: Evidence and Model

� Synthetic Web Reference Trace Generation

� Related Work

� Current and Future Work

2

Page 3: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Introduction

� The characteristics of Web access patterns fall into

two categories:

{ Static (e.g. popularity pro�les)

{ Dynamic (e.g. reference locality)

� Characterizing Web access patterns is crucial for

performance tuning and evaluation.

{ Client/server caching and prefetching protocols

{ Scheduling and load balancing protocols

{ Networking issues

3

Page 4: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Evaluating A Workload Model

Workload System

M o d e l

Systemρ

� A workload model W is a perfect representation of

the real workload R if the performance metrics �

obtained using W and R in the same system are

indistiguishable.

4

Page 5: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Data Collection

Log NCSA SDSC EPA BU

Duration 1 day 1 day 1 day 2 weeks

Start Date Dec 19 Aug 22 Aug 29 Oct 08

Total requests 46,955 28,338 47,748 80,518

Unique requests 4,851 1,267 6,518 4,471

Summary of Access Log Data

ClientIP : TimeStamp : RequestURL : Size

Data in a Typical Log Record

5

Page 6: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Characterizing Document Popularity

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.5 3

log1

0(N

umbe

r of

Ref

eren

ces)

log10(Document Rank)

Zipf's Law Applied to Web Documents

6

Page 7: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Characterizing Document Popularity

10

100

1000

1 10 100

Ref

eren

ces

Rank

Length 1Length 2Length 3

Zipf's Law Applied to Sequences

7

Page 8: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Measuring Locality of Reference

O1O2O3O4O5

OiOi+1

OiO1O2O3O4

Oi-1Oi+1

Object O is

referencedi

LRU stack distance

LRU Stack Distance Model

8

Page 9: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Measuring Locality of Reference

� For any string of requests R = r1:r2:r3 : : : we can

compute a corresponding string of stack distances

D = d1:d2:d3 : : :

� The request and distance strings are equivalent in

terms of the locality of reference information they

capture.

� The average stack distance of D is a measure of

the number of intervening requests to unique objects

between recurrring requests.

9

Page 10: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Evidence of Temporal Locality

� Consider a scrambled request string R0 that

corresponds to a random permutation of R.

� R and R0 have the same Zipf popularity pro�le since

they are permutations of each other.

� The di�erence in stack distance distribution for R

and R0 would be a measure of temporal locality.

BU Trace Original Scrambled

Mean Stack Distance 479.798 645.586

Standard Deviation 941.430 968.840

10

Page 11: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Characterizing Temporal Locality

� If FD is the distribution of the stack distance D,

then the miss rate M (C) of a cache that can hold

C �les is

M (C) = P [D > C] = 1� FD(C)

Knowledge of FD provides enough information to

predict the performance of a cache of any size for

the given trace.

� Our analysis shows that FD has a long tail, yet it

does not seem to follow a power-law (e.g. Pareto).

11

Page 12: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Characterizing Temporal Locality

� Lognormal distributions with parameters � and �

seem to provide the best �t for the distributions of

the stack distance in the traces we considered.

BU NCSA SDSC EPA

�̂ 1.829 1.730 1.568 2.150

�̂ 0.947 0.836 0.827 0.921

Lognormal Distribution Parameters

12

Page 13: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Characterizing Temporal Locality

0 1000 2000 3000 4000

02

00

00

40

00

0

Stack Distance

Nu

mb

er

of

Occu

rre

nce

s

0 1000 2000 3000

01

00

00

20

00

03

00

00

Stack Distance

Nu

mb

er

of

Occu

rre

nce

s

0 500 1000 15000

50

00

15

00

0

Stack Distance

Nu

mb

er

of

Occu

rre

nce

s

0 2000 4000 6000 8000

01

00

00

20

00

03

00

00

Stack Distance

Nu

mb

er

of

Occu

rre

nce

s

Stack Distance Distributions (BU, NCSA, SDSC, EPA)

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

log10(Stack Distance)0 1 2 3 4

0.0

0.2

0.4

log10(Stack Distance)0 1 2 3 4

0.0

0.2

0.4

log10(Stack Distance)0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

log10(Stack Distance)

Lognormal Distributions and Fit (BU, NCSA, SDSC, EPA)

13

Page 14: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Temporal Locality Modeling Accuracy

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Mis

s R

ate

Cache Size

Actual DataModel

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000 2500 3000 3500 4000

Mis

s R

ate

Cache Size

Actual DataModel

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000 1200 1400 1600 1800

Mis

s R

ate

Cache Size

Actual DataModel

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Mis

s R

ate

Cache Size

Actual DataModel

Actual and Predicted Miss Rates (BU, NCSA, SDSC, EPA)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Miss

Rat

e

Cache Size

SDSC (m = 1.6, s = .83)NCSA (m = 1.7, s = .84)

BU (m = 1.8, s = .95)EPA (m = 2.2, s = .92)

Comparative Predicted Miss Rates (BU, NCSA, SDSC, EPA)

14

Page 15: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Evidence of Spatial Locality

5000

10000

15000

20000

25000

30000

2 4 6 8 10

Sequence Length

Num

ber

of U

niqu

e Se

quen

ces

BU1 Server Trace

BU1 Server Trace (Random Permutation)

BU1 Client Trace

Unique sequences observed in the BU trace

15

Page 16: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Characterizing Spatial Locality

� Stack distance series are bursty at all timescales

; They exhibit self-similar characteristics.

0

50000

100000

150000

200000

250000

300000

350000

38000 42000 46000 500000

50000

100000

150000

200000

250000

300000

350000

38000 42000 46000 50000

0

5000

10000

15000

20000

25000

30000

42000 42500 43000 43500 44000 44500 450000

5000

10000

15000

20000

25000

30000

42000 42500 43000 43500 44000 44500 45000

0

1000

2000

3000

4000

5000

6000

7000

44500 44600 44700 44800 44900 45000 451000

1000

2000

3000

4000

5000

6000

7000

44500 44600 44700 44800 44900 45000 45100

Stack Distance Scaling Behavior of Original (left) and Scrambled (right) BU Traces

16

Page 17: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Characterizing Spatial Locality

� Stack distance self-similarity is evidence of very

long-range correlations, which correspond to long

periods of very large stack distances|caused by

phase changes in referencing behavior.

� The degree of self-similarity is captured by theHurst

parameter H , which takes values between 0.5 and

1.0. As H ! 1, the burstiness becomes more

pronounced at high levels of aggregation.

�We use four methods to estimate the H parameter

for our datasets: the variance-time plot, the R/S

plot, the periodogram, and the Whittle estimator,

which provides con�dence intervals as well.

17

Page 18: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Characterizing Spatial Locality

-5

-4.5

-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

log1

0(N

orm

aliz

ed V

aria

nce)

log10(m)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 1.5 2 2.5 3 3.5 4 4.5 5

log1

0(r/

s)

log10(d)

6

7

8

9

10

11

12

13

14

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0

log1

0(pe

riodo

gram

)

log10(frequency)

Graphical Estimators of H for BU Trace

V-T R/S Per. Wtl. (95% conf.)

BU .82 .78 .87 .85 (0.84, 0.87)

NCSA .71 .74 .74 .74 (0.73, 0.77)

SDSC .71 .68 .69 .68 (0.66, 0.71)

EPA .64 .66 .66 .65 (0.64, 0.67)

Estimates of H for Original Traces

18

Page 19: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Characterizing Spatial Locality

Original Trace Scrambled Trace

V-T R/S Per. Wtl. V-T R/S Per. Wtl.

BU .82 .78 .87 .85 .50 .55 .50 .50

NCSA .71 .74 .74 .74 .50 .51 .51 .49

SDSC .71 .68 .69 .68 .52 .54 .50 .50

EPA .64 .66 .66 .65 .51 .55 .47 .50

H for Original vs Scrambled Traces

19

Page 20: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Synthetic Web Reference Trace Generation

Step 1 :

Select parameters � and � re ecting temporal

locality, and H re ecting spatial locality|based on

empirical measurement of traces to be imitated, or

based on our results.

Step 2 :

Generate a stack distance trace with marginal

distribution determined by � and � and long-range

dependence determined by H using the two-phase

approach described in [Huang et al: 1995].

Step 3 :

Invert the stack distance trace to form a sequence of

�le names.

20

Page 21: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Related Work

Traditional Memory Systems

� Fundamentals of reference locality in hierarchical

memories [Denning and Schwartz: 1972].

� Stack distance analysis and algorithms [Mattson

et al: 1970].

� Establish the existence of long-range dependence

in reference strings [Spirn: 1976].

� Relate the fractal dimension of cache misses to

software complexity [Voldman et al: 1983].

�Model memory access pattern as a random walk

with fractal dimension [Thiebaut: 1989].

21

Page 22: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Related Work

Large-scale Information Systems

� Caching and replication for distributed �le

systems [Howard et al: 1988].

�Model Web access using Zipf-based popularity

pro�les [Glassman: 1994].

�Model Web access using frequency and recency

rates of past accesses [Recker and Pitkow: 1994].

� Characteristics of client access patterns [Cunha,

Bestavros, and Crovella: 1995].

� Used Markov processes to model access inter-

dependencies [Cunha and Bestavros: 1995, 1996].

22

Page 23: arallel and Distributed Information Systems, Miami Beacbest/research/talks/Locality... · 2006. 3. 17. · arallel and Distributed Information Systems, Miami Beac h, FL Characterizing

'

&

$

%Boston University / Computer Science Department PDIS'96

Current and Future Work

� Incorporate �le size information in the access

pattern characterization.

� Study the e�ect of increased multiprogramming

levels on access pattern characteristics.

� Study the implication on caching and prefetching

algorithms at clients and servers.

� Use measured characteristics to design benchmarks

for evaluating client and server software.

23