Analysis methods of heavy-tailed data - John Wiley & … Analysis methods of heavy-tailed data Natalia Markovich Institute of Control Sciences Russian Academy of Sciences, Moscow,

logo

Analysis methods of heavy-tailed data

Natalia Markovich

Institute of Control SciencesRussian Academy of Sciences, Moscow, Russia

February, 13-18, 2006, Bamberg, GermanyJune, 19-23, 2006, Brest, France

May, 14-19, 2007, Trondheim, NorwayPhD course

Natalia Markovich Analysis methods of heavy-tailed data

logo

Chapter 1

Introduction:

definitions and basic properties of classes of heavy-taileddistributions. Tail index estimation. Methods for the selection ofthe number of the largest order statistics in Hill estimator.Rough methods for the detection of heavy tails and the numberof finite moments.

The Section 1 contains the introduction

with necessary definitions, basic properties and examples ofheavy-tailed data. The tail index indicates the shape of the tailand therefore it is the basic characteristic of heavy-tailed data.Methods of tail index estimation are presented.Finally, several rough tools for the detection of heavy-tails, thenumber of finite moments and dependence are considered.


logo

Main assumption.

Let X1, . . . ,Xn be a sample of i.i.d. r.v.s distributed with theheavy-tailed DF F (x).

Definition

A DF F (x) (or the r.v. X ) is called heavy-tailed if its tailF (x) = 1 − F (x) > 0, x ≥ 0, satisfies ∀y ≥ 0

limx→∞

P{X > x + y |X > x} = limx→∞

F (x + y)/F (x) = 1.

Comparison of probability tails.


logo

Examples of heavy- and light-tailed distributions

Heavy-tailed Subexponential:distributions: Pareto, Lognormal, Weibull

with shape parameter less than 1.With regularly varyingtails: Pareto, Cauchy, Burr, Frechet,Zipf-Mandelbrot law.

Light-tailed exponential, gamma, Weibulldistributions with shape parameter more than 1,

normal, finite distributions.


logo

Heavy-tailed distributions have been accepted asrealistic models for various phenomena:

WWW-session characteristicssizes and durations of sub-sessions; sizes of responsesinter-response time intervals

on/off-periods of packet traffic

file sizes

service-time in queueing model

flood levels of rivers

major insurance claims

extreme levels of ozon concentrations

high wind-speed values

wave heights during a storm

low and high temperatures


logo

Basic definitions

Let X n = {X1, . . . ,Xn} be a sample of i.i.d. r.v. distributed withthe DF F (x) and Mn = max(X1,X2, . . . ,Xn).Extreme value theory assumes that for a suitable choice ofnormalizing constants an,bn it holds

P{(Mn − bn)/an ≤ x} = F n(bn + anx) →n→∞ Hγ(x), x ∈ R,

and an Extreme Value DF Hγ(x) is of the following type:

Hγ(x) =

exp(−x−1/γ), x > 0, γ > 0 ’Frechet’,exp(−(−x)−1/γ), x < 0, γ < 0 ’Weibull’,exp(−e−x), γ = 0, x ∈ R ’Gumbel’.

Definition

The parameter γ is called the extreme value index (EVI) anddefines the shape of the tail of the r.v. X .The parameter α = 1/γ is called tail index.


logo

Pickands’s theorem:

The limit distribution of the excess distribution of the Xi ’s is nec-essarily of the Generalized Pareto form

limu↑xF ,u+x<xF

P (X1 − u > x |X1 > u) → (1 + γx)−1/γ+ , x ∈ R,

wherexF = sup{x ∈ R : F (x) < 1}

is the right endpoint of the distribution F and the shape parameterγ ∈ R.

Therefore, the Generalized Pareto distribution (GPD) with DF

Ψσ,γ(x) = 1 − (1 + γx/σ)−1/γ+

is often used as a model of the tail of the distribution.


logo

The classes of heavy-tailed distributions

distributions with regularly varying tails ( RVT )(X ∈ R−1/γ)

P{X > x} = x−1/γℓ(x),∀x > 0, γ > 0,

where ℓ is a slowly varying function, i.e.

limx→∞

ℓ(tx)/ℓ(x) = 1, ∀t > 0.

Examples:

Pareto, Burr, Frechet distributions belong to RVT .

Examples of ℓ(x) give

c ln x , c ln(ln x) and all functions converging to positiveconstants.


logo

The classes of heavy-tailed distributions

subexponential distributions ( S)(X ∈ S)

P{Sn > x} ∼ nP{X1 > x} ∼ P{Mn > x} as x → ∞,

where Sn = X1 + ...Xn, n ≥ 2, Mn = maxi=1,...,n{Xi}.

Example:

Weibull with shape parameter less than 1 belongs to S.

Intuitively, subexponentiality means

that the only way the sum can be large is by one of thesummands getting large (in contrast, in the light-tailed case allsummands are large if the sum is so).


logo

Basic properties of regularly varying distributions:

Lemma

Let X ∈ R−α. Then,

(i) X ∈ S.

(ii) E{Xβ} <∞ if β < α, E{Xβ} = ∞ if β ≥ α.

(iii) If α > 1, then X r ∈ R1−α and

P{X r > x} ∼ ℓ(x)x1−α/((α− 1)E{X}) as x → ∞.

(iv) If Y is non-negative and independent of X such thatP{Y > x} = ℓ2(x)x−α2 , then X + Y ∈ R−min(α,α2) andP{X + Y > x} ∼ P{X > x} + P{Y > x} as x → ∞.

(v) If Y is non-negative and independent of X such thatE{Y α+ε} <∞ for some ε > 0 then XY ∈ R−α and

P{XY > x} ∼ E{Y α}P{X > x} as x → ∞.


logo

Important property for the rough detection of heavytails and the number of finite moments:

Let X ∈ R−α.

Then E{Xβ} <∞, if β < 1/γ; E{Xβ} = ∞, if β > 1/γ.

Examples:

If α = 2, γ = 0.5, thenEX1 <∞ (the first moment is finite),EX 2

1 = ∞ (the second moment is infinite or does not exist).

If α = 0.5, γ = 2, thenEX1 = ∞, EX 2

1 = ∞... (all moments are infinite or do notexist).


logo

Estimators of tail index

X(1) ≤ X(2) ≤ ... ≤ X(n)

are order statistics of the sample X n = {X1,X2, ...,Xn}For γ > 0:

Hill’s estimator

γH(n, k) =1k

k∑

i=1

ln X(n−i+1) − ln X(n−k)

Ratio estimator Goldie, Smith, (1987)

an = an(xn) =n∑

i=1

ln(Xi/xn)I{Xi > xn}/n∑

i=1

I{Xi > xn},

xn is am arbitrary threshold level, e.g., xn = X(n−k)


logo

Estimators of tail index

For γ ∈ R:

”Moment estimator”, Dekkers, Einmahl, de Haan, (1989):

γM(n, k) = γH(n, k) + 1 − 0.5(

1 − (γH(n, k))2/Sn,k ))−1

,

Sn,k = (1/k)∑k

i=1

(ln X(n−i+1) − ln X(n−k)

)2

” UH estimator”, Berlinet, (1998):

γUH(n, k) = (1/k)

k∑

i=1

ln UHi−ln UHk+1,UHi = X(n−i)γH(n, i)

Pickands’s estimator

γP(n, k) = 1/ ln 2 ln(X(k) − X(2k)

)/(X(2k) − X(4k)

)

No recursiveness!Natalia Markovich Analysis methods of heavy-tailed data

logo

Group estimator, Davydov, Paulauskas, Rackauskas,(2000):

The sample X n is divided into l groups V1, ...,Vl , each groupcontaining m r.v.s, i.e. n = l · m.

Estimator of the function of the tail index:

zl = (1/l)∑l

i=1 kli = αα+1 = 1

1+γ ,

where

kli = M(2)li /M(1)

li , M(1)li = max{Xj : Xj ∈ Vi}

and M(2)li is the second largest element in the same group Vi .


logo

Group estimator, Davydov, Paulauskas, Rackauskas,(2000).Theoretical background.

1 For distributions with regularly varying tails

1 − F (x) = x−αℓ(x),

and l = m = [√

n], it is proved

zl →a.s. α

α+ 1=

11 + γ

.

2 For distributions

1 − F (x) = C1x−α + C2x−β + o(x−β),

with β = 2α it holds

l(l−1l∑

i=1

kli−α(1+α)−1)

(l∑

i=1

k2li − l−1

l∑

i=1

kli

)−1/2

→p N(0,1).


logo

Recursiveness of the estimate zl . On-line estimation.

Having the next group of observations Vl+1 it follows

γl+1 =

(l

l + 1· 1

1 + γl+

kl+1l+1

l + 1

)−1

− 1

After getting i additional groups with m elements eachVl+1, ...,Vl+i

γl+i = (l + i)(

l1 + γl

+ kl+1l+1+ ...+ kl+il+i

)−1

− 1,

i.e.

γl+i is obtained using γl by O(1) calculations.


logo

Recursiveness of the estimate zl . On-line estimation.

Sincezl+i = 1/(1 + γl+i)

it holds

zl+i =

lzl +

i∑

j=1

kl+jl+j

/(l + i),

bias(zl+i) = bias(zl), var(zl+i) = var(zl)l/(l + i),

var(zl+i) < var(zl) for ∀i > 0


logo

The selection of the number of the largest orderstatistics in Hill estimator

A Hill plot {(k , γH(n, k)),1 ≤ k ≤ n − 1}):the estimate of γH(n, k) is chosen from an interval in whichthis function demonstrate stability.Plot of the mean excess function{(u,e(u)) : X(1) < u < X(n)}, where

e(u) =n∑

i=1

(Xi − u)1{Xi > u}/n∑

i=1

1{Xi > u}

is the sample mean excess function over the threshold u.If this plot follows a reasonably straight line above a certainvalue of u, then this indicates that excesses over u followeP(u) = (1 + γu)/(1 − γ) of generalized Pareto distributionwith positive shape parameter. One can take the nearestorder statistics to u as the estimate of k .


logo

The sensitivity of the Hill estimate to k .

0 500 10004

2

0

2

4

6

8

10

12

14

16

k

gam

ma(k

)

0 500 10004

2

0

2

4

6

8

10

k

gam

ma(k

)

0 500 10002

1.5

1

0.5

0

0.5

1

1.5

k

gam

ma(k

)

The Hill’s estimate against k for 15 samples of Weibull(left), Pareto (middle) and Frech et (right) distributions, allwith shape parameter α = 1/γ = 0.5. Sample size isn = 1000.


logo

Plot of the mean excess function.

Left: Weibull distribution. Right: Pareto distribution. Theshape parameter is 0.5, sample size 1000.


logo

Bootstrap method for k selection.

Minimizing of the empirical bootstrap estimate of the meansquared error of γ:

MSE(γ) = E (γ − γ)2 → mink.

The bootstrap estimate is obtained by drawing B sampleswith replacement from the original data set X n.

Smaller re-samples {X ∗1 , ...,X∗n1} of the size

n1 = nβ , 0 < β < 1,are used.

The corresponding smaller k1 and an optimal k are relatedby:

k = k1(n/n1)α, 0 < α < 1,

where β = 1/2 and α = 2/3, P.Hall, (1990) .


logo

Empirical bootstrap estimate of the MSE(γ) is

MSE∗(n1, k1) =(

b∗(n1, k1))2

+ var∗(n1, k1) → mink1

,

where

b∗(n1, k1) =1B

B∑

b=1

γ∗b(n1, k1) − γ(n, k),

var∗(n1, k1) =1

B − 1

B∑

b=1

(γ∗b(n1, k1) −

1B

B∑

b=1

γ∗b(n1, k1)

)2

are the empirical bootstrap estimates of the bias and thevariance,

γ∗b is the Hill’s estimate constructed by some re-sample of thesize n1 with the parameter k1.


logo

Double bootstrap method for k selection.

Danielsson, de Haan, Peng and de Vries, (1997) requiresless parameters than bootstrap method, Hall, (1990):

n1 and B are required, α is not required.

Auxiliary statistic:

MSE(zn,k ) = E(

zn,k − z⋆n,k

)2,

wherezn,k = Mn,k − 2(γH(n, k))2,

Mn,k =1k

k∑

j=1

(log X(n−j+1) − log X(n−k)

)2,

z⋆n,k is a bootstrap estimate of zn,k .


logo

Double bootstrap method for k selection.

Mn,k/2γH(n, k) and γH(n, k) are consistent estimates for γ,

=⇒ zn,k → 0 as n → ∞.

=⇒ AMSE(zn,k ) = E(zn,k )2 → mink.


logo

Double bootstrap procedure is

Draw B bootstrap sub-samples of the size n1 ∈ (√

n,n)(e.g., n1 ∼ n3/4) from the original sample and determinethe value k⋆

n1that minimizes MSE of zn1,k .

Repeat this for B sub-samples of the size n2 = [n21/n] ([x ]

is the integer part of the number) and determine the valuek⋆

n2that minimizes MSE of zn2,k .

Calculate koptn from the formula

koptn = [

(k⋆n1

)2

k⋆n2

(1 − 1

ρ1

) 22ρ1−1

], ρ1 =log k⋆

n1

2 log(k⋆n1/n1)

,

and estimate γ by the Hill’s estimate with koptn .

The method is robust with respect to the choice of n1, Gomes,Oliveira, (2000).


logo

Sequential procedure for k selection

is based on the theoretical result:

√i(γH(n, i) − γ) ∼ (log log n)1/2, 2 ≤ i ≤ k

in probability, Drees & Kaufmann, (1998) .


logo

Sequential procedure for k selection. Algorithm.

An initial estimate γ0 = γH(n,2√

n) for the parameter γ isobtained by the Hill’s estimate.For rn = 2.5γ0n0.25 we compute

k(rn) = min{k ∈ 1, . . . ,n − 1| max2≤i≤k

√i(γH(n, i)−γH(n, k)) > rn}.

If rn is too large and max2≤i≤k√

i(γH(n, i) − γH(n, k)) > rn

is not satisfied it is recommended repeatedly replace rn by0.9rn until k(rn) i well defined.Similarly, k(r ε

n) for ε = 0.7 is computed.Optimal k

kopt =13

(k(r ε

n)

(k(rn))ε

)1/(1−ε)

(2γ0)1/3

is calculated and γ is estimated by γH(n, kopt).

The method is sensitive to the choice of rn.Natalia Markovich Analysis methods of heavy-tailed data

logo

Plot method for m selection in the group estimator.

The plot {(m,1/zm − 1)} for Pareto distribution with γ = 1,the true γ is shown by dotted line. Sample sizes aren = {150,500,1000}.


logo

Plot method for m selection in the group estimator.

Plot: {(m, zm),m0 < m < M0}, m0 > 2, M0 < n/2m = n/l , zm = (m/n)

∑[n/m]i=1 k(n/m)i

From consistency result zl →a.s. 11+γ it follows that

there must be an interval [m−,m+]

such that zm ≈ α/(1 + α) = (1 + γ)−1 for all m ∈ [m−,m+].

We suggest choosing the average value

z = mean{1/zm − 1 : m ∈ [m−,m+]}

and m∗ ∈ [m−,m+] as a point such that zm∗ = z.


logo

Bootstrap method for m automatical selection.

Minimizing of the empirical bootstrap estimate of the meansquared error of (1 + γl)

−1, l = n/m:

MSE(γl) = E

(1l

l∑

i=1

kli −1

1 + γ

)2

→ minm.

The bootstrap estimate is obtained by drawing B sampleswith replacement from the original data set X n.

Smaller re-samples {X ∗1 , ...,X∗n1} of the size

n1 = nd , 0 < d < 1,are used.

The re-sample is divided into l1 subgroups.

The size of subgroups m1 and m are related by:m = m1(n/n1)

c , 0 < c < 1,where m1 is the size of subgroups in re-samples.


logo

Empirical bootstrap estimate of the MSE

MSE∗(l1,m1) =(

b∗(l1,m1))2

+ var∗(l1,m1) → minm1,

where

b∗(l1,m1) =1B

B∑

b=1

zbl1 − zl ,

var∗(l1,m1) =1

B − 1

B∑

b=1

(zb

l1 −1B

B∑

b=1

zbl1

)2

are the empirical bootstrap estimates of the bias and thevariance,zb

l1= 1

l1

∑l1i=1 kl1 i

is the group estimator constructed by somere-sample.

How to select c and d?


logo

Simulation study: the selection of c and d .

Asymptotic theory (P.Hall, (1990)) recommends

d = 1/2 and c = 2/3

for the bootstrap estimation of the parameter k of the Hill’sestimate of γ.

We check c = {0.05,0.1(0.1); 0.5} for a fixed d = 0.5.

Relative bias and the square root of the mean squared error:

Biasγ =1γ

1

NR

NR∑

i=1

γi − γ

,

RMSEγ =1γ

√√√√ 1NR

NR∑

i=1

(γi − γ)2


logo

Conclusions:

the best values of c for the fixed d = 0.5 arec = {0.3 ÷ 0.5};

the bias of the group estimator is larger for Weibulldistribution.

Further research:

proof of theoretically best values of c and d for the groupestimator using the bootstrap.


logo

Rough methods for the detection of heavy tails and thenumber of finite moments.

1. Ratio of the maximum to the sumLet X1,X2, ...,Xn be i.i.d. r.v.s. We define the followingstatistic

Rn(p) =Mn(p)

Sn(p), n ≥ 1, p > 0, where

Sn(p) = |X1|p + . . . |Xn|p,Mn(p) = max

(|X1|p, . . . , |Xn|p

), n ≥ 1

to check the moment conditions of the data.

For different values of p the plot of n → Rn(p) gives

a preliminary information about the distribution P{|X | > x}.Then E|X |p <∞ follows, if Rn(p) is small for large n. For largen a significant difference between Rn(p) and zero indicates thatthe moment E|X |p is infinite.


logo


2. QQ-plot A QQ-plot (or ”quantiles against quantiles”-plot)draws the dependence

{(

X(k),F←(

n−k+1n+1

)): k = 1, ...,n}, where

X(1) ≥ ... ≥ X(n) are the order statistics of the sample, andF← is an inverse function of the DF F .

Often, the QQ-plot is built as a dependence of exponentialquantiles

against the order statistics of the underlying sample. Then F←

is an inverse function of the exponential DF. Then a linearQQ-plot corresponds to the exponential distribution.


logo


3. Plot of the mean excess function

en(u) =

n∑

i=1

(Xi − u)1{Xi > u}/n∑

i=1

1{Xi > u}

1 Heavy-tailed distributions: e(u) → ∞, u → ∞.2 A Pareto distribution: a linear e(u).3 An exponential distribution: the constant e(u) = 1/λ.4 Light-tailed distributions: e(u) → 0, u → ∞.


logo


Plots of function e(u) forseveral distributions.


logo


4. Hill’s and other estimators of the tail index

The Hill estimate works bad if1 the underlying DF does not have a regularly varying tail,2 the tail index α = 1/γ is not positive,3 the sample size is not large enough,4 the tail is not heavy enough, i.e. γ is not big,5 F ⊆ Rα, since it strongly depends on the slowly varying

function ℓ(x) that is usually unknown.

The disadvantages of Hill’s estimate show

that one has to apply several estimates of the tail index to dealwith the complex analysis of data.


logo

Definition of the independence.

Definition

Random variables ξ1, ξ2, ..., ξn are independent if for anyx1, x2, ..., xn it holds

P{ξ1 ≤ x1, ξ2 ≤ x2 , ..., ξn ≤ xn}= P{ξ1 ≤ x1}P{ξ2 ≤ x2}...P{ξn ≤ xn}

or in terms of cumulative distribution functions

F (x1, x2, ..., xn) = F1(x1)F2(x2)...Fn(xn),

where Fk (xk ) is a distribution function of the r.v. ξk .


logo

Rough methods for the dependence detection.

Correlation between two r.v.s X1 and X2 is defined by

ρ(X1,X2) =cov(X1,X2)√

var(X1)var(X2),

wherecov(X1,X2) = E(X1X2) − E(X1)E(X2)

is a covariance, and var(x) is a variance.

1 If X1 and X2 are independent then ρ(X1,X2) = 0.2 If ρ(X1,X2) = 0 then not necessary X1 and X2 are independent!3 If Gaussian r.v.s X1 and X2 are not correlated then they are

independent.4 For non-Gaussian r.v.s it may be not true.


logo


Generally, the covariances and correlations cannot indicate thedependence.

1 They describe the degree to which two r.v.s. are linearlydependent:

ρ(X1,X2) ∈ [−1,1].

2 ρ(X1,X2) = ±1 ⇔ X1 and X2 are perfectly linearly dependent,i.e. X2 = α+ βX1, for some α ∈ R and β 6= 0.

What tool could be an appropriate to indicate thedependence?


logo

Exact methods for the dependence detection.

The mixing conditions of a stationary sequence should beconsidered for dependence detection.

Definition

(Rosenblatt (1956b)) The strictly stationary ergodic sequenceof random vectors Xt is strongly mixing with rate function φk forσ-field, if

supA∈σ(Xt ,t≤0),B∈σ(Xt ,t>k)

|P (A ∩ B)−P (A) P (B) | = φk → 0, k → ∞.

The rate φk shows how fast the dependence between the pastand the future decreases.


logo

Exact methods for the dependence detection.

Another measure of dependence is given by the dependenceindex:

βn = supx ,y

n∑

j=1

|fj(x , y) − f (x)f (y)|,

where f (x) is a marginal probability density function (PDF) of astationary sequence {Xj , j = 1,2, ...},fj(x , y) is a joint PDF of X1 and X1+j , j = 1,2, ....

1 for i.i.d. sequences βn = 0 for all n,2 for sequences with strong long range dependence βn may tend

to infinity, and3 in between βn may converge to a finite limit at various rates.

It is difficult to estimate such dependence measures bystatistical tools.


logo


The autocorrelation function ( ACF ) is

ρX (h) = ρ(Xt ,Xt+h) = E ((Xt − E(Xt))(Xt+h − E(Xt+h))) /Var (Xt)

Let {Xt , t = 0,±1,±2, ...} be a stationary sample series. Thestandard sample ACF at lag h ∈ Z is

ρn,X (h) =

∑n−ht=1 (Xt − X n)(Xt+h − X n)∑n

t=1(Xt − X n)2,

where X n = 1n

∑nt=1 Xt represents the sample mean.

The accuracy of ρn,X (h) may be poor if the sample size n issmall or h is large. The relevance of ρn,X (h) is determined by itsrate of convergence to the real ACF . When the distribution ofthe Xt ’s is very heavy-tailed (in the sense that EX 4

t = ∞), thisrate can be extremely slow.


logo

The autocorrelation function for heavy-tailed data.

For heavy-tailed data with infinite variance it is better to use themodified sample ACF :

ρn,X (h) =

∑n−ht=1 XtXt+h∑n

t=1 X 2t

,

i.e. without X n.

However, this estimate may behave in a very unpredictable wayif one uses the class of non-linear processes in the sense thatthis sample ACF may converge in distribution to anon-degenerate r.v. depending on h.For linear models it converges in distribution to a constantdepending on h, Davis and Resnick (1985).


logo

Confidence intervals of the sample ACF: linearprocesses.

The causal ARMA process (autoregressive moving average)

Xt has the representation

Xt =∞∑

j=0

ψjZt−j , t = 0,±1, ..., (1)

where Zt is an i.i.d. noise sequence and {ψj} is a sequence ofreal numbers providing the convergence of random series in(1), i.e.,

∑∞j=0 |ψj | <∞, Brockwell and Davis (1991).

The model ARMA(p,q) has the formXt =

∑pj=0 θjXt−j +

∑qj=0 ψjZt−j , t = 1, ...,n.

The model MA(q) has the form Xt =∑q

j=0 ψjZt−j .


logo

Confidence intervals of the sample ACF: linearprocesses.

Bartlett’s formula.

If the process is linear (1),∑∞

j=−∞ |ψj | <∞ and EZ 4t <∞, then

standard sample ACF ρn,X (i) has the asymptotic joint normaldistribution with mean ρX (i) (i.e. ACF) and variancevar

(ρn,X (i)

)= cii/n,

cii =∞∑

k=−∞

[ρ2X (k + i) + ρX (k − i)ρX (k + i) + 2ρ2

X (i)ρ2X (k)

− 4ρX (i)ρX (k)ρX (k + i)], (2)

as n → ∞.The Bartlett’s formula (2) allows to check the hypothesisρn,X (i) = 0.


logo

Confidence intervals of the sample ACF.

Example

For i.i.d. white noise Zt we have ρZ (0) = 1 and ρZ (i) = 0 fori 6= 0 (since Zt and Zt+i are independent) andvar

(ρn,Z (i)

)= 1/n by (2).

For ARMA process driven by such a white noise Zt

1 the sample ACF is approximately normally distributed withmean 0 and variance 1/n for sufficiently large n.

2 It provides 95% confidence interval with the bounds ±1.96/√

nfor the sample ACF.

3 The hypothesis ρn,X (i) = 0 is accepted if ρn,X (i) falls within thisinterval.

What would happened if1 the noise Zt is not normal, and (or)2 the process is not linear?


logo

Confidence intervals of the heavy-tailed sample ACFρn,X (h): linear processes.

ARMA process with i.i.d. regularly varying noise and tail index0 < α < 2 (infinite variance).

ρn,X (h) estimates the quantity∑

j ψjψj+h/∑

j ψ2j that represents

the autocorrelation cov(X0,Xh) in the case of a finite variance.

Illusion: heavy-tailed sample ACF ρn,X (h) can be applied toheavy-tailed processes without problem.

Recommendation (Resnick (2006), p.349):

For α < 1 use ρn,X (h);

for 1 < α < 2 the classical sample ACF ρn,X (h).

The calculation of confidence intervals in both cases is noteasy.


logo

Confidence intervals of the heavy-tailed sample ACFρn,X (h): nonlinear processes.

GARCH(p,q) process (Mikosch (2002)):

Xt = µ+ σtZt , t ∈ Z ,

σ2t = α0 +

p∑

i=1

αiX2t−i +

q∑

j=1

βjσ2t−j ,

(Zt) is an i.i.d. sequence,

σt and Zt are independent for a fixed t ,

the mean µ is estimated from the data, particularly µ = 0,

αi and βj are non-negative constants.

If the marginal distribution of the time series is veryheavy-tailed, namely, the fourth moment is infinite, then theasymptotic normal confidence bounds for the sample ACF arenot applicable anymore.


logo

Testing of long-range dependence (LRD).

Definition

A stationary process (Xt) is long range dependent if∑∞h=0 |ρX (h)| = ∞, where ρX (h) is the ACF at lag h ∈ Z , and

short range dependent otherwise.

This property implies that even though ρX (h)’s areindividually small for large lags, their cumulative effect isimportant.

To detect LRD by statistical procedures one replaces ρX (h)by the sample ACF ρn,X (h).

The LRD effect is typical for long time series, e.g., severalthousand points. One can look at lags 250,300,350, etc.

Usual short range dependent data sets would show asample ACF dying after only a few lags and then persistingwithin the 95% Gaussian confidence window ±1.96/

√n.


logo

Testing of long-range dependence: Hurst parameter.

One can assume that for some constant cρ > 0

ρX (h) ∼ cρh2(H−1) for large h and H ∈ (0.5,1),

(LRD case).

The constant H ∈ (0.5,1) is called the Hurst parameter.

The closer H is to 1 the slower is the rate of ρX (h) to zero ash → ∞, i.e., the longer is the range of dependence in the timeseries.

Kettani and Gubner (2002):

Hn = 0.5(1 + log2(1 + ρn,X (1))

).


logo

Dependence detection by bivariate data

Measures

Rank coefficient Kendall’s τ can be estimated by the sample:

ρτ =2Sτ

n(n − 1), where Sτ =

n∑

i=1

n∑

j=i+1

sign(rj − ri

),

where ri is the order number of the individuum by the secondproperty with the number i by the first property.

Rank coefficient Spearman’s ρ can be estimated by the sample:

ρS = 1 − 6Sρ

n3 − n, where Sρ =

n∑

i=1

(ri − i)2

Pickands dependence function A(t).

ρτ and ρS can be represented by means of A(t).Natalia Markovich Analysis methods of heavy-tailed data

logo

Pickands dependence function

Let (X1,Y1), ..., (Xn,Yn) be a bivariate i.i.d. random sample

with a bivariate extreme value distribution G(x , y).

Example: X1 is a TCP-flow file size, Y1 is a duration of itstransmission.

Similarly to univariate case it implies, that there existnormalizing constants aj,n > 0 and bj,n ∈ R, j = 1,2 such thatas n → ∞,

P{(M1,n − b1,n

)/a1,n ≤ x ,

(M2,n − b2,n

)/a2,n ≤ y} (3)

= F n (a1,nx + b1,n,a2,ny + b2,n)→ G(x , y),

where M1,n = max (X1, ...,Xn), M2,n = max (Y1, ...,Yn) are thecomponent-wise maxima.The vector (M1,n,M2,n) will in general not be presence in theoriginal data.


logo

Pickands dependence function.

Definition

Let F1(x) and F2(y) be the DFs of X and Y .

Let Gj(x), j = 1,2 be a univariate extreme value DF and Fj(x)is in its domain of attraction.

G(x , y) may be determined by margins G1(x) and G2(y) by therepresentation

G(x , y) = exp(

log (G1(x)G2(y)) A(

log (G2(y))

log (G1(x)G2(y))

)),

where A(t), t ∈ [0,1], is the Pickands dependence function.


logo

Pickands dependence function: properties.

In the bivariate case the function A(t) satisfies two properties:

1 (1 − t) ∨ t ≤ A(t) ≤ 1, t ∈ [0,1], i.e. A(0) = A(1) = 1 andlies inside the triangle determined by points (0,1), (1,1)and (0.5,0.5);

2 A(t) is convex.

Case A(t) ≡ 1 corresponds to a total independence

and A(t) = (1 − t) ∨ t corresponds to a total dependence.


logo

A-estimators

Caperaa et al. (1997):

AHTn (t) =

((1/n)

n∑

i=1

min

(ξi/ξn

1 − t,ηi/ηn

t

))−1

,

Hall and Tajvidi (2000):

log ACn (t) =

1n

n∑

i=1

log max(

t ξi , (1 − t)ηi

)

−t1n

n∑

i=1

log ξi − (1 − t)1n

n∑

i=1

log ηi .

Here ξi = − log G1(Xi) and ηi = − log G2(Yi), i = 1, ...,n,ξn = n−1∑n

i=1 ξi , ηn = n−1∑ni=1 ηi .


logo

Problems of A-estimators

1 The estimators are not convex. They may be improved bytaking a convex hull.

2 The margins G1(x) and G2(x) are unknown. One has toreplace them by their estimates G1(x) and G2(x), e.g., byempirical DFs constructed by component-wise maximaover blocks of data. The amount of these maxima may bevery moderate that may reflect on the accuracy of anempirical DF.

3 The component-wise maxima may be not observabletogether, i.e. there are such pairs of maxima which do notpresence in the sample.


logo

Bivariate quantile curve

Both estimates of A(t) allow to get the estimate G(x , y) andconstruct bivariate quantile curves

Q(G,p) = {(x , y) : G(x , y) = p}, 0 < p < 1, (4)

G(x , y) ≈ exp

log

(G1(x)G2(y)

)A

log(

G2(y))

log(

G1(x)G2(y))

,

assuming that G1(x) = p(1−w)/A(w) and G2(x) = pw/A(w) forsome w ∈ [0,1] in order to get G(x , y) = p, Beirlant et al.(2004).The quantile curve consists of the points

Q(G,p) = {(

G−11 (p(1−w)/A(w)), G−1

2 (pw/A(w)))

: w ∈ [0,1]}.


logo

Web-traffic analysis

Characteristics of sub-sessions:the size of a sub-session (s.s.s);

the duration of a sub-session (d.s.s.).

Characteristics of the transferred Web-pages:

the size of the response (s.r.);

the inter-response time (i.r.t.).


logo

Description of the Web data

Main statistical characteristics of the Web-data.

s.s.s.(B) d.s.s.(sec) s.r.(B) i.r.t.(sec)Sample 373 373 7107 7107SizeMini 128 2 0 6.543 · 10−3

mumMaxi 5.884 · 107 9.058 · 104 2.052 · 107 5.676 · 104

mumMean 1.283 · 106 1.728 · 103 5.395 · 104 80.908StDev 4.079 · 106 5.206 · 103 4.931 · 105 728.266


logo

Results of the Web traffic analysis.

Estimation of the tail index for Web-traffic characteristics.

r.v. c mb γbl γp

l γHn,k

s.s.s. 0.3 8 1.179 0.877 0.960.4 10 0.8560.5 22 0.902

s.r. 0.3 72 0.75 0.763 0.7630.4 71 0.870.5 92 0.85

i.r.t. 0.3 42 0.69 0.495 0.380.4 65 0.6250.5 156 0.611

d.s.s. 0.3 10 0.658 0.739 0.50.4 13 0.5390.5 18 0.683


logo


Notations to the previous table:

γbl is the group estimator with the bootstrap selected

parameter m (mb);

γpl is the group estimator with the plot selected parameter

m;

γHn,k is the Hill’s estimator with the plot selected parameter

k .


logo

The EVI estimation by the Hill’s estimator and the groupestimator γl for the data sets size of sub-sessions (left) andduration of sub-sessions (right).


logo

The EVI estimation by the Hill’s estimator and the groupestimator γl for the data sets inter-response times (left) andsize of responses (right).


logo

Conclusions from analysis of the tail index:

1 the distributions of considered Web-traffic characteristicsare heavy-tailed;

2 at least βth moments, β ≥ 2 of the distribution of the s.s.s.,s.r., d.s.s. are not finite;

3 the distribution of i.r.t. has two finite moments;

4 it might be possible for s.s.s. (when 1 < γ < 2) that α < 1and the expectation could be also not finite.


logo


Analysis of values Rn(p).

the values Rn(p) are dramatically large for large n andp ≥ 2, e.g., in the case of the duration of sub-sessions andn = 350 ln Rn(p) ≈ 10 for p = 2 and ln Rn(p) ≈ 103 forp = 3.

Hence, one may conclude that all moments of theconsidered r.v.s apart from the first one are not finite.


logo

n → ln Rn(p) of the duration of sub-sessions (left) and thesize of sub-sessions (right) for a variety of p-values.


logo

n → ln Rn(p) of the inter-response times (left) and the sizeof responses (right) for a variety of p-values.


logo


Analysis of the mean excess function.

the plots u → en(u) tend to infinity for large u implyingheavy tails.

These plots are close to a linear shape for all sets of data.The latter implies that the considered distributions can bemodelled by a DF of a Pareto type.


logo

Exceedance en(u) against the threshold u for the durationof sub-sessions (left) and the size of sub-sessions (right).


logo

Exceedance en(u) against the threshold u for theinter-response times (left) and the size of responses (right).


logo


Analysis of QQ-plots.

Two versions of QQ-plots are shown.

Right top plots show, that the distribution of the d.s.s. isclose to lognormal and Pareto distributions; distributions ofs.s.s., i.r.t. and s.r. are close to Pareto.

Left top plots show that the exponential distribution cannotbe used as an appropriate model for these r.v.s.;distributions of d.s.s., s.s.s., i.r.t. and s.r. are close to aGeneralized Pareto distribution with the DF

Ψσ,γ(x) =

{1 − (1 + γx/σ)−1/γ , γ 6= 0,1 − exp (−x/σ) , γ = 0,

with different values of the parameters γ and σ.

The QQ-plot does not give a unique model to fit theunderlying distribution.


logo

QQ-plots for the duration of sub-sessions.

Left: exponential quantiles (top) and quantiles of theGPD(0.3;1) (bottom) against d.s.s./s (the linear curvescorrespond to the QQ-plot of the same distributions).Right: Empirical DFs of the r.v. Ui = F (Xi). Exponential,Pareto, Weibull, lognormal and normal distributions are used asmodels of F . The linear curve corresponds to the exponentialdistribution.


logo

QQ-plot for the size of sub-sessions.

Left: exponential quantiles (top) and quantiles of theGPD(1;0.015) (bottom, first plot) and the GPD(0.05;0.3)(bottom, second plot) against s.s.s./s (the linear curvescorrespond to the QQ-plot of the same distributions).Right: Empirical DFs of the r.v. Ui = F (Xi). Exponential,Pareto, Weibull, lognormal and normal distributions are used asthe models of F . The linear curve corresponds to theexponential distribution.


logo

QQ-plots for the inter-response times.

Left: exponential quantiles (top) and quantiles of theGPD(0.8;0.015) (bottom) against i.r.t./s (the linear curvescorrespond to the QQ-plot of the same distributions).Right: Empirical DFs of the r.v. Ui = F (Xi). Exponential,Pareto, Weibull, lognormal and normal distributions are used asmodels of F . The linear curve corresponds to the exponentialdistribution.


logo

QQ-plot for the size of responses.

Left: exponential quantiles (top) and quantiles of theGPD(1;0.015) (bottom) against s.r./s (the linear curvescorrespond to the QQ-plot of the same distributions).Right: Empirical DFs of the r.v. Ui = F (Xi). Exponential,Pareto, Weibull, lognormal and normal distributions are used asthe models of F . The linear curve corresponds to theexponential distribution.


logo


ACF analysis of Web-data.

The previous analysis shows that

the considered Web data are heavy-tailed with infinitevariance. Therefore, the application of formula

ρn,X (h) =

∑n−ht=1 XtXt+h∑n

t=1 X 2t

is relevant.


logo

Summary results of the preliminary analysis

Comparison of the recommended methods for Web traffic data

Amount of finite moments Type of distributionr.v. Rn(p) Hill & Group QQ-plot en(u)

estimators.s.s. 1 1 GPD(0.015; 1), Pareto(B) GPD(0.05; 0.3) -liked.s.s. 1 1 GPD(1; 0.3), Pareto(sec) lognormal -likes.r. 1 1 GPD(0.015; 1) Pareto(B) -likei.r.t. 1 2 GPD(0.015; 0.8) Pareto(sec) -like


logo

Testing of dependence

ACF estimation by the modified sample ACF and the standardsample ACF for the data sets s.s.s. (first two plots left), d.s.s.(last two plots right).

The dotted horizontal lines indicate 95% asymptotic confidencebounds (±1.96/

√n) corresponding to the ACF of i.i.d.

Gaussian r.v.s.


logo


ACF estimation by the modified sample ACF and the standardsample ACF for the data sets i.r.t. (first two plots left), s.r. (lasttwo plots right).

The dotted horizontal lines indicate 95% asymptotic confidencebounds (±1.96/

√n) corresponding to the ACF of i.i.d.

Gaussian r.v.s.


logo

Hurst parameter estimation for Web traffic data

Table: Hurst parameter estimation for Web traffic.

Data s.s.s. d .s.s. i .r .t . s.r .Hn 0.493 0.488 0.508 0.507

Main conclusion:

all data sets are heavy-tailed and not long-range dependent.


logo

TCP-flow analysis

We observe

TCP-flow sizes S and

durations D

gathered from one source destination pair.

Motivation is to estimate

the distribution of the maximal rate (or throughput)R = S/D and

the expected throughput R (or S/D)

that the transport system provides.


logo

Description of the TCP-flow data

The analyzed data consist of

TCP-flow sizes and durations of transmissions have beenmeasured from the mobile network of the Finnish operatorElisa;

mobile TCP connections from periods of low, average and highnetwork load conditions;

TCP flows on port 80 (a WWW (HTTP) application).

The number of analyzed flows is 610 000 and, for practicalreasons, we consider 61 disjoint bivariate samples, each of sizen = 10 000.


logo

Description of the TCP-flow data

Statis Unit Definition Sample Mean Sample Variance

tic Min Max Min MaxSize kB Content 9.0 20.3 1303 204553

Transmitted 9.5 20.1 1357 206658

Dura sec SYN-FIN 18.2 30.4 2219 52125tion

The results, [min,max] ranges over all 61 samples.

’Content’ refers to the size of the downloaded web content and’Transmitted’ means Content plus segments retransmitted byTCP. Both are measures of the size of a flow. ’SYN-FIN’ meansfrom the three-way handshaking (synchronization) to finish.


logo

Results of the TCP-flow data analysis

Table: Estimation of the EVI for flow sizes (“Content” and“Transmitted”) and durations (“SYN-FIN”)

γH (n, k) γl γM(n, k) γUH(n, k)Min Max Min Max Min Max Min Max

Content 0.59 0.87 0.45 0.98 0.51 0.98 0.52 0.99

Transmit 0.58 1.15 0.45 0.94 0.52 0.96 0.53 0.97ted

SYN-FIN 0.52 1.00 0.38 0.77 0.36 0.86 0.37 0.82


logo

Conclusions from analysis of the tail index forTCP-flow data:

1 the distributions of TCP-flow size and duration areheavy-tailed;

2 all estimators apart of the group estimator γl indicate thatthe flow sizes samples (both content and transmitted) mayhave infinite variance under the assumption that theirdistributions are regularly varying;

3 some samples of flow durations may have two finite firstmoments.


logo

Results of the TCP-flow data analysis

Table: Comparison of the “rough” methods for TCP-flow data

Amount of first finite Type of distributionmoments

Rn(p) Estimators of γ QQ-plot en(u)

Content 1 2 or 1 GPD(1,1.3) Pareto-like

Transmit 1 2 or < 1 GPD(1,1.3) Pareto-ted likeSYN-FIN < 1 2 or < 1 GPD(1,0.85) Pareto-

like


logo


ACF estimation by the modified sample ACF and the standardsample ACF of one sub-sample (n = 1000) of the TCP-flowsizes (first two plots left), and durations (last two plots right).

The horizontal lines indicate 95% asymptotic confidencebounds (±1.96/

√n).

Conclusions:

The TCP-flow sizes may be independent.

The ACFs of the TCP-flow durations have three clusters thatmay indicate the dependence.


logo

Testing of long-range dependence

Table: Hurst parameter estimation for Web traffic and TCP-flow data

Data TCP − flow size TCP − flow durationHn 0.498 0.506

Main conclusion:

TCP-flow size and duration data sets are heavy-tailed and notlong-range dependent.


logo

Problems of throughput investigation

The distributions of both S and D are heavy-tailed andtheir expectations may not be finite. Thus, ES/D may benot computable.

Since S and D are dependent and positive, then the DF ofthe ratio R = S/D is defined by

FR(x) = P{S/D ≤ x} =

∫ ∞

0

∫ zx

0f (y , z)dydz

=

∫ ∞

0

∫ zx

0dF (y , z),

f (y , z) is a joint PDF of S and D, and its expectation by

ER =

∫ ∞

0xdFR(x),

if the latter integral converges.


logo

Bivariate analysis of TCP-flow data

First, we check the dependencebetween the pairs

(S1,D1), ..., (Sn,Dn) to apply(3). For this purpose, we cancalculate the ACF of the r.v.sri =

√S2

i + D2i , i = 1, ...,n.

The sample ACF of {ri} is smallin absolute value at all lags(possible exception are two lagsthat do not persist within 95%confidence interval). One maysuppose that the sizes-durationpairs are independent.


logo

Bivariate analysis of TCP-flow data

0 10 20 30 40

Block Maximaof Size (MB)

0

1

2

3

4

Blo

ckM

axim

aof

Dur

atio

n(h

ours

)

Scatter plot of pairs of blockmaxima (M j

S,m,MjD,m),

j = 1, . . . ,610, when the blocksize is m = 1 000. The pairsthat are presented in the initialsample are marked by dots asfar as the pairs that are notpresented (e.g., the maximalsize in the group does notnecessarily correspond to themaximal duration in this group)by circles. Lines D = S/384and D = S/42 indicate 384 kb/s(EDGE) and 42 kb/s (GPRS)access rates.


logo

Testing of dependence of the block maxima

0 10 20 30 40 50 60

Lag

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

AC

F

0 10 20 30 40 50 60

Lag

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

AC

F

Estimates of standard sample ACF of the both maxima samples

of size 61 corresponding to TCP-flow sizes (left), and durations(right). The dotted horizontal lines indicate 95% asymptoticconfidence bounds ±1.96/

√n.


logo

Testing of dependence of the block maxima

0 100 200 300 400 500 600

Lag

-0.2

-0.1

0

0.1

0.2

AC

F

0 100 200 300 400 500 600

Lag

-0.2

-0.1

0

0.1

0.2

AC

F

Estimates of standard sample ACF of the both maxima samples

of size 610 corresponding to TCP-flow sizes (left), anddurations (right). The dotted horizontal lines indicate 95%asymptotic confidence bounds ±1.96/

√n.


logo

Testing of distribution of the block maxima

The Generalized Extreme Value (GEV) distribution

Hγ(x) =

{exp(−(1 + γ

( x−µσ

))−1/γ), γ 6= 0

exp(−e−(x−µ

σ)), γ = 0.

is applied as a model of the block maxima distribution.

Maximum likelihood estimates of GEV parameters by blockmaxima of size 610 of TCP-flow data

Statistic Definition γ µ σ

Size Content 0.332259 7075.92 4605.53Duration SYN-FIN 0.10263 3775.8 2433.27


logo

Testing of distribution of the block maxima

10000 20000 30000 40000

Sorted Sample

10000

20000

30000

40000

Qua

ntile

sof

GE

V

2000 4000 6000 8000 100001200014000

Sorted Sample

2500

5000

7500

10000

12500

15000

Qua

ntile

sof

GE

V

QQ-plots of block maxima samples

corresponding to TCP-flow sizes (left) and durations (right).


logo

Testing of dependence of the TCP-flow data

0 0.2 0.4 0.6 0.8 1

t

0.5

0.6

0.7

0.8

0.9

1A

(t)

0 0.2 0.4 0.6 0.8 1

t

0.5

0.6

0.7

0.8

0.9

1

A(t

)

The estimation of the Pickands dependence function

by estimators ACn (t) (dashed line) and AHT

n (t) (solid line). Themaxima sample of size 61 (left) and of size 610 (right). Themarginal distributions G1(x) and G2(x) of TCP-flow sizes anddurations are estimated by GEV.

Conclusions: TCP-flow size and duration are dependent.


logo

Bivariate quantile curves of the TCP-flow data

Using estimates of A(t) one can construct bivariate quantilecurves of TCP-flow data by (4).

0 10000 20000 30000 40000 50000

Flow Size

0

5000

10000

15000

20000

Flo

wD

urat

ion

p = 0.9

p = 0.95

p = 0.75

0 10000 20000 30000 40000 50000

Flow Size

0

5000

10000

15000

20000

Flo

wD

urat

ion

p = 0.9

p = 0.95

p = 0.75

p = 0.975

Estimated quantile curves of TCP-flow data

for p ∈ {0.75,0.9,0.95} corresponding to estimator ACn (t): the

maxima sample of size 61 (left), of size 610 (right).


logo

Conclusions from bivariate analysis of TCP-flow data

The analysis is made from samples of moderate size.

Size S and duration D are heavy-tailed with probablyinfinite second moment.

Their distributions are complicated in the sense that theydo not belong to any known parametric models.

Estimates of the Pickands dependence function show thatS and D are dependent.

Bivariate quantile curves show that the bivariate extremevalue distribution of (S,D) is ’not quite heavy-tailed’ in thesense that not many observations fall in the ”outliers area”,i.e. beyond the 97.5% quantile curve. This can be a specialproperty of this mobile TCP data.Bivariate quantile curves are sensitive to, at least,

1 estimation of parameters of margins of G(x , y) andestimates of A(t) and

2 the amount of component-wise maxima, or to the block size.Natalia Markovich Analysis methods of heavy-tailed data

logo

Notes

Considering the real data three items have to be investigated:

1 the preliminary detection of heavy tails;2 the dependence structure of univariate data;3 the dependence structure of multivariate data.


logo

Reference:

1. Markovich, N.M. and Krieger, U.R. (2005) StatisticalInspection and Analysis Techniques for Traffic Data Arisingfrom the Internet, HETNETs’04 special journal issue on”Convergent Multi-Service Networks and Next GenerationInternet: Performance Modelling and Evaluation”pp.72/1-72/9.

2. Beirlant, J., Goegebeur, Y., Teugels, J. and Segers, J.(2004) Statistics of Extremes: Theory and Applications.Wiley, Chichester, West Sussex.

3. Davis, R. and Resnick, S. (1985) Limit theory for movingaverages of random variables with regularly varying tailprobabilities. Ann. Probability 13, pp.179-195.

4. Embrechts, P., Kluppelberg, C. and Mikosch, T. (1997)Modeling Extremal Events. Springer, Berlin.

5. Brockwell P.J. and Davis R.A. (1991) Time series:Theory and Methods, 2nd edition. Springer, New York.


logo

Reference:

6. Kettani, H. and Gubner, J.A. (2002) A novel approach tothe estimation of the hurst parameter in self-similar traffic.In: Proceedings of IEEE conference on local computernetworks, Tampa, Florida.

7. Resnick, S.I. (2006) Heavy-Tail Phenomena. Probabilisticand Statistical Modeling. Springer, New York.


Documents

Analysis methods of heavy-tailed data - John Wiley & … Analysis methods of heavy-tailed data Natalia Markovich Institute of Control Sciences Russian Academy of Sciences, Moscow,