A Short Course on Quantile Regression - Xiamen University · 2013-07-14 · 2 Estimation and Computation 23 2.2.2 Existing Algorithms for Quantile Regression Simplex Method By LP

1

A Short Course on Quantile Regression

Xuming He (University of Michigan)

Huixia Judy Wang (North Carolina State University)

2

Course Outline:

1. Basics of quantile regression

2. Estimation and computation

3. Statistical properties

4. Inference: tests and confidence intervals

5. Bayesian quantile regression

6. Nonparametric quantile regression

1 Basics of Quantile Regression 3

1 Basics of Quantile Regression

1.1 Quantile Regression versus Mean Regression

Quantile. Let Y be a random variable with cumulative distribution

function CDF FY (y) = P (Y ≤ y). The τth quantile of Y is

Qτ (Y ) = inf{y : FY (y) ≥ τ},

where 0 < τ < 1 is the quantile level.

Special cases

• Q0.5(Y ): median, the second quartile

• Q0.25(Y ): the first quartile, 25th percentile

• Q0.75(Y ): the third quartile, 75th percentile

Note: Qτ (Y ) is a nondecreasing function of τ , i.e.


Qτ1(Y ) ≤ Qτ2(Y ) for τ1 < τ2.

Conditional quantile. Suppose Y is the response variable, and X

is the p-dimensional predictor. Let FY (y|X = x) = P (Y ≤ y|X = x)

denote the conditional CDF of Y given X = x. Then the τth

conditional quantile of Y is defined as

Qτ (Y |X = x) = inf{y : FY (y|x) ≥ τ}.

Least squares linear (mean) regression model :

Y = XTβ + U, E(U) = 0.

Thus

E(Y |X = x) = xTβ,

where β measures the marginal change in the mean of Y due to a

marginal change in x.


Linear quantile regression model:

Qτ (Y |x) = xTβ(τ), 0 < τ < 1,

where β(τ) = (β1(τ), · · · , βp(τ))T is the quantile coefficient that may

depend on τ ;

• the first element of x is one corresponding to the intercept

• so that Qτ (Y |x) = β1(τ) + x2β2(τ) + · · ·+ xpβp(τ);

• β(τ) is the marginal change in the τ th quantile due to the

marginal change in x.

Note that Qτ (Y |x) is a nondecreasing function of τ for any given x.


Example: location-scale shift model

Yi = α+ZTi β + (1 +ZTi γ)εi, εii.i.d.∼ F (·).

The conditional quantile function

Qτ (Y |Xi) = α(τ) +ZTi β(τ),

where

• α(τ) = α+ F−1(τ) is nondecreasing in τ ;

• β(τ) = β + γF−1(τ) may depend on τ . That is, the covariate is

allowed to have a different impact on different quantiles of the Y

distribution.

Location-shift model: γ = 0, so that β(τ) = β is constant across

quantile levels.


1.2 Quantile Treatment Effect

Quantile Treatment Effect

• Xi = 0: control; Xi = 1: treatment

• Yi|Xi = 0 ∼ F (control distribution) and Yi|Xi = 1 ∼ G(treatment distribution)

• Mean treatment effect:

∆ = E(Yi|Xi = 1)− E(Yi|Xi = 0) =

∫ydG(y)−

∫ydF (y).

• Quantile treatment effect:

δ(τ) = Qτ (Y |Xi = 1)−Qτ (Y |Xi = 0) = G−1(τ)− F−1(τ).

• Thus

∆ =

∫ 1

0

G−1(u)du−∫ 1

0

F−1(u)du =

∫ 1

0

δ(u)du.


• Equivalent quantile regression model (with binary covariate):

Qτ (Y |X) = α(τ) + δ(τ)X.

• Location shift:

F (y) = G(y + δ)⇒ δ(τ) = ∆ = δ.

−6 −2 0 2 4 6

0.00.1

0.20.3

0.4

y

Densi

ty

0.0 0.2 0.4 0.6 0.8 1.0

1.90

1.95

2.00

2.05

2.10

Tau

delta(

tau) Q

TE


• Scale shift: ∆ = δ(0.5) = 0, but δ(τ) 6= 0 at other quantiles.

−5 0 5

0.00.1

0.20.3

0.4

y

Dens

ity

0.0 0.2 0.4 0.6 0.8 1.0−1

.0−0

.50.0

0.51.0

Tau

QTE


• Location-scale shift

−5 0 5

0.00.1

0.20.3

0.4

y

Dens

ity

0.0 0.2 0.4 0.6 0.8 1.00.0

0.51.0

1.52.0

Tau

QTE


1.3 Advantages of Quantile Regression

Why Quantile Regression?

Reason 1: Quantile regression allows us to study the impact of

predictors on different quantiles of the response distribution, and thus

provides a complete picture of the relationship between Y and X.

Example: More Severe Tropical Cyclones?

• Yi : max wind speeds of tropical cyclones in North Atlantic

• Xi: year 1978-2009


●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

1980 1985 1990 1995 2000 2005 2010

4060

8010

012

014

016

0

Year

Max

Win

d Sp

eed

LS estimate

Slope: 0.095

p-val: 0.569

No signifi-

cant trend

in mean!

Q: Do the quantiles of max wind speed change over time?


τ th quantile: Qτ (Y ) = {y : P (Y < y) = τ}.

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

1980 1985 1990 1995 2000 2005 2010

4060

8010

012

014

016

0

Year

Max

Win

d Sp

eed

ττ=0.5

ττ=0.75

ττ=0.95

ττ=0.25

p-value

τ = 0.95: 0.009

τ = 0.75: 0.100

τ = 0.5: 0.718

τ = 0.25: 0.659


Reason 2: robust to outliers in y observations.

0 1 2 3 4 5 6

02

46

810

x

y

Median

Mean

Reason 3: estimation and inference are distribution-free.


1.4 Other Properties of Quantiles and Quantile

Regression

1. Basic equivariance properties. Let A be any p× pnonsingular matrix, γ ∈ Rp, and a > 0 is a constant. Let

β(τ ; y,X) be the estimator in the τ th quantile regression based

on observations (y,X). Then for any τ ∈ [0, 1],

(i) β(τ ; ay,X) = aβ(τ ; y,X);

(ii) β(τ ;−ay,X) = −aβ(1− τ ; y,X);

(iii) β(τ ; y + Xγ,X) = β(τ ; y,X) + γ;

(iv) β(τ ; y,XA) = A−1β(τ ; y,X).

2. Equivariance property: quantiles are equivariant to monotone

transformations. Suppose h(·) is an increasing function on R.

Then for any variable Y ,

Qh(Y )(τ) = h {Qτ (Y )} .


That is, the quantiles of the transformed r.v. h(Y ) are simply the

transformed quantiles on the original scale.

3. Interpolation: regression of linear quantiles interpolate p

observations. If the first column of the design matrix is one

corresponding to the intercept, then there are roughly p zero, nτ

negative and n(1− τ) positive residuals yi − xTi β(τ). See more

details in Chapter 3 (subgradient condition).

2 Estimation and Computation 17

2 Estimation and Computation

2.1 Estimation

Suppose we observe a random sample {yi,xi, i = 1, · · · , n} of (Y,X).

Mean and Least Squares Estimation (LSE)

• E(Y ) = µY = arg minaE{(Y − a)2}.

• Sample mean solves mina∑ni=1(yi − a)2.

• The least squares∑ni=1(yi − xTi β)2 = minimum consistent for

conditional mean E(y|x) = xTβ.


Median and Least Absolute Deviation (LAD)

• Median Q0.5(Y ) = arg minaE|Y − a|

• Sample median solves mina∑ni=1 |yi − a|.

• Assume med(y|x) = xTβ(0.5), then β(0.5) can be obtained by

solving

minβ

n∑i=1

|yi − xTi β|.


Quantile Regression at Quantile level 0 < τ < 1

• τ th quantile of Y :

Qτ (Y ) = arg minaE{ρτ (Y − a)},

where ρτ (u) = u{τ − I(u < 0)} is the quantile loss function.

0

1

u

(u)


• τ th sample quantile of Y solves

mina

n∑i=1

ρτ (yi − a).

• Assume Qτ (Y |X) = XTβ(τ), then

β(τ) = arg minβ

n∑i=1

ρτ{yi − xTi β).


2.2 Computation of Quantile Regression

2.2.1 Linear Programming

Linear programming (standard minimization problem)

miny∈Rm

yTb,

subject to the constraints

yTA ≥ cT ,

and y1 ≥ 0, · · · , ym ≥ 0. Here A is m× n matrix, b ∈ Rm, c ∈ Rn.

The dual maximization problem

maxx∈Rn

cTx,

subject to constraints Ax ≤ b and x ≥ 0.


Note that the linear quantile regression model can be rewritten as

yi = xTi β(τ) + ei = xTi β(τ) + (ui − vi),

where ui = eiI(ei > 0), vi = |ei|I(ei < 0).

Therefore, minb

n∑i=1

ρτ (yi − xTi b)

⇔ min{b,u,v}

τ1Tnu+ (1− τ)1Tnv

s.t. y −XTb = u− v

b ∈ Rp, u ≥ 0, v ≥ 0.

This is a standard linear programming (minimization)

program.


2.2.2 Existing Algorithms for Quantile Regression

• Simplex Method

By LP theory, solutions are focused on the vertices of the

constraint set, i.e. subset that provide an exact fit to the p

observations.

– Phase 1: find an initial feasible vertex. E.g. choose any subset

h such that X(h) is of full rank and b(h) = X(h)−1y

– Phase 2: travel from one vertex to another until optimality is

achieved. Move along the direction of “steepest descent”, i.e.

the one with the most negative directional derivative; see

Barrodale and Roberts (1974) for more details.

– Efficient for problems with modest size (several thousands).

Speed is comparable to the least squares estimate (LSE) for n

up to several hundreds.

– Very slow relative to LSE for large problems.


Reference: Koenker and d’Orey (1987).

• Frisch-Newton Interior Point Method

– In contrast to the simplex, the algorithm traverses the interior

of the feasible region.

– More efficient than simplex for larger sample size.

Reference: Portnoy and Koenker (1997)

• Interior point method with preprocessing

– Use preprocessing to reduce effective sample sizes.

– Recommended for very large problems (e.g. n > 105)

Reference:Portnoy and Koenker (1997).

• Sparse regression quantile fitting

– A sparse implementation of the Frisch-Newton interior-point

algorithm.

– Efficient when the data (design matrix) is sparse, i.e. has many


zeros.

– R function: rq.fit.sfn

– Reference: Koenker and Ng (2003)

2.3 Software

• R package quantreg

• SAS/STAT PROC QUANTREG

• STATA: www.stata.com

• SPLUS

• Matlab code for quantile regression:

www.stat.psu.edu/ dhunter/code/qrmatlab/

www.econ.uiuc.edu/ roger/research/rq/rq.m


2.3.1 Estimation in R

• R package quantreg ( contributed by Koenker).

• Official releases of R and the install package of quantreg are

available at http://lib.stat.cmu.edu/R/CRAN/.

• The syntax of rq() function

library(quantreg)

rq(formula, tau=.5, data, method="br", ...)

• Arguments

– formula: model statement, e.g.

y ∼ x1 + x2

– tau: quantile level(s) of interest

∗ If tau is a single number ∈ (0, 1), returns the estimates from

regression at this specified quantile level.

http://lib.stat.cmu.edu/R/CRAN/


∗ If tau is a vector, e.g. tau = c(0.25, 0.5, 0.75), returns the

estimates from regression at multiple quantiles.

∗ If tau /∈ (0, 1), returns the entire quantile process.

– method

∗ method=“br”: Simplex method (default).

∗ method=“fn”: the Frisch-Newton interior point method.

∗ method=“pfn”: the Frisch-Newton approach with

preprocessing.

2.3.2 Estimation in SAS

• Package: SAS/STAT PROC QUANTREG

• Basic syntax


PROC QUANTREG DATA = sas-data-set <options>;

BY variables;

CLASS variables;

MODEL response = independents </options>;

RUN;

• To specify the quantile level:

– Use the option QUANTILE in the MODEL statement

MODEL Y = X / QUANTILE = <number list |ALL>;

Choice of quantile(s) Example

A single quantile QUANTILE = 0.25

Multiple quantiles QUANTILE = 0.25 0.5 0.75

Entire Quantile process QUANTILE = ALL

– Default value is 0.5, corresponding to the median.


• To specify the algorithm:

– Use the option ALGORITHM in the PROC

QUANTREG statement

Method Option

Simplex ALGORITHM = SIMPLEX

Interior point ALGORITHM = INTERIOR

Interior point with preprocessing ALGORITHM = INTERIOR PP

Smoothing ALGORITHM = SMOOTHING

– The default is Simplex algorithm.

• Default Output

– Model information: report the name of the data set and

the response variable, the number of covariates, the number of

observations, algorithm of optimization and the method for

confidence intervals.


– Summary statistics: report the sample mean and standard

deviation, sample median, MAD and interquartile range for

each variable included in the MODEL statement.

– Quantile objective function: report the quantile level to

be estimated and the optimized objective function.

– Parameter Estimates: report the estimated coefficients

and their 95% confidence intervals.

Confidence interval constructions will be discussed later.

3 Statistical Properties 31

3 Statistical Properties

3.1 Consistency

Coefficient estimator in linear quantile regression model

β(τ) = arg minb∈Rp

n∑i=1

ρτ (yi − xTi b).

Classical Sufficient Regularity conditions

A1 The distribution functions of Y given xi, Fi(·), are absolutely

continuous with continuous densities fi(·) that are uniformly

bounded away from 0 and ∞ at ξi(τ) = Qτ (Y |xi).

A2 There exist positive definite matrices D0 and D1 such that

(i) limn→∞ n−1∑ni=1 xix

Ti = D0;

(ii) limn→∞ n−1∑ni=1 fi{ξi(τ)}xixTi = D1;


(iii) maxi=1,··· ,n ‖xi‖ = o(n1/2).

Theorem 1 Under Conditions A1 and A2 (i), β(τ)p→ β(τ).

Sketch of Proof:

1. Use the uniform law of large numbers to show that

supb∈B

n−1n∑i=1

[ρτ (yi − xTi b)− E

{ρτ (yi − xTi b)

}],

where B is a compact subset of Rp.

• Reference: Pollard (1991).

2. Note that β(τ)→ β(τ) holds if for any ε > 0,

Q(b) ≡ n−1∑ni=1E

{ρτ (yi − xTi b)

}is bounded away from zero

with probability approaching one for any ‖b− β(τ)‖ ≥ ε.

3. Under Conditions A1 and A2 (i), Q(b) has a unique minimizer

β(τ).


4. The convergence is thus proven.

3.2 Asymptotic Normality

Theorem 2 Under Conditions A1 and A2,

n1/2{β(τ)− β(τ)

}d→ N

(0, τ(1− τ)D−1

1 D0D1

).

For the i.i.d. error models, i.e. fi{ξi(τ)} = fε(0), the above result can

be simplified as

n1/2{β(τ)− β(τ)

}d→ N

(0,τ(1− τ)

f2ε (0)

D−10

).

Sketch of Proof


1. Define δn = n1/2{β(τ)− β(τ)}, which is the minimizer of

Zn(δ) =n∑i=1

[ρτ (εi − n−1/2xTi δ)− ρτ (εi)

],

where εi = yi − xTi β(τ).

2. By Knight’s identity (Knight, 1998)

ρτ (u− v)− ρτ (u) = −vψτ (u) +

∫ v

0

{I(u ≤ s)− I(u ≤ 0)} ds,

it is easy to show that

Zn(δ) = −δTWn + Z2n(δ),

where

Wn = n−1/2n∑i=1

xiψτ (εi)d−→W, W ∼ N(0, τ(1− τ)D0),


Z2n(δ) =n∑i=1

Z2ni(δ) =n∑i=1

∫ xTi δ/√n

0

{I(εi ≤ s)− I(εi ≤ 0)}ds,

ψτ (u) = τ − I(u < 0).

3. By Taylor expansion,

E{Z2n(δ)} =1

2δTD1δ + o(1),

and

V {Z2n(δ)} ≤ 1√n

maxi|xTi δ|E{Z2n(δ)} = o(‖δ‖)

4. Thus

Zn(δ)→ Z0(δ) = −δTW +1

2δTD1δ.

5. Z0(δ) is convex and thus has a unique minimizer

arg minδZ0(δ) = D−1

1 W.


6. By Complexity Lemma Pollard (1991), the result in Step 4 can be

strengthened to be uniform over δ.

7. Thus δn = arg minZn(δ)d−→ D−1

1 W.

A different approach based on the score function can be found in

He and Shao (1996).

3.3 Subgradient Condition

Define

R(β) =n∑i=1

ρτ (yi − xTi β).

• Piecewise linear and continuous.

• Differentiable except at points such that yi − xTi β = 0.


The directional derivative of R(β) in direction w

∇R(β,w) =d

dtR(β + xTi wt)|t=0.

Note that

d

dtρτ (y − xTβ − xTwt)|t=0

=d

dt(y − xTβ − xTwt)

{τ − I(y − xTβ − xTwt < 0)

}|t=0

=

−xTwτ, y − xTβ > 0

−xTw(1− τ), y − xTβ < 0

−xTw{τ − I(−xTw < 0)}, y − xTβ = 0

= xTwψ∗τ (y − xTβ,−xTw), (3.1)


where

ψ∗τ (u, v) =

τ − I(u < 0), u 6= 0

τ − I(v < 0), u = 0.

Thus

∇R(β,w) =

n∑i=1

xTi wψ∗τ (yi − xTi β,−xTi w). (3.2)

Note

∇R(β,w) ≥ 0 for all w ∈ Rp with ‖w‖ = 1

⇔ β = arg minβR(β).


Theorem 3 If (y,X) are in general positions (i.e. if any p

observations of them yield a unique exact fit), then there exists a

minimizer of R(β) of the form b(h) = X(h)−1y(h) if and only if, for

some h ∈ H,

−τ1p ≤ ξ(h) ≤ (1− τ)1p,

where ξ(h)T =∑i∈h ψτ{yi − xTi b(h)}xTi X(h)−1, and h is the

complement of h.

Proof. In linear programming, vertex solutions (basic solutions)

correspond to points at which p observations are interpolated, i.e.(y(h), X(h)

)= {(yi,xi), i ∈ h}. That is, the basic solutions pass

through these n points as

b(h) = X(h)−1y(h), h ∈ H∗ = {h ∈ H∗ : |X(h)| 6= 0}.

For any w ∈ Rp, reparameterize to get v = X(h)w, i.e.

w = X(h)−1v.


For a basic solution b(h) to be the minimizer, we need for all v ∈ Rp,

−n∑i=1

ψ∗τ{yi − xTi b(h),−xTi X(h)−1v}xTi X(h)−1v ≥ 0. (3.3)

WLOG, assume X(h) = (xT1 , · · · ,xTp )T .

• If i ∈ h, xTi X(h) = eTi , where ei is a p-dimensional vector

containing all zeros except the ith element being of 1. Thus

eiv = vi.

• If (y,X) are in general position, none of the residuals yi − xTi b(h)

with i ∈ h is zero. If yi’s have a density wrt Lesbesgue measure,

then with probability one (y,X) are in general position.

• The space of directions v ∈ Rp is spanned by

v = ±ek, k = 1, · · · , p. So (3.3) holds for any v ∈ Rp iff the

inequality holds for ±ek, k = 1, · · · , p.


Therefore, (3.3) becomes

0 ≤ −∑i∈h

ψ∗τ{0,−vi}vi − ξ(h)Tv, (3.4)

where ξ(h)T =∑i∈h ψτ{yi − xTi b(h)}xTi X(h)−1.

• If v = ei, we have

0 ≤ −(τ − 1)− ξi(h), i = 1, · · · , p.

• If v = −ei, we have

0 ≤ τ + ξi(h), i = 1, · · · , p.

That is,

−τ1p ≤ ξ(h) ≤ (1− τ)1p.

�


Remark 1 The sum of score function

‖n∑i=1

xiψτ{yi − xTi β(τ)}‖ ≤ Cp maxi=1,··· ,n

‖xi‖.

3.4 Bahadur Representation

A Basic Bahadur Representation

Regularity conditions

C1 The distribution functions of Y given xi, Fi(·), have continuous

densities fi(·) that are bounded away from 0 and uniformly

bounded away from ∞ in a neighborhood of ξi(τ) = Qτ (Y |xi),

i = 1, · · · , n. In addition, the first derivative of fi(·) is uniformly

bounded in a neighborhood of ξi(τ), i = 1, · · · , n.

C2 maxi=1,··· ,n ‖xi‖ = O(n1/4{log(n)}−1/2).

C3 n−1∑ni=1 ‖xi‖4 ≤ B for some finite constant B.


C4 There exist positive definite matrices D0 and D1 such that

limn→∞ n−1∑ni=1 xix

Ti = D0 and

limn→∞ n−1∑ni=1 fi{xTi β(τ)}xixTi = D1.

Theorem 4 Assume Conditions C1-C4 hold, then

n1/2{β(τ)− β(τ)} = D−11 n−1/2

n∑i=1

xiψτ{yi − xTi β(τ)}

+Op(n−1/4

√log n). (3.5)

Remark 2 For linear regression models with i.i.d. errors

εi = yi − xTi β(τ) ∼ F , we have

n1/2{β(τ)−β(τ)} =1

f(0)D−1

0 n−1/2n∑i=1

xiψτ (εi)+Op(n−1/4

√log n).


Sketch of Proof:

• Step 1 (uniform approximation). Let C be some fixed

constant. Define

Rn(δ) =n∑i=1

xi[ψτ{εi − xTi δ} − ψτ (εi)

],

where εi.= yi − xTi β(τ). Then

supδ:‖δ‖≤C

‖Rn(δ)− E{Rn(δ)}‖ = Op(n1/2(log n)‖δ‖1/2). (3.6)

The uniform approximation can be obtained by

– applying exponential inequality and chaining arguments.

– e.g. applying Lemma 4.1 of He and Shao (1996).

• Step 2: the consistency of β(τ) was proven earlier.


• Define δ = {b− β(τ)}. Then

Rn(δ) =n∑i=1

xi{ψτ (εi − xTi δ)− ψτ (εi)

}=

n∑i=1

xi[ψτ{yi − xTi β(τ)− xTi δ} − ψτ{yi − xTi β(τ)}

]


and

E {Rn(δ)} =n∑i=1

xi[τ − Fi{xTi β(τ) + xTi δ}

]=

n∑i=1

xi[Fi{xTi β(τ)} − Fi{xTi β(τ) + xTi δ}

]= −

n∑i=1

xi[fi{xTi β(τ)}xTi δ + f ′i(ηi)(x

Ti δ)2

]= −

n∑i=1

xixTi fi{xTi β(τ)}δ +O

( n∑i=1

‖xi‖3‖δ‖2).

Define δ = β(τ)− β(τ). By the uniform approximation and the


root-n consistency of β(τ), under conditions C1-C3, we get

n∑i=1

xiψτ{yi − xTi β(τ)} −n∑i=1

xiψτ{yi − xTi β(τ)}

= −[n−1

n∑i=1

xixTi fi{xTi β(τ)}

]nδ

+Op(n‖δ‖2) +Op(n1/2(log n)‖δ‖1/2). (3.7)

By the subgradident condition and condition C2,

n∑i=1

xiψτ{yi−xTi β(τ)} = Op(p maxi=1,··· ,n

‖xi‖) = Op(n1/4/

√log n).

From (3.7), we get δ = Op(n−1/2 log n).


Therefore,

n1/2{β(τ)− β(τ)}

= D−11 n−1/2

n∑i=1

xiψτ{yi − xTi β(τ)}+Op(n−1/4

√log n).(3.8)

4 Statistical Inference 49

4 Statistical Inference

4.1 Wald-type Test

4.1.1 Asymptotic Normality

• Asymptotic normality in iid settings

n1/2{β(τ)− β(τ)

}d→ N

(0,τ(1− τ)

f2ε (0)

D−10

),

where D0 = limn→∞ n−1∑ni=1 xix

Ti .

• Asymptotic normality in non-iid settings

n1/2{β(τ)− β(τ)} d→ N(0, τ(1− τ)D1(τ)−1D0D1(τ)

),

where

D1(τ) = limn→∞

n−1n∑i=1

fi{xTi β(τ)}xixTi .


• Asymptotic covariance between quantiles

Acov(√

n{β(τi)− β(τi)},√n{β(τj)− β(τj)}

)= (τi ∧ τj − τiτj)D1(τi)

−1D0D1(τj).

4.1.2 Wald Test for General Linear Hypotheses

Define the coefficient vector θ = (β(τ1)T , · · · ,β(τm)T )T .

• Null hypothesis H0 : Rθ = r.

• Test statistic

Tn = n(Rθ − r)T (RV −1RT )−1(Rθ − r),

where V is the mp×mp matrix with the ijth block

V (τi, τj) = (τi ∧ τj − τiτj)D1(τi)−1D0D1(τj).

• Under H0, Tnd−→ χ2

q, where q is the rank of R.


• Drawback: the covariance matrix involves the unknown density

functions (nuisance parameters), i.e. fi{xTi β(τ)} in Non-IID

settings, and fε(0) in IID settings.

• Reference: Koenker and Machado (1999).

4.1.3 Estimation of Asymptotic Covariance Matrix

1. IID settings:

var{n1/2β(τ)} =τ(1− τ)

f2ε (0)

D−10 ,

where D0 = n−1∑ni=1 xix

Ti .

Estimation of fε(0) = fε{F−1ε (τ)}


• Sparsity parameter:

s(τ) =1

f{F−1(τ)}.

• Note F{F−1(t)} = t. Differentiate both side with respect to t,

we get

f{F−1(t)} ddtF−1(t) = 1⇔ d

dtF−1(t) = s(t).

That is, the sparsity parameter s(t) is simply the derivative of

quantile function F−1(t) wrt t.

• Difference quotient estimator (Siddiqui, 1960).

sn(t) =F−1n (t+ hn|x)− F−1

n (t− hn|x)

2hn,

where hn → 0 as n→∞, and F−1n (t|x) is the estimated tth

conditional quantile of Y given x = n−1∑ni=1 xi.


2. Non-IID settings:

var{n1/2β(τ)} = τ(1− τ)D1(τ)−1D0(τ)D1(τ).

Estimation of D1(τ)

• Hendricks-Koenker Sandwich

– Suppose the conditional quantiles of Y given x are linear at

quantile levels around τ .

– Then fit quantile regression at (τ ± hn)th quantiles,

resulting in β(τ − hn) and β(τ + hn).

– Estimate fi{ξi(τ)} by

fi{ξi(τ)} =2hn

xTi β(τ + hn)− xTi β(τ − hn),

where ξi(τ) = Qτ (Y |xi).

– In finite samples quantiles may cross so that the upper

quantiles may be estimated to be smaller than lower


quantiles. A modified estimator to account for this issue:

fi{ξi(τ)} = max

(0,

2hn

xTi β(τ + hn)− xTi β(τ − hn)− ε

),

where ε is a small positive constant to avoid zero

denominator.

– Estimator of D1(τ):

D1(τ) = n−1n∑i=1

fi{ξi(τ)}xixTi .

• Powell Sandwich

– Powell (1991): a kernel estimator of D1(τ):

D1(τ) =1

nhn

n∑i=1

K

{εi(τ)

hn

}xix

Ti ,

where εi(τ) = yi − xTi β(τ), and hn is a bandwidth


parameter satisfying hn → 0 and n1/2hn →∞ as n→∞.

– Under certain continuity conditions on fi, D1(τ)P−→ D1(τ).

Implementation in R quantreg package

fit = rq(y~x, tau)

#assuming iid errors

summary.rq(fit, se="iid", tau)

#assuming non iid errors, Hendricks-Koenker Sandwich

summary.rq(fit, se="nid", tau)

#based on Powell Sandwich

summary.rq(fit, se="ker", tau)


4.2 Rank Score Test

Consider the model

Qτ (Y |xi, zi) = xTi β(τ) + zTi γ(τ),

and hypotheses

H0 : γ(τ) = 0 v.s. Ha : γ(τ) 6= 0.

Here β(τ) ∈ Rp and γ(τ) ∈ Rq.

Score function:

Sn = n−1/2n∑i=1

z∗i ψτ{yi − xTi β(τ)},

• ψτ (u) = τ − I(u < 0);

• Z∗ = (z∗i ) = Z −X(XTΨX)−1XTΨZ,

• Ψ = diag (fi{Qτ (Y |xi, zi)});


• β(τ) is the quantile coefficient estimator obtained under H0.

Asymptotic property: Under H0, as n→∞,

Sn = AN(0,M1/2n ), (4.1)

where Mn = n−1∑ni=1 z

∗i z∗Ti τ(1− τ).

Rank-score test statistic:

Tn = STnM−1n Sn

d→ χ2q, under H0.

Simplification for i.i.d. settings

• Z∗ = (z∗i ) ={I −X(XTX)−1XT

}Z: the residuals by

projecting Z on X;

• Mn = τ(1− τ)n−1∑ni=1 z

∗i z∗Ti ;


• so no need to estimate the nuisance parameters

fi{Qτ (Y |xi, zi)}.

Construction of confidence interval (CI) of γ(τ)

• γ(τ) is a parameter of interest, corresponding to one of the

covariates.

• CI of γ(τ) can be constructed by inversion of rank score test.

• Consider the hypotheses

H0 : γ(τ) = γ0 v.s. Ha : γ(τ) 6= γ0,

where γ0 is a prespecified scalar.

• Reject H0 if Tn ≥ χ2α(1), the (1− α)th quantile of χ2(1), and

vice versa.

• The collection of all the γ0 for which H0 is not rejected is

taken to be the (1− α)th CI of γ(τ).


• Reference: Gutenbrunner et al. (1993).

Implementation in R quantreg package

fit = rq(y~x, tau)

#assuming iid errors

summary.rq(fit, se="rank", tau, alpha=0.05, iid=TRUE)

#assuming non iid errors

summary.rq(fit, se="rank", tau, alpha=0.05, iid=FALSE)

#Outputs: estimation and (1-alpha) CI for each coefficient


4.3 Resampling and Bootstrap

4.3.1 Residual bootstrap

For i.i.d. errors, location-shift model.

• Obtain the estimator β(τ) using the observed sample, and

residuals εi = yi − xTi β(τ).

• Draw bootstrap samples ε∗i , i = 1, · · · , n from {εi, i = 1, · · · , n}with replacement, and define y∗i = xTi β(τ) + ε∗i .

• Compute the bootstrap estimator β∗(τ) by quantile regression

using the bootstrap sample.

• Carry out inference by calculating the covariance of β(τ) by the

sample covariance of bootstrap estimators or construct CI using

percentile methods.

• Reference: De Angelis et al. (1993), Knight (2003).


4.3.2 Paired bootstrap

• Generate bootstrap sample (y∗i ,x∗i ) by drawing with replacement

from the n pairs {(yi,xi), i = 1, · · · , n}.

• Obtain the bootstrap estimator β∗(τ) by quantile regression using

the bootstrap sample.

• Reference: Andrews and Buchinsky (2000, 2001)

4.3.3 MCMB

Markov chain marginal bootstrap (mcmb) (He and Hu, 2002;

Kocherginsky et al., 2005). Instead of solving a p-dimensional

estimating equation for each bootstrap replication, MCMB solves p

one-dimensional estimating equations. Model:

Qτ (Yi|xi) = xTi β(τ), β(τ) ∈ Rp,


where xi = (xi,1, · · · , xi,p)T .

Procedure

(i) Calculate ri = yi − xTi β(τ). Define zi = xiψτ (ri)− z with

z = n−1∑ni=1 xiψτ (ri), where ψτ (r) = τ − I(r < 0).

(ii) Step 0: let β(0) = β(τ).

(iii) Step k: for each integer 1 ≤ j ≤ p in the ascending order, draw

with replacement from z1, · · · , zn to obtain zj,k1 , · · · , zj,kn . Solve

β(k)j as the solution to

n∑i=1

xiψτ

yi −∑l<j

xi,lβ(k)l −

∑l>j

xi,lβ(k−1)l − xi,jβ(k)

j

=

n∑i=1

zj,ki .

(iv) Repeat Step (iii) until K replications β(k), k = 1, · · · ,K are

obtained. The variance of β(τ) is then estimated by the sample

variance of {β(k), k = 1, · · · ,K}.

References 63

The MCMB method is implemented in both the R package quantreg

and SAS.

Other options

• Bootstrap estimating equations: Parzen and Ying (1994).

• Wild bootstrap: Feng et al. (2011).

References

Andrews, D. W. K. and Buchinsky, M. (2000), “A Three-Step Method for

Choosing the Number of Bootstrap Repetitions,” Econometrica, 68,

23–51.

— (2001), “Evaluation of a Three-Step Method for Choosing the Number of

Boot- strap Repetitions,” Journal of Econometrics, 103, 345–386.

Barrodale, I. and Roberts, F. (1974), “Solution of an Overdetermined

References 64

System of Equations in the L1 Norm,” Communications of the ACM, 17,

319–320.

De Angelis, D., Hall, P., and Young, G. A. (1993), “Analytical and

Bootstrap Approxima- tions to Estimator Distributions in L1 Regression,”

Journal of the American Statistical Association, 88, 1310–1316.

Feng, X., He, X., and Hu, J. (2011), “Wild Bootstrap for Quantile

Regression,” Biometrika, 98, 995–999.

Gutenbrunner, C., Jureckova, J., Koenker, R., and Portnoy, S. (1993),

“Tests of Linear Hypotheses Based on Regression Rank Scores,” Journal

of Nonparametric Statistics, 2, 307–333.

He, X. and Hu, F. (2002), “Markov Chain Marginal Bootstrap,” Journal of

the American Statistical Association, 97, 783–795.

He, X. and Shao, Q. M. (1996), “A General Bahadur Representation of

M-Estimators and Its Application to Linear Regression With

Nonstochastic Designs,” The Annals of Statistics, 24, 2608–2630.

References 65

Knight, K. (1998), “Limiting distributions for L1 regression estimators under

general conditions,” The Annals of Statistics, 26, 755–770.

— (2003), “On the Second Order Behaviour of the Bootstrap of L1

Regression Estimators,” Journal of the Iranian Statistical Society, 2,

21–42.

Kocherginsky, M., He, X., and Mu, Y. (2005), “Practical Confidence

Intervals for Regression Quantiles,” Journal of Computational and

Graphical Statistics, 14, 41–55.

Koenker, R. and Bassett, G. (1982), “Robust tests for heteroscedasticity

based on regression quantiles,” Econometrica, 50, 43–61.

Koenker, R. and d’Orey, V. (1987), “Computing Regression Quantiles,”

Applied Statistics, 36, 383–393.

Koenker, R. and Machado, J. A. F. (1999), “Goodness of Fit and Related

Inference Processes for Quantile Regression,” Journal of the American

Statistical Association, 94, 1296–1310.

References 66

Koenker, R. and Ng, P. (2003), “SparseM: A Sparse Matrix Package for R,”

Journal of Statistical Software, 8, 1–9.

Parzen, M. I., L. W. and Ying, Z. (1994), “A Resampling Method Based on

Pivotal Estimating Functions,” Biometrika, 81, 341–350.

Pollard, D. (1991), “Asymptotics for least absolute deviation regression

estimators,” Econometrics Theory, 7, 186–199.

Portnoy, S. and Koenker, R. (1997), “The Gaussian Hare and the Laplacian

Tortoise: Computability of Squared-Error Versus Absolute-Error

Estimators, with Discusssion,” Statistical Science, 12, 279–300.

Powell, J. L. (1991), “Estimation of Monotonic Regression Models under

Quantile Restrictions,” in Nonparametric and Semiparametric Methods in

Econometrics, eds. Barnett, W., Powell, J., and Tauchen, G., Cambridge:

Cambridge University Press.

Siddiqui, M. (1960), “Distribution of Quantiles from a Bivariate Population,”

Journal of Research of the National Bureau of Standards, 64, 145–150.

Documents

A Short Course on Quantile Regression - Xiamen University · 2013-07-14 · 2 Estimation and Computation 23 2.2.2 Existing Algorithms for Quantile Regression Simplex Method By LP