Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
A Short Course on Quantile Regression
Xuming He (University of Michigan)
Huixia Judy Wang (North Carolina State University)
2
Course Outline:
1. Basics of quantile regression
2. Estimation and computation
3. Statistical properties
4. Inference: tests and confidence intervals
5. Bayesian quantile regression
6. Nonparametric quantile regression
1 Basics of Quantile Regression 3
1 Basics of Quantile Regression
1.1 Quantile Regression versus Mean Regression
Quantile. Let Y be a random variable with cumulative distribution
function CDF FY (y) = P (Y ≤ y). The τth quantile of Y is
Qτ (Y ) = inf{y : FY (y) ≥ τ},
where 0 < τ < 1 is the quantile level.
Special cases
• Q0.5(Y ): median, the second quartile
• Q0.25(Y ): the first quartile, 25th percentile
• Q0.75(Y ): the third quartile, 75th percentile
Note: Qτ (Y ) is a nondecreasing function of τ , i.e.
1 Basics of Quantile Regression 4
Qτ1(Y ) ≤ Qτ2(Y ) for τ1 < τ2.
Conditional quantile. Suppose Y is the response variable, and X
is the p-dimensional predictor. Let FY (y|X = x) = P (Y ≤ y|X = x)
denote the conditional CDF of Y given X = x. Then the τth
conditional quantile of Y is defined as
Qτ (Y |X = x) = inf{y : FY (y|x) ≥ τ}.
Least squares linear (mean) regression model :
Y = XTβ + U, E(U) = 0.
Thus
E(Y |X = x) = xTβ,
where β measures the marginal change in the mean of Y due to a
marginal change in x.
1 Basics of Quantile Regression 5
Linear quantile regression model:
Qτ (Y |x) = xTβ(τ), 0 < τ < 1,
where β(τ) = (β1(τ), · · · , βp(τ))T is the quantile coefficient that may
depend on τ ;
• the first element of x is one corresponding to the intercept
• so that Qτ (Y |x) = β1(τ) + x2β2(τ) + · · ·+ xpβp(τ);
• β(τ) is the marginal change in the τ th quantile due to the
marginal change in x.
Note that Qτ (Y |x) is a nondecreasing function of τ for any given x.
1 Basics of Quantile Regression 6
Example: location-scale shift model
Yi = α+ZTi β + (1 +ZTi γ)εi, εii.i.d.∼ F (·).
The conditional quantile function
Qτ (Y |Xi) = α(τ) +ZTi β(τ),
where
• α(τ) = α+ F−1(τ) is nondecreasing in τ ;
• β(τ) = β + γF−1(τ) may depend on τ . That is, the covariate is
allowed to have a different impact on different quantiles of the Y
distribution.
Location-shift model: γ = 0, so that β(τ) = β is constant across
quantile levels.
1 Basics of Quantile Regression 7
1.2 Quantile Treatment Effect
Quantile Treatment Effect
• Xi = 0: control; Xi = 1: treatment
• Yi|Xi = 0 ∼ F (control distribution) and Yi|Xi = 1 ∼ G(treatment distribution)
• Mean treatment effect:
∆ = E(Yi|Xi = 1)− E(Yi|Xi = 0) =
∫ydG(y)−
∫ydF (y).
• Quantile treatment effect:
δ(τ) = Qτ (Y |Xi = 1)−Qτ (Y |Xi = 0) = G−1(τ)− F−1(τ).
• Thus
∆ =
∫ 1
0
G−1(u)du−∫ 1
0
F−1(u)du =
∫ 1
0
δ(u)du.
1 Basics of Quantile Regression 8
• Equivalent quantile regression model (with binary covariate):
Qτ (Y |X) = α(τ) + δ(τ)X.
• Location shift:
F (y) = G(y + δ)⇒ δ(τ) = ∆ = δ.
−6 −2 0 2 4 6
0.00.1
0.20.3
0.4
y
Densi
ty
0.0 0.2 0.4 0.6 0.8 1.0
1.90
1.95
2.00
2.05
2.10
Tau
delta(
tau) Q
TE
1 Basics of Quantile Regression 9
• Scale shift: ∆ = δ(0.5) = 0, but δ(τ) 6= 0 at other quantiles.
−5 0 5
0.00.1
0.20.3
0.4
y
Dens
ity
0.0 0.2 0.4 0.6 0.8 1.0−1
.0−0
.50.0
0.51.0
Tau
QTE
1 Basics of Quantile Regression 10
• Location-scale shift
−5 0 5
0.00.1
0.20.3
0.4
y
Dens
ity
0.0 0.2 0.4 0.6 0.8 1.00.0
0.51.0
1.52.0
Tau
QTE
1 Basics of Quantile Regression 11
1.3 Advantages of Quantile Regression
Why Quantile Regression?
Reason 1: Quantile regression allows us to study the impact of
predictors on different quantiles of the response distribution, and thus
provides a complete picture of the relationship between Y and X.
Example: More Severe Tropical Cyclones?
• Yi : max wind speeds of tropical cyclones in North Atlantic
• Xi: year 1978-2009
1 Basics of Quantile Regression 12
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
1980 1985 1990 1995 2000 2005 2010
4060
8010
012
014
016
0
Year
Max
Win
d Sp
eed
LS estimate
Slope: 0.095
p-val: 0.569
No signifi-
cant trend
in mean!
Q: Do the quantiles of max wind speed change over time?
1 Basics of Quantile Regression 13
τ th quantile: Qτ (Y ) = {y : P (Y < y) = τ}.
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
1980 1985 1990 1995 2000 2005 2010
4060
8010
012
014
016
0
Year
Max
Win
d Sp
eed
ττ=0.5
ττ=0.75
ττ=0.95
ττ=0.25
p-value
τ = 0.95: 0.009
τ = 0.75: 0.100
τ = 0.5: 0.718
τ = 0.25: 0.659
1 Basics of Quantile Regression 14
Reason 2: robust to outliers in y observations.
0 1 2 3 4 5 6
02
46
810
x
y
Median
Mean
Reason 3: estimation and inference are distribution-free.
1 Basics of Quantile Regression 15
1.4 Other Properties of Quantiles and Quantile
Regression
1. Basic equivariance properties. Let A be any p× pnonsingular matrix, γ ∈ Rp, and a > 0 is a constant. Let
β(τ ; y,X) be the estimator in the τ th quantile regression based
on observations (y,X). Then for any τ ∈ [0, 1],
(i) β(τ ; ay,X) = aβ(τ ; y,X);
(ii) β(τ ;−ay,X) = −aβ(1− τ ; y,X);
(iii) β(τ ; y + Xγ,X) = β(τ ; y,X) + γ;
(iv) β(τ ; y,XA) = A−1β(τ ; y,X).
2. Equivariance property: quantiles are equivariant to monotone
transformations. Suppose h(·) is an increasing function on R.
Then for any variable Y ,
Qh(Y )(τ) = h {Qτ (Y )} .
1 Basics of Quantile Regression 16
That is, the quantiles of the transformed r.v. h(Y ) are simply the
transformed quantiles on the original scale.
3. Interpolation: regression of linear quantiles interpolate p
observations. If the first column of the design matrix is one
corresponding to the intercept, then there are roughly p zero, nτ
negative and n(1− τ) positive residuals yi − xTi β(τ). See more
details in Chapter 3 (subgradient condition).
2 Estimation and Computation 17
2 Estimation and Computation
2.1 Estimation
Suppose we observe a random sample {yi,xi, i = 1, · · · , n} of (Y,X).
Mean and Least Squares Estimation (LSE)
• E(Y ) = µY = arg minaE{(Y − a)2}.
• Sample mean solves mina∑ni=1(yi − a)2.
• The least squares∑ni=1(yi − xTi β)2 = minimum consistent for
conditional mean E(y|x) = xTβ.
2 Estimation and Computation 18
Median and Least Absolute Deviation (LAD)
• Median Q0.5(Y ) = arg minaE|Y − a|
• Sample median solves mina∑ni=1 |yi − a|.
• Assume med(y|x) = xTβ(0.5), then β(0.5) can be obtained by
solving
minβ
n∑i=1
|yi − xTi β|.
2 Estimation and Computation 19
Quantile Regression at Quantile level 0 < τ < 1
• τ th quantile of Y :
Qτ (Y ) = arg minaE{ρτ (Y − a)},
where ρτ (u) = u{τ − I(u < 0)} is the quantile loss function.
0
1
u
(u)
2 Estimation and Computation 20
• τ th sample quantile of Y solves
mina
n∑i=1
ρτ (yi − a).
• Assume Qτ (Y |X) = XTβ(τ), then
β(τ) = arg minβ
n∑i=1
ρτ{yi − xTi β).
2 Estimation and Computation 21
2.2 Computation of Quantile Regression
2.2.1 Linear Programming
Linear programming (standard minimization problem)
miny∈Rm
yTb,
subject to the constraints
yTA ≥ cT ,
and y1 ≥ 0, · · · , ym ≥ 0. Here A is m× n matrix, b ∈ Rm, c ∈ Rn.
The dual maximization problem
maxx∈Rn
cTx,
subject to constraints Ax ≤ b and x ≥ 0.
2 Estimation and Computation 22
Note that the linear quantile regression model can be rewritten as
yi = xTi β(τ) + ei = xTi β(τ) + (ui − vi),
where ui = eiI(ei > 0), vi = |ei|I(ei < 0).
Therefore, minb
n∑i=1
ρτ (yi − xTi b)
⇔ min{b,u,v}
τ1Tnu+ (1− τ)1Tnv
s.t. y −XTb = u− v
b ∈ Rp, u ≥ 0, v ≥ 0.
This is a standard linear programming (minimization)
program.
2 Estimation and Computation 23
2.2.2 Existing Algorithms for Quantile Regression
• Simplex Method
By LP theory, solutions are focused on the vertices of the
constraint set, i.e. subset that provide an exact fit to the p
observations.
– Phase 1: find an initial feasible vertex. E.g. choose any subset
h such that X(h) is of full rank and b(h) = X(h)−1y
– Phase 2: travel from one vertex to another until optimality is
achieved. Move along the direction of “steepest descent”, i.e.
the one with the most negative directional derivative; see
Barrodale and Roberts (1974) for more details.
– Efficient for problems with modest size (several thousands).
Speed is comparable to the least squares estimate (LSE) for n
up to several hundreds.
– Very slow relative to LSE for large problems.
2 Estimation and Computation 24
Reference: Koenker and d’Orey (1987).
• Frisch-Newton Interior Point Method
– In contrast to the simplex, the algorithm traverses the interior
of the feasible region.
– More efficient than simplex for larger sample size.
Reference: Portnoy and Koenker (1997)
• Interior point method with preprocessing
– Use preprocessing to reduce effective sample sizes.
– Recommended for very large problems (e.g. n > 105)
Reference:Portnoy and Koenker (1997).
• Sparse regression quantile fitting
– A sparse implementation of the Frisch-Newton interior-point
algorithm.
– Efficient when the data (design matrix) is sparse, i.e. has many
2 Estimation and Computation 25
zeros.
– R function: rq.fit.sfn
– Reference: Koenker and Ng (2003)
2.3 Software
• R package quantreg
• SAS/STAT PROC QUANTREG
• STATA: www.stata.com
• SPLUS
• Matlab code for quantile regression:
www.stat.psu.edu/ dhunter/code/qrmatlab/
www.econ.uiuc.edu/ roger/research/rq/rq.m
2 Estimation and Computation 26
2.3.1 Estimation in R
• R package quantreg ( contributed by Koenker).
• Official releases of R and the install package of quantreg are
available at http://lib.stat.cmu.edu/R/CRAN/.
• The syntax of rq() function
library(quantreg)
rq(formula, tau=.5, data, method="br", ...)
• Arguments
– formula: model statement, e.g.
y ∼ x1 + x2
– tau: quantile level(s) of interest
∗ If tau is a single number ∈ (0, 1), returns the estimates from
regression at this specified quantile level.
2 Estimation and Computation 27
∗ If tau is a vector, e.g. tau = c(0.25, 0.5, 0.75), returns the
estimates from regression at multiple quantiles.
∗ If tau /∈ (0, 1), returns the entire quantile process.
– method
∗ method=“br”: Simplex method (default).
∗ method=“fn”: the Frisch-Newton interior point method.
∗ method=“pfn”: the Frisch-Newton approach with
preprocessing.
2.3.2 Estimation in SAS
• Package: SAS/STAT PROC QUANTREG
• Basic syntax
2 Estimation and Computation 28
PROC QUANTREG DATA = sas-data-set <options>;
BY variables;
CLASS variables;
MODEL response = independents </options>;
RUN;
• To specify the quantile level:
– Use the option QUANTILE in the MODEL statement
MODEL Y = X / QUANTILE = <number list |ALL>;
Choice of quantile(s) Example
A single quantile QUANTILE = 0.25
Multiple quantiles QUANTILE = 0.25 0.5 0.75
Entire Quantile process QUANTILE = ALL
– Default value is 0.5, corresponding to the median.
2 Estimation and Computation 29
• To specify the algorithm:
– Use the option ALGORITHM in the PROC
QUANTREG statement
Method Option
Simplex ALGORITHM = SIMPLEX
Interior point ALGORITHM = INTERIOR
Interior point with preprocessing ALGORITHM = INTERIOR PP
Smoothing ALGORITHM = SMOOTHING
– The default is Simplex algorithm.
• Default Output
– Model information: report the name of the data set and
the response variable, the number of covariates, the number of
observations, algorithm of optimization and the method for
confidence intervals.
2 Estimation and Computation 30
– Summary statistics: report the sample mean and standard
deviation, sample median, MAD and interquartile range for
each variable included in the MODEL statement.
– Quantile objective function: report the quantile level to
be estimated and the optimized objective function.
– Parameter Estimates: report the estimated coefficients
and their 95% confidence intervals.
Confidence interval constructions will be discussed later.
3 Statistical Properties 31
3 Statistical Properties
3.1 Consistency
Coefficient estimator in linear quantile regression model
β(τ) = arg minb∈Rp
n∑i=1
ρτ (yi − xTi b).
Classical Sufficient Regularity conditions
A1 The distribution functions of Y given xi, Fi(·), are absolutely
continuous with continuous densities fi(·) that are uniformly
bounded away from 0 and ∞ at ξi(τ) = Qτ (Y |xi).
A2 There exist positive definite matrices D0 and D1 such that
(i) limn→∞ n−1∑ni=1 xix
Ti = D0;
(ii) limn→∞ n−1∑ni=1 fi{ξi(τ)}xixTi = D1;
3 Statistical Properties 32
(iii) maxi=1,··· ,n ‖xi‖ = o(n1/2).
Theorem 1 Under Conditions A1 and A2 (i), β(τ)p→ β(τ).
Sketch of Proof:
1. Use the uniform law of large numbers to show that
supb∈B
n−1n∑i=1
[ρτ (yi − xTi b)− E
{ρτ (yi − xTi b)
}],
where B is a compact subset of Rp.
• Reference: Pollard (1991).
2. Note that β(τ)→ β(τ) holds if for any ε > 0,
Q(b) ≡ n−1∑ni=1E
{ρτ (yi − xTi b)
}is bounded away from zero
with probability approaching one for any ‖b− β(τ)‖ ≥ ε.
3. Under Conditions A1 and A2 (i), Q(b) has a unique minimizer
β(τ).
3 Statistical Properties 33
4. The convergence is thus proven.
3.2 Asymptotic Normality
Theorem 2 Under Conditions A1 and A2,
n1/2{β(τ)− β(τ)
}d→ N
(0, τ(1− τ)D−1
1 D0D1
).
For the i.i.d. error models, i.e. fi{ξi(τ)} = fε(0), the above result can
be simplified as
n1/2{β(τ)− β(τ)
}d→ N
(0,τ(1− τ)
f2ε (0)
D−10
).
Sketch of Proof
3 Statistical Properties 34
1. Define δn = n1/2{β(τ)− β(τ)}, which is the minimizer of
Zn(δ) =n∑i=1
[ρτ (εi − n−1/2xTi δ)− ρτ (εi)
],
where εi = yi − xTi β(τ).
2. By Knight’s identity (Knight, 1998)
ρτ (u− v)− ρτ (u) = −vψτ (u) +
∫ v
0
{I(u ≤ s)− I(u ≤ 0)} ds,
it is easy to show that
Zn(δ) = −δTWn + Z2n(δ),
where
Wn = n−1/2n∑i=1
xiψτ (εi)d−→W, W ∼ N(0, τ(1− τ)D0),
3 Statistical Properties 35
Z2n(δ) =n∑i=1
Z2ni(δ) =n∑i=1
∫ xTi δ/√n
0
{I(εi ≤ s)− I(εi ≤ 0)}ds,
ψτ (u) = τ − I(u < 0).
3. By Taylor expansion,
E{Z2n(δ)} =1
2δTD1δ + o(1),
and
V {Z2n(δ)} ≤ 1√n
maxi|xTi δ|E{Z2n(δ)} = o(‖δ‖)
4. Thus
Zn(δ)→ Z0(δ) = −δTW +1
2δTD1δ.
5. Z0(δ) is convex and thus has a unique minimizer
arg minδZ0(δ) = D−1
1 W.
3 Statistical Properties 36
6. By Complexity Lemma Pollard (1991), the result in Step 4 can be
strengthened to be uniform over δ.
7. Thus δn = arg minZn(δ)d−→ D−1
1 W.
A different approach based on the score function can be found in
He and Shao (1996).
3.3 Subgradient Condition
Define
R(β) =n∑i=1
ρτ (yi − xTi β).
• Piecewise linear and continuous.
• Differentiable except at points such that yi − xTi β = 0.
3 Statistical Properties 37
The directional derivative of R(β) in direction w
∇R(β,w) =d
dtR(β + xTi wt)|t=0.
Note that
d
dtρτ (y − xTβ − xTwt)|t=0
=d
dt(y − xTβ − xTwt)
{τ − I(y − xTβ − xTwt < 0)
}|t=0
=
−xTwτ, y − xTβ > 0
−xTw(1− τ), y − xTβ < 0
−xTw{τ − I(−xTw < 0)}, y − xTβ = 0
= xTwψ∗τ (y − xTβ,−xTw), (3.1)
3 Statistical Properties 38
where
ψ∗τ (u, v) =
τ − I(u < 0), u 6= 0
τ − I(v < 0), u = 0.
Thus
∇R(β,w) =
n∑i=1
xTi wψ∗τ (yi − xTi β,−xTi w). (3.2)
Note
∇R(β,w) ≥ 0 for all w ∈ Rp with ‖w‖ = 1
⇔ β = arg minβR(β).
3 Statistical Properties 39
Theorem 3 If (y,X) are in general positions (i.e. if any p
observations of them yield a unique exact fit), then there exists a
minimizer of R(β) of the form b(h) = X(h)−1y(h) if and only if, for
some h ∈ H,
−τ1p ≤ ξ(h) ≤ (1− τ)1p,
where ξ(h)T =∑i∈h ψτ{yi − xTi b(h)}xTi X(h)−1, and h is the
complement of h.
Proof. In linear programming, vertex solutions (basic solutions)
correspond to points at which p observations are interpolated, i.e.(y(h), X(h)
)= {(yi,xi), i ∈ h}. That is, the basic solutions pass
through these n points as
b(h) = X(h)−1y(h), h ∈ H∗ = {h ∈ H∗ : |X(h)| 6= 0}.
For any w ∈ Rp, reparameterize to get v = X(h)w, i.e.
w = X(h)−1v.
3 Statistical Properties 40
For a basic solution b(h) to be the minimizer, we need for all v ∈ Rp,
−n∑i=1
ψ∗τ{yi − xTi b(h),−xTi X(h)−1v}xTi X(h)−1v ≥ 0. (3.3)
WLOG, assume X(h) = (xT1 , · · · ,xTp )T .
• If i ∈ h, xTi X(h) = eTi , where ei is a p-dimensional vector
containing all zeros except the ith element being of 1. Thus
eiv = vi.
• If (y,X) are in general position, none of the residuals yi − xTi b(h)
with i ∈ h is zero. If yi’s have a density wrt Lesbesgue measure,
then with probability one (y,X) are in general position.
• The space of directions v ∈ Rp is spanned by
v = ±ek, k = 1, · · · , p. So (3.3) holds for any v ∈ Rp iff the
inequality holds for ±ek, k = 1, · · · , p.
3 Statistical Properties 41
Therefore, (3.3) becomes
0 ≤ −∑i∈h
ψ∗τ{0,−vi}vi − ξ(h)Tv, (3.4)
where ξ(h)T =∑i∈h ψτ{yi − xTi b(h)}xTi X(h)−1.
• If v = ei, we have
0 ≤ −(τ − 1)− ξi(h), i = 1, · · · , p.
• If v = −ei, we have
0 ≤ τ + ξi(h), i = 1, · · · , p.
That is,
−τ1p ≤ ξ(h) ≤ (1− τ)1p.
�
3 Statistical Properties 42
Remark 1 The sum of score function
‖n∑i=1
xiψτ{yi − xTi β(τ)}‖ ≤ Cp maxi=1,··· ,n
‖xi‖.
3.4 Bahadur Representation
A Basic Bahadur Representation
Regularity conditions
C1 The distribution functions of Y given xi, Fi(·), have continuous
densities fi(·) that are bounded away from 0 and uniformly
bounded away from ∞ in a neighborhood of ξi(τ) = Qτ (Y |xi),
i = 1, · · · , n. In addition, the first derivative of fi(·) is uniformly
bounded in a neighborhood of ξi(τ), i = 1, · · · , n.
C2 maxi=1,··· ,n ‖xi‖ = O(n1/4{log(n)}−1/2).
C3 n−1∑ni=1 ‖xi‖4 ≤ B for some finite constant B.
3 Statistical Properties 43
C4 There exist positive definite matrices D0 and D1 such that
limn→∞ n−1∑ni=1 xix
Ti = D0 and
limn→∞ n−1∑ni=1 fi{xTi β(τ)}xixTi = D1.
Theorem 4 Assume Conditions C1-C4 hold, then
n1/2{β(τ)− β(τ)} = D−11 n−1/2
n∑i=1
xiψτ{yi − xTi β(τ)}
+Op(n−1/4
√log n). (3.5)
Remark 2 For linear regression models with i.i.d. errors
εi = yi − xTi β(τ) ∼ F , we have
n1/2{β(τ)−β(τ)} =1
f(0)D−1
0 n−1/2n∑i=1
xiψτ (εi)+Op(n−1/4
√log n).
3 Statistical Properties 44
Sketch of Proof:
• Step 1 (uniform approximation). Let C be some fixed
constant. Define
Rn(δ) =n∑i=1
xi[ψτ{εi − xTi δ} − ψτ (εi)
],
where εi.= yi − xTi β(τ). Then
supδ:‖δ‖≤C
‖Rn(δ)− E{Rn(δ)}‖ = Op(n1/2(log n)‖δ‖1/2). (3.6)
The uniform approximation can be obtained by
– applying exponential inequality and chaining arguments.
– e.g. applying Lemma 4.1 of He and Shao (1996).
• Step 2: the consistency of β(τ) was proven earlier.
3 Statistical Properties 45
• Define δ = {b− β(τ)}. Then
Rn(δ) =n∑i=1
xi{ψτ (εi − xTi δ)− ψτ (εi)
}=
n∑i=1
xi[ψτ{yi − xTi β(τ)− xTi δ} − ψτ{yi − xTi β(τ)}
]
3 Statistical Properties 46
and
E {Rn(δ)} =n∑i=1
xi[τ − Fi{xTi β(τ) + xTi δ}
]=
n∑i=1
xi[Fi{xTi β(τ)} − Fi{xTi β(τ) + xTi δ}
]= −
n∑i=1
xi[fi{xTi β(τ)}xTi δ + f ′i(ηi)(x
Ti δ)2
]= −
n∑i=1
xixTi fi{xTi β(τ)}δ +O
( n∑i=1
‖xi‖3‖δ‖2).
Define δ = β(τ)− β(τ). By the uniform approximation and the
3 Statistical Properties 47
root-n consistency of β(τ), under conditions C1-C3, we get
n∑i=1
xiψτ{yi − xTi β(τ)} −n∑i=1
xiψτ{yi − xTi β(τ)}
= −[n−1
n∑i=1
xixTi fi{xTi β(τ)}
]nδ
+Op(n‖δ‖2) +Op(n1/2(log n)‖δ‖1/2). (3.7)
By the subgradident condition and condition C2,
n∑i=1
xiψτ{yi−xTi β(τ)} = Op(p maxi=1,··· ,n
‖xi‖) = Op(n1/4/
√log n).
From (3.7), we get δ = Op(n−1/2 log n).
3 Statistical Properties 48
Therefore,
n1/2{β(τ)− β(τ)}
= D−11 n−1/2
n∑i=1
xiψτ{yi − xTi β(τ)}+Op(n−1/4
√log n).(3.8)
4 Statistical Inference 49
4 Statistical Inference
4.1 Wald-type Test
4.1.1 Asymptotic Normality
• Asymptotic normality in iid settings
n1/2{β(τ)− β(τ)
}d→ N
(0,τ(1− τ)
f2ε (0)
D−10
),
where D0 = limn→∞ n−1∑ni=1 xix
Ti .
• Asymptotic normality in non-iid settings
n1/2{β(τ)− β(τ)} d→ N(0, τ(1− τ)D1(τ)−1D0D1(τ)
),
where
D1(τ) = limn→∞
n−1n∑i=1
fi{xTi β(τ)}xixTi .
4 Statistical Inference 50
• Asymptotic covariance between quantiles
Acov(√
n{β(τi)− β(τi)},√n{β(τj)− β(τj)}
)= (τi ∧ τj − τiτj)D1(τi)
−1D0D1(τj).
4.1.2 Wald Test for General Linear Hypotheses
Define the coefficient vector θ = (β(τ1)T , · · · ,β(τm)T )T .
• Null hypothesis H0 : Rθ = r.
• Test statistic
Tn = n(Rθ − r)T (RV −1RT )−1(Rθ − r),
where V is the mp×mp matrix with the ijth block
V (τi, τj) = (τi ∧ τj − τiτj)D1(τi)−1D0D1(τj).
• Under H0, Tnd−→ χ2
q, where q is the rank of R.
4 Statistical Inference 51
• Drawback: the covariance matrix involves the unknown density
functions (nuisance parameters), i.e. fi{xTi β(τ)} in Non-IID
settings, and fε(0) in IID settings.
• Reference: Koenker and Machado (1999).
4.1.3 Estimation of Asymptotic Covariance Matrix
1. IID settings:
var{n1/2β(τ)} =τ(1− τ)
f2ε (0)
D−10 ,
where D0 = n−1∑ni=1 xix
Ti .
Estimation of fε(0) = fε{F−1ε (τ)}
4 Statistical Inference 52
• Sparsity parameter:
s(τ) =1
f{F−1(τ)}.
• Note F{F−1(t)} = t. Differentiate both side with respect to t,
we get
f{F−1(t)} ddtF−1(t) = 1⇔ d
dtF−1(t) = s(t).
That is, the sparsity parameter s(t) is simply the derivative of
quantile function F−1(t) wrt t.
• Difference quotient estimator (Siddiqui, 1960).
sn(t) =F−1n (t+ hn|x)− F−1
n (t− hn|x)
2hn,
where hn → 0 as n→∞, and F−1n (t|x) is the estimated tth
conditional quantile of Y given x = n−1∑ni=1 xi.
4 Statistical Inference 53
2. Non-IID settings:
var{n1/2β(τ)} = τ(1− τ)D1(τ)−1D0(τ)D1(τ).
Estimation of D1(τ)
• Hendricks-Koenker Sandwich
– Suppose the conditional quantiles of Y given x are linear at
quantile levels around τ .
– Then fit quantile regression at (τ ± hn)th quantiles,
resulting in β(τ − hn) and β(τ + hn).
– Estimate fi{ξi(τ)} by
fi{ξi(τ)} =2hn
xTi β(τ + hn)− xTi β(τ − hn),
where ξi(τ) = Qτ (Y |xi).
– In finite samples quantiles may cross so that the upper
quantiles may be estimated to be smaller than lower
4 Statistical Inference 54
quantiles. A modified estimator to account for this issue:
fi{ξi(τ)} = max
(0,
2hn
xTi β(τ + hn)− xTi β(τ − hn)− ε
),
where ε is a small positive constant to avoid zero
denominator.
– Estimator of D1(τ):
D1(τ) = n−1n∑i=1
fi{ξi(τ)}xixTi .
• Powell Sandwich
– Powell (1991): a kernel estimator of D1(τ):
D1(τ) =1
nhn
n∑i=1
K
{εi(τ)
hn
}xix
Ti ,
where εi(τ) = yi − xTi β(τ), and hn is a bandwidth
4 Statistical Inference 55
parameter satisfying hn → 0 and n1/2hn →∞ as n→∞.
– Under certain continuity conditions on fi, D1(τ)P−→ D1(τ).
Implementation in R quantreg package
fit = rq(y~x, tau)
#assuming iid errors
summary.rq(fit, se="iid", tau)
#assuming non iid errors, Hendricks-Koenker Sandwich
summary.rq(fit, se="nid", tau)
#based on Powell Sandwich
summary.rq(fit, se="ker", tau)
4 Statistical Inference 56
4.2 Rank Score Test
Consider the model
Qτ (Y |xi, zi) = xTi β(τ) + zTi γ(τ),
and hypotheses
H0 : γ(τ) = 0 v.s. Ha : γ(τ) 6= 0.
Here β(τ) ∈ Rp and γ(τ) ∈ Rq.
Score function:
Sn = n−1/2n∑i=1
z∗i ψτ{yi − xTi β(τ)},
• ψτ (u) = τ − I(u < 0);
• Z∗ = (z∗i ) = Z −X(XTΨX)−1XTΨZ,
• Ψ = diag (fi{Qτ (Y |xi, zi)});
4 Statistical Inference 57
• β(τ) is the quantile coefficient estimator obtained under H0.
Asymptotic property: Under H0, as n→∞,
Sn = AN(0,M1/2n ), (4.1)
where Mn = n−1∑ni=1 z
∗i z∗Ti τ(1− τ).
Rank-score test statistic:
Tn = STnM−1n Sn
d→ χ2q, under H0.
Simplification for i.i.d. settings
• Z∗ = (z∗i ) ={I −X(XTX)−1XT
}Z: the residuals by
projecting Z on X;
• Mn = τ(1− τ)n−1∑ni=1 z
∗i z∗Ti ;
4 Statistical Inference 58
• so no need to estimate the nuisance parameters
fi{Qτ (Y |xi, zi)}.
Construction of confidence interval (CI) of γ(τ)
• γ(τ) is a parameter of interest, corresponding to one of the
covariates.
• CI of γ(τ) can be constructed by inversion of rank score test.
• Consider the hypotheses
H0 : γ(τ) = γ0 v.s. Ha : γ(τ) 6= γ0,
where γ0 is a prespecified scalar.
• Reject H0 if Tn ≥ χ2α(1), the (1− α)th quantile of χ2(1), and
vice versa.
• The collection of all the γ0 for which H0 is not rejected is
taken to be the (1− α)th CI of γ(τ).
4 Statistical Inference 59
• Reference: Gutenbrunner et al. (1993).
Implementation in R quantreg package
fit = rq(y~x, tau)
#assuming iid errors
summary.rq(fit, se="rank", tau, alpha=0.05, iid=TRUE)
#assuming non iid errors
summary.rq(fit, se="rank", tau, alpha=0.05, iid=FALSE)
#Outputs: estimation and (1-alpha) CI for each coefficient
4 Statistical Inference 60
4.3 Resampling and Bootstrap
4.3.1 Residual bootstrap
For i.i.d. errors, location-shift model.
• Obtain the estimator β(τ) using the observed sample, and
residuals εi = yi − xTi β(τ).
• Draw bootstrap samples ε∗i , i = 1, · · · , n from {εi, i = 1, · · · , n}with replacement, and define y∗i = xTi β(τ) + ε∗i .
• Compute the bootstrap estimator β∗(τ) by quantile regression
using the bootstrap sample.
• Carry out inference by calculating the covariance of β(τ) by the
sample covariance of bootstrap estimators or construct CI using
percentile methods.
• Reference: De Angelis et al. (1993), Knight (2003).
4 Statistical Inference 61
4.3.2 Paired bootstrap
• Generate bootstrap sample (y∗i ,x∗i ) by drawing with replacement
from the n pairs {(yi,xi), i = 1, · · · , n}.
• Obtain the bootstrap estimator β∗(τ) by quantile regression using
the bootstrap sample.
• Reference: Andrews and Buchinsky (2000, 2001)
4.3.3 MCMB
Markov chain marginal bootstrap (mcmb) (He and Hu, 2002;
Kocherginsky et al., 2005). Instead of solving a p-dimensional
estimating equation for each bootstrap replication, MCMB solves p
one-dimensional estimating equations. Model:
Qτ (Yi|xi) = xTi β(τ), β(τ) ∈ Rp,
4 Statistical Inference 62
where xi = (xi,1, · · · , xi,p)T .
Procedure
(i) Calculate ri = yi − xTi β(τ). Define zi = xiψτ (ri)− z with
z = n−1∑ni=1 xiψτ (ri), where ψτ (r) = τ − I(r < 0).
(ii) Step 0: let β(0) = β(τ).
(iii) Step k: for each integer 1 ≤ j ≤ p in the ascending order, draw
with replacement from z1, · · · , zn to obtain zj,k1 , · · · , zj,kn . Solve
β(k)j as the solution to
n∑i=1
xiψτ
yi −∑l<j
xi,lβ(k)l −
∑l>j
xi,lβ(k−1)l − xi,jβ(k)
j
=
n∑i=1
zj,ki .
(iv) Repeat Step (iii) until K replications β(k), k = 1, · · · ,K are
obtained. The variance of β(τ) is then estimated by the sample
variance of {β(k), k = 1, · · · ,K}.
References 63
The MCMB method is implemented in both the R package quantreg
and SAS.
Other options
• Bootstrap estimating equations: Parzen and Ying (1994).
• Wild bootstrap: Feng et al. (2011).
References
Andrews, D. W. K. and Buchinsky, M. (2000), “A Three-Step Method for
Choosing the Number of Bootstrap Repetitions,” Econometrica, 68,
23–51.
— (2001), “Evaluation of a Three-Step Method for Choosing the Number of
Boot- strap Repetitions,” Journal of Econometrics, 103, 345–386.
Barrodale, I. and Roberts, F. (1974), “Solution of an Overdetermined
References 64
System of Equations in the L1 Norm,” Communications of the ACM, 17,
319–320.
De Angelis, D., Hall, P., and Young, G. A. (1993), “Analytical and
Bootstrap Approxima- tions to Estimator Distributions in L1 Regression,”
Journal of the American Statistical Association, 88, 1310–1316.
Feng, X., He, X., and Hu, J. (2011), “Wild Bootstrap for Quantile
Regression,” Biometrika, 98, 995–999.
Gutenbrunner, C., Jureckova, J., Koenker, R., and Portnoy, S. (1993),
“Tests of Linear Hypotheses Based on Regression Rank Scores,” Journal
of Nonparametric Statistics, 2, 307–333.
He, X. and Hu, F. (2002), “Markov Chain Marginal Bootstrap,” Journal of
the American Statistical Association, 97, 783–795.
He, X. and Shao, Q. M. (1996), “A General Bahadur Representation of
M-Estimators and Its Application to Linear Regression With
Nonstochastic Designs,” The Annals of Statistics, 24, 2608–2630.
References 65
Knight, K. (1998), “Limiting distributions for L1 regression estimators under
general conditions,” The Annals of Statistics, 26, 755–770.
— (2003), “On the Second Order Behaviour of the Bootstrap of L1
Regression Estimators,” Journal of the Iranian Statistical Society, 2,
21–42.
Kocherginsky, M., He, X., and Mu, Y. (2005), “Practical Confidence
Intervals for Regression Quantiles,” Journal of Computational and
Graphical Statistics, 14, 41–55.
Koenker, R. and Bassett, G. (1982), “Robust tests for heteroscedasticity
based on regression quantiles,” Econometrica, 50, 43–61.
Koenker, R. and d’Orey, V. (1987), “Computing Regression Quantiles,”
Applied Statistics, 36, 383–393.
Koenker, R. and Machado, J. A. F. (1999), “Goodness of Fit and Related
Inference Processes for Quantile Regression,” Journal of the American
Statistical Association, 94, 1296–1310.
References 66
Koenker, R. and Ng, P. (2003), “SparseM: A Sparse Matrix Package for R,”
Journal of Statistical Software, 8, 1–9.
Parzen, M. I., L. W. and Ying, Z. (1994), “A Resampling Method Based on
Pivotal Estimating Functions,” Biometrika, 81, 341–350.
Pollard, D. (1991), “Asymptotics for least absolute deviation regression
estimators,” Econometrics Theory, 7, 186–199.
Portnoy, S. and Koenker, R. (1997), “The Gaussian Hare and the Laplacian
Tortoise: Computability of Squared-Error Versus Absolute-Error
Estimators, with Discusssion,” Statistical Science, 12, 279–300.
Powell, J. L. (1991), “Estimation of Monotonic Regression Models under
Quantile Restrictions,” in Nonparametric and Semiparametric Methods in
Econometrics, eds. Barnett, W., Powell, J., and Tauchen, G., Cambridge:
Cambridge University Press.
Siddiqui, M. (1960), “Distribution of Quantiles from a Bivariate Population,”
Journal of Research of the National Bureau of Standards, 64, 145–150.