The structured elastic net for quantile regression and ... · For quantile regression, the quantity inffy: P(Y yjX = x) ˝gis called the ˝-quantile of the conditional distribution

The structured elastic net for quantile regression

and support vector classification

Martin Slawski

Machine Learning GroupDepartment of Computer Science, Saarland University, Saarbrucken, [email protected]

Abstract

In view of its ongoing importance for a variety of practical applications, featureselection via `1-regularization methods like the lasso has been subject to extensive the-oretical as well empirical investigations. Despite its popularity, mere `1-regularizationhas been criticized for being inadequate or ineffective, notably in situations in whichadditional structural knowledge about the predictors should be taken into account.This has stimulated the development of either systematically different regularizationmethods or double regularization approaches which combine `1-regularization with asecond kind of regularization designed to capture additional problem-specific structure.One instance thereof is the ’structured elastic net’, a generalization of the proposal inZou and Hastie [2005], studied in Slawski et al. [2010] for the class of generalized linearmodels.In this paper, we elaborate on the structured elastic net regularizer in conjunction withtwo important loss functions, the check loss of quantile regression and the hinge loss ofsupport vector classification. Solution paths algorithms are developed which computethe whole range of solutions as one regularization parameter varies and the second oneis kept fixed.The methodology and practical performance of our approach is illustrated by meansof case studies from image classification and climate science.keywords: double regularization, elastic net, quantile regression, solution path, supportvector classification

1 Introduction

Adopting the framework in Slawski et al. [2010], we let X = (X1, . . . , Xp)> be a randomvector of real-valued predictor variables and let Y be a random response variable takingvalues in a set Y. The aim is to model a functional ψ[Y |X] of the conditional distributionof Y |X via a linear predictor f(X;β∗0 ,β

∗) = β∗0 +X>β∗ and a function ζ : R→ Y such thatψ[Y |X] = ζ(f(X;β∗0 ,β

∗)). Given an i.i.d. sample S = (xi, yi)ni=1, where xini=1 andyini=1 are realizations of X and Y , respectively, we try to infer β∗0 ,β

∗ using regularizedrisk estimation, i.e. we determine estimators β0, β as minimizers

(β0, β) = argmin(β0,β)

n∑i=1

Lψ(yi, f(xi;β0,β)) + λΩ(β), λ > 0, (1.1)

where Lψ : Y × R→ R+0 is a loss function related to ψ and Ω : Rp → R+

0 is a regularizer.In this paper, we treat quantile regression (QR) and support vector classification (SVC),for which the relevant quantities are listed in Table 1.

1

[email protected]

Y ζ ψ Lψ

QR R idR infy : P(Y ≤ y|X = x) ≥ τ, τ ∈ (0, 1) τ [y − f ]+++(1− τ)[f − y]+

SVC −1, 1 sign sign(P(Y = 1|X = x)− 1/2) [1− yf ]+

Table 1: Key quantities for quantile regression (QR) and support vector classification(SVC). For any z ∈ R, we define [z]+ = max(0, z).

For quantile regression, the quantity infy : P(Y ≤ y|X = x) ≥ τ is called the τ -quantile ofthe conditional distribution of Y given X. Varying τ from 0 to 1 allows one to characterizethat conditional distribution as a whole, which may be more informative than the usualapproach of modeling the conditional expectation only. Choosing τ = 0.5 yields leastabsolute deviation regression. For detailed information about quantile regression and itsapplications, we refer to Koenker [2005] and the references therein.Turning to SVC, the loss is known as hinge loss or soft margin loss in the literature (Bennettand Mangasarian [1993]). Like the traditionally used logistic loss, the hinge loss is a convexsurrogate to the 0-1 loss in the sense of Bartlett et al. [2006]. The result that the hingeloss satisfies

argminf

EY |X=x[[1− Y f(x)]+] = sign(P(Y = 1|X = x)− 1/2),

i.e. the population minimizer is the Bayes rule, was proved in Lin [2002]. This propertyis not shared by the logistic loss which targets the log-odds. Estimating the log-odds islittle effective for regions far distant from the decision boundary, which is attributed fullattention by the hinge loss. Its decision boundary is determined by usually a small fractionof the sample, the so-called support vectors. If the estimation of conditional class prob-abilities is secondary, these facts make SVC more attractive for classification than, e.g.,logistic regression.The most common choice for the regularizer Ω(β) in Eq. (1.1) is Ω(β) = ‖β‖22 whichin combination which the hinge loss yields the standard formulation of the soft-marginSVC and its generalizations to reproducing kernel Hilbert spaces, see the monographsof Christianini and Shawe-Taylor [2000], Scholkopf and Smola [2002] and Steinwart andChristmann [2008]. QR in reproducing kernel Hilbert spaces is studied in Takeuchi et al.[2006] and Li et al. [2007]. While reproducing kernel Hilbert space methods aim at moreflexibility by implicitly mapping the predictor variables into a space of higher dimension,the goal of the `1- or lasso (Tibshirani [1996]) regularizer Ω(β) = ‖β‖1 is dimension reduc-tion by feature selection, yielding a minimizer β typically containing only a few non-zerocoefficients, thus being a computationally tractable alternative to best subset selection ifthe number of predictors p is large. In conjunction with the hinge loss, the resulting clas-sifier is known as linear programming machine (Bradley and Mangasarian [1998]). In thispaper, we equip QR and SVC with the structured elastic net, a combined `1-`2 regularizeroriginally introduced and studied with a focus on simple exponential family models for theresponse in Slawski et al. [2010].Additional strong motivation for studying QR and SVC is given by the applicabilityof piecewise linear regularized solution path algorithms (Rosset and Zhu [2007]), whichtremendously facilitate model selection. This issue is studied in detail in Section 3.The rest of this paper is organized as follows: Section 2 recalls the essential ideas of thestructured elastic net regularizer and illustrates the key methodological concepts of thepaper by means of an application to feature extraction in image classification. Section 4contains two further data examples. We conclude with a short discussion in Section 5.

2

2 The structured elastic net regularizer

Apart from few modifications, the material of Subsection 2.1 can be found in Slawski et al.[2010]; it is included here to achieve a self-contained treatment.

2.1 Double regularization

While there exists a large body of work about optimality properties of the lasso, severalmodifications have been suggested with the purpose to make `1-regularization behave moreeffectively in certain scenarios. For example, one important task in gene expression analysisis to identify a set of genes predictive for some phenotype. Concerning this problem, Zouand Hastie [2005] argue that the ’elastic net’-regularizer Ω(β) = α ‖β‖1+(1−α) ‖β‖22 , α ∈(0, 1) possesses a crucial advantage over the lasso: the combined `1-`2 regularizer allowsone to select whole groups of strongly correlated predictors, thereby enabling the scientistto reveal potentially interacting genes in the context of biological pathways. In a similarspirit, Bondell and Reich [2008] complement `1-regularization with a second regularizerwhich additionally clusters the set of predictors on the basis of their estimated regressioncoefficients. A second line of research has aimed at the explicit inclusion of structuralknowledge about the predictors. One simple form of such structural knowledge is an orderrelation Xj ≺ Xj′ ⇔ j < j′, which is satisfied, e.g., when the predictors represent a signalsampled at different time points. This situation is addressed by the ’fused lasso’ procedureof Tibshirani et al. [2005] who propagate

Ω(β) = α ‖β‖1 + (1− α) ‖Dβ‖1 , α ∈ (0, 1),

where

D : Rp → Rp−1

(β1, . . . , βp)> 7→ ([β2 − β1], . . . , [βp − βp−1])>(2.1)

is the first forward difference operator. Adding the total variation ‖Dβ‖1 of β as secondregularizer yields a sequence of estimators β1, . . . , βp, which is piecewise constant. Al-though the fused lasso simplifies interpretation, because it automatically clusters the setof predictors, one can argue that it yields a sequence β1, . . . , βp, which is too rough inthe sense that large deviations |βj − βj−1|, j = 2, . . . , p, are penalized less severely as inthe case one would take ‖Dβ‖22 instead of ‖Dβ‖1. This prompts the introduction of thestructured elastic net regularizer

Ω(β) = α ‖β‖1 + (1− α)β>Λβ, α ∈ (0, 1), (2.2)

where Λ is a p× p symmetric and positive semidefinite matrix. Setting Λ = D>D yieldsthe sum of the squared forward differences as second regularizer. In general, the twoingredients of the structured elastic net regularizer are differences of pairs of coefficientsand a weight function w : 1, . . . , p2 → R satisfying w(j, j) = 0 and w(j, j′) = w(j′, j) forall j, j′. The generic form of the regularizer is then given by

12

p∑j=1

p∑j′=1

|w(j, j′)| (βj − signw(j, j′) βj′)2 = β>Λβ, (2.3)

where Λ has entries

ljj′ =

∑pk=1 |w(j, k)| if j = j′,

−w(j, j′) otherwise.j, j′ = 1, . . . , p. (2.4)

3

One may think of |w(j, j′)| as the strength of the prior association of the predictors j, j′

such that |w(j, j′)| → ∞ enforces βj = sign(w(j, j′))βj′ . If w(j, j′) ≥ 0 for all j, j′, it is nothard to verify that constant vectors are contained in the null space of Λ. The regularizer(2.3) is versatile since it can handle any kind of neighbourhood relation, in particular spa-tial or temporal ones. Combined with an `1 regularizer, we aim at the selection of relevantsubstructures. The structured elastic net regularizer covers the graph-constrained estima-tion procedure of Li and Li [2010] as well as the correlation-based penalization method ofTutz and Ulbricht [2009], see also El Anbari and Mkhadri [2008], as special cases. Thesereferences also point to interesting applications in which, as opposed to those in this paper,Λ is not constructed from temporal or spatial neighbourhood information.We would like to clarify that the form of regularization employed in this paper is fundamen-tally different from the group lasso (Yuan and Lin [2006]) or the composite absolute penaltyfamily (Zhao et al. [2009]) in which groups are fixed prior to estimation, whereas the struc-tured elastic net only encourages grouping of predictors according to specific structuralprior knowledge.

2.2 Illustration

This subsection is intended to illustrate the main concepts of the paper by means of sim-ulated data mimicking a functional magnetic resonance imaging (FMRI) experiment in asimplistic way. FMRI is a technology widely used in human brain mapping, an area ofresearch where scientists try to associate each brain region with the primary task it isresponsible for. This can be achieved by identifying those brain regions which becomeactivated as reaction to specific experimental stimuli. The output of FMRI is a signalsampled on a three-dimensional grid of voxels, each corresponding to one small sectionof the brain. The fourth dimension of this kind of data is time, since periods with andwithout stimuli alternate. In the artificial example presented here, the problem is onlytwo-dimensional, because only one time point and one slice of the brain are considered.Suppose we are given ten different scans of the same brain region, five corresponding to anactive state, at which the experimental stimulus is present, and the remaining five to aninactive state. The task is to construct a system discriminating between scans of the twostates and extracting brain regions with high discriminatory power, thereby potentiallylocating the regions showing a reaction to the experimental stimulus. With the notationof Section 1, we have Y = −1, 1 with 1 corresponding to the active state and −1 tothe inactive state, respectively. Each scan be seen as a realization of a random vectorX = (Xt)t∈D, where D is a finite set of points contained in a bounded, connected subsetof R2. The data displayed in Figure 1 are generated according to the model

X|Y = y ∼ N(µy,Σ + τ2I),

µ−1 = (0, . . . , 0)>, µ1 = (µt)t∈D, µt =

ct t ∈ A,0 otherwise.

t ∈ D,

where the ct > 0 and A ⊂ D is the set of points in D belonging to the brain regionresponding to the experimental stimulus and the symmetric positive definite matrix Σ hasentries

γ(∥∥t− t′∥∥), t, t′ ∈ D, γ(h) =

12κ−1Γ(κ)

(h

φ

)κKκ

(h

φ

).

The function γ is the covariance function of the Matern family depending on a smoothnessparameter κ > 0 and a range parameter φ > 0 (Stein [1999]). Kκ denotes the modifiedBessel function of the third kind of order κ. To be specific, the covariance parameters are

4

Figure 1: The artificial FMRI data generated according to the model described in the text.The ten scans are displayed on the right half of the figure. The five scans with stimulusare shown in the rightmost column, and the scan of the top right column is zoomed at inthe upper left half of the figure. It is characterized by one eminent bright spot, which isthe region of activation A. The shape of the grid D is depicted in the lower left half of thefigure, in which A is marked by a rectangle.

chosen as τ2 = 1, κ = 32 and φ =

√2. The shape of the grid D and the activation region A

are taken from the dataset brain contained in the R package gamair (Wood [2006]). Thesedata were originally analyzed in Landau et al. [2003]. We address the joint classification-feature selection problem using SVC endowed with the structured elastic net regularizerbased on the weight function

w(t, t′) =

1 if t1 = t′1 and t′2 ∈ t2 − 1, t2 + 1

or t2 = t′2 and t′1 ∈ t1 − 1, t1 + 10 otherwise.

(2.5)

The resulting neighbourhood graph is depicted in the lower left panel of Figure 1. The re-gression coefficients βt, βt′ of adjacent pixels (w(t, t′) = 1) are hence enforced to be similar,thus incorporating a prior assumption of spatial smoothness. The `1 part of the structuredelastic net takes the sparse nature of the problem into account, since it is known that thestimulus only affects certain regions of the brain.

5

0.00 0.05 0.10 0.15 0.20 0.25

0.00

0.01

0.02

0.03

0.04

0.05

s

β(s)

*** * * * ** * * *** ** * ** * * *** * * * * ** * ***** * * * ** * * *** ** * ** * * *** * * * * ** * ***** * * * ** * * *** ** * ** * * *** * ** * ** * **

*** * * * ** * * *** **

*** *

* ***

* ** * ** * **

*** * * * ** * *

*****

*** * * *** * * * * ** * **

*** * *

*

**

* *

***** *

***

* **** * * * ** * **

***

**

*

**

**

***

***

***

****

* ** * ** * **

*** * * * ** * * *** ** * ** * * *** **

* * ** * **

*** * * *

**

* ****

***

***

****

** * * ** * **

*** *

*

*

**

**

***

**

*

***

****

* ** * ** * **

1 2 3 4 5

0 10 20 30 40

010

2030

40

1

0 10 20 30 40

010

2030

40

2

0 10 20 30 40

010

2030

40

3

0 10 20 30 40

010

2030

40

4

0 10 20 30 40

010

2030

40

5

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

1

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

2

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

3

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

4

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

5

Figure 2: Solution path and coefficient surfaces (t, βt), t ∈ D for the artificial FMRIdataset when the amount of regularization in β>Λβ is kept fixed and the amount ofregularization in ‖β‖1 decreases from left to right. The piecewise linear path is in s = ‖β‖1.Each breakpoint of the piecewise linear path is marked by a star. The five numbered verticallines indicate the choices of s for which a coefficient surface is displayed at two differentlevels of resolution. The size of the dots is proportional to the size of the coefficient at therespective location. While the upper row displays the whole grid D, the lower row zoomsat the region A coloured in grey. The places where µ1 is different from zero are emphasizedby white squares.

The combination of two regularizers comes at the expense of the difficulty to choose twotuning parameters which in general entails a two-dimensional grid search. However, whenusing one of the two loss functions considered in this paper, a significant simplificationarises from the fact that for one of the two tuning parameters kept fixed, the whole rangeof solutions can be obtained by tracking a piecewise linear solution path (Rosset and Zhu[2007]). As a consequence, a reduction to a one-dimensional grid search is possible. Morespecifically, for k-fold cross-validation, one specifies a grid of values to be explored for oneof the two tuning parameters; in an outer loop, one runs through the different folds, fix-ing a training and a test sample (the sample hold out) each time. In an inner loop, one

6

runs through the grid points. Having one of the two parameters fixed within each inneriteration, a solution path of the second parameter is computed. Figures 2 and 3 show twoselected solution paths and the corresponding coefficient surfaces for the artificial FMRIdataset.

0.0 0.5 1.0 1.5 2.0

0.00

0.02

0.04

0.06

0.08

0.10

0.12

1 λ2

β(1

λ 2)

*********** *** * * ************ *** * * ************ *** * * *

****

******* *** * * *

*****

****** *** * * *

***********

***

*

*

*

****

*

***

*

*****

*

* *

*********** *** * * *

*********

***** * * *

*********

**

***

*

*

*

1 2 3 4 5

0 10 20 30 40

010

2030

40

1

0 10 20 30 40

010

2030

40

2

0 10 20 30 40

010

2030

40

3

0 10 20 30 40

010

2030

40

4

0 10 20 30 40

010

2030

40

5

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

1

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

2

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

3

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

4

36.0 36.5 37.0 37.5 38.0 38.5 39.0

2930

3132

3334

35

5

Figure 3: Solution path and coefficient surfaces for the artificial FMRI dataset when theamount of regularization in ‖β‖1 is kept fixed and the amount of regularization in β>Λβdecreases from left to right. The piecewise linear path is in 1/λ2 = 1/(λ(1 − α)). Thestructure of the figure is identical to that of Figure 2.

The two figures clearly demonstrate the influence as well as the interplay of the two reg-ularizers. The `1-regularizer keeps most coefficients down at zero, thereby highlightinglocations in the region that matters. Figure 3 shows that when reducing the amount ofquadratic regularization, one obtains a stronger concentration on single locations, whichreveals that the quadratic regularizer is indeed beneficial when the selection of whole neigh-bourhoods of pixels is desired.

7

3 Solution path algorithms

The two path-following algorithms presented in the next two subsections rely on method-ology developed in a series of papers discussing solution paths for the case that eitheran `1-regularizer or a quadratic regularizer is employed. More specifically, solution pathsfor `1-regularized SVC and QR are described in Zhu et al. [2003] and Li and Zhu [2008],respectively, while the second alternative is studied in Hastie et al. [2004] and Li et al.[2007], respectively. The approach of this paper is closely related to the work of Wanget al. [2006] who treat SVC with the elastic net regularizer. The authors prove that aslong as the amount of `1- or `2-regularization is kept fixed, the construction of piecewiselinear solution paths is possible, which coincides with the approach pursued in this paper.In view of the fact that the solution paths for SVC and QR parallel each other, we unifytheir description for the sake of brevity, at the cost of a slightly more involved notation.For better guidance, Figure 4 displays most of it at a glance. Throughout this section,we use the following notational conventions. For a vector a = (aj)j=1,...,N , a matrixA = (ajj′) j=1,...,N,

j′=1,...,N ′, index sets J ⊆ 1, . . . , N and J ′ ⊆ 1, . . . , N ′ we define

aJ = (aj)j∈J , AJ = (ajj′)j∈J , AJ ′ = (ajj′)j′∈J ′ , AJ ′J = (ajj′)j∈J,j′∈J ′ .

3.1 Solution path in ‖βββ‖1

In this subsection, we will show that the function

(β0(s), β(s)) = argminβ0,β

Lψ(yi, f(xi;β0,β)) +λ2

2β>Λβ, λ2 > 0

subject to ‖β‖1 ≤ s(3.1)

is well-defined and piecewise linear if Lψ is chosen as one of the loss functions of Table 1and regularity Assumption 1 given below holds. Note that above formulation is equivalentto a structured elastic net problem in the sense that for each s in Equation (3.1) there exista corresponding Lagrangian multiplier λ1(s), which will be shown to depend in a piecewiseconstant manner on s, such that

λ(s) = λ1(s) +λ2

2, α(s) =

λ1(s)λ1(s) + λ2

2

.

In the sequel, we tend to suppress dependence on s whenever it is clear from the context.

3.1.1 KKT conditions

Before developing the algorithm, we derive the Karush-Kuhn-Tucker (KKT) conditions ofthe convex optimization problem which is reformulated as the following quadratic program:

Minimize (1− τ)n∑i=1

ξi + τ

n∑i=1

ξi +λ2

2β>Λβ

subject to(C.1): ‖β‖1 ≤ s,(C.2): mi + ξi ≥ m0, i = 1, . . . , n,

(C.3): mi − ξi ≤ m0, i = 1, . . . , n,

(C.4): ξi, ξi ≥ 0, i = 1, . . . , n,

(3.2)

8

m−2 −1 0 1 2

ξξ~ξξ

SVCQR ττ == 0.5QR ττ == 0.2QR ττ == 0.8

m0

Figure 4: Graphical overview on the notation used in Subsections 3.1 and 3.2. The figuresketches the hinge loss of SVC and the check loss of QR for three different choices of τ asfunctions of the margin m.

where for QR, we have

mi = yi − f(xi;β0,β), m0 = 0,

ξi = [f(xi;β0,β)− yi]+, ξi = [yi − f(xi;β0,β)]+, i = 1, . . . , n.

On the other hand, for SVC, the relevant quantities are given by

mi = mC(yi, f(xi;β0,β)) = yif(xi;β0,β), m0 = 1, ξi = [1−mi]+, i = 1, . . . , n,

and we require further that τ = 0 such that the set of auxiliary variables ξini=1 and’constraint’ (C.3) are in fact not needed.The primal Lagrangian of the program (3.2) has the form

(1− τ)n∑i=1

ξi + τ

n∑i=1

ξi + λ1(‖β‖1 − s) +λ2

2β>Λβ +

+n∑i=1

γi(mi −m0 − ξi)−n∑i=1

ηi(mi −m0 + ξi)−n∑i=1

µiξi −n∑i=1

µiξi,

(3.3)

where ηini=1, γini=1, µini=1, µini=1 and λ1 are nonnegative Lagrangian multipliers.Denote the active set by V = j : βj 6= 0. Differentiating the primal Lagrangian (3.3)with respect to β0, βV , ξ = (ξi)i=1,...,n, ξ = (ξi)i=1,...,n and setting the result equal to zero,we obtain the following KKT conditions:

λ2ΛVV βV + λ1σ(V)− [XV ]>θ = 0|V|,

θ>1n = 0,(1− τ)1n − η − µ = 0n,

τ1n − γ − µ = 0n,

(3.4)

9

X = ((xi)j)i=1,...,n, j=1,...,p

σ(V) = (sign(βj))j∈V , η = (ηi)i=1,...,n, γ = (γi)i=1,...,n,(3.5)

and

θ = (θi)i=1,...,n =

γ − η for QR,diag(y1, . . . , yn)η for SVC,

noting that τ = 0 implies that γ = µ = 0n. The KKT conditions additionally include theconstraints

γi(mi −m0 − ξi) = 0,ηi(mi −m0 + ξi) = 0,µiξi = 0,

µiξi = 0, i = 1, . . . , n.

(3.6)

These constraints allow a division of the elements of the sample S into three subsets. Com-bining information in Equations (3.2), (3.4) and (3.6), we have the following implications.

(1) mi < m0 ⇒ ξi > 0, ξi = 0 ⇒ µi = 0, γi = 0 ⇒ θi =

−(1− τ) for QR,±1 for SVC,

(2) mi = m0 ⇒ ξi = ξi = 0 ⇒ µi, µi ≥ 0 ⇒ ηi ≤ 1− τ, γ ≤ τ

⇒

−(1− τ) ≤ θi ≤ τ for QR,−1 ≤ θi ≤ 1 for SVC.

(3) mi > m0 ⇒ ξi > 0, ξi = 0 ⇒ µi = 0, ηi = 0 ⇒ θi = τ, i = 1, . . . , n.(3.7)

We introduce sets L = i : mi < m0, E = i : mi = m0, R = i : mi > m0 and define

Θ(L) =

−(1− τ) for QR±1 for SVC

, Θ(R) = τ. (3.8)

As conclusion which turns out to be essential for the next two subsections, we obtain thatfor elements in S belonging to the set L and R, we know the values of the dual variablesθi, while for the elements in E , the θi are between Θ(L) and Θ(R) and further that theirmargins mi are all equal to the value m0, which corresponds to the only non-smooth pointof Lψ as a function of f . The main principle of the algorithm is to identify changes in thesets E and V, which correspond to the breakpoints of the piecewise linear path. Once abreakpoint is encountered, a new direction and a stepsize for the solution path is calculatedsuch that the KKT conditions continue to hold.

3.1.2 Regularity conditions

For the algorithm to work as formulated in Subsections 3.3 and 3.4, we require the followingconditions to hold.

Assumption 1.

(i) At every step of the algorithm, the matrix XVE has rank |E|, where X is defined inEq. (3.5).

10

(ii) The matrix Λ is positive definite.

(iii) At every step of the algorithm, precisely one of the sets E and V changes by preciselyone element.

Condition (i) and (ii) guarantee that the solution of the linear system (3.12) below is unique.Note that condition (i) claims in particular that |E| ≤ |V|. While we have required Λ to bepositive semidefinite in Section 2, positive definiteness is achieved by adding some smallvalue to the diagonal. The third condition is related to the ’one-at-a-time-condition’ inEfron et al. [2004]. We conjecture that a relaxation of condition (iii) is possible. However,practical experience has shown that multiple changes in E are typically irreconciliable withcondition (i).

3.1.3 Initialization

The initalization (s = 0) is actually the most critical part of the algorithm since in generalβ0(0) or the next direction the algorithm proceeds into are not determined uniquely. ForQR, as pointed out by Li and Zhu [2008], β(0) = 0p and β0(0) equals the sample τ -quantileof the yini=1. The latter is unique and attained by one of the yini=1 if and only if nτis not an integer. Let π(1), . . . , π(n) be an index permutation of 1, . . . , n such thatyπ(1) ≤ . . . ≤ yπ(n), then we set

β0(0) = ybi, i = π(bnτc+ 1) ⇒ E = i, L = i : yi < β0(0), R = i : yi > β0(0),

where we exclude ties in yini=1. Otherwise,

β0(0) ∈ [ybl, ybr], l = π(bnτc), r = π(bnτc+ 1).

We circumvent this issue by setting β0(0) equal to the left endpoint ybl, likewise E = land accordingly L and R. Equations (3.8) and (3.4) allow us to compute the θini=1. Thelatter are in turn used to compute the generalized correlations cjpj=1, the Lagrangianmultiplier λ1 and the active set V according to the first equation in the system (3.4) as

cj =n∑i=1

θi (xi)j , j = 1, . . . , p, λ1 = maxj=1,...,p

|cj |, V = j : |cj | = λ1. (3.9)

For the same reasons as for condition (iii) in Assumption 1, we assume that V consists ofprecisely one element.For support vector classification, the situation is even more difficult (Zhu et al. [2003]).Denoting by I+ = i : yi = 1 and by I− = i : yi = −1 the indices of the samplesbelonging to the positive and negative class respectively, then β0(0) = 1 if |I+| > |I−| andβ0(0) = −1 if |I−| > |I+|. Consequently, E = I+ or E = I−, which leads to a violationof condition (i) in Assumption 1. We resort to a workaround communicated by Ji Zhu,who suggests to add a small amount of jitter to the yi of the majority class, generatingslightly perturbed responses yini=1. Plugging these yini=1 into the optimization problem(3.2) yields a solution β0(0) such that the set E consists of precisely one element. We thenproceed as in Eq. (3.9) to obtain λ1 and V. A closely related workaround is employed inWang and Shen [2005].

11

3.1.4 Main algorithm

The principle of the algorithm can be summarized as follows. Given an initial solutionq(0) = (β0(0),θ(0)>, λ1(0))> together with an initial configuration of the sets L, E , Rand V, we determine the smallest δs such that the structure of the KKT conditions (3.4)and (3.6) belonging to optimization problem (3.2) with a specific choice of s is differentfrom the same problem with s + δs. We call these optimization problems different in thestructure if one of the following conditions holds:

(1) V(s) 6= V(s+ δs)(2) E(s) 6= E(s+ δs)

(3.10)

Note that, by continuity, a change in E precedes every transition of an index from L to R.Adopting the terminology of Zhu et al. [2003] and Wang et al. [2006], changes (1) and (2)are called events. The next event is predicted by computing the right derivative

∆q(s)∆s

= lim∆s→0

q(s+ ∆s)− q(s)∆s

, q(s) = (β0(s),θ(s)>, λ1(s))>.

from the KKT conditions and one then updates q(s+δs)← q(s)+δs∆q(s)

∆s . This is repeatedfrom event to event, until a stopping criterion is met (see below). A key observation forthe computation of the right derivative is that, according to the observation made in Eq.(3.7),

i ∈ L ∪R ⇒ ∆θi∆s

= 0.

The right derivative can hence be computed by solving the linear system

λ2ΛVV∆βV∆s

−[XVE

]> ∆θE∆s

+∆λ1

∆sσ(V) = 0|V|,

∆θ>E 1∆s

= 0,

∆β0

∆s+XVE

∆βV∆s

= 0|E|,

(3.11)

where the third equation follows from i ∈ E ⇔ mi = m0, i = 1, . . . , n. Furthermore, ∆λ1∆s

is a negative scalar (negativity follows from the fact that λ1(s) is decreasing in s), thereforewe may set ∆λ1

∆s = −1. The linear system (3.11) consists of |E|+ |V|+ 1 equations and thesame number of unknowns. It can be reduced to a system in |E|+ 1 unknowns by solvingthe first equation for ∆ bβV

∆s and plugging the result into the third equation. This yields thefollowing linear system to be solved.

[λ21|E| G(E ,V)

0 1>|E|

] ∆bβ0

∆s

∆θE∆s

=[−XVE

[ΛVV]−1

σ(V)0

], (3.12)

where

G(E ,V) = XVE[ΛVV]−1 [

XVE]>. (3.13)

12

Given ∆θE∆s , ∆bβ0

∆s , we now compute the resulting derivatives for the remaining quantities:

∆βV∆s

= [ΛVV ]−1 [XVE ]>∆θE∆s + σ(V)λ2

,

∆cVc

∆s= [XV

c

E ]>∆θE∆s− λ2ΛVVc

∆βV∆s

,

∆rEc

∆s= −∆β0

∆s−XVEc

∆βV∆s

, r = ξ + ξ.

(3.14)

Note that ξiξi = 0, i = 1, . . . , n. Together with the solution of the linear system (3.12), theright derivatives (3.14) are used to calculate the stepsize δ = δs/

∥∥∥∆ bβV∆s

∥∥∥1

and to updatethe status sets V, E , L and R. Defining

δθ = infd > 0 : ∃i ∈ E : θi + d

∆θi∆s∈ Θ(L) ∪Θ(R)

,

δr = infd > 0 : ∃i ∈ Ec : ri + d

∆ri∆s

= 0,

δbβ = inf

d > 0 : ∃j ∈ V : βj + d

∆βj∆s

= 0

,

δc = infd > 0 : ∃j ∈ Vc :

∣∣∣∣cj + d∆cj∆s

∣∣∣∣ = λ1 − d,

(3.15)

we have that δ = minδθ, δr, δbβ, δc, λ1(s). Including λ1(s) guarantees that λ1(s+ δs) ≥ 0.If λ1(s+ δs) = 0, the algorithm terminates. A further stopping criterion is that the loss inthe sample drops to zero, i.e. ∀i mi(s+ δs) ≥ m0 for SVC and ∀i mi(s+ δs) = m0 for QR,respectively.

3.2 Solution path in βββ>ΛΛΛβββ

In order to examine how the solution varies for a fixed amout of `1-regularization, weproceed as follows. We first fix λ2 = λinit

2 , say, and compute the entire solution path(β0(s), β(s)) according to the previous subsection. For each s, we then know the optimumvalues θ(s), β(s), λ1(s) as well as the status sets E(s) and V(s). For any choice of s, wemay subsequently trace the solution path (β0(λ2), β(λ2)), 0 ≤ λ2 ≤ λinit

2 . From the firstequation of the KKT conditions (3.4), one obtains

λ2βV = [ΛVV ]−1(

[XV ]>θ − λ1σ(V)).

Next, we use that mi = m0 for all i ∈ E , which, after premultiplying λ2, can be written as

θ01|E| +XVE [ΛVV ]−1

([XVE ]>θE − λ1σ(V)

)= λ2yE , θ0 = λ2β0.

Differentiating both sides w.r.t. λ2 yields

∆θ0

∆λ21|E| +G(E ,V)

∆θE∆λ2

= yE ,

where the matrix G(E ,V) is defined in Eq. (3.13). We additionally know that (cf. Eq.(3.11)) ∑

i∈E

∆θi∆λ2

= 0.

13

Altogether, ∆θ0∆λ2

, ∆θE∆λ2

can be computed by solving the linear system

[1|E| G(E ,V)0 1>|E|

] ∆θ0∆λ2

∆θE∆λ2

=[yE0

].

It remains to compute the stepsize δλ2 such that λnew2 = λ2 − δλ2 , where λnew

2 < λ2

corresponds to the next breakpoint of the solution path. As in Subsection 3.1, the stepsizeδλ2 is computed as the smallest value such that the next event occurs. Analogously to Eq.(3.15), we compute

δ′θ = infd > 0 : ∃i ∈ E : θi − d

∆θi∆λ2

∈ Θ(L) ∪Θ(R),

δ′r = inf+θ0 + x>i [ΛVV ]−1

([XV ]>θ − λ1σ(V)

)− λ2yi

∆θ0∆λ2

+ x>i [ΛVV ]−1[XVE ]>∆θE∆λ2− yi

, i ∈ Ec,

δ′bβ = inf+

([ΛVV ]−1[XV ]>θ)j − λ1(σ(V))j([ΛVV ]−1[XVE ]>∆θE

∆λ2)j

, j ∈ V

,

δ′bc = inf+

±λ1 + (X>Vcθ − λ2ΛVVcβV)j([XVc

E ]>∆θE∆λ2−ΛVVc [ΛVV ]−1[XVE ]>∆θE

∆λ2

)j

, j ∈ Vc

,

where inf+ indicates that the infimum is taken over indices for which the correspondingexpression within the curly brackets is positive. The stepsize is determined as δλ2 =minδ′θ, δ′r, δ′bβ, δ′bc, λ2. The stopping criteria of the algorithm are the same as for thesolution path in ‖β‖1.

4 Data analysis

The following section presents applications of the structured elastic net in one regressionand one classification problems taken from climate science and image processing, respec-tively.

4.1 Canadian weather data

In chapter 15 of Ramsay and Silverman [2006], the authors illustrate their functionallinear model approach by predicting the logarithm of the total annual precipitation at 35Canadian weather stations from the temperatures measured at 365 days of a year. Thedata were averaged over 35 annual reports starting in 1960 and ending in 1994. In ouranalysis, we aim at the prediction of the 0.25-, 0.5-, and 0.75 quantile of the total annualprecipitation by a linear regression on the temperature pattern, i.e. we set

yi,τ = βτ0 +p∑j=1

βτj xij ,

where yi,τ is the prediction of the τ -quantile, τ ∈ 0.25, 0.5, 0.75 for weather station i,i = 1, . . . , 35, and xij is the temperature for weather station i at day j, j = 1, . . . , p, p = 365.The use of the structured elastic net regularizer can be motivated from the following

14

considerations. At first, the predictors are ordered temporally such that their influence onthe response as quantified by the regression coefficients is expected to be similar. Secondly,interpretation is greatly simplified if predictors with low or no influence on the response areremoved. On the other hand, mere `1-regularization is not fully adequate for this purpose,since it tends to select single days. Intuitively, selection on the basis of days is an unreliableprocedure with respect to the prediction of future observations. It therefore seems to bebeneficial to identify relevant sequences of days, e.g. weeks or (parts of) months. One mightargue that one could alternatively coarsen the data by averaging over days. However, thiswould be a rather wasteful treatment of the information one has at hand. The spirit of ourapproach is closely related to the ideas underlying the analyses of the same dataset in Jameset al. [2008] and Tutz and Gertheiss [2010]. Both use squared loss and computationallydifferent approaches to incorporate temporal structure and to perform variable selection.We compare the performance of `1, elastic net and the structured elastic net regularizer,

0 100 200 300

−30

−20

−10

010

20

days

Tem

pera

ture

[deg

ree

Cel

sius

]

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

1 2 3 4 5 6 7 8 9 10 11 12

log10precipitation [mm]2.1 3.4

0 100 200 300

−0.0

010.

000

0.00

10.

002

0.00

3

days

β

1 2 3 4 5 6 7 8 9 10 11 12

τ = 0.25τ = 0.5τ = 0.75

Figure 5: Upper panel: temperatures of 35 Canadian weather stations. Their annual (log10)precipitations can be read off from the colour bar. Lower panel: Regression coefficients ofthe structured elastic net for median, lower and upper quartile. In both panels, the verticallines divide the 365 days into 12 months.

where the structure matrix Λ is composed of squared first forward differences for adjacentdays. We also compute the generalized ridge solution which only uses the quadratic partof the structured elastic net regularizer. All four approaches have in common that modelselection is simplified by the availability of piecewise linear solution path algorithms. For

15

method τ = 0.25 τ = 0.5 τ = 0.75loss (test sets) loss (test sets) loss (test sets)

generalized ridge 0.276 0.300 0.290(0.018) (0.020) (0.021)

lasso 0.285 0.361 0.326(0.029) (0.024) (0.022)

e.net 0.261 0.295 0.219(0.014) (0.014) (0.012)

structured e.net 0.222 0.265 0.229(0.013) (0.014) (0.014)

Table 2: Mean loss on the random test samples, averaged over 50 iterations, for the Cana-dian weather dataset. Standard errors are given in parentheses. The term ’e.net’ abbrevi-ates ’elastic net’.

the evaluation of the performance, we compute 50 random splits of the dataset into learningand test sample consisting of 30 and 5 observations, respectively. Hyperparameters aredetermined via tenfold cross-validation on the learning sets. Table 2 displays the meanloss (averages over the 50 splits) on the test sample, where ’loss’ refers to the check loss(cf. Table 1) and hence depends on τ . It is seen that mere `1-regularization shows a poorperformance, confirming the argumentation above. The structured elastic net performsbest for τ ∈ 0.25, 0.5, but is beaten by the ordinary elastic net for τ = 0.75. Theupper panel of Figure 5 displays the temperature profiles of the 35 weather stations andthe median (over the 50 splits) of the estimated coefficients βτj , j = 1, . . . , 365 of thestructured elastic net. According to the lower panel, the days in November carry muchpotential to predict the outcome, while the days in June up to October seem to be irrelevant.The latter is not surprising when looking at the upper panel: in the summer months, thereis considerable overlap of the temperature profiles of weather stations with rather differentannual precipitations. Too a large extent, the coefficient profile for the conditional median(τ = 0.5) agrees with those displayed in James et al. [2008] and Tutz and Gertheiss [2010],whose approaches target the conditional mean. In particular, the eminent bump at theend as well as the flat in the middle of the year is a common feature. Figure 6 displays

2.0 2.5 3.0 3.5

2.0

2.5

3.0

3.5

y

y

ττ == 0.25ττ == 0.5ττ == 0.75

Figure 6: Predicted quantiles (y) vs observed values (y). The straight lines correspondto linear regressions of the fitted values on the observed values, separately for each τ ∈0.25, 0.5, 0.75. The (y, y)-pairs for the weather station in Kamloops are emphasized bymeans of a frame in dashed lines.

16

an assessment of the fit, which appears to be decent except for few data. A particularlystrong misfit results for the weather station in Kamloops, an observation already madein the analysis of Ramsay and Silverman [2006], p. 269, where a plausible meteorologicalexplanation for the poor fit is provided.

4.2 Handwritten digit recognition

We analyze a subset of the USPS database described in Le Cun et al. [1989] and Hastieet al. [2001]. The complete database consists of more than 9,000 greyscale images ofhandwritten digits at a resolution of 16×16 pixels, with a predefined division into learningand test sample. We extract all instances of the digits ’3’, ’5’, ’6’, ’8’, ’9’, which − in viewof their composition of arcs and circular shapes − we regard to be suitable for an effectivevisualization (see Figure 7). We consider all resulting pairwise classification problems, inwhich each of the 256 pixels is used as input of a linear classification rule of the formy = sign(β0 + xj,kβj,k), where pixel j, k is denoted by xj,k, and βj,k is the correspondingcoefficient, j, k = 1, . . . , 16. We here state clearly that linear classifiers are well-known tobe of inferior performance for this dataset. Instead, our attention focuses on a comparisonof the lasso, elastic net and structured elastic net regularization. The ridge regularizercorresponds to standard linear SVC, and is hence included as baseline. For the structuredelastic net, the matrix Λ is chosen such that

β>Λβ =∑j

∑k

(βj,k−βj−1,k)2+(βj+1,k−βj,k)2+(βj,k−βj,k−1)2+(βj,k−βj,k+1)2,(4.1)

the usual discretization of the Laplacian acting on functions defined on R2. Note thatthe regularizer (4.1) results from the weight specification (2.5), identifying a grid pointt with a pair of indices (j, k). Hyperparameters are determined such that the numbersof misclassifications in tenfold cross-validation on the learning sample is minimized. Theresults are reported in Table 3, showing that mere `1-regularization performs poorly. Theelastic net and the structured elastic net select far more pixels, and the latter is typically agood deal more sparse than the former, while being equally good in terms of classificationperformance. However, both are beaten by standard linear SVC. It is instructive to havea look at the coefficient surfaces of the structured elastic net, depicted in Figure 7. Theresults are in accordance to what one would pick as predictive regions by visual inspection:large coefficients are predominantly placed in regions where the two digits show little orno overlap.

17

methodridge lasso e.net structured e.net

problem’3’ vs. ’5’ test error 39/326 48/326 38/326 36/326

nonzero.coef 29 226 189’3’ vs. ’6’ test error 10/336 24/336 6/336 6/336









nonzero.coef 26 124 84total test error 158/3356 280/3356 166/3356 164/3356

Table 3: Misclassification rates on the test for the ten selected handwritten digit recognitionproblems. The table also displays the number of nonzero coefficients (’nonzero.coef’) ofthe CV-optimal models.

18

solid: mean '5' dashed: mean '3'

pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 15

510

15

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16− − − − − − + + +

− − − − − + + +

+ + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + + −

+ + + + + + + −

+ + + + + − − −

+ + + + + + − − − − − − + +

+ + + + + + + − − − − − −

+ + + + + + + − − − − − − −

+ + + + + + + + − − − − − − −

+ + + + + + + + + − − − − − −

+ + + + + + + + − − − − − − +

+ − + + + − − − − + + +

− − − − − − − − − + + + + +

− − − − − − − − − + + + + + +

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16


pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 15

510

15

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16

+ + + + + + +

− + + + + + + + − − − +

+ + + + + + + + + + + − − −

+ + + + + + + + + + + + − − −

+ + + + + + + + + + + − − −

+ + + + + + + + + + + − − −

+ + + + + + + + + + − −

+ + + + + + + − − − − − − − −

+ + + + + + − − − − − − − −

+ + + + + + − − − − − − −

+ + + + + − − − − − − − −

+ + + + + − − − − − − −

− − − − + + − − − − − − − −

− − − − − − − − − − − − −

− − − − − − − − − − − − −

+ − − − − − − − − − − − −

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16


pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 15

510

15

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16

− − − +

− − − − − − −

− − + − − − −

− + + + + + + − − − −

+ + + + + + + − − − −

+ + + + + + + − − − −

+ + + + + + + − − − − −

+ + + + + − − − −

+ + + + − − − − −

+ + + + − −

+ + + + +

+ + + + +

+ + + +

+ − −

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16


pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 155

1015

1 2 3 4 6 7 8 9 11 12 13 14 161

23

46

78

911

1213

1416

− − − −−− − − − − − − −

− − − −− − − − − − − − − − −

− − − − − − − − − − − − − − − −

− − − − − − − + + − − − − − − −

− − − − − − + + + + − − − − − −

− − + + + + − − − − − −

+ + + + + + + + + + − − − − − −

+ + + + + + + + − − − − −

+ + + + + + − − − + − −

+ + + + + + − − + + −

+ + + + + + + + +

+ + + + + + + + + + +

+ + + + + + − − − − −

+ + + + + − − +

− − − + − + + + +

− − − − − + + + + + + + +

1 2 3 4 6 7 8 9 11 12 13 14 161

23

46

78

911

1213

1416


pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 15

510

15

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16

−

− − −

+ + + +

+ + + + + + +

+ + + + + + + +

+ + + + + + + + + +

+ + + + + + + + +

+ + + + − − −

+ + − − − − −

− − − − − −

− − − − − − −

− − − − − − − − −

− − − − − − − − − − −

− − − − − − − − − − −

− − − − − − − − − − −

− − − − −

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16


pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 15

510

15

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16

− − − − − − − + + − − − − − − −

− − − − − − − − − − − − − − − −

− − − − − + + − − − − − − − − −

+ − − + + + + + − − − − − − − −

− − + + + + + + − − − − − − −

− − − + + + + + + + − − − − − −

− − − + + + ++ + + − − − − − −

− − − − + + + + + + + − − − − −

− − − − − + + + + + + + + + + +

− − − − − − + + + + + + + + + +

− − − − − − − − + + + + + + + +

− − − + + − − − − + + + + + + −

− − + + + − − − + + + + − − −

− + + + + + − − − + + − − − −

+ + + + + + + + − − − − − − −

+ + + + + + + + + + − − − − − −

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16


pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 15

510

15

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16

− − − − − − − −

− − − − − − − − − − −

− − − − − − − − − −

− − − − + + − − − − −

− − + + + − − − − −

+ + + − − − −

+ + − −

+ + +

+ + +

+ + + +

− − + + + + +

− − + + + + +

− − − + + +

− − −

+ +

+ + +

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16


pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 15

510

15

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16

− − −

− − − − − − −

− − − − − − − −

− − − − − − −

− − − − − − −

− − − + + + − − − − −

− − − + + + + − − −

− − − + + + + +

+ + + + + + +

− + + + + +

+ + + − − + + + + + +

+ + + + + − + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + +

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16


pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 15

510

15

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16

− − − − − − − − − −

− − − − − − − − − − − −

− − − − − − − − − − − − −

− − − − − − − − − − − − − −

− − − − − − − − − − − −

− − − − − − − + − − − − −

− − − − − − + + + − − − − −

− − − − − + + + + + − +

− − − + + + + + +

+ + + + + + +

+ + + − + + + + + + + +

+ + + + + − − + + + + + + +

+ + + + + − + + + + + + +

+ + + + + + + + + + + + +

+ + + + + + + + + + + +

+ + + + + + + + + +

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16


pixels (horizontal)

pixe

ls (

vert

ical

)

5 10 15

510

15

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16

− − − − − − −

− −− − − − − −

− −−− + − − −

− − −−− + + − − −

− − − − +

− − − − +

− − − + +

− − + + +

+ − − − + + + ++ − + +

+ + + +

+ +

+ +

−

− − − +

− − − + + +

1 2 3 4 6 7 8 9 11 12 13 14 16

12

34

67

89

1112

1314

16

Figure 7: Schematic representation of the coefficient profiles βj,k, 1 ≤ j, k ≤ 16 of theten CV-optimal structured elastic net models. The symbols ’+’ and ’−’ indicate the signof the coefficients (with the convention that the digit which is numerically the larger of thetwo is labeled +1), and the symbol size their absolute magnitude. For better guidance, theaverage shapes of the digits within the USPS database have been added.

19

5 Discussion

In this paper, we have described the structured elastic net regularizer for quantile regres-sion and support vector classification tailored to the grouped selection of variables on thebasis of a known association structure. By means of artificial and real world datasets, wehave demonstrated in what way the structured elastic net differs from its predecessors, andwe have shown a series of scenarios in which it can be useful. A drawback of the structuredelastic net is its dependence on two tuning parameters. The loss functions of quantileregression and support vector classification admit a computational shortcut by trackinga sequence of piecewise linear solution paths with one of the two parameters kept fixed.While this is already a considerable computational simplification, there is further room forimprovement. Cross-validation could be replaced by the computation of a suitable modelselection criterion, as done in previous work (Li and Zhu [2008]). For quantile regression,Rosset [2008] proposes a bi-level solution path algorithm for varying regularization pa-rameter and varying quantile τ , thus elegantly avoiding the problem of quantile crossing(Koenker [2005]). An extension of the path algorithm of Section 3 into this direction wouldbe of much practical use.The approach pursued in this paper focuses on how to compute estimators as minimizersof convex optimization problems, but no advice on how do further inference, notably howto obtain standard errors, is given − a problem that seems to be inherent in frequentistapproaches to variable selection based on `1-regularization. A Bayesian framework couldprovide an error assessment of all parameters of interest, also covering more complicatedcases where the matrix Λ depends on additional unknown parameters, which is out of thescope of the present paper. In this context, it is worth mentioning that Bayesian counter-parts of the two loss functions employed in this paper (Sollich [2002], Lancaster and Jun[2009]) as well as the `1-regularizer (Park and Casella [2008], Hans [2009], Hans [2010])have been developed.

Acknowledgments

The author is greatly indebted to Ji Zhu (University of Michigan) for providing him examplecode for the solution path of `1-regularized support vector classification, which turned outto be helpful for the development and implementation of the algorithm described in Section3.The author thanks the two reviewers for their constructive comments, leading to increasedclarity in presentation and an improved discussion.

References

P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journalof the American Statistical Association, 101:138–156, 2006.

K. Bennett and O. Mangasarian. Multicategory separation via linear programming. Opti-mization Methods and Software, 3:27–39, 1993.

H. Bondell and B. Reich. Simultaneous regression shrinkage, variable selection and clus-tering of predictors with OSCAR. Biometrics, 64:115–123, 2008.

P. Bradley and O. Mangasarian. Feature selection via concave minimization and supportvector machines. In International Conference on Machine Learning, 1998.

20

N. Christianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cam-bridge University Press, 2000.

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least Angle Regression (with discus-sion). The Annals of Statistics, 32:407–499, 2004.

C. Hans. Bayesian Lasso regression. Biometrika, 96:221–229, 2009.

C. Hans. Model uncertainty and variable selection in Bayesian lasso regression. Statisticsand Computing, 20:221–229, 2010.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,New York, 2001.

T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The Entire Regularization Path for theSupport Vector Machine. Journal of Machine Learning Research, 5:1391–1415, 2004.

G. James, J. Wang, and J. Zhu. Functional Linear Regression That’s Interpretable. TheAnnals of Statistics, 37:2083–2108, 2008.

R. Koenker. Quantile Regression. Cambridge University Press, 2005.

T. Lancaster and SJ. Jun. Bayesian quantile regression methods. Journal of AppliedEconometrics, 25:287–307, 2009.

S. Landau, I. Ellison-Wright, and E. Bullmore. Tests for a difference in timing of physiologi-cal response between two brain regions measured by using functional magnetic resonanceimaging. Applied Statistics, 63-82:53, 2003.

C. Li and H. Li. Variable Selection and Regression Analysis for Graph-Structured Co-variates with an Application to Genomics. The Annals of Applied Statistics, to appear,2010.

Y. Li and J. Zhu. L1-norm Quantile regression. Journal of Computational and GraphicalStatistics, 17:163–185, 2008.

Y. Li, Y. Liu, and J. Zhu. Quantile regression in reproducing kernel Hilbert spaces. Journalof the American Statistical Association, 102:255–268, 2007.

Y. Lin. Support vector machines and the Bayes rule in classification. Data Mining andKnowledge Discovery, 6:259–275, 2002.

T. Park and G. Casella. The Bayesian Lasso. Journal of the American Statistical Associ-ation, 103:681–686, 2008.

J. Ramsay and B. Silverman. Functional Data Analysis. Springer, New York, 2006.

S. Rosset. Bi-level path following for cross validated solution of kernel quantile regression.In International Conference on Machine Learning, 2008.

S. Rosset and J. Zhu. Piecewise linear regularized solution paths. The Annals of Statistics,1012-1030:35, 2007.

B. Scholkopf and A. Smola. Learning with kernels. MIT Press, Cambridge, Massachussets,2002.

21

M. Slawski, W. zu Castell, and G. Tutz. Feature Selection guided by Structural Informa-tion. The Annals of Applied Statistics, to appear, 2010.

P. Sollich. Bayesian methods for Support Vector Machines: Evidence and predictive classprobabilities. Machine Learning, 46:21–52, 2002.

M. Stein. Interpolation of Spatial Data. Springer, New York, 1999.

I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.

I. Takeuchi, Q. Le, T. Sears, and A. Smola. Nonparametric quantile regression. Journalof Machine Learning Research, 7:1231–1264, 2006.

M. El Anbari and A. Mkhadri. Penalized regression combing the L1 norm and a correlationbased penalty. Technical report, Universite Paris Sud 11, 2008.

Y. Le Cun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel.Backpropagation applied to handwritten zip code recognition. Neural Computation, 2:541–551, 1989.

R. Tibshirani. Regression shrinkage and variable selection via the lasso. Journal of theRoyal Statistical Society Series B, 58:671–686, 1996.

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothnessvia the fused lasso. Journal of the Royal Statistical Society Series B, 67:91–108, 2005.

G. Tutz and J. Gertheiss. Feature extraction in signal regression: a boosting techniquefor functional data regression. Journal of Computational and Graphical Statistics, 19:154–174, 2010.

G. Tutz and J. Ulbricht. Penalized regression with correlation based penalty. Statisticsand Computing, 19:239–253, 2009.

L. Wang and X. Shen. Multi-category support vector machines, feature selection, andsolution path. Statistica Sinica, 16:617–634, 2005.

L. Wang, J. Zhu, and H. Zou. The doubly regularized support vector machine. StatisticaSinica, 16:589–616, 2006.

S. Wood. R package gamair: Data for ”GAMs: An Introduction with R”, version 0.0-4.available from www.r-project.org, 2006.

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.Journal of the Royal Statistical Society Series B, 68:49–67, 2006.

P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped andhierarchical variable selection. The Annals of Statistics, 37:3468–3497, 2009.

J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. L1 norm support vector machine. Advancesin Neural Information Processing Systems, 16:55–63, 2003.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal ofthe Royal Statistical Society Series B, 67:301–320, 2005.

22

Documents

The structured elastic net for quantile regression and ... · For quantile regression, the quantity inffy: P(Y yjX = x) ˝gis called the ˝-quantile of the conditional distribution