Cost-Sensitive Tree of Classifiers · 2021. 1. 5. · Karayev et al.(2012) use reinforcement learning to dynamically select features to maximize the average precision over time in

Cost-Sensitive Tree of Classifiers

Zhixiang (Eddie) Xu [email protected] J. Kusner [email protected] Q. Weinberger [email protected] Chen [email protected]

Washington University, One Brookings Dr., St. Louis, MO 63130 USA

Abstract

Recently, machine learning algorithms havesuccessfully entered large-scale real-world in-dustrial applications (e.g. search engines andemail spam filters). Here, the CPU costduring test-time must be budgeted and ac-counted for. In this paper, we address thechallenge of balancing the test-time cost andthe classifier accuracy in a principled fashion.The test-time cost of a classifier is often dom-inated by the computation required for fea-ture extraction—which can vary drasticallyacross features. We decrease this extrac-tion time by constructing a tree of classifiers,through which test inputs traverse along in-dividual paths. Each path extracts differentfeatures and is optimized for a specific sub-partition of the input space. By only com-puting features for inputs that benefit fromthem the most, our cost-sensitive tree of clas-sifiers can match the high accuracies of thecurrent state-of-the-art at a small fraction ofthe computational cost.

1. Introduction

Machine learning algorithms are widely used inmany real-world applications, ranging from email-spam (Weinberger et al., 2009) and adult content filter-ing (Fleck et al., 1996), to web-search engines (Zhenget al., 2008). As machine learning transitions intothese industry fields, managing the CPU cost at test-time becomes increasingly important. In applicationsof such large scale, computation must be budgeted andaccounted for. Moreover, reducing energy wasted onunnecessary computation can lead to monetary sav-

Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).

ings and reductions of greenhouse gas emissions.

The test-time cost consists of the time required to eval-uate a classifier and the time to extract features forthat classifier, where the extraction time across fea-tures is highly variable. Imagine introducing a newfeature to an email spam filtering algorithm that re-quires 0.01 seconds to extract per incoming email. Ifa web-service receives one billion emails (which manydo daily), it would require 115 extra CPU days to ex-tract just this feature. Although this additional fea-ture may increase the accuracy of the filter, the costof computing it for every email is prohibitive. Thisintroduces the problem of balancing the test-time costand the classifier accuracy. Addressing this trade-offin a principled manner is crucial for the applicabilityof machine learning.

In this paper, we propose a novel algorithm, Cost-Sensitive Tree of Classifiers (CSTC). A CSTC tree (il-lustrated schematically in Fig. 1) is a tree of classifiersthat is carefully constructed to reduce the average test-time complexity of machine learning algorithms, whilemaximizing their accuracy. Different from prior work,which reduces the total cost for every input (Efronet al., 2004) or which stages the feature extractioninto linear cascades (Viola & Jones, 2004; Lefakis &Fleuret, 2010; Saberian & Vasconcelos, 2010; Pujaraet al., 2011; Chen et al., 2012), a CSTC tree incor-porates input-dependent feature selection into train-ing and dynamically allocates higher feature budgetsfor infrequently traveled tree-paths. By introducinga probabilistic tree-traversal framework, we can com-pute the exact expected test-time cost of a CSTC tree.CSTC is trained with a single global loss function,whose test-time cost penalty is a direct relaxation ofthis expected cost. This principled approach leads tounmatched test-cost/accuracy tradeoffs as it naturallydivides the input space into sub-regions and extractsexpensive features only when necessary.

We make several novel contributions: 1. We introduce


the meta-learning framework of CSTC trees and de-rive the expected cost of an input traversing the treeduring test-time. 2. We relax this expected cost with amixed-norm relaxation and derive a single global op-timization problem to train all classifiers jointly. 3.We demonstrate on synthetic data that CSTC effec-tively allocates features to classifiers where they aremost beneficial and show on large-scale real-world web-search ranking data that CSTC significantly outper-forms the current state-of-the-art in test-time cost-sensitive learning—maintaining the performance of thebest algorithms for web-search ranking at a fraction oftheir computational cost.

2. Related Work

A basic approach to control test-time cost is the useof l1-norm regularization (Efron et al., 2004), whichresults in a sparse feature set, and can significantly re-duce the feature cost during test-time (as unused fea-tures are never computed). However, this approachfails to address the fact that some inputs may besuccessfully classified by only a few cheap features,whereas others strictly require expensive features forcorrect classification.

There is much previous work that extends single classi-fiers to classifier cascades (mostly for binary classifica-tion) (Viola & Jones, 2004; Lefakis & Fleuret, 2010;Saberian & Vasconcelos, 2010; Pujara et al., 2011;Chen et al., 2012). In these cascades, several classi-fiers are ordered into a sequence of stages. Each clas-sifier can either reject inputs (predicting them), or passthem on to the next stage, based on the prediction ofeach input. To reduce the test-time cost, these cas-cade algorithms enforce that classifiers in early stagesuse very few and/or cheap features and reject manyeasily-classified inputs. Classifiers in later stages, how-ever, are more expensive and cope with more difficultinputs. This linear structure is particularly effectivefor applications with highly skewed class imbalanceand generic features. One celebrated example is facedetection in images, where the majority of all imageregions do not contain faces and can often be easilyrejected based on the response of a few simple Haarfeatures (Viola & Jones, 2004). The linear cascademodel is however less suited for learning tasks with bal-anced classes and specialized features. It cannot fullycapture the scenario where different partitions of theinput space require different expert features, as all in-puts follow the same linear chain.

Grubb & Bagnell (2012) and Xu et al. (2012) focus ontraining a classifier that explicitly trades-off test-timecost and accuracy. Instead of optimizing the trade-

off by building a cascade, they push the cost trade-offinto the construction of the weak learners. It should benoted that, in spite of the high accuracy achieved bythese techniques, the algorithms are based heavily onstage-wise regression (gradient boosting) (Friedman,2001), and are less likely to work with more generalweak learners.

Gao & Koller (2011) use locally weighted regressionduring test time to predict the information gain of un-known features. Different from our algorithm, theirmodel is learned during test-time, which introducesan additional cost especially for large data sets. Incontrast, our algorithm learns and fixes a tree struc-ture in training and has a test-time complexity that isconstant with respect to the training set size.

Karayev et al. (2012) use reinforcement learning todynamically select features to maximize the averageprecision over time in an object detection setting. Inthis case, the dataset has multi-labeled inputs and thuswarrants a different approach than ours.

Hierarchical Mixture of Experts (HME) (Jordan &Jacobs, 1994) also builds tree-structured classifiers.However, in contrast to CSTC, this work is not mo-tivated by reductions in test-time cost and results infundamentally different models. In CSTC, each clas-sifier is trained with the test-time cost in mind andeach test-input only traverses a single path from theroot down to a terminal element, accumulating path-specific costs. In HME, all test-inputs traverse allpaths and all leaf-classifiers contribute to the final pre-diction, incurring the same cost for all test-inputs.

Recent tree-structured classifiers include the work ofDeng et al. (2011), who speed up the training and eval-uation of label trees (Bengio et al., 2010), by avoidingmany binary one-vs-all classifier evaluations. Differ-ently, we focus on problems in which feature extrac-tion time dominates the test-time cost which motivatesdifferent algorithmic setups. Dredze et al. (2007) com-bine the cost to select a feature with the mutual in-formation of that feature to build a decision tree thatreduces the feature extraction cost. Different from thiswork, they do not directly minimize the total test-timecost of the decision tree or the risk. Possibly most sim-ilar to our work are (Busa-Fekete et al., 2012), wholearn a directed acyclic graph via a Markov decisionprocess to select features for different instances, and(Wang & Saligrama, 2012), who adaptively partitionthe feature space and learn local region-specific classi-fiers. Although each work is similar in motivation, thealgorithmic frameworks are very different and can beregarded complementary to ours.


3. Cost-sensitive classification

We first introduce our notation and then formalize ourtest-time cost-sensitive learning setting. Let the train-ing data consist of inputs D={x1, . . . ,xn} ⊂ Rd withcorresponding class labels {y1, . . . , yn} ⊆ Y, whereY = R in the case of regression (Y could also be afinite set of categorical labels—because of space limi-tations we do not focus on this case in this paper).

Non-linear feature space. Throughout this pa-per, we focus on linear classifiers but in order to al-low non-linear decision boundaries we map the in-put into a non-linear feature space with the “boost-ing trick” (Friedman, 2001; Chapelle et al., 2011),prior to our optimization. In particular, we firsttrain gradient boosted regression trees with a squaredloss penalty (Friedman, 2001), H ′(xi) =

∑Tt=1 ht(xi),

where each function ht(·) is a limited-depth CARTtree (Breiman, 1984). We then apply the map-ping xi → φ(xi) to all inputs, where φ(xi) =[h1(xi), . . . , hT (xi)]

>. To avoid confusion betweenCART trees and the CSTC tree, we refer to CARTtrees ht(·) as weak learners.

Risk minimization. At each node in the CSTCtree we propose to learn a linear classifier in this fea-ture space, H(xi) = φ(xi)

>β with β ∈ RT , which istrained to explicitly reduce the CPU cost during test-time. We learn the weight-vector β by minimizing aconvex empirical risk function `(φ(xi)

>β, yi) with l1regularization, |β|. In addition, we incorporate a costterm c(β), which we derive in the following subsection,to restrict test-time cost. The combined test-time cost-sensitive loss function becomes

L(β) =∑

i

`(φ(xi)>β, yi) + ρ|β|

︸︷︷︸regularized risk

+ λ c(β)︸︷︷︸test-cost

, (1)

where λ is the accuracy/cost trade-off parameter, andρ controls the strength of the regularization.

Test-time cost. There are two factors that con-tribute to the test-time cost of each classifier. Theweak learner evaluation cost of all active ht(·) (with|βt|>0) and the feature extraction cost for all featuresused in these weak learners. We assume that featuresare computed on demand with the cost c the first timethey are used, and are free for future use (as featurevalues can be cached). We define an auxiliary matrixF ∈ {0, 1}d×T with Fαt = 1 if and only if the weaklearner ht uses feature fα. Let et > 0 be the cost toevaluate a ht(·), and cα be the cost to extract featurefα. With this notation, we can formulate the total

test-time cost for an instance precisely as

c(β) =∑

t

et‖βt‖0︸︷︷︸evaluation cost

+∑

α

cα

∥∥∥∥∥∑

t

|Fαtβt|∥∥∥∥∥

0︸︷︷︸feature extraction cost

, (2)

where the l0 norm for scalars is defined as ‖a‖0∈{0, 1}with ‖a‖0 =1 if and only if a 6=0. The first term assignscost et to every weak learner used in β, the second termassigns cost cα to every feature that is extracted by atleast one of such weak learners.

Test-cost relaxation. The cost formulation in (2)is exact but difficult to optimize as the l0 norms arenon-continuous and non-differentiable. As a solution,throughout this paper we use the mixed-norm relax-ation of the l0 norm over sums,

∑

j

∥∥∥∥∥∑

i

|aij |∥∥∥∥∥

0

→∑

j

√∑

i

(aij)2, (3)

described by (Kowalski, 2009). Note that for a sin-gle element this relaxation relaxes the l0 norm tothe l1 norm, ‖aij‖0 →

√(aij)2 = |aij |, and recovers

the commonly used approximation to encourage spar-sity (Efron et al., 2004; Scholkopf & Smola, 2001). Weplug the cost-term (2) into the loss in (1) and applythe relaxation (3) to all l0 norms to obtain

∑

i

`i+ρ|β|︸︷︷︸

regularized loss

+λ

( ∑

t

et|βt|︸︷︷︸

ev. cost penalty

+∑

α

cα

√∑

t

(Fαtβt)2

︸︷︷︸feature cost penalty

),

(4)

where we abbreviate `i=`(φ(xi)>β, yi) for simplicity.

While (4) is cost-sensitive, it is restricted to a singlelinear classifier. In the next section we describe howto expand this formulation into a cost-effective tree-structured model.

4. Cost-sensitive tree

We begin by introducing foundational concepts regard-ing the CSTC tree and derive a global loss function (5).Similar to the previous section, we first derive the ex-act cost term and then relax it with the mixed-norm.Finally, we describe how to optimize this function ef-ficiently and to undo some of the inaccuracy inducedby the mixed-norm relaxations.

CSTC nodes. We make the assumption that in-stances with similar labels can utilize similar features.1

1For example, in web-search ranking, features generatedby browser statistics are typically predictive only for highlyrelevant pages as they require the user to spend significanttime on the page and interact with it.


minβ1,θ1,...,β|V |,θ|V |

∑

vk∈V

(1

n

n∑

i=1

pki `ki +ρ|βk|

)

︸︷︷︸regularized risk

+λ∑

vl∈Lpl

[ ∑

t

et

√∑

vj∈πl

(βjt )2

︸︷︷︸evaluation cost penalty

+∑

α

cα

√∑

vj∈πl

∑

t

(Fαtβjt )

2

︸︷︷︸feature cost penalty

](5)

(β1, θ1)

(β2, θ2)

(β3, θ3)

classifiernodes

(β4, θ4)

(β5, θ5)

(β6, θ6)

(β0, θ0)

CSTC Tree π7

π8

π9

π10

v0

v1

v3

v4

v5

v2

v6

terminal element

φ(x)�β0 > θ0

φ(x)�β0 ≤ θ0

v7

v8

v9

v10

Figure 1. A schematic layout of a CSTC tree. Each nodevk has a threshold θk to send instances to different partsof the tree and a weight vector βk for prediction. We solvefor βk and θk that best balance the accuracy/cost trade-offfor the whole tree. All paths of a CSTC tree are shown incolor.

We therefore design our tree algorithm to partition theinput space based on classifier predictions. Classifiersthat reside deep in the tree become experts for a smallsubset of the input space and intermediate classifiersdetermine the path of instances through the tree. Wedistinguish between two different elements in a CSTCtree (depicted in Figure 1): classifier nodes (whitecircles) and terminal elements (black squares). Eachclassifier node vk is associated with a weight vectorβk and a threshold θk. Different from cascade ap-proaches, these classifiers not only classify inputs us-ing βk, but also branch them by their threshold θk,sending inputs to their upper child if φ(xi)

>βk > θk,and to their lower child otherwise. Terminal elementsare “dummy” structures and are not classifiers. Theyreturn the predictions of their direct parent classifiernodes—essentially functioning as a placeholder for anexit out of the tree. The tree structure may be a fullbalanced binary tree of some depth (eg. figure 1), orcan be pruned based on a validation set (eg. figure 4,left).

During test-time, inputs are first applied to the rootnode v0. The root node produces predictions φ(xi)

>β0

and sends the input xi along one of two different paths,

depending on whether φ(xi)>β0 > θ0. By repeat-

edly branching the test-inputs, classifier nodes sittingdeeper in the tree only handle a small subset of allinputs and become specialized towards that subset ofthe input space.

4.1. Tree loss

We derive a single global loss function over all nodesin the CSTC tree.

Soft tree traversal. Training the CSTC tree withhard thresholds leads to a combinatorial optimiza-tion problem, which is NP-hard. Therefore, duringtraining, we softly partition the inputs and assigntraversal probabilities p(vk|xi) to denote the likeli-hood of input xi traversing through node vk. Ev-ery input xi traverses through the root, so we de-fine p(v0|xi) = 1 for all i. We use the sigmoid func-tion to define a soft belief that an input xi will tran-sition from classifier node vk to its upper child vj

as p(vj |xi, vk) = σ(φ(xi)>βk − θk).2 The probabil-

ity of reaching child vj from the root is, recursively,p(vj |xi) = p(vj |xi, vk)p(vk|xi), because each node hasexactly one parent. For a lower child vl of parent vk wenaturally obtain p(vl|xi) =

[1 − p(vj |xi, vk)

]p(vk|xi).

In the following paragraphs we incorporate this prob-abilistic framework into the single-node risk and costterms of eq. (4) to obtain the corresponding expectedtree risk and tree cost.

Expected tree risk. The expected tree risk can beobtained byWg over all nodes V and inputs and weigh-ing the risk `(·) of input xi at node vk by the proba-bility pki =p(vk|xi),

1

n

n∑

i=1

∑

vk∈Vpki `(φ(xi)

>βk, yi). (6)

This has two effects: 1. the local risk for each nodefocusses more on likely inputs; 2. the global risk at-tributes more weight to classifiers that serve many in-puts.

Expected tree costs. The cost of a test-input isthe cumulative cost across all classifiers along its paththrough the CSTC tree. Figure 1 illustrates an exam-

2The sigmoid function is defined as σ(a) = 11+exp(−a)

and takes advantage of the fact that σ(a) ∈ [0, 1] and thatσ(·) is strictly monotonic.


ple of a CSTC tree with all paths highlighted in color.Every test-input must follow along exactly one of thepaths from the root to a terminal element. Let L de-note the set of all terminal elements (e.g., in figure 1we have L={v7, v8, v9, v10}), and for any vl∈L let πl

denote the set of all classifier nodes along the uniquepath from the root v0 before terminal element vl (e.g.,π9 = {v0, v2, v5}). The evaluation and feature cost ofthis unique path is exactly

cl=∑

t

et

∥∥∥∥∥∑

vj∈πl

|βjt |∥∥∥∥∥

0︸︷︷︸evaluation cost

+∑

α

cα

∥∥∥∥∥∑

vj∈πl

∑

t

|Fαtβjt |∥∥∥∥∥

0︸︷︷︸feature cost

.

This term is analogous to eq. (2), except the cost et ofthe weak learner ht is paid if any of the classifiers vj inpath πl use this tree (i.e. assign βjt non-zero weight).Similarly, the cost cα of a feature fα is paid exactlyonce if any of the weak learners of any of the classi-fiers along πl require it. Once computed, a feature orweak learner can be reused by all classifiers along thepath for free (as the computation can be cached veryefficiently).

Given an input xi, the probability of reaching ter-minal element vl ∈ L (traversing along path πl) ispli = p(vl|xi). Therefore, the marginal probability thata training input (picked uniformly at random fromthe training set) reaches vl is pl =

∑i p(v

l|xi)p(xi) =1n

∑ni=1 p

li. With this notation, the expected cost for

an input traversing the CSTC tree becomes E[cl] =∑vl∈L p

lcl. Using our l0-norm relaxation in eq. (3) on

both l0 norms in cl gives the final expected tree costpenalty

∑

vl∈Lpl

[∑

t

et

√∑

vj∈πl

(βjt )2 +∑

α

cα

√∑

vj∈πl

∑

t

(Fαtβjt )

2

],

which naturally encourages weak learner and featurere-use along paths through the CSTC tree.

Optimization problem. We combine the risk (6)with the cost penalties and add the l1-regularizationterm (which is unaffected by our probabilistic split-ting) to obtain the global optimization problem (5).(We abbreviate the empiWisk at node vk as `ki =`(φ(xi)

>βk, yi).)

4.2. Optimization Details

There are many techniques to minimize the loss in(5). We use a cyclic optimization procedure, solving

∂L∂(βk,θk)

for each classifier node vk one at a time, keep-

ing all other nodes fixed. For a given classifier nodevk, the traversal probabilities pji of a descendant node

vj and the probability of an instance reaching a ter-minal element pl also depend on βk and θk (throughits recursive definition) and must be incorporated intothe gradient computation.

To minimize (5) with respect to parameters βk, θk,we use the lemma below to overcome the non-differentiability of the square-root terms (and l1norms) resulting from the l0-relaxations (3).

Lemma 1. Given any g(x) > 0, the following holds:

√g(x) = min

z>0

1

2

[g(x)

z+ z

]. (7)

The lemma can be proved as z =√g(x) minimizes the

function on the right hand side. Further, it is shownin (Boyd & Vandenberghe, 2004) that the right handside is jointly convex in x and z, so long as g(x) isconvex.

For each square-root or l1 term we introduce an aux-iliary variable (i.e., z above) and alternate betweenminimizing the loss in (5) with respect to βk, θk andthe auxiliary variables. The former is performed withconjugate gradient descent and the latter can be com-puted efficiently in closed form. This pattern of block-coordinate descent followed by a closed form minimiza-tion is repeated until convergence. Note that the lossis guaranteed to converge to a fixed point because eachiteration decreases the loss function, which is boundedbelow by 0.

Initialization. The minimization of eq. (5) is non-convex and therefore initialization dependent. How-ever, minimizing eq. (5) with respect to the parametersof leaf classifier nodes is convex, as the loss function,after substitutions based on lemma 1, becomes jointlyconvex (because of the lack of descendant nodes). Wetherefore initialize the tree top-to-bottom, starting atv0, and optimize over βk by minimizing (5) while con-sidering all descendant nodes of vk as “cut-off” (thuspretending node vk is a leaf).

Tree pruning. To obtain a more compact model andto avoid overfitting, the CSTC tree can be pruned withthe help of a validation set. As each node is a classifier,we can apply the CSTC tree on a validation set andcompute the validation error at each node. We pruneaway nodes that, upon removal, do not decrease theperformance of CSTC on the validation set (in the caseof ranking data, we even can use validation NDCG asour pruning criterion).

Fine-tuning. The relaxation in (3) makes the exactl0 cost terms differentiable and is well suited to ap-proximate which dimensions in a vector βk should be


X

X

X

X

Z

Z

Z

β3 =

000100

β4 =

001000

β5 =

100000

β6 =

010000

y++

y−+

y−−y+−

sign(X)sign(Z)

c =

1010101011

µ++

µ−+

µ−−

input: costs (c):

data labels

µ+−

β1 =

0000

0.290

β2 =

0000

−4.020

label means

X

β0 =

00000

−7.91

Z

Z

Z

Z

dataCSTC Tree

Figure 2. CSTC on synthetic data. The box at left describes the artificial data set. The rest of the figure shows the CSTCtree built for the data set. At each node we show a plot of the predictions made by that classifier. After each node weshow the weight vector that was selected to make predictions and send instances to child nodes (if applicable).

assigned non-zero weights. The mixed-norm does how-ever impact the performance of the classifiers because(different from the l0 norm) larger weights in β incurlarger penalties in the loss. We therefore introducea post-processing step to correct the classifiers fromthis unwanted regularization effect. We re-optimize allpredictive classifiers (classifiers with terminal elementchildren, i.e. classifiers that make final predictions),while clamping all features with zero-weight to strictlyremain zero.

minβk

∑

i

pki `(φ(xi)>β

k, yi) + ρ|βk|

subject to: βkt = 0 if βkt = 0. (8)

The final CSTC tree uses these re-optimized weight

vectors βk

for all predictive classifier nodes vk.

5. Results

In this section, we first evaluate CSTC on a carefullyconstructed synthetic data set to test our hypothesisthat CSTC learns specialized classifiers that rely ondifferent feature subsets. We then evaluate the perfor-mance of CSTC on the large scale Yahoo! Learning toRank Challenge data set and compare it with state-of-the-art algorithms.

5.1. Synthetic data

We construct a synthetic regression dataset, sampledfrom the four quadrants of the X,Z-plane, whereX = Z = [−1, 1]. The features belong to two cate-gories: cheap features, sign(x), sign(z) with cost c=1,which can be used to identify the quadrant of an in-put; and four expensive features y++, y+−, y−+, y−−with cost c = 10, which represent the exact label ofan input if it is from the corresponding region (a ran-dom number otherwise). Since in this synthetic dataset we do not transform the feature space, we haveφ(x) =x, and F (the weak learner feature-usage vari-able) is the 6×6 identity matrix. By design, a perfectclassifier can use the two cheap features to identify thesub-region of an instance and then extract the correctexpensive feature to make a perfect prediction. Theminimum feature cost of such a perfect classifier is ex-actly c=12 per instance. The labels are sampled fromGaussian distributions with quadrant-specific meansµ++, µ−+, µ+−, µ−− and variance 1. Figure 2 showsthe CSTC tree and the predictions of test inputs madeby each node. In every path along the tree, the firsttwo classifiers split on the two cheap features and iden-tify the correct sub-region of the input. The final clas-sifier extracts a single expensive feature to predict thelabels. As such, the mean squared error of the trainingand testing data both approach 0.


0 0.5 1 1.5 2x 104

0.705

0.71

0.715

0.72

0.725

0.73

0.735

0.74

Stage−wise regression (Friedman, 2001)Single cost−sensitive classifierEarly exit s=0.2 (Cambazoglu et. al. 2010)Early exit s=0.3 (Cambazoglu et. al. 2010)Early exit s=0.5 (Cambazoglu et. al. 2010)Cronus optimized (Chen et. al. 2012)CSTC w/o fine−tuningCSTC

ND

CG

@ 5

Cost 104

Figure 3. The test ranking accuracy (NDCG@5) and costof various cost-sensitive classifiers. CSTC maintains itshigh retrieval accuracy significantly longer as the cost-budget is reduced. Note that fine-tuning does not improveNDCG significantly because, as a metric, it is insensitiveto mean squared error.

5.2. Yahoo! Learning to Rank

To evaluate the performance of CSTC on real-worldtasks, we test our algorithm on the public Yahoo!Learning to Rank Challenge data set3 (Chapelle &Chang, 2011). The set contains 19,944 queries and473,134 documents. Each query-document pair xi con-sists of 519 features. An extraction cost, which takeson a value in the set {1, 5, 20, 50, 100, 150, 200}, is as-sociated with each feature4. The unit of these valuesis the time required to evaluate a weak learner ht(·).The label yi ∈ {4, 3, 2, 1, 0} denotes the relevancy ofa document to its corresponding query, with 4 indi-cating a perfect match. In contrast to Chen et al.(2012), we do not inflate the number of irrelevant doc-uments (by counting them 10 times). We measure theperformance using NDCG@5 (Jarvelin & Kekalainen,2002), a preferred ranking metric when multiple lev-els of relevance are available. Unless otherwise stated,we restrict CSTC to a maximum of 10 nodes. All re-sults are obtained on a desktop with two 6-core Inteli7 CPUs. Minimizing the global objective requires lessthan 3 hours to complete, and fine-tuning the classi-fiers takes about 10 minutes.

Comparison with prior work. Figure 3 shows acomparison of CSTC with several recent algorithmsfor test-time cost-sensitive learning. We show NDCG

3http://learningtorankchallenge.yahoo.com4The extraction costs were provided by a Yahoo! em-

ployee.

Predictive Node Classifier Similarity

1.00 0.85

1.00

1.00

1.00

0.85

0.81

0.81 0.76

0.76

0.90

0.90

0.83

0.83

0.88

0.88

0.64

0.72

0.75

0.84

0.72 0.75 0.84 1.000.64

v3

v4

v5

v6

v6v5v4v3

v14

v14

v0

v1

v3

1.23

1.67

0.56

0.36

0.84

2.02

1.39

v7

v9

v13

v29

v5

1.13

v11

CSTC Tree

v2

v4

v6

v14

Figure 4. (Left) The pruned CSTC-tree generated fromthe Yahoo! Learning to Rank data set. (Right) Jaccardsimilarity coefficient between classifiers within the learnedCSTC tree.

versus cost (in units of weak learner evaluations).The plot shows different stages in our derivation ofCSTC: the initial cost-insensitive ensemble classifierH ′(·) (Friedman, 2001) from section 3 (stage-wise re-gression), a single cost-sensitive classifier as describedin eq. (4), the CSTC tree (5) and CSTC tree withfine-tuning (8). We obtain the curves by varyingthe accuracy/cost trade-off parameter λ (and performearly stopping based on the validation data, for fine-tuning). For CSTC tree we evaluate six settings,λ= { 1

3 ,12 , 1, 2, 3, 4, 5, 6}. In the case of stage-wise re-

gression, which is not cost-sensitive, the curve is sim-ply a function of boosting iterations.

For competing algorithms, we include Early exit(Cambazoglu et al., 2010) which improves upon stage-wise regression by short-circuiting the evaluation ofunpromising documents at test-time, reducing theoverall test-time cost. The authors propose severalcriteria for rejecting inputs early and we use thebest-performing method “early exits using proxim-ity threshold”. For Cronus (Chen et al., 2012), weuse a cascade with a maximum of 10 nodes. Allhyper-parameters (cascade length, keep ratio, dis-count, early-stopping) were set based on a validationset. The cost/accuracy curve was generated by vary-ing the corresponding trade-off parameter, λ.

As shown in the graph, CSTC significantly improvesthe cost/accuracy trade-off curve over all other algo-rithms. The power of Early exit is limited in this caseas the test-time cost is dominated by feature extrac-tion, rather than the evaluation cost. Compared withCronus, CSTC has the ability to identify features thatare most beneficial to different groups of inputs. It isthis ability, which allows CSTC to maintain the highNDCG significantly longer as the cost-budget is re-duced.

http://learningtorankchallenge.yahoo.com


1 2 3 40

0.2

0.4

0.6

0.8

1

Depth

Feat

ure

Use

d

c=1 (123)c=5 (31)c=20 (191)c=50 (125)c=100 (16)c=150 (32)c=200 (1)

Pro

porti

on o

f Fea

ture

s

Tree Depth

Figure 5. The ratio of features, grouped by cost, that areextracted at different depths of CSTC (the number of fea-tures in each cost group is indicated in parentheses in thelegend). More expensive features (c ≥ 20) are graduallyextracted as we go deeper.

Note that CSTC with fine-tuning only achieves verytiny improvement over CSTC without it. Althoughthe fine-tuning step decreases the mean squared erroron the test-set, it has little effect on NDCG, which isonly based on the relative ranking of the documents (asopposed to their exact predictions). Moreover, becausewe fine-tune prediction nodes until validation NDCGdecreases, for the majority of λ values, only a smallamount of fine-tuning occurs.

Input space partition. Figure 4 (left) shows apruned CSTC tree (λ = 4) for the Yahoo! data set.The number above each node indicates the average la-bel of theWg inputs passing through that node. Wecan observe that different branches aim at differentparts of the input domain. In general, the upperbranches focus on correctly classifying higher rankeddocuments, while the lower branches target low-rankdocuments. Figure 4 (right) shows the Jaccard matrixof the predictive classifiers (v3, v4, v5, v6, v14) from thesame CSTC tree. The matrix shows a clear trend thatthe Jaccard coefficients decrease monotonically awayfrom the diagonal. This indicates that classifiers sharefewer features in common if their average labels arefurther apart—the most different classifiers v3 and v14

have only 64% of their features in common—and vali-dates that classifiers in the CSTC tree extract differentfeatures in different regions of the tree.

Feature extraction. We also investigate the featuresextracted in individual classifier nodes. Figure 5 showsthe fraction of features, with a particular cost, ex-tracted at different depths of the CSTC tree for theYahoo! data. We observe a general trend that asdepth increases, more features are being used. How-ever, cheap features (c ≤ 5) are fully extracted early-

on, whereas expensive features (c ≥ 20) are extractedby classifiers sitting deeper in the tree, where each in-dividual classifier only copes with a small subset of in-puts. The expensive features are used to classify thesesubsets of inputs more precisely. The only feature thathas cost 200 is extracted at all depths—which seemsessential to obtain high NDCG (Chen et al., 2012).

6. Conclusions

We introduce Cost-Sensitive Tree of Classifiers(CSTC), a novel learning algorithm that explicitly ad-dresses the trade-off between accuracy and expectedtest-time CPU cost in a principled fashion. The CSTCtree partitions the input space into sub-regions andidentifies the most cost-effective features for each oneof these regions—allowing it to match the high accu-racy of the state-of-the-art at a small fraction of thecost. We obtain the CSTC algorithm by formulat-ing the expected test-time cost of an instance passingthrough a tree of classifiers and relax it into a contin-uous cost function. This cost function can be mini-mized while learning the parameters of all classifiersin the tree jointly. By making the test-time cost vs.accuracy tradeoff explicit we enable high performanceclassifiers that fit into computational budgets and canreduce unnecessary energy consumption in large-scaleindustrial applications. Further, engineers can designhighly specialized features for particular edges-casesof their input domain and CSTC will automaticallyincorporate them on-demand into its tree structure.

Acknowledgements KQW, ZX, MK, and MC aresupported by NIH grant U01 1U01NS073457-01 and NSFgrants 1149882 and 1137211. The authors thank John P.Cunningham for clarifying discussions and suggestions.

References

Bengio, S., Weston, J., and Grangier, D. Label embeddingtrees for large multi-class tasks. NIPS, 23:163–171, 2010.

Boyd, S.P. and Vandenberghe, L. Convex optimization.Cambridge Univ Pr, 2004.

Breiman, L. Classification and regression trees. Chapman& Hall/CRC, 1984.

Busa-Fekete, R., Benbouzid, D., Kegl, B., et al. Fast clas-sification using sparse decision dags. In ICML, 2012.

Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J.,Liao, C., Zheng, Z., and Degenhardt, J. Early exit opti-mizations for additive machine learned ranking systems.In WSDM’3, pp. 411–420, 2010.

Chapelle, O. and Chang, Y. Yahoo! learning to rank chal-lenge overview. In JMLR: Workshop and ConferenceProceedings, volume 14, pp. 1–24, 2011.


Chapelle, O., Shivaswamy, P., Vadrevu, S., Weinberger, K.,Zhang, Y., and Tseng, B. Boosted multi-task learning.Machine learning, 85(1):149–173, 2011.

Chen, M., Xu, Z., Weinberger, K. Q., and Chapelle, O.Classifier cascade for minimizing feature evaluation cost.In AISTATS, 2012.

Deng, J., Satheesh, S., Berg, A.C., and Fei-Fei, L. Fastand balanced: Efficient label tree learning for large scaleobject recognition. In NIPS, 2011.

Dredze, M., Gevaryahu, R., and Elias-Bachrach, A. Learn-ing fast classifiers for image spam. In proceedings of theConference on Email and Anti-Spam (CEAS), 2007.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R.Least angle regression. The Annals of Statistics, 32(2):407–499, 2004.

Fleck, M., Forsyth, D., and Bregler, C. Finding nakedpeople. ECCV, pp. 593–602, 1996.

Friedman, J.H. Greedy function approximation: a gradientboosting machine. The Annals of Statistics, pp. 1189–1232, 2001.

Gao, T. and Koller, D. Active classification based on valueof classifier. In NIPS, pp. 1062–1070. 2011.

Grubb, A. and Bagnell, J. A. Speedboost: Anytime predic-tion with uniform near-optimality. In AISTATS, 2012.

Jarvelin, K. and Kekalainen, J. Cumulated gain-basedevaluation of IR techniques. ACM TOIS, 20(4):422–446,2002.

Jordan, M.I. and Jacobs, R.A. Hierarchical mixtures ofexperts and the em algorithm. Neural computation, 6(2):181–214, 1994.

Karayev, S., Baumgartner, T., Fritz, M., and Darrell, T.Timely object recognition. In Advances in Neural Infor-mation Processing Systems 25, pp. 899–907, 2012.

Kowalski, M. Sparse regression using mixed norms. Appliedand Computational Harmonic Analysis, 27(3):303–324,2009.

Lefakis, L. and Fleuret, F. Joint cascade optimization usinga product of boosted classifiers. In NIPS, pp. 1315–1323.2010.

Pujara, J., Daume III, H., and Getoor, L. Using classi-fier cascades for scalable e-mail classification. In CEAS,2011.

Saberian, M. and Vasconcelos, N. Boosting classifier cas-cades. In Lafferty, J., Williams, C. K. I., Shawe-Taylor,J., Zemel, R.S., and Culotta, A. (eds.), NIPS, pp. 2047–2055. 2010.

Scholkopf, B. and Smola, A.J. Learning with kernels: Sup-port vector machines, regularization, optimization, andbeyond. MIT press, 2001.

Viola, P. and Jones, M.J. Robust real-time face detection.IJCV, 57(2):137–154, 2004.

Wang, J. and Saligrama, V. Local supervised learningthrough space partitioning. In Advances in Neural In-formation Processing Systems 25, pp. 91–99, 2012.

Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A.,and Attenberg, J. Feature hashing for large scale multi-task learning. In ICML, pp. 1113–1120, 2009.

Xu, Z., Weinberger, K., and Chapelle, O. The greedymiser: Learning under test-time budgets. In ICML, pp.1175–1182, 2012.

Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Chen, K., andSun, G. A general boosting method and its applicationto learning ranking functions for web search. In NIPS,pp. 1697–1704. Cambridge, MA, 2008.

Documents

Cost-Sensitive Tree of Classifiers · 2021. 1. 5. · Karayev et al.(2012) use reinforcement learning to dynamically select features to maximize the average precision over time in