Bayesian Online Learning for Multi-label and Multi-variate ...zhangx/papers/ZhaGraHer10.pdf · Bayesian Online Learning for Multi-label and Multi-variate Performance Measures Unfortunately,

956

Bayesian Online Learning for Multi-label and Multi-variatePerformance Measures

Xinhua Zhang Thore Graepel Ralf HerbrichUniversity of Alberta Microsoft Research, Cambridge Microsoft Research, Cambridge

[email protected] [email protected] [email protected]

Abstract

Many real world applications employ multi-variate performance measures and each ex-ample can belong to multiple classes. Thecurrently most popular approaches train anSVM for each class, followed by ad hocthresholding. Probabilistic models usingBayesian decision theory are also commonlyadopted. In this paper, we propose aBayesianonline multi-label classification framework(BOMC) which learns a probabilistic linearclassifier. The likelihood is modeled by agraphical model similar to TrueSkillTM, andinference is based on Gaussian density fil-tering with expectation propagation. Us-ing samples from the posterior, we labelthe testing data by maximizing the expectedF1-score. Our experiments on Reuters1-v2dataset show BOMC compares favorably tothe state-of-the-art online learners in macro-averaged F1-score and training time.

1 Introduction

Real world applications often involve a large numberof classes and each example can be associated withmultiple classes. For instance, many web related ob-jects such as blogs, bookmarks, RSS feeds are attachedwith tags which are essentially forms of categoriza-tion. A news article on “Obama supported the $170billion AIG bailout after intense debate” can be asso-ciated with insurance, economics, and politics. Inthe search industry, revenue comes from clicks on theads embedded in the result page. The selection andplacement of ads can be significantly improved if adsare automatically tagged, or further categorized intoa hierarchy or ontology. This setting is referred to asmulti-label classification in machine learning, which isalso useful in many other applications such as bioin-

Appearing in Proceedings of the 13th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 ofJMLR: W&CP 9. Copyright 2010 by the authors.

formatics (Seki & Mostafa, 2005) and video retrieval(Qi et al., 2007).Learning with multi-label data is usually faced withthe following practical challenges:1. The problem scale is large in the number of datapoints n, number of features D, and number of classesC. Usually, we can afford at most O(nDC) computa-tions. Hence efficiency is critical and expensive opera-tions such as pairwise comparison must be avoided.2. Often multi-variate performance measures are used,e.g. macro-average Fβ-score and area under the ROC.They couple the labels of all data points and/or classesin a nondecomposable way. As a result, models learnedby minimizing the training error often perform poorlyunder this new measure, and it is important to cali-brate the trained model according to the testing data.3. Labels can be highly correlated and many applica-tions employ a tree structured ontology. One exampleis the Pascal challenge on large scale hierarchical textclassification which is based on the ODP web directorydata: lshtc.iit.demokritos.gr.

Existing algorithms for multi-label classification canbe categorized into three dimensions: a) batch v.s. on-line, b) frequentist v.s. Bayesian, and c) using struc-tures in the label space v.s. treating the labels as in-dependent. These dimensions help us to analyze howmuch a learning algorithm fits the above three chal-lenges, and to eventually motivate our new algorithm.

A typical max-margin frequentist method generalizesthe binary hinge loss to the maximum inconsistency(Elisseeff & Weston, 2001). Intuitively, for each pairof associated label c and non-associated label c′, thelinear score of c is expected to exceed that of c′ bya certain margin. However, this method may takeO(C2) time, and is hence inapplicable when the num-ber of classes is large. Among the probabilistic meth-ods, mixture models are the most natural. They as-sume that each document has an unknown “topic”,and each word is generated by the topic through amultinomial distribution. To cater for the multi-labelscenario, McCallum (1999) proposed expanding the la-tent topic space to the power set of the topics.

957

Bayesian Online Learning for Multi-label and Multi-variate Performance Measures

Unfortunately, all of these batch methods are very ex-pensive in training, and hence do not scale well. Forlarge datasets, online learning becomes effective. Itemploys cheap updates and works well for stream data.Although it has been widely used for binary classifica-tion, it is less studied for the multi-label scenario. Forexample, the additive online category ranking (Cram-mer & Singer, 2003) uses pairwise class comparison butis made efficient by precomputation. Bayesian onlinelearning (Opper, 1998) has also been studied. They es-sentially perform assumed density filtering, where ateach step the posterior of the model is updated basedon the likelihood of a single data point, followed by ap-proximations, e.g. using Gaussians (Minka, 2001). Weare unaware of any published Bayesian online learnerfor multi-label classification.

Bayesian methods learn a distribution over a fam-ily of models. They are both useful and commonlyadopted. Although learning and applying distribu-tions over models is generally more computationallyexpensive that point estimates, they provide more flex-ibility in decision making and allow the model to beused for different purposes. A case in point is the sec-ond challenge above, multi-variate performance mea-sures. Joachims (2005) tailored the training of SVMfor the multi-variate measure, however the testing ex-amples were still labeled by applying the learned modelindependently. Other frequentist methods also rely onad hoc thresholding. In contrast, with a distributionof models available, the Bayesian method provides aprincipled framework for labeling the test data by op-timizing the posterior expectation of the multi-variatemeasure in a batch fashion. Also, the model can beestimated independent of the performance measure.This is especially useful for online learning where datapoints are intrinsically decoupled.

Finally, to make use of the structure in the label spaceas desired from the third challenge, some frequentistmethods such as (Rousu et al., 2006) use the frame-work of maximum margin Markov network, where theclass hierarchy is represented by a Markov tree. Thistree plays a key role in the definition of the discrepancybetween labels, and of the joint kernels (kernels on thepair of feature and label). On the Bayesian side, themost straightforward way to incorporate label interde-pendence is through conditional random fields (CRFs),based on which Ghamrawi & McCallum (2005) di-rectly incorporated label co-occurrences into the fea-tures. Interestingly, this CRF model can also inducethe structure of labels from the data, instead of relyingon a given structure that is assumed by Rousu et al.(2006). However, the CRF was trained in a batch fash-ion and it is not clear whether it can also be learnedefficiently in the stochastic online setting for the multi-label data.

We propose a Bayesian online multi-label classifica-tion framework (BOMC) which learns a probabilisticmodel of the linear classifier (Section 2). The labelsare loosely coupled via a global bias for the multi-label scenario. The training labels are incorporated toupdate the posterior of the classifiers via a graphicalmodel similar to TrueSkillTM (Herbrich et al., 2007).Inference is based on assumed density filtering throughthe stream of training data, and expectation propaga-tion (Minka, 2001) is applied on each training example(Section 3). This allows us to efficiently learn from alarge amount of training data. Using samples fromthe posterior of the model, we label the testing ex-amples by maximizing the expected Fβ-score (Section4). Encouraging experimental results are presented inSection 5, including the comparison in macro-averageFβ-score and training time. Section 6 concludes thewhole paper with future work.

2 A Bayesian model formulti-label data

Suppose we have n training examples whose featurevectors are

{xi ∈ RD

}ni=1

. Assume there are C classes{1, . . . , C} =: [C], and the label vector yi ∈ {0, 1}Cencodes in the multi-label setting that yic = 1 if exam-ple xi is relevant to class c, and 0 otherwise.

Our model uses a probabilistic linear discriminant wc

for each class c, and {wc}c are independent diago-nal Gaussians whose mean and variance are estimatedfrom the training data. We start from a special case ofmulti-label: multi-class where exactly one label is rel-evant. Our key model is the likelihood p(y| {wc}c ,x),the probability of label y given the weights {wc}c. ByBayes’ rule, the posterior of {wc}c can be computedby p({w}c |y,x) ∝ p(y| {wc}c ,x)p({wc}c |x).

2.1 Multi-class case

We model the likelihood using a factor graph shownin Figure 1 (below the dashed line), where class 2 isassumed to be the correct class. ac = 〈wc,x〉 (in-ner product) is simply a linear discriminant, which isencoded by the factor Fwa(wc, ac) := δ(ac − 〈wc,x〉)where δ is the Dirac/impulse function. To model thenoise for practical purposes, Gaussian noise N (0, β2)is added to ac yielding fc, which is represented by thefactor Faf (ac, fc) := N (fc − ac, β2). Our key assump-tion on the labeling mechanism is that the likelihoodis non-zero only when f2 is greater than all other fcby a margin ε. This rule is implemented by first in-troducing a difference node dc = f2− fc via the factorFfd(fc, f2, dc) := δ(dc− (fc− f2)). And then we checkwhether dc is greater than ε: Fd(dc) := I(dc > ε),where here I(x) := 1 if x is true and 0 otherwise.

By definition, the product of the factors below thedashed line in Figure 1 is αp(y,a, f ,d|w,x) where α

958

Xinhua Zhang, Thore Graepel, Ralf Herbrich

prior

likelihood

a1

f1

d1 dC

noisydiscriminant

I (d1 > ε) I (dC > ε)

discriminant

δ (d1 – ( f2 – f1))

noise (eg. Gaussian)

w1

prior for w1

a2

f2

w2

N ( fC – aC , β2)

…... aC

fC

wC

fc

dc

… …

…...

Class 1 Class CClass 2: true label

… …

…

δ (aC – <wC, x>)δ (a1 – <w1, x>)

N ( f1 – a1 , β2)

linearcombination

comparison

difference

δ (dC – ( f2 – fC))

Fwa=

Faf

Ffd =

Fd

prior for w2 prior for wC

Figure 1: A factor graph for multi-class classification.The graph corresponds to an example x whose label is 2.

f1 f2 f4f3 f5

d2,1 d4,1 d2,3 d4,3 d2,5 d4,5

I (· >ε) I (· >ε) I (· >ε) I (· >ε) I (· >ε) I (· >ε)

Figure 2: Multi-label classification via pair-wise comparison. Class 2 and 4 are relevant.dij = fi− fj , where i is relevant and j is not.

f1 f2 f4f3

d1 d2 d3 d4

I (· < –ε) I (· > ε) I (· < –ε) I (· > ε)

dc := fc – b

irrelevant relevant

b bias

belief

on b

relevantirrelevant

1 2 3 4

1 2 3 4

1 1 1 1

2 2 2 2

3 3 3 3

Figure 3: Multi-label classification via totalordering and a global bias.

is independent of w. So the product of all the factorsin Figure 1 is proportional to p(y|w,x)p(w|x) in w.Therefore, the posterior p({w}c |y,x) can be obtainedby simply marginalizing out a, f , and d in the graph.

It is noteworthy that our likelihood model and fac-tor graph are very similar to the TrueSkillTM diagram(Herbrich et al., 2007). Our factor graph correspondsto a fixed example x while in TrueSkillTM it corre-sponds to a match. The classes in our setting cor-respond to the teams. The parameter we estimate isthe weights which are linearly combined using x, whileTrueSkillTM learns the players’ skills, which are com-bined according to how teams are formed.

2.2 Multi-label case

In the multi-label scenario, the likelihood model canbe extended using the pairwise comparison principleas in (Elisseeff & Weston, 2001). Figure 2 illustratesthis idea where the noisy discriminant value fc of rele-vant classes is enforced to exceed that of the irrelevantclasses. Unfortunately, this method may cost O(C2)computations, which is not affordable. As a simpli-fication, we assume a total order underlying the rel-evance of the labels. This translates to thresholdingthe discriminant of the classes by a global bias, as il-lustrated by Figure 3. One can further incorporate a“local” bias for each class by, e.g., adding an artificialconstant feature. This could even eliminate the needfor global bias and decouple all the classes. We willcompare these two models in experiment.

3 Online learning and inferenceTo estimate the model w and b from the training data,we adopt the Gaussian density filtering scheme (GDF,

Maybeck, 1982) which is also used in (Herbrich et al.,2007). It employs a Gaussian prior p0(w), and at eachiteration “absorbs” the likelihood of one training ex-ample (xi,yi), computes the posterior

pi(w) := p(w|xi,yi) ∝ pi−1(w)p(yi|w,xi),and approximates it by a Gaussian that is closest inthe sense of the Kullback-Leibler divergence.

In this paper, we restrict the prior and posterior todiagonal Gaussians though covariance could be incor-porated at a much higher computational cost. Asanalyzed before, the posterior can be computed bymarginalizing out a, f , and d in the factor graph inFigure 3. This marginalization and subsequent Gaus-sian approximation can be effectively performed by ex-pectation propagation (EP, Minka, 2001).

3.1 EP and message passing schedule

Intuitively, EP is similar to loopy belief propagation,but further approximates the messages as much as pos-sible. In particular, it approximates the marginals ofthe factors by Gaussians via matching the first andsecond moments. Consequently, the posterior is alsoapproximated by a Gaussian. Since the set of factorsused in our model is the same as in TrueSkillTM, themessage formulae can be found in Table 1 of (Herbrichet al., 2007).

One important issue in implementing EP is the mes-sage passing schedule. There is no loop in all thegraphical models from Figure 1 to 3. However, they allhave non-Gaussian factors I(· > ε), which necessitatesrunning EP repeatedly on the graph. Observe that theshortest paths between these factors only involve fac-tors {αc, βc}, variables {dc} and bias b (see Figure 3).

959


Hence we only need to run EP iteratively over b and{αc, dc, βc}c as the arrows show. This significantly re-duces the cost of each EP iteration from O(DC) (forall weights) to O(C).1 In practice, suppose we onlysend messages from factors to variables, then we justneed to repeatedly perform:

¬ : {αc → dc}Cc=1; : {βc → dc}Cc=1; ® : {αc → b}Cc=1.

3.2 Dynamic learning

Our model is static, while many real world applicationsbenefit from modeling temporal and spatial evolutions.For example, the categorization rule of news wire mayvary with time. Besides, GDF depends on the randomorder of training examples and the belief of our modelonly propagates in the forward direction of the datastream. In the batch setting, we may add dynamicfactors between the weight nodes of the factor graphsof adjacent time steps to allow smooth temporal varia-tion. Dangauthier et al. (2008) extended TrueSkillTM

to dynamic scenarios, where EP is performed back andforth over the whole dataset. While theoretically ap-pealing, it is very expensive in both time and space,and hence we stick to GDF in this paper.

4 Generalization for multi-variateperformance measure

Given a set of test data Xtest :={xi ∈ RD : i ∈ [n]

},

our task is to label xi with a subset of [C] using theposterior of weights {wc ∼ N (µc,Σc) : class c ∈ [C]}where Σc = diag(σ2

c,1, . . . , σ2c,D) and a global bias b ∼

N (µ0, σ20). Our objective is to optimize some multi-

variate performance measure as found, e.g., in manyapplications like information retrieval. In the case ofbinary classification, let l ∈ {0, 1}n be the referencelabel, and y ∈ {0, 1}n be a predicted label. Multi-variate measures such as Fβ-score and area under ROCrequire that the predicted labels on the test set beoptimized as a whole. For example, the F1-score isdefined as the harmonic mean of precision and recall:

F1-score(y, l) :=2∑ni=1 y

i · li∑ni=1 y

i +∑ni=1 l

i. (1)

Furthermore, in the multi-label scenario, Eq. (1) de-fines for each class c an F1-score F1(c), and the overallmacro-average F1-score (Lewis et al., 2004) is definedas the average of F1(c) given by

∑c F1(c)/C. Clearly,

it can be optimized by maximizing F1(c) for each c in-dependently. So in the sequel, we will focus on binaryclassification and omit the class index c when it is clearfrom the context. Following the principles of Bayesiandecision theory, we will optimize the expected value

1After EP converges, it still takes O(DC) complexityto record the final posterior.

of the multi-variate performance measure under theposterior model.

Let yi be a Bernoulli random variable, with yi = 1indicating xi belongs to class c according to our modeland 0 otherwise. Given an instantiation of w and b, wedefine the label y as p(y = 1|w, b) := I(〈w,x〉−b > 0).Therefore using the posterior of w and b, we have

p(y = 1) = Ew,b

[p(y = 1|w, b)] = Φ

(〈µ,x〉 − µ0√σ2

0 + x>Σx

),(2)

where Φ is the cumulative distribution of the stan-dard normal distribution. Since there is no theoreti-cal guarantee that thresholding p(y = 1) at 0.5 opti-mizes the multi-variate performance measure, we willlabel Xtest in a much more principled Bayesian fash-ion, which is based on the joint distribution of all labelsy := (y1, . . . , yn)>. Given the model, all labels

{yi}

are assumed to be independent. However after inte-grating out w and b, the independence is lost under

p(y|Xtest) := Ew,b

[∏n

i=1p(yi|xi,w, b

)]. (3)

4.1 Expected F1-score and optimization

Suppose we label Xtest by l ∈ {0, 1}n. Then the ex-pected F1-score will be

ExpFs(l) := Ey∼p(y) [F1-score(y, l)] ,

and it is natural to choose the l which maximizes it:l∗ := argmax

l∈{0,1}nExpFs(l) = argmax

l∈{0,1}nEy

[F1-score(y, l)] . (4)

In general, closed form solutions rarely exist for maxi-mizing expected multi-variate measures, and comput-ing the expectation in (4) is intractable. So we resortto approximations. The most related algorithm thattackles the problem (4) is by Jansche (2007). Intu-itively, for a fixed value of

∑i li, l appears only in the

numerator of the objective and optimization gets eas-ier. However, in order to solve argmaxl:

∑il

i=rExpFs(l)by simply sorting E

[yi], Jansche (2007) assumed that

the yi are independent which is unrealistic in both the-ory and practice.

Although we do recognize the importance of the cor-relation between {yi}, it is too expensive to computeand store. We resort to a heuristic which respects theorder of p(yi = 1) but tunes the threshold: given acertain threshold θ ∈ [0, 1], we consider the class to berelevant if, and only if, p(yi = 1) > θ. So

l(θ) := (I(p(y1 = 1) > θ), . . . , I(p(yn = 1) > θ))>. (5)

We now find the deterministic labeling by maximizingthe expected F1-score of l(θ) wrt θ ∈ [0, 1]:

θ∗ := argmaxθ∈[0,1] ExpFs(l(θ)). (6)

This reduction of search space from {0, 1}n to [0, 1] sig-nificantly simplifies optimization, although l(θ∗) is not

960


0 0.2 0.4 0.6 0.8 10

20

40

60

80

100

Threshold

F−

scor

e

Sample basedGround truth

(a) class 4

0 0.2 0.4 0.6 0.8 10

20

40

60

80

Threshold

F−

scor

e


(b) class 6

0 0.2 0.4 0.6 0.8 10

20

40

60

80

100

Threshold

F−

scor

e


(c) class 55

0 0.2 0.4 0.6 0.8 10

10

20

30

40

Threshold

F−

scor

e


(d) class 74

Figure 4: Example curves of ExpFs(l(θ)) (solid blue) and F1-score(l(θ),y∗) (dotted red) as functions of θ.

guaranteed to recover l∗. Given l = l(θ), there is stillno closed form to compute ExpFs(l), so we evaluate itapproximately based on samples {y1, . . . , yS} drawnindependently and identically (iid) from p(y) (via iidsamples of w and b, then thresholding 〈w,x〉−b at 0):

ExpFs(l) :=1S

S∑s=1

∑ni=1 y

isli∑n

i=1 yis +

∑ni=1 l

i. (7)

The concentration of ExpFs(l) around ExpFs(l) canbe easily quantified in probability by McDiarmid’s in-equality (Herbrich, 2002, Theorem A.119). Unfortu-nately, a naıve application of (7) costs O(nSCD) time,which is impractical for large datasets. We will de-sign a more efficient algorithm in Section 4.3 using the“sufficient statistics”. Before that, we first justify thelabeling criteria in (6) and (5).

4.2 Soundness of Bayesian labeling criteria

Our labeling criteria l∗ := argmaxl∈{0,1}n ExpFs(l) isdeemed as sound if l∗ is “close” to the ground truth y∗,as long as p(w, b) has been well estimated. However,the intractability of finding l∗ precludes direct check.Fortunately, we can indirectly check the soundness ofmaxθ∈[0,1] ExpFs(l(θ)) by comparing two curves:

1. Expected F1-score: ExpFs(l(θ)) versus θ.2. True F1-score: F1-score(l(θ),y∗) versus θ.

If these two curves are “similar”, then it suggests thatoptimizing ExpFs(l(θ)) over θ is a good proxy to max-imizing the real testing F1-score against the groundtruth. In practice, we can only use the sample basedestimates ExpFs(l), ExpFs(l(θ)) and its maximizer θ∗.

Figure 4 shows an experimental result on comparingExpFs(l(θ)) and F1-score(l(θ),y∗) as functions of θ. Ituses the topics group of Reuters dataset with 5 ran-dom samples. Due to space constraints, only 4 typicalplots are shown. It can be observed that both curvesfollow roughly similar trend. Indeed, we only needthe maximizer of ExpFs(l(θ)) (solid) to give approxi-mately the max of F1-score(l(θ),y∗) (dotted), i.e.

F1-score(l(θ∗),y∗) be close to maxθ F1-score(l(θ),y∗).

In this example, this is actually pretty much the case:the first term is 60.97 after summing up all the 101classes, while the second term is 63.26.

4.3 Efficient calculation of empiricalexpected F1-score

We design an efficient algorithm to computeExpFs(l(θ)), and use the Reuters dataset as an exam-ple. Here C=300, D=5 · 104, n=105, average numberof non-zero features per example D=70, and we use G= 20 candidate θ. Our key idea is to collect three “suf-ficient statistics” α,β,γ derived from the definition:

ExpFsc(l(θg)) =1S

S∑s=1

(for class c threshold θg)

:=αc,s,g︷︸︸︷n∑i=1

I(⟨

xi, ws,c

⟩− bs > 0

)· I(p(yic = 1) > θg)

n∑i=1

I(⟨

xi, ws,c

⟩− bs > 0

)︸︷︷︸

:=βc,s

+n∑i=1

I(p(yic) = 1) > θg)︸︷︷︸:=γc,g

.

α,β,γ are cheap in space, but it is challenging to com-pute α,β efficiently due to the following constraints.1) Memory or privacy constraints force the testing datato be accessed as a stream, which cannot be stored orrevisited. In some cases, although revisiting is allowed,we can only afford at most a dozen of passes due to thecost of IO and parsing. 2) Sampling is also expensivein time and space. For the Reuters dataset, w costs8CD bytes = 120 MB, and it takes O(nCD) time toapply one sample to all the testing data, which means2× 109 time cost. Hence we can neither compute norstore over a dozen samples of w, and so we let S = 10.

If we are only allowed to visit the test dataset for a sin-gle pass, then for each testing example, we must applyall the samples of w. Since there is not enough memoryto store all the weight samples, we have to regeneratethese samples for every testing example. Furthermore,to ensure good statistical performance, we also storethe seed of the random number generator for all theweight components, which allows us to apply the samesamples of w to all the testing examples.

5 Empirical evaluation

In this section, we compare the empirical performanceof several variants of our Bayesian online multi-labelclassifier (BOMC) with batch SVM and two state-of-

961


the-art online learning classifiers. We focus on macro-average F1-score and training time, and the datasetused is Reuters1-v2.

5.1 Dataset

The Reuters1-v2 dataset (Lewis et al., 2004) consists ofthree groups of categories: topics, industries, andregions, which contain 103, 354, and 366 categories(classes) respectively. It has 804,414 documents andevery document is associated with zero or more labelsfrom each of the three groups. In the experiment, thetraining and test sets were both sampled uniformly atrandom from the whole dataset.

tf-idf features (Salton & Buckley, 1988) were usedfor documents. On average each example has onlyabout 77 non-zero features, although the whole train-ing set contains about 35k features.

5.2 Algorithms

We compared different variants of BOMC with twostate-of-the-art online learners. All these algorithmsrandomized the order of the training examples.

To train BOMC, the feature weights had prior N (0, 1),while the prior of the global bias was N (0, 104). Thenoise level was β = 0.01 and the margin ε = 1. EPwas used for inference. On average, the relative changeof mean and variance of the messages falls below 10−3

after only three iterations. All variants of BOMC wereimplemented in F#.

In practice, many classes only have very few positiveexamples, and this skewness is commonly dealt withby two heuristics. Yang (2001); Lewis et al. (2004)tune the threshold by cross validation (CV), whichtranslates the separating hyperplane towards the neg-ative region. However, CV is expensive and is in-trinsically batch. The second approach requires moreprior knowledge but is much cheaper. It uses differentcosts for misclassifying positive and negative exam-ples, e.g. the “-j” parameter in SVMlight. Intuitivelyit increases the influence of the less common classes.Using this heuristic with SVMlight, Lewis (2001) wonthe TREC-2001 batch filtering evaluation.

All algorithms under comparison perform very poorlywhen neither heuristic is used. Therefore we assumesome prior knowledge such as the frequency ratio ofpositive and negative examples (denoted by r). BOMCcan easily encode this prior by changing the factorI(· > 1) to I(d > ln(e + 1/r)) for positive examples.The intuition behind this choice is that if the datasethas only a small fraction of positive examples, thenthe model is expected to correctly classify these posi-tive examples with a higher margin (or confidence).

BMOC with sampling (BOMC Sample) To labelthe test data, we drew 5 samples from the posterior

of the learned model. 10 or 20 samples did not lead toany improvement.

BOMC with class mass normalization (BOMC CMN)A much simpler but non-Bayesian heuristic for tuningthe threshold is by matching the zero-th order moment(Zhu et al., 2003): sort p(yi = 1) and threshold bymaking the class ratio in the testing set identical tothat in the training set.

BMOC: training all classes independently(BOMC IND CMN and BOMC IND Sample) We also triedtraining all classes independently, i.e. each class c hasits own bias bc without using the shared global bias.Now the posterior can be computed in closed formfor each training example. During testing, both CMNand sampling are again applicable, and hence calledBOMC IND CMN and BOMC IND Sample, respectively.

Batch SVM (SVM Batch) Although online learningis the focus of this paper, we also tried out a batchSVM as a baseline. It has an unfair advantage overonline learning because it revisits training examples.One SVM is trained for each class independently. Todeal with class skewness, we tuned the threshold us-ing nested CV (Yang, 2001) which outperformed theheuristic of reweighting false positive and false nega-tive. We used the liblinear2 implemented in C.

LaSVM (LaSVM) LaSVM3 is an online solver for SVM,which, according to Bordes et al. (2005), takes a singlepass to achieve similar generalization performance asthe batch SVM. We tuned the bias using the CV basedstrategy because of its superior empirical performance.

Passive-Aggressive (PA) PA (Crammer et al.,2006) optimizes the regularized risk of the currentexample at each step. We tuned the threshold byCV, which can use PA or batch SVM. We call themPA OnlineCV and PA BatchCV respectively. PA isequivalent to running liblinear for one iteration.

5.3 Results

We compared all algorithms with respect to the testingmacro-average F1-score, and the CPU time cost fortraining. We randomly sampled the training and testdata five times which allowed us to plot error bars.

5.3.1 Macro-average F1-score

Figure 5 shows the macro-average F1-score as a func-tion of the number of training examples. Among allonline learners, BOMC CMN achieves the highest macro-average F1-score most of the time. BOMC Sample is in-ferior to BOMC CMN, but still competitive. Notice thatCMN is also a method to choose the threshold, so it

2http://www.csie.ntu.edu.tw/∼cjlin/liblinear3http://leon.bottou.org/projects/lasvm

962


1 2 4 8x 10

4

20

25

30

35

40

45

50

Number of training examples

Mac

ro−

aver

age

F−

scor

e (%

)

BOMC_CMNBOMC_SampleSVM_BatchPA_BatchCVPA_OnlineCVLaSVM

(a) #test = 200k, industries

1 2 4 8x 10

4

20

25

30

35

40

45

50


Mac

ro−

aver

age

F−

scor

e (%

)


(b) #test = 700k, industries

1 2 4 8x 10

4

58

60

62

64

66

68


Mac

ro−

aver

age

F−

scor

e (%

)


(c) #test = 200k, regions

1 2 4 8x 10

4

58

60

62

64

66

68


Mac

ro−

aver

age

F−

scor

e (%

)


(d) #test = 700k, regions

1 2 4 8x 10

4

55

60

65

70


Mac

ro−

aver

age

F−

scor

e (%

)


(e) #test = 200k, topics

1 2 4 8x 10

4

55

60

65

70


Mac

ro−

aver

age

F−

scor

e (%

)


(f) #test = 700k, topics

Figure 5: Comparison of F1-score for the category aaaagroups industries, regions, and topics.

10k 20k 40k 80k

−0.2

0

0.2

0.4

0.6

Number of training data

Rel

Mac

ro−

avg

F−

scor

e di

ff

(a) #test = 200k,industries

10k 20k 40k 80k

−0.2

0

0.2

0.4

0.6


Rel

Mac

ro−

avg

F−

scor

e di

ff

(b) #test = 700k,industries

10k 20k 40k 80k

−0.2

0

0.2

0.4

0.6


Rel

Mac

ro−

avg

F−

scor

e di

ff

(c) #test = 200k, regions

10k 20k 40k 80k

−0.2

0

0.2

0.4

0.6


Rel

Mac

ro−

avg

F−

scor

e di

ff

(d) #test = 700k, regions

10k 20k 40k 80k

−0.2

0

0.2

0.4

0.6


Rel

Mac

ro−

avg

F−

scor

e di

ff

(e) #test = 200k, topics

10k 20k 40k 80k

−0.2

0

0.2

0.4

0.6


Rel

Mac

ro−

avg

F−

scor

e di

ff

(f) #test = 700k, topics

Figure 6: F1-score of coupled models minus F1-scoreof independent models. Height of bars represents rel-ative difference. The bars in five colors correspondto five random draws of traning/testing examples.

suggests that the model is well trained, and our samplebased method to find the threshold can be improved.Comparing Figure 5(c) and Figure 5(d) on the groupregions, we observe that BOMC CMN significantly bene-fits from a large test set. This is not surprising becausethe assumption made by CMN is more likely to holdwhen the test set is large.

Unsurprisingly, SVM Batch usually yields the highestF1-score. However, BOMC CMN often performs as wellas or even better than SVM Batch by a single pass, es-pecially on the dataset industries, or when the train-ing set size is medium. PA OnlineCV and PA BatchCVperform worse than other algorithms probably due tobeing susceptible to noise. In contrast, LaSVM employsa removal step to handle noise, and converges to thetrue SVM solution if multiple passes are run. LaSVM isslightly worse than BOMC CMN, but competitive.

5.3.2 Comparing coupled and decoupledBOMC

The benefit of modeling the interaction between la-bels has been confirmed by existing algorithms suchas (Rousu et al., 2006; Ghamrawi & McCallum, 2005).

Our multi-label model in Figure 3 loosely couples allthe classes via the global bias. A natural question iswhy not introduce a “local” bias to all the classes andlearn the model of all the classes independently. Wenow demonstrate in Figure 6 how much the macro-average F1-score of BOMC CMN (Fcoupled) is relativelyhigher than that of BOMC IND CMN (Find) as quantifiedby the relative macro-average F1-score difference:

200 · (Fcoupled − Find)/(Fcoupled + Find).

On the industries and regions groups, BOMC CMN de-livers significantly higher macro-average F1-score thanBOMC IND CMN. We observed that the global bias inBOMC CMN is much more confidently learned (higherprecision) than the feature weights and the local biasin BOMC IND CMN. This is because the global bias servesas a hub and is updated more often. On the topicsgroup, BOMC IND CMN performs slightly better.

5.3.3 Training time

Figure 7 presents the CPU time cost for training withthese algorithms. If an algorithm uses CV, then onlythe cost for training the final model is included.

963


1 2 4 8x 10

4

0

2000

4000

6000

8000

10000


Tra

inin

g tim

e

BOMC_CoupledBOMC_INDSVM_BatchPA_BatchCVPA_OnlineCVLaSVM

(a) industries

1 2 4 8x 10

4

0

2000

4000

6000

8000


Tra

inin

g tim

e


(b) regions

1 2 4 8x 10

4

0

1000

2000

3000

4000


Tra

inin

g tim

e


(c) topics

Figure 7: CPU time for training.

The key observation is that the training time of all al-gorithms except LaSVM is linear in the number of train-ing examples. This matches their algorithmic prop-erty. Training BOMC IND CMN takes slightly more timethan BOMC CMN, which suggests that the cost for updat-ing local bias outweighs the savings from closed formposterior. PA and SVM Batch can be trained faster thanBOMC by a factor of 2–3, which could be attributed tothe programming language (C vs F#).

Although LaSVM is the online learner which achievesclosest testing F1-score to BOMC, it takes much moretraining time. This is because LaSVM operates in thedual and has not been optimized for linear kernels.

6 Conclusion and future directions

We proposed a Bayesian online learning algorithm formulti-label classification. It uses Gaussian density fil-tering for efficient inference and can label unseen datain a principled manner, as opposed to the ad hocthresholding schemes used in frequentist approaches.Empirically, it delivers favorable macro-average F1-score compared with state-of-the-art online learners,and is even competitive with batch SVM.

This work can be extended in several directions. Weare designing efficient algorithms to train the dynamicmodels briefed in Section 3.2, which is expected toyield a more accurate model. Label noise studied byKim & Ghahramani (2006) can also be modeled in astraightforward way. For example, the common noisethat flips the label by a probability ρ can be modeledby replacing the factor I(d > ε) with ρI(d > ε) + (1−ρ)I(d < −ε). Finally, label hierarchies can also beconveniently incorporated using graphical models.

References

Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005).Fast kernel classifiers with online and active learning.JMLR, 6, 1579–1619.

Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., &Singer, Y. (2006). Online passive-aggressive algorithms.JMLR, 7, 551–585.

Crammer, K., & Singer, Y. (2003). Ultraconservative on-line algorithms for multiclass problems. JMLR, 3, 951–991.

Dangauthier, P., Herbrich, R., Minka, T., & Graepel, T.(2008). Trueskill through time: Revisiting the history ofchess. In NIPS, 337–344.

Elisseeff, A., & Weston, J. (2001). A kernel method formulti-labelled classification. In NIPS, 681–688.

Ghamrawi, N., & McCallum, A. (2005). Collective multi-label classification. In CIKM, 195–200.

Herbrich, R. (2002). Learning Kernel Classifiers: Theoryand Algorithms. Cambridge, MA: MIT Press.

Herbrich, R., Minka, T., & Graepel, T. (2007).TrueskillTM: A Bayesian skill ranking system. In NIPS,569–576.

Jansche, M. (2007). A maximum expected utility frame-work for binary sequence labeling. In ACL, 736–743.

Joachims, T. (2005). A support vector method for multi-variate performance measures. In ICML, 377–384.

Kim, H.-C., & Ghahramani, Z. (2006). Bayesian gaussianprocess classification with the EM-EP algorithm. IEEEPAMI, 28 (12), 1948–1959.

Lewis, D. (2001). Applying support vector machines tothe TREC-2001 batch filtering and routing tasks. InText REtrieval Conference, 286–292.

Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1:A new benchmark collection for text categorization re-search. JMLR, 5, 361–397.

Maybeck, P. S. (1982). Stochastic Models, Estimation andControl. Academic Press.

McCallum, A. (1999). Multi-label text classification witha mixture model trained by EM. In AAAI Workshop onText Learning.

Minka, T. (2001). Expectation Propagation for approxima-tive Bayesian inference. Ph.D. thesis, MIT Media Labs.

Opper, M. (1998). A Bayesian approach to online learn-ing. In On-line Learning in Neural Networks, 363–378.Cambridge University Press.

Qi, G.-J., Hua, X.-S., Rui, Y., Tang, J., Mei, T., & Zhang,H.-J. (2007). Correlative multi-label video annotation.In International Conference on Multimedia, 17–26.

Rousu, J., Sunders, C., Szedmak, S., & Shawe-Taylor, J.(2006). Kernel-based learning of hierarchical multilabelclassification methods. JMLR, 7, 1601–1626.

Salton, G., & Buckley, C. (1988). Term-weighting ap-proaches in automatic text retrieval. Information Pro-cessing & Management, 24 (5), 513–523.

Seki, K., & Mostafa, J. (2005). An application of textcategorization methods to gene ontology annotation. InSIGIR, 138–145.

Yang, Y. (2001). A study on thresholding strategies fortext categorization. In SIGIR, 137–145.

Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using Gaussian fields and harmonicfunctions. In ICML, 912–919.

Documents

Bayesian Online Learning for Multi-label and Multi-variate ...zhangx/papers/ZhaGraHer10.pdf · Bayesian Online Learning for Multi-label and Multi-variate Performance Measures Unfortunately,