9
1 Ambiguity-Based Multiclass Active Learning Ran Wang, Member, IEEE, Chi-Yin Chow, Member, IEEE, and Sam Kwong, Fellow, IEEE Abstract—Most existing works on active learning (AL) focus on binary classification problems, which limit their applications in various real-world scenarios. One solution to multiclass AL (MAL) is evaluating the informativeness of unlabeled samples by an uncertainty model, and selecting the most uncertain one for query. In this paper, an ambiguity-based strategy is proposed to tackle this problem by applying possibility approach. First, the possibilistic memberships of unlabeled samples in the multiple classes are calculated from the one-against-all (OAA)- based support vector machine (SVM) model. Then, by employing fuzzy logic operators, these memberships are aggregated into a new concept named k-order ambiguity, which estimates the risk of labeling a sample among k classes. Afterwards, the k-order ambiguities are used to form an overall ambiguity measure to evaluate the uncertainty of the unlabeled samples. Finally, the sample with the maximum ambiguity is selected for query, and a new MAL strategy is developed. Experiments demonstrate the feasibility and effectiveness of the proposed method. Index Terms—Active learning, ambiguity, fuzzy sets and fuzzy logic, possibility approach, multiclass. I. I NTRODUCTION A CTIVE learning (AL) [1], known as a revised supervised learning scheme, adopts the selective sampling manner to collect a sufficiently large training set. It iteratively selects the informative unlabeled samples for query, and constructs a high-performance classifier by labeling as few samples as possible. A widely used base classifier in AL is support vector machine (SVM) [2], which is a binary classification technique based on statistical learning theory. Under binary settings, many successful AL strategies have been proposed for SVM, such as uncertainty reduction [3], [4], version space (VS) reduction [5], minimum expected model loss [6], etc. However, extending these strategies to multiclass problems is still a challenging issue due to the following two reasons. Traditional SVMs are binary classifiers. In order to realize multiclass AL (MAL), it is necessary to construct an effective multiclass SVM model. Designing a sample selection criterion for multiple class- es is much more complicated than two classes. For instance, the size of VS for a binary SVM is easy to calculate, while the size of VS for multiple SVMs is hard to define. This work is partially supported by Hong Kong RGC General Research Fund GRF grant 9042038 (CityU 11205314); the National Natural Science Foundation of China under the Grant 61402460; and China Postdoctoral Science Foundation Funded Project under Grant 2015M572386. R. Wang is with the Department of Computer Science, City University of Hong Kong, Hong Kong, and also with Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China (e-mail: [email protected]; [email protected]). C.-Y. Chow and S. Kwong are with the Department of Computer Science, City University of Hong Kong, Hong Kong (e-mail: [email protected], [email protected]). Existing SVM-based multiclass models, such as one-against- all (OAA) [7] or one-against-one (OAO) [8], decompose the multiclass problem into a set of binary problems. These models evaluate the informativeness of unlabeled samples by aggregating the output decision values of the binary SVMs, or by estimating the class probabilities of the samples. Possibility theory [9], as an extension of fuzzy sets and fuzzy logic [10], is a commonly used technique for dealing with vagueness and imprecision about the given information. It has a great potential for solving AL problems, especially for SVM under multiclass environments. First, possibility approach is able to evaluate the uncertainty of unlabeled samples by aggregating a set of class memberships. This is intrinsically compatible with MAL. Besides, the memberships in possibility theory can be independent. Under this condition, SVM-based model can compute the memberships by decom- posing a multiclass problem into a set of binary problems, with one for each class. Second, possibility approach is based on an ordering structure rather than an additive structure. This feature makes it less rigorous in measuring the unlabeled samples for SVM-based model. For instance, in probability approach, it has to consider the pairwise relations of all the classes to satisfy the additive property. However, possibility approach can relax the additive property and enable SVM to compute the memberships in a simpler way. To the best of our knowledge, the applications of possibility approach to AL have not been investigated. In doing so, a new pool-based MAL strategy is proposed. First, the possibilistic memberships of unlabeled samples are calculated from OAA- based SVM model. Then, fuzzy logic operators [11], [12], [13] are employed to aggregate the memberships into a new concept named k-order ambiguity, which estimates the risk of labeling a sample among k classes. Finally, a new un- certainty measurement is proposed by integrating the k-order ambiguities. It is noteworthy that possibility approach has no difference with probability approach for binary SVM, since the positive and negative memberships are always complementary. However, possibility approach is more flexible for multiclass SVM, since it defines an ordering structure with independent memberships. The remainder of this paper is organized as follows. In Section II, we present some related works. In Section III, we design the ambiguity measure and prove its basic prop- erties, then we establish the ambiguity-based multiclass active learning strategy. In Section IV, we conduct extensive exper- imental comparisons to show the feasibility and effectiveness of the proposed method. Finally, conclusions and future work directions are given in Section V.

Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

1

Ambiguity-Based Multiclass Active LearningRan Wang,Member, IEEE,Chi-Yin Chow, Member, IEEE,and Sam Kwong,Fellow, IEEE

Abstract—Most existing works on active learning (AL) focuson binary classification problems, which limit their applicationsin various real-world scenarios. One solution to multiclass AL(MAL) is evaluating the informativeness of unlabeled samplesby an uncertainty model, and selecting the most uncertainone for query. In this paper, an ambiguity-based strategy isproposed to tackle this problem by applying possibility approach.First, the possibilistic memberships of unlabeled samplesin themultiple classes are calculated from the one-against-all (OAA)-based support vector machine (SVM) model. Then, by employingfuzzy logic operators, these memberships are aggregated into anew concept namedk-order ambiguity, which estimates the riskof labeling a sample amongk classes. Afterwards, thek-orderambiguities are used to form an overall ambiguity measure toevaluate the uncertainty of the unlabeled samples. Finally, thesample with the maximum ambiguity is selected for query, anda new MAL strategy is developed. Experiments demonstrate thefeasibility and effectiveness of the proposed method.

Index Terms—Active learning, ambiguity, fuzzy sets and fuzzylogic, possibility approach, multiclass.

I. I NTRODUCTION

A CTIVE learning (AL) [1], known as a revised supervisedlearning scheme, adopts the selective sampling manner

to collect a sufficiently large training set. It iterativelyselectsthe informative unlabeled samples for query, and constructsa high-performance classifier by labeling as few samples aspossible. A widely used base classifier in AL is supportvector machine (SVM) [2], which is a binary classificationtechnique based on statistical learning theory. Under binarysettings, many successful AL strategies have been proposedfor SVM, such as uncertainty reduction [3], [4], version space(VS) reduction [5], minimum expected model loss [6], etc.However, extending these strategies to multiclass problems isstill a challenging issue due to the following two reasons.

• Traditional SVMs are binary classifiers. In order to realizemulticlass AL (MAL), it is necessary to construct aneffective multiclass SVM model.

• Designing a sample selection criterion for multiple class-es is much more complicated than two classes. Forinstance, the size of VS for a binary SVM is easy tocalculate, while the size of VS for multiple SVMs is hardto define.

This work is partially supported by Hong Kong RGC General ResearchFund GRF grant 9042038 (CityU 11205314); the National Natural ScienceFoundation of China under the Grant 61402460; and China PostdoctoralScience Foundation Funded Project under Grant 2015M572386.

R. Wang is with the Department of Computer Science, City Universityof Hong Kong, Hong Kong, and also with Shenzhen Institutes ofAdvancedTechnology, Chinese Academy of Sciences, Shenzhen 518055,China (e-mail:[email protected]; [email protected]).

C.-Y. Chow and S. Kwong are with the Department of Computer Science,City University of Hong Kong, Hong Kong (e-mail: [email protected],[email protected]).

Existing SVM-based multiclass models, such as one-against-all (OAA) [7] or one-against-one (OAO) [8], decompose themulticlass problem into a set of binary problems. Thesemodels evaluate the informativeness of unlabeled samples byaggregating the output decision values of the binary SVMs, orby estimating the class probabilities of the samples.

Possibility theory [9], as an extension of fuzzy sets andfuzzy logic [10], is a commonly used technique for dealingwith vagueness and imprecision about the given information.It has a great potential for solving AL problems, especiallyfor SVM under multiclass environments. First, possibilityapproach is able to evaluate the uncertainty of unlabeledsamples by aggregating a set of class memberships. This isintrinsically compatible with MAL. Besides, the membershipsin possibility theory can be independent. Under this condition,SVM-based model can compute the memberships by decom-posing a multiclass problem into a set of binary problems, withone for each class. Second, possibility approach is based onan ordering structure rather than an additive structure. Thisfeature makes it less rigorous in measuring the unlabeledsamples for SVM-based model. For instance, in probabilityapproach, it has to consider the pairwise relations of all theclasses to satisfy the additive property. However, possibilityapproach can relax the additive property and enable SVM tocompute the memberships in a simpler way.

To the best of our knowledge, the applications of possibilityapproach to AL have not been investigated. In doing so, a newpool-based MAL strategy is proposed. First, the possibilisticmemberships of unlabeled samples are calculated from OAA-based SVM model. Then, fuzzy logic operators [11], [12],[13] are employed to aggregate the memberships into a newconcept namedk-order ambiguity, which estimates the riskof labeling a sample amongk classes. Finally, a new un-certainty measurement is proposed by integrating thek-orderambiguities. It is noteworthy that possibility approach has nodifference with probability approach for binary SVM, sincethepositive and negative memberships are always complementary.However, possibility approach is more flexible for multiclassSVM, since it defines an ordering structure with independentmemberships.

The remainder of this paper is organized as follows. InSection II, we present some related works. In Section III,we design the ambiguity measure and prove its basic prop-erties, then we establish the ambiguity-based multiclass activelearning strategy. In Section IV, we conduct extensive exper-imental comparisons to show the feasibility and effectivenessof the proposed method. Finally, conclusions and future workdirections are given in Section V.

Page 2: Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

2

II. BACKGROUNDS AND RELATED WORKS

Researchers have made some efforts to realize SVM-basedMAL. A number of works adopt OAA approach to constructthe base models [5], [14], [15], [16], and design the sampleselection criterion by aggregating the output decision values ofthe binary SVMs. Specifically, Tong [5] proposed to evaluatethe uncertainty of an unlabeled sample by aggregating itsdistances to all the SVM hyperplanes. Later, he proposedto evaluate the model loss by aggregating the VS areas ofthe binary SVMs after having queried an unlabeled sampleand received a certain label. This method was also discussedin [15] and [16] by designing a more effective approximationmethod for the VS areas. Moreover, Hospedales et al. [14]proposed to evaluate the unlabeled samples by both generativeand discriminative models, in order to achieve both accurateprediction and class discovery. In addition to OAA approach,OAO approach is also effective to construct multiclass SVMmodels. For instance, Joshi et al. [17] proposed a scalable ALstrategy for multiclass image classification, which estimatesthe pairwise probabilities of the images by OAO approach,and selects the one with the highest value-of-information.Thiswork also stated that entropy might not be a good sampleselection criterion, since the entropy value is highly affectedby the probabilities of unimportant classes. Thus, the bestvs.second best (BvSB) method was developed, which only makesuse of the two largest probabilities.

Possibility approach, different from the above techniques,is an uncertainty analysis tool with imprecise probabilities.It is driven by the minimal specificity principle, which statesthat any hypothesis not known to be impossible cannot beruled out. Given aC-class problem, assumeµi(x) is themembership of samplex in the i-th class (i = 1, . . . , C),then the memberships are said to be possibilistic/fuzzy ifµi(x) ∈ [0, 1] and probabilistic further if

∑Ci=1 µi(x) = 1.

In the context of AL, if the memberships are possibilistic,then

1) µi(x) = 0 means that classi is rejected as impossiblefor x;

2) µi(x) = 1 means that classi is totally possible forx;3) at least one class is totally possible forx, i.e.,

maxi=1,...,C{µi(x)} = 1.

With a normalisation process, condition 3 can be modified asmaxi=1,...,C{µi(x)} 6= 0.

There are two schemes to handle possibilistic memberships:1) transform them to probabilities and apply probability ap-proaches; or 2) aggregate them by fuzzy logic operators. It isstated in [18] that transforming possibilities to probabilities orconversely can be useful in many cases. However, they are notequivalent representations. Probabilities are based an additivestructure, while possibilities are based on an ordering structureand are more explicit in handling imprecision. Thus, we onlyfocus on the second scheme in this paper.

On the other hand, Wang et al. [19] proposed a concept ofclassification ambiguity to measure the non-specificity of aset,and applied it to the induction of fuzzy decision tree. Givena setR with a number of labeled samples, its classificationambiguity is defined as:

Ambiguity(R) =∑C

i=1(p∗i − p∗i+1) log i, (1)

where (p1, . . . , pC) is the class frequency inR, and(p∗1, . . . , p

∗C , p

∗C+1) is the normalisation of(p1, . . . , pC , 0)

with descending order, i.e.,1 = p∗1 ≥ . . . ≥ p∗C+1 = 0. Later,Wang et al. [20] applied the same measure to the inductionof extreme learning machine tree, and used it for attributeselection during the induction process. It is noteworthy thatboth [19] and [20] applied the ambiguity measure to a setof probabilities. However, applying this concept to a set ofpossibilities might be more effective. Besides, other thanmeasuring the non-specificity of a set, it is also potential tomeasure the uncertainty of an unlabeled sample.

Motivated by the above statements, in this paper, we willapply possibility approach to MAL, and develop an ambiguity-based strategy for SVM.

III. A MBIGUITY-BASED MULTICLASS ACTIVE LEARNING

A. Ambiguity Measure

In fuzzy theory, fuzzy logic provides functions for aggregat-ing fuzzy sets and fuzzy relations. These functions are calledaggregation operators. In MAL, if we treat the possibilisticmemberships of unlabeled samples as fuzzy sets, then theaggregation operators in fuzzy logic could be used to evaluatethe informativeness of these samples.

Given a set of possibilistic memberships{µ1, . . . , µC}whereµi ∈ [0, 1], i = 1, . . . , C, Frelicot and Mascarilla [11]proposed the fuzzy OR-2 aggregation operator▽(2) as:

▽(2)i=1,...,C µi = ▽i=1,...,C

(

µi △(

▽j 6=i µj

))

, (2)

where(△,▽) is a dual pair oft-norm andt-conorm.It is demonstrated in [12] that when the standardt-norma△

b = min{a, b} and standardt-conorma▽ b = max{a, b} areselected, Eq. (2) has the property of▽

(2)i=1,...,C µi = µ′

2, whereµ′2 is the second largest membership among{µ1, . . . , µC}.

Based on this property, they proposed a specific fuzzy OR-2 operator, i.e.,▽(2)

i=1,...,C µi = △i=1,...,C

(

▽j 6=i µj

)

, andproved its properties of continuous, monotonic, and symmet-ric, under some boundary conditions.

Mascarilla et al. [13] further extended Eq. (2) to a gen-eralized version ofk-order fuzzy OR operator. LetC ={1, . . . , C}, P(C) be the power set ofC, andPk = {A ∈P(C) : |A| = k} where|A| is the cardinality of subsetA, thek-order fuzzy OR operator▽(k) is defined as:

▽(k)i=1,...,C µi = △A∈Pk−1

(

▽j∈C\A µj

)

, (3)

where (△,▽) is a dual pair oft-norm andt-conorm, andk ∈ {2, . . . , C}.

By theoretical proof, they also demonstrated that when thestandardt-norm a △ b = min{a, b} and standardt-conorma ▽ b = max{a, b} are selected, Eq. (3) has the property of▽

(k)i=1,...,C µi = µ′

k, whereµ′k is thek-th largest membership

among{µ1, . . . , µC}.It is noteworthy that there are various combinations of

aggregation operators andt-norms, but the study on them is

Page 3: Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

3

not the focus of this work. For simplicity, we adopt the stan-dard t-norm and standardt-conorm with the aforementionedaggregation operators.

In decision theory, ambiguity indicates the risk of classify-ing a sample based on its memberships. Furthermore, a largermembership has a higher influence on the risk. Obviously,the larger ambiguity, the higher difficulty in classifying thesample. In this section, we will design an ambiguity measureto achieve this purpose. For the sake of clarity, we start froma set of axioms. Given a samplex with a set of possibilisticmemberships{µ1(x), . . . , µC(x)}, whereµi(x) ∈ [0, 1] (i =1, . . . , C) is the possibility ofx in thei-th class. The ambiguitymeasure onx, denoted asAC(x) = A(µ1(x), . . . , µC(x)),is a continuous mapping[0, 1]C → [0, a] (where a ∈ R+)satisfying the following three axioms:

1) Symmetry: AC(x) is a symmetric function ofµ1(x), . . . , µC(x).

2) Monotonicity: AC(x) is monotonically decreasing inmax{µi(x)}, and is monotonically increasing in otherµi(x).

3) Boundary condition:AC(x) = a when µ1(x) =µ2(x) = . . . = µC(x) 6= 0; AC(x) = 0 whenmax{µi(x)} 6= 0 andµi(x) = 0 otherwise.

According to Axiom 1, the ambiguity value keeps the samewith regard to any permutation of the memberships. Accordingto Axiom 2, the increase of the greatest membership andthe decrease of the other memberships will lead to a smallerambiguity, i.e., the classification risk on a sample is lowerwhen the differences between its greatest membership andthe other memberships are larger. According to Axiom 3, theclassification on a sample is most difficult when it equallybelongs to all the classes, and is easiest when only the greatestmembership is not zero.

Frelicot et al. [12] applied the fuzzy OR-2 operator toclassification problem, and proposed the2-order ambiguity,to evaluate the classification risk ofx, which is defined as:

A(2)C (x) = ▽

(2)i=1,...,C µi(x)

/

▽i=1,...,C µi(x). (4)

Eq. (4) reflects the risk of labelingx between two classes,i.e., classi ∈ C and the mostly preferred class from the othersC \ i. By applying standardt-norm and standardt-conorm,Eq. (4) is equal to making a comparison between the largestmembership and the second largest membership. In this paper,we extend the2-order ambiguity measure to a generalizedversion ofk-order ambiguity measure as:

A(k)C (x) = ▽

(k)i=1,...,C µi(x)

/

▽i=1,...,C µi(x). (5)

Similarly, Eq. (5) reflects the risk of labelingx amongkclasses, i.e., classesA ∈ P(C) : |A| = k − 1 and the mostlypreferred class from the othersC \ A. By applying standardt-norm and standardt-conorm, Eq. (5) is equal to makinga comparison between the largest membership and thek-thlargest membership.

In order to get the precise uncertainty information ofx,we have to consider all the ambiguities, i.e.,A

(k)C (x), k =

2, . . . , C. The most efficient way is to aggregate them by

2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

k

(log

k−

log(k

−1))

γ

(log2−

log1)γ

γ=1γ=2γ=3γ=4γ=5

Fig. 1: Values of(log k−log(k−1))γ

(log 2−log 1)γ with differentγ.

weighted sum. As a result, we propose an overall ambiguitymeasure as Definition 1.

Definition 1: (Ambiguity) Given a samplex with a setof possibilistic memberships{µ1(x), . . . , µC(x)}, whereµi(x) ∈ [0, 1] (i = 1, . . . , C) is the possibility ofx in thei-th class, then, the ambiguity ofx is defined as:

AC(x) =∑C

k=2wkA

(k)C (x), (6)

wherewk is the weight for thek-order ambiguity.It is a consensus that in classifying a sample, the large

memberships are critical, and the small memberships are lessimportant. With standardt-norm and standardt-conorm, thek-order ambiguity is proportional to thek-th largest membership.As a result, the2-order ambiguity should be given the highestweight, and theC-order ambiguity should be given the lowestweight. In this case, we propose to use a nonlinear weightfunction wk = (log k − log(k − 1))γ , since 1) it is positivedecreasing in[0,+∞], and 2) it can give higher importance tothe large memberships. In this weight function, the scale factorγ is a real positive integer. Fig. 1 demonstrates the values of(log k−log(k−1))γ

(log 2−log 1)γ whenγ = 1, . . . , 5, which equals to gettingthe normalised weights withw1 = 1. Obviously, a largerγcan further magnify the importance of the large memberships,i.e., the2-order ambiguity will become even more importantand theC-order ambiguity will become even less important.

B. Properties of the Ambiguity Measure

Fig. 2 demonstrates the ambiguity value whenC = 2and γ = 1. Under these conditions, the ambiguity hasseveral common features. It is symmetric aboutµ1(x) =µ2(x), strictly decreasing inmax{µ1(x), µ2(x)}, strictly in-creasing inmin{µ1(x), µ2(x)}, and concave. Besides, it at-tains the maximum atµ1(x) = µ2(x) and minimum atmin{µ1(x), µ2(x)} = 0. Having these observations, we fur-ther give some general properties of the ambiguity measure byrelaxing the conditions ofC = 2 andγ = 1. For the sake ofclarity, we denoteµi(x) asµi in short, and letµ′

1 ≥ . . . ≥ µ′C

be the sequence ofµ1, . . . , µC in descending order.Theorem 1:AC(x) is a symmetric function ofµ1, . . . , µC .

Page 4: Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

4

00.2

0.40.6

0.81

0

0.2

0.4

0.6

0.8

10

0.2

0.4

0.6

0.8

µ1(x)

µ2(x)

A2(x

)

Fig. 2: Ambiguity value whenC = 2 andγ = 1.

Proof: Let µ′1 ≥ . . . ≥ µ′

C be the sequence ofµ1, . . . , µC

in descending order, then

AC(x) =∑C

k=2 A(k)C (x)(log k − log(k − 1))γ

=∑C

k=2(▽(k)i=1,...,C µi/▽i=1,...,C µi) (log k − log(k − 1))

γ

=∑C

k=2(µ′k/µ

′1)(log k − log(k − 1))γ

=µ′2

µ′1(log 2− log 1)γ +

µ′3

µ′1(log 3− log 2)γ + . . .

+µ′C

µ′1(logC − log (C − 1))γ .

(7)Given any permutation ofµ1, . . . , µC , the orderµ′

1 ≥ . . . ≥µ′C keeps unchanged. Based on Eq. (7) and the definition of

symmetric function, the proof is straightforward.Theorem 2:AC(x) will decrease whenµ′

1 increases and allthe others are unchanged;AC(x) will increase whenµ′

i, i ∈{2, . . . , C} increases and all the others are unchanged.

Proof: Follow the expression in Eq. (7), wheni = 1,∂AC(x)∂µ′

1= −

µ′2

µ′21(log 2− log 1)γ −

µ′3

µ′21(log 3− log 2)γ − . . .−

µ′C

µ′21(logC− log (C − 1))γ < 0; wheni = 2, . . . , C, ∂AC(x)

∂µ′i

=1µ′1(log i− log (i − 1))γ > 0.Theorem 3:AC(x) attains its maximum atµ′

1 = . . . =µ′C 6= 0, and minimum atµ′

1 6= 0, µ′C\1 = 0.

Proof: Whenµ′1 = . . . = µ′

C 6= 0, it is clear thatµ′2

µ′1=

. . . =µ′C

µ′1= 1; whenµ′

1 6= 0 andµ′C\1 = 0, it is clear that

µ′2

µ′1

= . . . =µ′C

µ′1

= 0. Sinceµ′1 ≥ . . . ≥ µ′

C , follow theexpression in Eq. (7), the proof is straightforward.

Theorem 4:Whenγ is set to 1,AC(x) has the same formwith Eq. (1).

Proof: Follow the expression in Eq. (7), supposeµ′C+1 =

0, whenγ = 1 we have:

AC(x) =µ′2

µ′1(log 2− log 1) +

µ′3

µ′1(log 3− log 2) + . . .

+µ′C

µ′1(logC − log (C − 1))

= −µ′2

µ′1log 1 + (

µ′2

µ′1−

µ′3

µ′1) log 2 + . . .

+(µ′C

µ′1−

µ′C−1

µ′1

) log (C − 1) +µ′C

µ′1logC

= (µ′1

µ′1−

µ′2

µ′1) log 1 + (

µ′2

µ′1−

µ′3

µ′1) log 2 + . . .

+(µ′C

µ′1−

µ′C+1

µ′1

) logC

=∑C

i=1(µ∗i − µ∗

i+1) log i,

whereµ∗i = µ′

i/µ′1.

Theorems 1∼3 show that the three basic axioms given inSection III-A are satisfied by the proposed ambiguity measure,and Theorem 4 demonstrates that the proposed measure is ageneralized and extended version of Eq. (1) proposed in [19].

C. Fuzzy Memberships of Unlabeled Sample

Given a binary training set{(xi, yi)}ni=1 ∈ Rm×{+1,−1},

the SVM hyperplane is defined aswTx + b = 0, where

w ∈ Rm and b ∈ R. The linearly separable case isformulated asminw,b

12w

Tw, s.t. yi(wT

xi + b) ≥ 1, i =1, . . . , n. The nonlinearly separable case could be handledby soft-margin SVM, which transforms the formulation intominw,b,ξ

12w

Tw+θ

∑ni=1 ξi, s.t.yi(wT

xi+b) ≥ 1− ξi, ξi ≥0, i = 1, . . . , n, whereξi is the slack variable introduced toxi, and θ is a trade-off between the maximum margin andminimum training error. Besides, kernel trick [2] is adopted,which maps the samples into higher dimensional feature spacevia φ : x → φ(x), and expresses the inner product of featurespace as a kernel functionK : 〈φ(x), φ(xi)〉 = K(x,xi).By Lagrange method, the decision function is achieved ash(x) =

∑ni=1 yiαiK(x,xi) + b, whereαi is the Lagrange

multiplier of xi, and the final classifier isf(x) = sign(h(x)).SVM-based multiclass models decompose a multiclass

problem into a set of binary problems. Among the solutions,OAA approach is one of the most effective and efficient meth-ods. It constructsC binary classifiers for aC-class problem,wherefq(x) = sign(hq(x)), q = 1, . . . , C, separates classqfrom the restC − 1 classes. For the testing samplex ∈ Rm,the output class is determined asy = argmaxq=1,...,C(hq(x)).In a binary SVM, the absolute decision value of a sampleis proportional to its distance to the SVM hyperplane, anda larger distance represents a higher degree of certainty inits label. In OAA-based SVM model, the membership ofx

in classq could be computed by logistic function [21], i.e.,µq(x) =

11+e−hq(x) , which has the following properties:

• When hq(x) > 0, µq(x) ∈ (0.5, 1], and µq(x) willincrease with the increase of|hq(x)|;

• When hq(x) < 0, µq(x) ∈ [0, 0.5), and µq(x) willdecrease with the increase of|hq(x)|;

• Whenhq(x) = 0, x lies on the decision boundary, andµq(x) = 0.5.

Obviously, each membership is independently defined for aspecific class. Since the memberships are possibilistic, theproblem is suitable to be handled by possibility theory.

D. Algorithm Description

By applying the ambiguity measure and the fuzzy mem-bership calculation, the ambiguity-based MAL strategy isdepicted as Algorithm 1.

We now give an analysis on the time complexity of selectingone sample in Algorithm 1. Given an iteration, suppose thenumber of labeled training samples isl, and the number ofunlabeled samples isu. Based on [22], training a radial basisfunction (RBF) kernel-based SVM has the highest complexityof O(s3), and making prediction for one testing sample has

Page 5: Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

5

Algorithm 1: Ambiguity-Based MALInput :

• Initial labeled setL = {(xi, yi)}

li=1 ∈ Rm × {1, . . . , C};

• Unlabeled poolU = {(xj)}uj=1 ∈ Rm;

• Scale factorγ and parameters for trainingbase-classifiers.

Output :• Classifiersf1, . . . , fC trained on the final labeled set.

1 LearnC SVM hyperplanesh1, . . . , hC on L by the OAAapproach, one for each class;

2 while U is not emptydo3 if stop criterion is metthen4 Let f1 = sign(h1), . . . , fC = sign(hC);5 return f1, . . . , fC ;6 else7 for eachxj ∈ U do8 Calculate its decision values by the SVMs,

i.e., hq(xj), q = 1 . . . , C;9 Calculate its fuzzy memberships in the

classes, i.e.,µq(xj) =

1

1+e−hq(xj)

, q = 1, . . . , C;

10 Calculate its ambiguity based on Eq. (6), i.e.,AC(xj) ;

11 end12 Find the unlabeled sample with the maximum

ambiguity, i.e.,x∗ = argmaxxj

AC(xj);13 Query the label ofx∗, denoted byy∗;14 Let U = U \ x∗, andL = L ∪ (x∗, y∗);15 Updateh1, . . . , hC based onL;16 end17 end18 return f1, . . . , fC ;

the complexity ofO(sm), wheres is the number of supportvectors (SVs) andm is the input dimension. Thus, trainingthe C binary SVMs (line 1) leads to the highest complexityof O(Cl3), calculating the decision values of theu unlabeledsamples (line 8) leads to the highest complexity ofO(uClm),and calculating the ambiguity values of theu unlabeledsamples (lines 9∼10) leads to the highest complexity ofO(uC2). Furthermore, finding the sample with the maximumambiguity (line 12) leads to a complexity ofO(u). Finally,the complexity for selecting one sample in Algorithm 1 iscomputed asO(Cl3) + O(uClm) + O(uC2) + O(u) ≈O(Cl3) + O(uClm) = O(Cl(l2 + Cm)). It is noteworthythat this complexity is the highest possible one when all thetraining samples are supposed to be the SVs.

IV. EXPERIMENTAL COMPARISONS

A. Comparative Methods

Five MAL strategies are used in this paper to compare withthe proposed algorithm (Ambiguity).

1) Random Sampling(Random): During each iteration,the learner randomly selects an unlabeled sample for query.The OAA approach is adopted to train the base classifiers.

2) Margin-Based Strategy(Margin) [5]: With the OAAapproach, the margin values of theC bianry SVMs areaggregated into an overall margin, i.e.,m(x) =

∏Cq=1 |hq(x)|,

and the learner selects the one with the minimum aggregatedmargin, i.e.,x∗ = argmin

xm(x).

3) Version Space Reduction(VS Reduction) [15], [16]:With the OAA approach, assume the original VS area ofhq

is Area(V(q)), and the new area after querying samplex isArea(V

(q)x ), then an approximation method is applied, i.e.,

Area(V(q)x ) ≈ (

|hq(x)|+12 )Area(V(q)). Finally, the sample is

selected byx∗ = argminx

∏Cq=1 Area(V

(q)x ).

4) Entropy-Based Strategy(Entropy) [17]: This methodis based on probability theory. The OAO approach is adopted,which constructsC(C − 1)/2 binary classifiers, and eachone is for a pair of classes. The classifier of classq againstclassg is defined ashq,g(x) when q < g, wherex belongsto class q if hq,g(x) > 0 and classg if hq,g(x) < 0.Besides,hq,g(x) = −hg,q(x) when q > g. The pairwiseprobabilities ofx regarding classesq and g are derived asrq,g(x) =

11+e−hq,g (x) whenq < g, andrq,g(x) = 1− rg,q(x)

when q > g. The probability ofx in class q is calculated

as pq(x) =2∑C

g 6=q,g=1 rq,g

C(C−1) . Obviously,∑C

q=1 pq(x) = 1.Finally, the sample with the maximum entropy is selected,i.e., x∗ = argmax

x−∑C

q=1 pq(x) log pq(x).5) Best vs. Second Best(BvSB) [17]: This method applies

the same probability estimation process asEntropy, and onlymakes use of the two most important classes. Assume thelargest and second largest class probabilities of samplex

are p∗1(x) and p∗2(x) respectively, then the most informativesample is selected byx∗ = argmin

x(p∗1(x)− p∗2(x)).

B. Experimental Design

The experiments are first conducted on 12 multiclass UCImachine learning datasets as listed in Table I. Since thetesting samples are not available forGlass, Cotton, Libras,Dermatology, Ecoli, Yeast, and Letter, in order to have asufficiently large unlabeled pool, 90% data are randomlyselected as the training set and the rest 10% as the testing set.Each input feature is normalised to [0,1]. The initial trainingset is formed by two randomly chosen labeled samples fromeach class, and the learning stops after 100 new samples havebeen labeled or the selective pool becomes empty. To avoidthe random effect, 50 trials are conducted on the datasets withless than2, 000 samples, and 10 trials are conducted on thelarger datasets. Finally, the average results are recorded.

For fair comparison,θ is fixed as 100 for SVM, and RBFkernel K(x,xi) = exp(− ||x−xi||

2

2σ2 ) with σ = 1 is adopted.Besides,γ in Eq. (6) is treated as a parameter, and tunedon the training set. More specifically, the training setX isdivided into two subsets, i.e.,X1 and X2, with equal size.Active learning is conducted onX1 by fixing γ = 1, 2, . . . , 10,then, the models are validated onX2 and the bestγ value isselected. The selectedγ values for the 12 datasets are listedin the last column of Table I.

Page 6: Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

6

TABLE I: Selected datasets for performance comparison

Dataset #Train #Test #Feature Feature Type #Class Class Distribution in Training Set γ

Glass 214 0 10 Real+Integer 6 70 / 76 / 17 / 13 / 9 / 29 5Cotton 356 0 21 Real 6 55 / 49 / 30 / 118 / 77 / 27 6Libras 360 0 90 Real 15 24× 15 7Dermatology 366 0 34 Real+Integer 6 112 / 61 / 72 / 49 / 52 / 20 7Ecoli 366 0 7 Real+Integer 8 143 / 77 / 2 / 2 / 35 / 20 / 5 / 52 8Yeast 1,484 0 8 Real+Integer 10 244 / 429 / 463 / 44 / 51 / 163 / 35 /30 / 20 / 5 7Letter 20,000 0 16 Real 26 Note 10Soybean 307 376 35 Real+Integer 19 10× 9 / 40× 4 / 20× 2 / 6 × 2 / 4 / 1 2Vowel 528 462 10 Real 11 48× 11 10Optdigits 3,823 1,797 64 Real+Integer 10 376 / 389 / 380 / 389 /387 / 376 / 377 / 387 / 380 / 382 10Satellite 4,435 2,000 36 Real 6 1072 / 479 / 961 / 415 / 470 / 1038 6Pen 7,494 3,498 16 Real+Integer 10 780 / 779 / 780 / 719 / 780 / 720 / 720 / 778 / 719 / 719 8

Note: The class distribution of dataset ”Letter” is 789 / 766 / 736 /805 / 768 / 775 / 773 / 734 / 755 / 747 / 739 / 761 / 792 / 783 / 753 / 803/ 783 /758 / 748 / 796 / 813 / 764 / 752 / 787 / 786 / 734.

The experiments are performed under MATLAB 7.9.0 withthe “svmtrain” and “svmpredict” functions of libsvm, whichare executed on a computer with a 3.16-GHz Intel Core 2 DuoCPU, a 4-GB memory, and 64-bit Windows 7 system.

C. Empirical Studies

Among the six strategies,Random is a baseline,MarginandVS Reduction directly utilize the output decision valuesof the SVMs,Entropy andBvSB are probability approachesbased on the OAO model, andAmbiguity is a possibilityapproach based on the OAA model.

Fig. 3 demonstrates the average testing accuracy and s-tandard deviation of different trials for the six strategies.The average results on the 12 datasets are shown in Fig. 4.It is clear from these results thatAmbiguity has obtainedsatisfactory performances on all the datasets exceptVowel. Infact Vowel is a difficult problem that all the methods fail toachieve an accuracy higher than 50% after the learning stops.Besides,Ambiguity has achieved very similar performancewith BvSB in some cases (e.g., datasetsDermatology, Yeastand Letter). This could be due to the fact that when thescale factorγ is large enough, the ambiguity measure can beregarded as only considering the2-order ambiguity. Since the2-order ambiguity is just decided by the largest and the secondlargest memberships,Ambiguity is intrinsically the same withBvSB in this case. Furthermore,Ambiguity has achieved lowstandard deviation on most datasets. However, all the methodshave shown fluctuant standard deviations on datasetsLetterandPen. This could be caused by the large size of the datasetsand the small number of trials on them.

Another typical phenomenon observed from Fig. 3 is thatEntropy, which is an effective uncertainty measurement formany problems, has performed worse thanAmbiguity andeven worse thanRandom in many cases. In order to findthe reason, we make an investigation on the learning processof datasetEcoli. Fig. 5 demonstrates the class possibilitiesand probabilities of three unlabeled samples in an iteration. Itis calculated from the possibilities that the ambiguity valuesof the three samples are 1.722, 0.966 and 1.040. Obviously,Sample 1 is more benefit to the learning, which will be selectedby Ambiguity. It is further calculated from the probabilitiesthat the entropy values of the three samples are 2.953, 2.981

0 10 20 30 40 50 60 70 80 90 10060

65

70

75

80

85

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(a) Testing Accuracy

0 10 20 30 40 50 60 70 80 90 1002

3

4

5

6

7

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(b) Standard Deviation

Fig. 4: Average result on the 12 UCI datasets.

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

Class

Mem

bers

hip

PossibilityProbability

(a) Sample 1

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

Class

Mem

bers

hip

PossibilityProbability

(b) Sample 2

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

Class

Mem

bers

hip

PossibilityProbability

(c) Sample 3

Fig. 5: Class memberships of 3 samples.

and 2.951. In this case, Sample 2 will be selected byEn-tropy. However, this might not be a good selection, since theadvantage of Sample 2 over Samples 1 and 3 is too trivial. Thisexample tells that the rigorous computation on class probabili-ties may weaken the differences of the samples in uncertainty,especially when the number of classes is large, the entropyvalue is highly affected by the probabilities of unimportantclasses. In the context of active learning, possibility approachmight be more effective in distinguishing unlabeled samples.

Table II reports the mean accuracy and standard deviationof the 100 learning iterations, as well as the final accuracy andthe average time for selecting one sample. It is observed thatAmbiguity has achieved the highest mean accuracy and finalaccuracy on 11 and 9 datasets out of 12 respectively. Besides,Ambiguity is much faster thanEntropy andBvSB, but slightlyslower thanMargin andVS Reduction. It is noteworthy thatin a real active learning process, labeling a sample usuallytakes much more time than selecting a sample. For instance,it may take several seconds to several minutes for labelinga sample, while the selecting part just takes milliseconds.

Page 7: Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

7

0 10 20 30 40 50 60 70 80 90 10065

70

75

80

85

90

95

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(a) Glass

0 10 20 30 40 50 60 70 80 90 10060

70

80

90

100

Number of New Training SamplesT

estin

g A

ccur

acy

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(b) Cotton

0 10 20 30 40 50 60 70 80 90 10040

50

60

70

80

90

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(c) Libras

0 10 20 30 40 50 60 70 80 90 10090

92

94

96

98

100

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(d) Dermatology

0 10 20 30 40 50 60 70 80 90 10065

70

75

80

85

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(e) Ecoli

0 10 20 30 40 50 60 70 80 90 10030

35

40

45

50

55

60

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(f) Yeast

0 10 20 30 40 50 60 70 80 90 10030

35

40

45

50

55

60

Number of New Training SamplesT

estin

g A

ccur

acy

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(g) Letter

0 10 20 30 40 50 60 70 80 90 10070

75

80

85

90

95

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(h) Soybean

0 10 20 30 40 50 60 70 80 90 10025

30

35

40

45

50

55

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(i) Vowel

0 10 20 30 40 50 60 70 80 90 10040

50

60

70

80

90

100

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(j) Optdigits

0 10 20 30 40 50 60 70 80 90 10070

75

80

85

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(k) Satellite

0 10 20 30 40 50 60 70 80 90 10070

75

80

85

90

95

100

Number of New Training Samples

Tes

ting

Acc

urac

y (%

)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(l) Pen

0 10 20 30 40 50 60 70 80 90 1002

4

6

8

10

12

14

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(m) Glass

0 10 20 30 40 50 60 70 80 90 1002

4

6

8

10

12

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(n) Cotton

0 10 20 30 40 50 60 70 80 90 1005

6

7

8

9

10

11

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(o) Libras

0 10 20 30 40 50 60 70 80 90 1002

3

4

5

6

7

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(p) Dermatology

0 10 20 30 40 50 60 70 80 90 1004

6

8

10

12

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(q) Ecoli

0 10 20 30 40 50 60 70 80 90 1003

4

5

6

7

8

9

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(r) Yeast

0 10 20 30 40 50 60 70 80 90 1001

1.5

2

2.5

3

3.5

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(s) Letter

0 10 20 30 40 50 60 70 80 90 1001

2

3

4

5

6

7

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(t) Soybean

0 10 20 30 40 50 60 70 80 90 1002.5

3

3.5

4

4.5

5

5.5

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(u) Vowel

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(v) Optdigits

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(w) Satellite

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMarginVS ReductionEntropyBvSBAmbiguity

(x) Pen

Fig. 3: Experimental comparisons on the selected UCI datasets. (a)∼(l) Testing accuracy. (m)∼(x) Standard deviation.

Page 8: Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

8

TABLE II: Performance comparison on the selected datasets:mean accuracy (%), standard deviation, final accuracy (%), andaverage time for selecting one sample (seconds)

Dataset Random Margin VS Reduction Entropy BvSB Ambiguitymean final time mean final time mean final time mean final time mean final time mean final time

Glass 85.16±6.05 91.71 0.0074* 88.99±7.22 94.57 0.0105 89.41±7.56 94.86 0.0112 87.87±4.84 93.62 0.0367 88.53±4.91 93.05 0.0369 90.62±6.78 94.57 0.0112Cotton 81.60±5.02 87.22 0.0139* 85.03±6.64 91.56 0.0191 84.67±7.38 92.11 0.0200 81.17±4.79 87.11 0.0869 83.89±5.68 89.78 0.0873 86.44±6.80 92.72 0.0200Libras 64.93±6.89 74.94 0.0381* 66.38±8.96 80.06 0.0724 66.41±9.28 80.22 0.0706 66.85±7.95 75.11 0.2391 67.96±8.31 79.28 0.2341 70.77±8.62 82.78 0.0803Dermatology 95.87±1.01 97.14 0.0146* 96.51±1.35 97.68 0.0208 96.97±1.36 97.84 0.0224 96.26±1.10 96.54 0.0946 96.75±1.64 97.78 0.0933 97.17±1.14 97.73 0.0243Ecoli 77.67±2.69 80.76 0.0084* 79.47±2.69 82.06 0.0132 79.86±2.66 81.88 0.0137 76.94±2.66 79.29 0.0658 79.55±2.78 81.18 0.0656 80.43±2.58 82.29 0.0141Yeast 48.66±5.36 54.51 0.0206* 48.02±4.41 52.51 0.0362 47.51±4.20 51.45 0.0384 43.64±4.12 49.74 0.3970 49.66±5.11 54.77 0.3984 50.16±4.84 55.00 0.0432Letter 45.32±5.10 53.00 0.0465* 45.47±4.49 52.31 0.7616 46.05±4.97 53.31 0.7852 39.32±1.83 41.80 10.455 46.53±5.95 54.51 10.625 46.53±5.92 55.53 0.9271Soybean 84.93±4.54 89.65 0.0262* 85.87±5.54 92.65 0.0606 86.35±5.81 92.58 0.0633 82.13±2.32 86.19 0.2305 88.89±5.13 92.89 0.2261 90.04±4.42 93.29 0.0634Vowel 38.47±4.97 44.85 0.0179* 37.79±4.46 44.11 0.0238 39.24±4.74 44.64 0.0247 38.63±2.92 42.34 0.1557 43.94±5.36 50.26 0.1547 40.03±5.46 46.63 0.0275Optdigits 84.16±6.35 90.82 0.0328* 74.30±3.64 81.03 0.5581 75.36±2.79 80.96 0.5012 82.64±10.63 92.37 1.4646 82.23±11.95 93.96 1.4666 89.53±6.15 95.29 0.6073Satellite 78.92±2.82 81.00 0.0110* 79.33±2.53 82.21 0.1476 80.14±3.74 83.34 0.1527 76.98±2.28 79.79 0.9483 80.99±2.69 83.41 0.9027 81.17±3.18 84.09 0.1778Pen 84.77±5.12 90.03 0.0175* 88.36±6.49 95.01 0.1162 85.07±6.14 92.60 0.1170 84.99±4.93 89.56 1.8034 89.52±6.21 95.35 1.7915 90.18±6.94 96.48 0.1440

Avg. 72.54±4.66 77.97 0.0212* 72.96±4.87 78.81 0.1533 73.09±5.05 78.82 0.1517 71.45±4.20 76.12 1.3314 74.87±5.48 80.52 1.3402 76.09±5.23 81.37 0.1783

Note: For each dataset, the highest mean accuracy and final accuracy are in bold face, the minimum time for selecting one sample is marked with *.

TABLE III: Paired Wilcoxon’s signed rank tests (p-values)

Method Margin VS Re-duction

Entropy BvSB Ambiguity

Random 0.15140.2334

0.09230.1763

0.20360.0923

0.0049†0.0005†

0.0005†0.0005†

Margin – –– –

0.17630.5186

0.09230.0342†

0.0122†0.2661

0.0005†0.0010†

VSReduction

– –– –

– –– –

0.05220.0269†

0.0425†0.3013

0.0005†0.0024†

Entropy – –– –

– –– –

– –– –

0.0010†0.0010†

0.0005†0.0005†

BvSB – –– –

– –– –

– –– –

– –– –

0.0269†0.0425†

Note: In each comparison, the upper and lower results are respectively thep-values of the Wilcoxon’s signed rank tests on the mean accuracy andfinal accuracy. For each test,† represents that the two referred methods aresignificantly different with the significance level 0.05.

Assuredly, the time complexity is acceptable.Finally, Table III reports thep values of paired Wilcoxon’s

signed rank tests conducted on the accuracy listed in Table II.We adopt the significance level 0.05, i.e., if thep value issmaller than 0.05, the two referred methods are consideredas statistically different. It can be seen thatAmbiguity isstatistically different from all the others by consideringboththe mean accuracy and final accuracy.

D. Handwritten Digits Image Recognition Problem

We further conduct experiments on the MNIST handwrittendigits image recognition problem1, which aims to distinguish0 ∼ 9 handwritten digits as shown in Fig. 6(a). This datasetcontains 60,000 training samples and 10,000 testing samplesfrom approximately 250 writers, with a relatively balancedclass distribution. We use gradient-based method [23] toextract 2,172 features for each sample, and select 68 featuresby WEKA. Different from the previous experiments, we applybatch-mode active learning on this dataset, which selectsmultiple samples with high diversity during each iteration.We combine the ambiguity measure with two diversity criteriaproposed in [24], which are angle-based diversity (ABD) andenhanced clustering-based diversity (ECBD), and realize twobatch-mode active learning strategies, i.e.,Ambiguity-ABDandAmbiguity-ECBD. Besides, we compare them with batch-mode random sampling (Random), the ambiguity strategy

1http://yann.lecun.com/exdb/mnist

without diversity criteria (Ambiguity), and two strategiesin [24] that combine multiclass-level uncertainty (MCLU) withABD and ECBD, i.e.,MCLU-ABD and MCLU-ECBD. Theinitial training set contains two randomly chosen samples fromeach class. During each iteration, the learner considers 40informative samples, and selects the five most diverse onesfrom them. Besides, the learning stops after 60 iterations,γ istuned as9 for all the ambiguity-based strategies, and 10 trialsare conducted. The mean accuracy and standard deviationare shown in Figs. 6(b)∼(c). It can be observed that theinitial accuracy is just slightly higher than 50%. After 300new samples (0.5% of the whole set) have been queried, theaccuracy has been improved about 30%. Besides,Ambiguity-ECBD has achieved the best performance, which demonstratesthe potential of combining the ambiguity measure with theECBD criteria.

V. CONCLUSIONS ANDFUTURE WORKS

This paper proposed an ambiguity-based MAL strategy byapplying possibility approach, and achieved it for OAA-basedSVM model. This strategy relaxes the additive property inprobability approach. Thus, it computes the memberships ina more flexible way, and evaluates unlabeled samples lessrigorously. Experimental results demonstrate that the proposedstrategy can achieve satisfactory performances on variousmulticlass problems. Future developments regarding this workare listed as follows. 1) In the experiment, we treat the scalefactor γ as a model parameter and tune it empirically. In thefuture, it might be useful to discuss how to get the optimalγ based on the characteristics of the dataset. 2) It mightbe interesting to apply the proposed ambiguity measure tomore base classifiers other than SVMs. 3) If we transform apossibility vector into a probability vector or conversely, theexisting possibilistic and probabilistic models can be realizedin a more flexible way. How to realize an effective and efficienttransformation between possibility and probability for MALwill also be one of the future research directions.

REFERENCES

[1] D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with activelearning,” Mach. Learn., vol. 15, no. 2, pp. 201–221, 1994.

[2] V. N. Vapnik, The nature of statistical learning theory. Springer Verlag,2000.

Page 9: Ambiguity based multiclass active learningchiychow/papers/TFS_2015.pdf · 2015-07-06 · multiclass problem into a set of binary problems. These models evaluate the informativeness

9

5 0 4 1 9 2 1 3 1 4

4 0 9 1 1 2 4 3 2 7

1 8 7 9 3 9 8 5 9 3

4 4 6 0 4 5 6 1 0 0

9 0 2 6 7 8 3 9 0 4

(a) Samples in MNIST dataset

0 50 100 150 200 250 300

55

60

65

70

75

80

85

Number of New Training SamplesT

estin

g A

ccur

acy

(%)

RandomMCLU−ABDMCLU−ECBDAmbiguityAmbiguity−ABDAmbiguity−ECBD

(b) Testing Accuracy

0 50 100 150 200 250 3000

1

2

3

4

5

6

Number of New Training Samples

Sta

ndar

d D

evia

tion

(%)

RandomMCLU−ABDMCLU−ECBDAmbiguityAmbiguity−ABDAmbiguity−ECBD

(c) Standard Deviation

Fig. 6: Batch-mode active learning results on MNIST dataset.

[3] R. Wang, D. Chen, and S. Kwong, “Fuzzy rough set based activelearning,” IEEE Trans. Fuzzy Syst., vol. 22, no. 6, pp. 1699–1704, 2014.

[4] R. Wang, S. Kwong, and D. Chen, “Inconsistency-based active learningfor support vector machines,”Pattern Recogn., vol. 45, no. 10, pp. 3751–3767, 2012.

[5] S. Tong, “Active learning: theory and applications,” Ph.D. dissertation,Citeseer, 2001.

[6] M. Li and I. K. Sethi, “Confidence-based active learning,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 28, no. 8, pp. 1251–1261, 2006.

[7] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,” J.Mach. Learn. Res., vol. 5, pp. 101–141, 2004.

[8] T.-F. Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multi-class classification by pairwise coupling,”J. Mach. Learn. Res., vol. 5,no. 975-1005, p. 4, 2004.

[9] D. Dubois and H. Prade,Possibility theory. Springer, 1988.[10] G. J. Klir and B. Yuan,Fuzzy sets and fuzzy logic: theory and

applications. Prentice Hall New Jersey, 1995.[11] C. Frelicot and L. Mascarilla, “A third way to design pattern classifiers

with reject options,” inProc. 21st Int. Conf. of the North American FuzzyInformation Processing Society. IEEE, 2002, pp. 395–399.

[12] C. Frelicot, L. Mascarilla, and A. Fruchard, “An ambiguity measurefor pattern recognition problems using triangular-norms combination,”WSEAS Trans. Syst., vol. 8, no. 3, pp. 2710–2715, 2004.

[13] L. Mascarilla, M. Berthier, and C. Frelicot, “Ak-order fuzzy ORoperator for pattern classification withk-order ambiguity rejection,”Fuzzy Sets Syst., vol. 159, no. 15, pp. 2011–2029, 2008.

[14] T. M. Hospedales, S. Gong, and T. Xiang, “Finding rare classes: Activelearning with generative and discriminative models,”IEEE Trans. Knowl.Data Eng., vol. 25, no. 2, pp. 374–386, 2013.

[15] R. Yan and A. Hauptmann, “Multi-class active learning for videosemantic feature extraction,” inProc. 2004 IEEE Int. Conf. Multimediaand Expo, vol. 1. IEEE, 2004, pp. 69–72.

[16] R. Yan, J. Yang, and A. Hauptmann, “Automatically labeling video datausing multi-class active learning,” inProc. 9th ICCV. IEEE, 2003, pp.516–523.

[17] A. J. Joshi, F. Porikli, and N. P. Papanikolopoulos, “Scalable activelearning for multiclass image classification,”IEEE Trans. Pattern Anal.Mach. Intell., vol. 34, no. 11, pp. 2259–2273, 2012.

[18] D. Dubois, H. Prade, and S. Sandri, “On possibility/probability trans-formations,” inFuzzy logic, 1993, pp. 103–112.

[19] X. Z. Wang, L. C. Dong, and J. H. Yan, “Maximum ambiguity-basedsample selection in fuzzy decision tree induction,”IEEE Trans. Knowl.Data Eng., vol. 24, no. 8, pp. 1491–1505, 2012.

[20] R. Wang, Y.-L. He, C.-Y. Chow, F.-F. Ou, and J. Zhang, “Learning ELM-tree from big data based on uncertainty reduction,”Fuzzy Sets Syst., vol.258, pp. 79–100, 2015.

[21] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learningand its application to medical image classification,” inProc. 23rd ICML.ACM, 2006, pp. 417–424.

[22] L. Bottou and C.-J. Lin, “Support vector machine solvers,” Large ScaleKernel Machines, pp. 301–320, 2007.

[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[24] B. Demir, C. Persello, and L. Bruzzone, “Batch-mode active-learning

methods for the interactive classification of remote sensing images,”IEEE Trans. Geosci. Remote Sens., vol. 49, no. 3, pp. 1014–1031, 2011.

Ran Wang (S’09-M’14) received her B.Eng. degreein computer science from the College of InformationScience and Technology, Beijing Forestry Universi-ty, China, in 2009, and the Ph.D. degree from CityUniversity of Hong Kong, in 2014. She is currentlya Postdoctoral Senior Research Associate at theDepartment of Computer Science, City Universityof Hong Kong. Since 2014, she is also an AssistantResearcher at the Shenzhen Key Laboratory for HighPerformance Data Mining, Shenzhen Institutes ofAdvanced Technology, Chinese Academy of Sci-

ences, China. Her current research interests include pattern recognition,machine learning, fuzzy sets and fuzzy logic, and their related applications.

Chi-Yin Chow received the M.S. and Ph.D. degreesfrom the University of Minnesota-Twin Cities in2008 and 2010, respectively. He is currently an assis-tant professor in Department of Computer Science,City University of Hong Kong. His research inter-ests include spatio-temporal data management andanalysis, GIS, mobile computing, and location-basedservices. He is the co-founder and co-organizer ofACM SIGSPATIAL MobiGIS 2012, 2013, and 2014.

Sam Kwong(M’93-SM’04-F’13) received the B.Sc.and M.S. degrees in electrical engineering from theState University of New York at Buffalo in 1983,the University of Waterloo, ON, Canada, in 1985,and the Ph.D. degree from the University of Hagen,Germany, in 1996. From 1985 to 1987, he was aDiagnostic Engineer with Control Data Canada. Hejoined Bell Northern Research Canada as a Memberof Scientific Staff. In 1990, he became a Lecturerin the Department of Electronic Engineering, CityUniversity of Hong Kong, where he is currently a

Professor and Head in the Department of Computer Science. His main researchinterests include evolutionary computation, video coding, pattern recognition,and machine learning.

Dr. Kwong is an Associate Editor of the IEEE Transactions on IndustrialElectronics, the IEEE Transactions on Industrial Informatics, and the Infor-mation Sciences Journal.