Outline - users.stat.umn.edu

Outline

� Background

� Binary Support Vector Machine (SVM) and � -Learning

� Multicategory � -Learning and SVM

� Optimization for Multicategory � -Learning

� Statistical Learning Theory for Multicategory � -Learning

� Numerical Examples

� Summary and Future Work

2

An Example

� Letter imagine recognition: Train machine to recognize hand-writtenEnglish letters and classify them correctly.

� Dataset Letter from Statlog collection: Each sample contains

– a vector with 16 primitive numerical attributes: � � � � �� ;

– a response variable representing 26 capital letters: � (classlabel) � � � � � � � � � �� .

� Goal: Build a classifier using the training data to recoginize new letters.

� A 3-class Example: consider letters D, O, Q.

3

x1

x2

Letter D

Letter O Letter Q

Plot of samples for letters D, O, Q using the first two attributes in � .4

x1

x2

Letter D

Letter O Letter Q

Seperable case: many possible partitions.5

x1

x2

Letter D

Letter O Letter Q

Seperable case: many possible partitions.5

Literature Review

� Traditional statistical methodsLinear/Quadratic Discriminant Analysis, Nearest Neighbor, LogisticRegression, etc.

� Machine learning

– Active research in computer science, engineering, etc.

– Methods: Neural Network, Boosting, SVM, � -Learning.

� Goal: Maximize generalization ability.

6

Machine Learning and Statistics

� StatisticsEstimate conditional probabilities to yield classification, e.g., CART,Logistic Regression, etc.

� Machine learningMargins � SVM (Boser, Guyon, & Vapnik, 1992, Vapnik, 1995),Boosting (Freund & Schapire, 1997), etc.

� Theoretical foundation

– Statistics: Function estimation.

– Machine learning: Vapnik-Chervonenkis theory.

� Level of difficultyClassification is easier than function estimation.

7

Multicategory Problem

� � -class

– Construct decision function vector � � �� via sample

� � � � � � � � � , i.i.d. � unknown � � � � � � .

– � : � � �� represents class � ;may not be a probability.

– Classifier: argmax � �� .

� AccuracyGeneralization Error (GE): Err � � � � � �� argmax � � � � .

� GoalSeek � to minimize Err � � � directly.

8

x1

x2

Letter D

Letter O Letter Q

f1>max(f2,f3)

f2>max(f1,f3)

f3>max(f1,f2)

Find � � �� and use argmax � � � � to do classification.9

Difficulties

� Class label representation is not unique.

– Binary: � �� or � � �� (SVM, � -learning).

– Multicategory:

1. Scalar: � � � � � � � � . � �

2. Vector: � � �� with in � th coordinate

represents class � (Lee, Lin, &Wahba, 2001).

� Generalize the concept of Err � � � given the new coding system.

� Generalize the concept of margins (only available for binary problem).

� Existence of no dominating class, i.e., � �� when

� � � , where � � � � � � �� .– Conventional “One-vs-Rest”: suboptimal (Lee et al., 2001).

1. Perform � binary classifiers sequentially to obtain � .2. Check whether � � � � � � ; � � � � � � � � .

10

Binary Case

� Begin with binary case in the new setting for motivation.

� Usual setting: only one� , classifier sign �� with � � �� .

� New setting: � � �� and classifier argmax � � � � .

– sign �� suffices: classify � into class 2 if� � � � �� .

– Remove redundancy: sum-to-zero constraint�

� � � � � � � .

� Linear:� � � � � � �� ; � � �� , � � � .

� Margins

– Functional margin for an instance � � � � � � � :� � � � � � � � � � � � �

� � � � � � � � in the usual �� setting.

– Separation margin:� � � � � � ��

1. The vertical Euclidean distance between� � � � � � � .2.� � � : Euclidean norm in �� .

11

SVM for Binary Case

� Separable: Find optimal hyperplane s.t.

– Maximize � � � � subj. to� � � � � � � � � � � � �� (1). “zero”-training error; (2). scaling of � � � � �� .

� Nonseparable: “zero”-error not attainable � “slack variables” ��

� ��

� � � � � subj. to

� � � � � � � � � � � � �� , � � � isa tuning parameter.

� � solve � � � � �� via minimizing

��

� � � � � � � �

� � � � � � � � � � � � � � � � � � � � subj. to

� � � � � � � � , � � �� (Hinge loss).

� Support Vectors determine the solution.

– Instances with � � � in halfspace � �� .

– Instances with � � � � in � � � �� .12

f1(x)−f2(x)=1

f2(x)−f1(x)=1

f1(x)−f2(x)=0

margin=1/||w2||

Class 2

Class 1

Plot of the decision boundary defined by� � � � � � � � � � � � � , the

separation margin �

� ��

, and three SVs on� � � � � � � � � � � � � .

13

-Learning for Binary Case

� Proposed by Shen, Tseng, Zhang, Wong (JASA, 2003) under usualbinary �� setting.

� New setting: � � �� and classifier argmax � � � � .

� Goal: Minimize GE, Err � � � � �� sign �� .

� Solve � � � � �� via minimizing

��

� � � � � � � �

� � � � � � � � � � � � � � � � � � �

subj. to

� � � � � � � � .

� � sign � � (non-increasing): Solve scaling problem.

� � � �� ,

� �� sign �� o.w.

� Potentials over SVM both theoretically and numerically.

14

−2 −1 0 1 2

−2

02

46

u

L(u)

L= 1−signL= psi1

15

Multicategory Framework

� Classifer: argmax � � � � with � � �� .

� Important conceptMultiple comparison: � � � � � � � � � � � � � � � � � ��

– Compare class � with rest � � classes.

– � � � � � � � � � � � � � � � � � � when � � � .

� � yields correct classification for � � � � � if � � � � � � � �� .

� Multivariate sign

sign �� if� min � min ��

� if� min� � �

� GE: Err � � � � �� sign � � � � � � �� .

16

Multicategory Framework, cont’d

� Multivariate � functionFor � � � � and � � � ,

� � � �� if� � � � ��

� �� sign �� o.w.(1)

� is nonincreasing in each� � .

� Include binary � function.

� A specific � for implementation

� �� sign �� if� min � � � � � ,

� � � � min � o.w.(2)

17

−2

−1

0

1

2

u1

−2

−1

0

1

2

u2

00.

51

1.5

2ps

i(u1,

u2)

Perspective plot of the 3-class � function as defined in (2).18

Multicategory -Learning

� Multicategory � -learning: Find � via minimizing

��

� ��

� � � � � � � � � � � � � , with constraint

� � � � � � � � .

� Linear:� �� .

� Nonlinear: Apply linear learning to a nonlinear feature space �

induced by kernel �� .

– satisfies Mercer Theorem (Courant & Hilber, 1959);

– � � � � � with � � � � (Wahba, 1998);

– Representer Theorem (Kimeldorf & Wahba, 1971)

� � � � � � �� ;

� �� , where� � � � � � � � � � � � .

� Reduce to binary � -learning when � � � .

19

Multicategory SVM

� Multivariate hinge loss

� � �� if� min� and� � � � min � o.w.

� Multicategory SVM

Find � via minimizing ��

� ��

� � � � � � � � � � � � � � , with

constraint

� � � � � � � � .

� Reduce to the binary SVM when � � � .

� Differ from other SVM Multiclass extensions (Weston & Watkins, 1998,Vapnik, 1998, Lee et al., 2001).

20

Generalized Margin Interpretation

� Generalized functional margin for � � � � � � � : � �� .

– Indicate correctness and strength of classification.

– Reduce to� � � � � � � � � � � when � � � .

� Generalized separation margin � : � �� , where � � �

is the Euclidean distance between� � � � � � � .

� Separable: Multicategory � -learning and SVM find � s.t.

– Instances with � � � � fall into the convex polyhedron

� � � � � min � � � � � � � � � � ; � � � � � � � � .

– � is maximized.

21

Support Vectors

� Separable: Instances on the boundary of polyhedrons � ;

� � � � � � � � .

� Nonseparable: Instances with � � � � falling outside � and on theboundary of � ; � � � � � � � � .

� Multicategory � -learning and SVM retain the property of SVs.

� Sparsity of solutions, i.e. small number of SVs, is desirable since datareduction can be achieved.

22

Ployhedron one

Ployhedron two Ployhedron three

f1−f2=1

f1−f2=0

f2−f1=1

f1−f3=1

f1−f3=0

f3−f1=1

f2−f3=1

f3−f2=1f2−f3=0

Illustration of margins and SVs in a 3-class separable example.23

Deterministic Nonconvex Minimization

� Minimization involved in � -learning is nonconvex.

� Unexplored in Statistics.

� D.C. programming (Global minimization)Key: D.C. decomposition (Diff. Convex func., i.e., Convex� Concave).

– DCA (An and Tao, J. Global Optimization, 1997).

– Outer approx. (Blanquero & Carrizosa, J. Global Optimization,2000).

24

D.C. Decomposition

� Decompose � : �

�� , where

�� if� min� � ;� � min otherwise.

� Yield a d.c. decomposition of � � � � � �� .

– � � ��

�

� ��

� � � � � � � � � � � � � � is convex;

– ��

� � �� is concave in

�� .

� � -Learning, subj. to sum-to-zero constraint, solves

� �� (3)

where

�� vec �� and

��

� ��

.

� Nice interpretation

– � � : Convex cost function of SVM;

– �� : Bias correction for generalization.25

−2 −1 0 1 2

−6

−4

−2

02

46

u

L(u)

L= psiL= psi1L= psi2

Plot of D.C. decomposition �

�� for � � � .

26

D.C. Algorithm

� Idea: �� , affine minorization;

� : subgradient.

� Solving a sequence of convex subproblems:

Given

�� , obtain

�� via solving � �� .

– Employing Lagrange multipliers.

– Solving the dual problem using QP.

� Algorithm 1Step 1: (Initial)

�� .

Step 2: (Iteration) At iteration � , compute

�� by solving QP.

Step 3: (Stopping) If��

Sol:

�� argmin � � �� .

27

D.C. Algorithm, cont’d

� Theorem: (Convergence of Algorithm 1 )

� �� is nonincreasing, � � � � �� , and

� � � � �� . Moreover, Algorithm 1 terminates

finitely.

� Convergence in 20 steps. Complexity � QP.

� The solution may not be global.

� Choice of initial values

– Important for performance of the final solution.

– Use SVM’s solution.

28

Outer Approximation

� Idea: � � �� (affine

minorizations).

� Algorithm 2: Solve a sequence of concave problems � ��

via vertex enumeration, � � � �� with

�� .

� Theorem: (Convergence of Algorithm 2 )

The sequence

�� converges to global optima, i.e.,

� � � � �� , and � � ��

when stop.

� Comparison

Algorithm 1 : Good for large � ; may not be global. � �

Algorithm 2 : Good for small � , � , and large� ; global.

29

Theory for Multicategory -Learning

� Class of candidate classification partitions:

� � � � � � � � � � �� , induced by function class � .

� Ideal performance: Err ��

� � � � �� Err � � � , where

�� is a Bayes rule.

� Actual performance: Err ��

� � .

� Comparison: (Actual-Ideal)

– � ��

� ��

� � � Err ��

� � � Err ��

� �� .

– � ��

� ��

� � � � � � � ��

� � � ��

�� .

� Important formula for � � � ��

� � : � � ��

� � sign ��

� � � � � � � � sign � � � � � � � � �� .

– Reveal dramatic difference between binary and multicategoryproblems.

– Do not suffer the difficulty of no dominating class.30

Theory

ofMulticategory

-Learning,cont’d

�

Assum

ptionA

:(A

pproximation)

For

��

�

(posseq)

as��

�,

��

�

s.t.

��

�

�

;i.e.,��

��

�.

�

Assum

ptionB

:(B

oundarybehavior) �

��

� ��

and��

s.t.

��

��

� max��

�

argmax �

��

�

�� for� ��

� .

�

Assum

ptionC

:(M

etricentropy)��

��

� ��

��

,where

��

��

� ��

��

��

�� !

� "� ��

��!#

,

"� ��

� ��

� ��

��

� ��

�

��

�� ,

#�#

��

� ��

��

��

� � ��

��

� � ,

��

��

� �� ,and��

��

�

� � � � �

.

�

Assum

ptionD

:

�

satisfies(1).

31

Theory of Multicategory -Learning, cont’d

� Theorem: (Accuracy of � -learning: argmax ��

� � ) For a constant � � � � ,

� � ��

� ��

� ��

� � �� (4)

provided that � � � � � �� , where �� ,

and � � � satisfying Assumption C.

� Corollary

� � ��

� ��

� � � � � � � ��

� ��

� � � � � � ��

provided that� � �� is bounded away from zero.

� Allow to study the dependence of � ��

� ��

� � on � and� simultaneouslywith � � � �� .

32

TheoreticalE

xample:

Linear

�

Class:

��

��

� ��

� ��

��

� �

� �� .

�

Input:��

� ��

,

�

��

isa

constant.

�

��

� :is

uniformin�

;� � ��

�

for

� � ��

��

� �� ;� �

��

o.w.

�

��

� � ��

��

��

� ;��

��

��

��

�

when

� ��

� ��

� ��

��

�

��

��

�� ,providedthat

��

��

��

�

� .

�

Rate

isnear

optimalw

hen�is

finite.

33

TheoreticalE

xample:

Nonlinear

�

��

� ��

� ��

� � �

� ��

� �

� ��

with

� ��

� � ��

� �

.

�

Input:��

� ��

.

�

��

� �� :

Sam

eas

LinearE

xample

with

�

�

.

�

��

� � ��

��

��

� ,��

��

��

��

�

�

when� �

� ��

� ��

��

�

��

��

��

providedthat

��

��

��

�

� .

�

Sam

erate

asLinear

Learningifthe

orderofthe

polynomial�

isfixed;

nearoptim

al.

34

Simulated Examples

� Performance: multicategory � -learning vs SVM.

� Improvement of � -learning over SVM: �� SVM � � � � � � � � (SVM),where � �� Bayes error, and� �� denotes the testing error.

� Each training sample:� � � � . The testing and Bayes errors arecomputed via independent testing samples of size �� .

� Perform linear learning and a grid search on � .

� Results are obtained via averaging 100 repeated simulations.

35

Simulated Examples: Data Generation

� Generate � � � � �� from bivariate � -dist. with d.f. =1, 3 in cases 1, 2.

� Randomly assign � �� to itslabel index for each � � � � �� .

� Generate � � � � �� : � � � � � �

� � and �� with

� � � � �� , � � � � � ,

� � � � � � for classes 1–3, respec-tively.

Class 1Class 2

Class 3

36

Case 1: d.f.=1; Bayes error=0.247; improv. of � over SVM is 43.22%.

Case 2: d.f.=3; Bayes error=0.146; improv. of � over SVM is 20.41%.

Case Train(s.e.) Test(s.e.) � �� ) SVs(s.e.)

d.f.=1 SVM .400(.147) .431(.141) .184 141.76(10.97)

� -L .320(.124) .349(.121) .102 64.64(15.43)

d.f.=3 SVM .145(.027) .151(.005) .005 71.81(11.02)

� -L .143(.029) .150(.003) .004 41.29(13.51)

� � -learning has smaller testing error and consequently has bettergeneralization ability than that of SVM.

� When d.f.=1, moments of bivariate � -dist. do not exist and SVM fails toaccomplish data reduction while � -learning has much smaller numberof SVs.

37

Applications: Letter Imagine Recognition

� 3-class example (Letters D, O, Q).

� Search optimal � over grid pointsin � � � � � � � � .

� Improvement

�� SVM � � � � � � � � � SVM � .

� Observations

– � -learning does better in test-ing than SVM.

– On average, � -learning re-duces SVs of SVM. Thepercent of reduction, however,varies.

Testing errors of SVM and � -learning

Case SVM � -L Improv.

1 .083 .079 3.39%

2 .073 .063 12.24%

3 .086 .076 11.41%

4 .072 .072 0%

5 .088 .085 3.74%

6 .077 .073 5.45%

7 .075 .072 4.39%

8 .079 .075 5.92%

9 .093 .091 1.51%

10 .090 .086 4.11%

� SVs 51.1 40.8

38

Summary

� Propose a novel methodology for � -learning and SVM.

� Develop a learning theory for multicategory � -learning.

� Propose optimization methods to solve nonconvex minimization.

� � -learning is robust to outliers. In contrast, any classifier withunbounded loss function such as SVM suffers difficulty from extremeoutliers.

� Numerical study suggests that � -learning yields an even more “sparse”solution than SVM.

39

Future Directions

� Real applicationsMicroarray classification data, text recognition, etc.

� Choices of tuning parameters and kernels.

� Variable selection � -learning and SVM (Lasso, Tibshirani, JRSS, 1996;Basis Pursuit, Chen, Donoho, and Saunders, SIAM, 1998).

� Extensions to nonstandard case: treat � classes unequally.

40

References

� Liu, Y. and Shen, X. (2003). On multicategory � -learning and supportvector machine. J. Amer. Statist. Assoc. Under review.

� Liu, Y., Shen, X., and Doss, H. (2003). Multicategory � -learning andsupport vector machine: computational tools. J. Comput. Graph.Statist. Tentatively accepted.

� Shen, X., Tseng, G. C., Zhang, X., and Wong, W. H. (2003). On

� -learning. J. Amer. Statist. Assoc. 98, 724-734.

� An, H. L. T., and Tao, P. D. (1997). Solving a class of linearlyconstrained indefinite quadratic problems by D.C. algorithms. J. GlobalOptim. 11, 253-285.

� Shen, X., and Wong, W. H. (1994). Convergence rate of sieveestimates. Ann. Statist. 22, 580-615.

� Lee, Y., Lin, Y., and Wahba, G. (2003). Multicategory Support VectorMachines, theory, and application to the classification of microarraydata and satellite radiance data. J. Amer. Statist. Assoc. To appear.

41

Documents

Outline - users.stat.umn.edu