Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Outline
� Background
� Binary Support Vector Machine (SVM) and � -Learning
� Multicategory � -Learning and SVM
� Optimization for Multicategory � -Learning
� Statistical Learning Theory for Multicategory � -Learning
� Numerical Examples
� Summary and Future Work
2
An Example
� Letter imagine recognition: Train machine to recognize hand-writtenEnglish letters and classify them correctly.
� Dataset Letter from Statlog collection: Each sample contains
– a vector with 16 primitive numerical attributes: � � � � ��� ;
– a response variable representing 26 capital letters: � (classlabel) � � � � � � � � � ��� �� � � � � � � � .
� Goal: Build a classifier using the training data to recoginize new letters.
� A 3-class Example: consider letters D, O, Q.
3
x1
x2
Letter D
Letter O Letter Q
Plot of samples for letters D, O, Q using the first two attributes in � .4
x1
x2
Letter D
Letter O Letter Q
Seperable case: many possible partitions.5
x1
x2
Letter D
Letter O Letter Q
Seperable case: many possible partitions.5
Literature Review
� Traditional statistical methodsLinear/Quadratic Discriminant Analysis, Nearest Neighbor, LogisticRegression, etc.
� Machine learning
– Active research in computer science, engineering, etc.
– Methods: Neural Network, Boosting, SVM, � -Learning.
� Goal: Maximize generalization ability.
6
Machine Learning and Statistics
� StatisticsEstimate conditional probabilities to yield classification, e.g., CART,Logistic Regression, etc.
� Machine learningMargins � SVM (Boser, Guyon, & Vapnik, 1992, Vapnik, 1995),Boosting (Freund & Schapire, 1997), etc.
� Theoretical foundation
– Statistics: Function estimation.
– Machine learning: Vapnik-Chervonenkis theory.
� Level of difficultyClassification is easier than function estimation.
7
Multicategory Problem
� � -class
– Construct decision function vector � � �� � � � � � � �� � via sample
� � � � � � � � � , i.i.d. � unknown � � � � � � .
– � : � � �� � � represents class � ;may not be a probability.
– Classifier: argmax � ��� � � �� � � � � .
� AccuracyGeneralization Error (GE): Err � � � � � ��� �� argmax � � � � .
� GoalSeek � to minimize Err � � � directly.
8
x1
x2
Letter D
Letter O Letter Q
f1>max(f2,f3)
f2>max(f1,f3)
f3>max(f1,f2)
Find � � �� � � �� � � � � and use argmax � � � � to do classification.9
Difficulties
� Class label representation is not unique.
– Binary: � �� � or � � ��� � (SVM, � -learning).
– Multicategory:
1. Scalar: � � � � � � � � . � �
2. Vector: � � �� � � � � � � � � � � � � � �� � � � with in � th coordinate
represents class � (Lee, Lin, &Wahba, 2001).
� Generalize the concept of Err � � � given the new coding system.
� Generalize the concept of margins (only available for binary problem).
� Existence of no dominating class, i.e., � �� � � � �� � when
� � � , where � � � � � � ��� � � � � � .– Conventional “One-vs-Rest”: suboptimal (Lee et al., 2001).
1. Perform � binary classifiers sequentially to obtain � .2. Check whether � � � � � � ; � � � � � � � � .
10
Binary Case
� Begin with binary case in the new setting for motivation.
� Usual setting: only one� , classifier sign �� � with � � ��� � .
� New setting: � � �� � � �� � and classifier argmax � � � � .
– sign �� � � � � � suffices: classify � into class 2 if� � � � ��� � � � � � .
– Remove redundancy: sum-to-zero constraint�
� � � � � � � .
� Linear:� � � � � � �� � � � ; � � �� , � � � .
� Margins
– Functional margin for an instance � � � � � � � :� � � � � � � � � � � � �
� � � � � � � � in the usual �� � setting.
– Separation margin:� � � � � � �� � � � � �
1. The vertical Euclidean distance between� � � � � � � .2.� � � : Euclidean norm in �� .
11
SVM for Binary Case
� Separable: Find optimal hyperplane s.t.
– Maximize � � � � subj. to� � � � � � � � � � � � ��� �(1). “zero”-training error; (2). scaling of � � � � �� � .
� Nonseparable: “zero”-error not attainable � “slack variables” �� � � � �
� �� � � ��� � � � �� � � � � � � �
� � � � � subj. to
� � � � � � � � � � � � �� � � � � � � � �� � � � � � � � ��� , � � � isa tuning parameter.
� � solve � � � � �� � � � � � � � via minimizing
��
� � � � � � � �
� � � � � � � � � � � � � � � � � � � � subj. to
� � � � � � � � , � � ��� � � � � � � �� (Hinge loss).
� Support Vectors determine the solution.
– Instances with � � � in halfspace � �� � � � � � � �� � � � � � .
– Instances with � � � � in � � � �� � � � � � � � � ��� � .12
f1(x)−f2(x)=1
f2(x)−f1(x)=1
f1(x)−f2(x)=0
margin=1/||w2||
Class 2
Class 1
Plot of the decision boundary defined by� � � � � � � � � � � � � , the
separation margin �
� �� � � �
, and three SVs on� � � � � � � � � � � � � .
13
-Learning for Binary Case
� Proposed by Shen, Tseng, Zhang, Wong (JASA, 2003) under usualbinary ��� � setting.
� New setting: � � �� � � �� � and classifier argmax � � � � .
� Goal: Minimize GE, Err � � � � �� � � � sign ��� � � � � � �� � � �� .
� Solve � � � � �� � � � � � � � via minimizing
��
� � � � � � � �
� � � � � � � � � � � � � � � � � � �
subj. to
� � � � � � � � .
� � sign � � (non-increasing): Solve scaling problem.
� � � ��� � � � � � � � ��� � ,
� ��� � � � sign ��� � o.w.
� Potentials over SVM both theoretically and numerically.
14
−2 −1 0 1 2
−2
02
46
u
L(u)
L= 1−signL= psi1
15
Multicategory Framework
� Classifer: argmax � � � � with � � �� � � � � � � � � � .
� Important conceptMultiple comparison: � � � � � � � � � � � � � � � � � ��� � �� � � �
– Compare class � with rest � � classes.
– � � � � � � � � � � � � � � � � � � when � � � .
� � yields correct classification for � � � � � if � � � � � � � �� � � .
� Multivariate sign
sign ��� � � � if� min � min ��� � � � � � � � � � � �� � �
� if� min� � �
� GE: Err � � � � �� � � � sign � � � � � � �� .
16
Multicategory Framework, cont’d
� Multivariate � functionFor � � � � and � � � ,
� � � ��� � � � if� � � � ��� � � � � � � � � � ��� � � � � �
� ��� � � � sign ��� � o.w.(1)
� is nonincreasing in each� � .
� Include binary � function.
� A specific � for implementation
� ��� � � � sign ��� � if� min � � � � � ,
� � � � min � o.w.(2)
17
−2
−1
0
1
2
u1
−2
−1
0
1
2
u2
00.
51
1.5
2ps
i(u1,
u2)
Perspective plot of the 3-class � function as defined in (2).18
Multicategory -Learning
� Multicategory � -learning: Find � via minimizing
��
� ��� �� � � �
� � � � � � � � � � � � � , with constraint
� � � � � � � � .
� Linear:� �� � � � � � � .
� Nonlinear: Apply linear learning to a nonlinear feature space �
induced by kernel �� �� �� � � � � � .
– satisfies Mercer Theorem (Courant & Hilber, 1959);
– � � � � � with � � � � (Wahba, 1998);
– Representer Theorem (Kimeldorf & Wahba, 1971)
� � � � � � ��� � � � � � � � ;
� �� � � � � � � � � � �� � , where� � � � � � � � � � � � .
� Reduce to binary � -learning when � � � .
19
Multicategory SVM
� Multivariate hinge loss
� � ��� � � � if� min� and� � � � min � o.w.
� Multicategory SVM
Find � via minimizing ��
� �� �� � � � �
� � � � � � � � � � � � � � , with
constraint
� � � � � � � � .
� Reduce to the binary SVM when � � � .
� Differ from other SVM Multiclass extensions (Weston & Watkins, 1998,Vapnik, 1998, Lee et al., 2001).
20
Generalized Margin Interpretation
� Generalized functional margin for � � � � � � � : � �� � � � � � � � � � � .
– Indicate correctness and strength of classification.
– Reduce to� � � � � � � � � � � when � � � .
� Generalized separation margin � : � �� � � � � � �� � � � , where � � �
is the Euclidean distance between� � � � � � � .
� Separable: Multicategory � -learning and SVM find � s.t.
– Instances with � � � � fall into the convex polyhedron
� � � � � min � � � � � � � � � � ; � � � � � � � � .
– � is maximized.
21
Support Vectors
� Separable: Instances on the boundary of polyhedrons � ;
� � � � � � � � .
� Nonseparable: Instances with � � � � falling outside � and on theboundary of � ; � � � � � � � � .
� Multicategory � -learning and SVM retain the property of SVs.
� Sparsity of solutions, i.e. small number of SVs, is desirable since datareduction can be achieved.
22
Ployhedron one
Ployhedron two Ployhedron three
f1−f2=1
f1−f2=0
f2−f1=1
f1−f3=1
f1−f3=0
f3−f1=1
f2−f3=1
f3−f2=1f2−f3=0
Illustration of margins and SVs in a 3-class separable example.23
Deterministic Nonconvex Minimization
� Minimization involved in � -learning is nonconvex.
� Unexplored in Statistics.
� D.C. programming (Global minimization)Key: D.C. decomposition (Diff. Convex func., i.e., Convex� Concave).
– DCA (An and Tao, J. Global Optimization, 1997).
– Outer approx. (Blanquero & Carrizosa, J. Global Optimization,2000).
24
D.C. Decomposition
� Decompose � : �
�� � � �� , where
�� � � if� min� � ;� � min otherwise.
� Yield a d.c. decomposition of � � � � � �� .
– � � ��� � � �
�
� �� �� � � � �
� � � � � � � � � � � � � � is convex;
– �� ��� � � �
� � �� � � � � � � � � � � is concave in
�� .
� � -Learning, subj. to sum-to-zero constraint, solves
� ���� � ��� � � � � ��� � � �� ��� � � (3)
where
�� � vec ��� � � � � � ��� � � and
�� ��
� �� ���
.
� Nice interpretation
– � � : Convex cost function of SVM;
– �� : Bias correction for generalization.25
−2 −1 0 1 2
−6
−4
−2
02
46
u
L(u)
L= psiL= psi1L= psi2
Plot of D.C. decomposition �
�� � � �� for � � � .
26
D.C. Algorithm
� Idea: �� ��� � � �� ��� � � � � � �� ��� � � ��� � �� � � , affine minorization;
� : subgradient.
� Solving a sequence of convex subproblems:
Given
�� � , obtain
�� �� � via solving � �� �� � � ��� � � � � �� ��� � � ��� � .
– Employing Lagrange multipliers.
– Solving the dual problem using QP.
� Algorithm 1Step 1: (Initial)
���� .
Step 2: (Iteration) At iteration � , compute
�� �� � by solving QP.
Step 3: (Stopping) If��� �� � � �� �� � ��
Sol:
��� � argmin � � �� � �� � .
27
D.C. Algorithm, cont’d
� Theorem: (Convergence of Algorithm 1 )
� ��� � � is nonincreasing, � � � � �� � ��� � � ��� � �� �� � ��� � , and
� � � � �� ��� �� � � �� � � �� � � . Moreover, Algorithm 1 terminates
finitely.
� Convergence in 20 steps. Complexity � QP.
� The solution may not be global.
� Choice of initial values
– Important for performance of the final solution.
– Use SVM’s solution.
28
Outer Approximation
� Idea: � � ��� � � � �� � � � � � � � ��� � � � ��� � �� � � � � � ��� � � � (affine
minorizations).
� Algorithm 2: Solve a sequence of concave problems � �� �� � � ��� �
via vertex enumeration, � � � �� ��� � � � with
�� � �� � � � � � ��� � �� � � � � � ��� � � � .
� Theorem: (Convergence of Algorithm 2 )
The sequence
�� � converges to global optima, i.e.,
� � � � �� � ��� �� � � � � �� �� � ��� � , and � � ��� � � � � �� �� � ��� � � � �
when stop.
� Comparison
Algorithm 1 : Good for large � ; may not be global. � �
Algorithm 2 : Good for small � , � , and large� ; global.
29
Theory for Multicategory -Learning
� Class of candidate classification partitions:
� � � � � � � � � � �� � � � � �� � � � � � , induced by function class � .
� Ideal performance: Err ��
� � � � �� � Err � � � , where
�� is a Bayes rule.
� Actual performance: Err ��
� � .
� Comparison: (Actual-Ideal)
– � ��
� ��
� � � Err ��
� � � Err ��
� ��� � .
– � ��
� ��
� � � � � � � ��
� � � �� � � � � � �� � � � � � � � �
�� �� � � � � � .
� Important formula for � � � ��
� � : � � �� � � � � � � � � � � � �
� � sign ��
� � � � � � � � sign � � � � � � � � �� .
– Reveal dramatic difference between binary and multicategoryproblems.
– Do not suffer the difficulty of no dominating class.30
Theory
ofMulticategory
-Learning,cont’d
�
Assum
ptionA
:(A
pproximation)
For
��
�
(posseq)
as��
�,
����
�
s.t.
�� � ��� ���
�
�
;i.e.,���� �� ��
�� � �� ����
�.
�
Assum
ptionB
:(B
oundarybehavior) �
���
� ��
and�� ��
s.t.
��
�� �
� max�� �
�
argmax �
�� ����
�
�� � �for� ��
� .
�
Assum
ptionC
:(M
etricentropy)���� �� ����
��
� ���
�� ��� �
,where
��
�� �� �
� �� ������ �� �� ��
�� �
��� �
��� �!
� "� ���
��!#
,
"� �� ��� �
� �� ���� �
� �� �
��
� �� ��
�
���
�� �� ,
#�#
��
� �� �� �
����
�� �
� � �� ��� � �
�� ��
� � ,
���
�� �
� �� ��� � � ,and�� �� �
��
�
� � � � �
.
�
Assum
ptionD
:
�
satisfies(1).
31
Theory of Multicategory -Learning, cont’d
� Theorem: (Accuracy of � -learning: argmax ��
� � ) For a constant � � � � ,
� � ��
� ��
� ��� �� � � � � �� � � � � � � �� � �� � � �� � � �
� � �� � �� � � (4)
provided that � � � � � �� � � , where �� � � �� � � �� � �� �� � � � � ,
and � � � satisfying Assumption C.
� Corollary
� � ��
� ��
� � � � � � � �� � � � � � ��
� ��
� � � � � � �� � �
provided that� � �� � � � � � �� � �� � �� � � is bounded away from zero.
� Allow to study the dependence of � ��
� ��
� � on � and� simultaneouslywith � � � �� � .
32
TheoreticalE
xample:
Linear
�
Class:
���� �
�� ����� �
� ��
� �� ��
���� � �
� �
� �� ��� .
�
Input:��
� �� � �
,
�
��
isa
constant.
�
��� �
� :is
uniformin�
;� � �� ��
�
for
� � ��� �
�� ��
� ��� ��� ;� �
��
o.w.
�
�� ��� ��� ��
� � ��
���� �
��
� ;��� ��� ��� �
�� �
���
�� ���
�
when
� �� �
� �� ��� � � �
� ���
���
�
��
���� �
�� ,providedthat
��
���� �
��
�
� .
�
Rate
isnear
optimalw
hen�is
finite.
33
TheoreticalE
xample:
Nonlinear
�
��
� ���
� �� ��
� � �
� ��� �� �� � �
� �
� �� ���
with
� ���� �
� � ���
� �
.
�
Input:��
� �� �
.
�
��
� �� :
Sam
eas
LinearE
xample
with
�
�
.
�
�� ��� ��� ��
� � ���
���� �
���
� ,��� ��� ��� �
�� �
����
�� ���
�
�
when� �
� �� �� ��� � � �
� ���
���
�
���
���� �
���
providedthat
���
���� �
���
�
� .
�
Sam
erate
asLinear
Learningifthe
orderofthe
polynomial�
isfixed;
nearoptim
al.
34
Simulated Examples
� Performance: multicategory � -learning vs SVM.
� Improvement of � -learning over SVM: ��� � SVM � � � � � � � � (SVM),where � �� � � � �� � � Bayes error, and� �� � denotes the testing error.
� Each training sample:� � � � . The testing and Bayes errors arecomputed via independent testing samples of size ��� .
� Perform linear learning and a grid search on � .
� Results are obtained via averaging 100 repeated simulations.
35
Simulated Examples: Data Generation
� Generate � � � � �� � from bivariate � -dist. with d.f. =1, 3 in cases 1, 2.
� Randomly assign � �� � � � to itslabel index for each � � � � �� � .
� Generate � � � � �� � : � � � � � �
� � and �� � �� � �� with
� � � � �� � � � � � � , � � � � � ,
� � � � � � for classes 1–3, respec-tively.
Class 1Class 2
Class 3
36
Case 1: d.f.=1; Bayes error=0.247; improv. of � over SVM is 43.22%.
Case 2: d.f.=3; Bayes error=0.146; improv. of � over SVM is 20.41%.
Case Train(s.e.) Test(s.e.) � ��� ) SVs(s.e.)
d.f.=1 SVM .400(.147) .431(.141) .184 141.76(10.97)
� -L .320(.124) .349(.121) .102 64.64(15.43)
d.f.=3 SVM .145(.027) .151(.005) .005 71.81(11.02)
� -L .143(.029) .150(.003) .004 41.29(13.51)
� � -learning has smaller testing error and consequently has bettergeneralization ability than that of SVM.
� When d.f.=1, moments of bivariate � -dist. do not exist and SVM fails toaccomplish data reduction while � -learning has much smaller numberof SVs.
37
Applications: Letter Imagine Recognition
� 3-class example (Letters D, O, Q).
� Search optimal � over grid pointsin � � � � � � � � .
� Improvement
��� � SVM � � � � � � � � � SVM � .
� Observations
– � -learning does better in test-ing than SVM.
– On average, � -learning re-duces SVs of SVM. Thepercent of reduction, however,varies.
Testing errors of SVM and � -learning
Case SVM � -L Improv.
1 .083 .079 3.39%
2 .073 .063 12.24%
3 .086 .076 11.41%
4 .072 .072 0%
5 .088 .085 3.74%
6 .077 .073 5.45%
7 .075 .072 4.39%
8 .079 .075 5.92%
9 .093 .091 1.51%
10 .090 .086 4.11%
� SVs 51.1 40.8
38
Summary
� Propose a novel methodology for � -learning and SVM.
� Develop a learning theory for multicategory � -learning.
� Propose optimization methods to solve nonconvex minimization.
� � -learning is robust to outliers. In contrast, any classifier withunbounded loss function such as SVM suffers difficulty from extremeoutliers.
� Numerical study suggests that � -learning yields an even more “sparse”solution than SVM.
39
Future Directions
� Real applicationsMicroarray classification data, text recognition, etc.
� Choices of tuning parameters and kernels.
� Variable selection � -learning and SVM (Lasso, Tibshirani, JRSS, 1996;Basis Pursuit, Chen, Donoho, and Saunders, SIAM, 1998).
� Extensions to nonstandard case: treat � classes unequally.
40
References
� Liu, Y. and Shen, X. (2003). On multicategory � -learning and supportvector machine. J. Amer. Statist. Assoc. Under review.
� Liu, Y., Shen, X., and Doss, H. (2003). Multicategory � -learning andsupport vector machine: computational tools. J. Comput. Graph.Statist. Tentatively accepted.
� Shen, X., Tseng, G. C., Zhang, X., and Wong, W. H. (2003). On
� -learning. J. Amer. Statist. Assoc. 98, 724-734.
� An, H. L. T., and Tao, P. D. (1997). Solving a class of linearlyconstrained indefinite quadratic problems by D.C. algorithms. J. GlobalOptim. 11, 253-285.
� Shen, X., and Wong, W. H. (1994). Convergence rate of sieveestimates. Ann. Statist. 22, 580-615.
� Lee, Y., Lin, Y., and Wahba, G. (2003). Multicategory Support VectorMachines, theory, and application to the classification of microarraydata and satellite radiance data. J. Amer. Statist. Assoc. To appear.
41