53
Comparison of Binary Discrimination Methods for High Dimension Low Sample Size Data Addy M. Bolivar Cime CIMAT Joint work with J. S. Marron 1 1 UNC at Chapel Hill

Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

  • Upload
    donga

  • View
    223

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Comparison of Binary Discrimination Methods forHigh Dimension Low Sample Size Data

Addy M. Bolivar Cime

CIMAT

Joint work withJ. S. Marron 1

1UNC at Chapel Hill

Page 2: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Content

1 Introduction

2 Geometric representation in the Gaussian setting

3 Binary discrimination methodsSupport Vector MachineDistance Weighted DiscriminationMean DifferenceMaximal Data PilingNaive Bayes

4 Asymptotic results for the normal vectors

5 Comparison of the MD and SVM methods

6 Conclusions

Page 3: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Content

1 Introduction

2 Geometric representation in the Gaussian setting

3 Binary discrimination methodsSupport Vector MachineDistance Weighted DiscriminationMean DifferenceMaximal Data PilingNaive Bayes

4 Asymptotic results for the normal vectors

5 Comparison of the MD and SVM methods

6 Conclusions

Page 4: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Content

1 Introduction

2 Geometric representation in the Gaussian setting

3 Binary discrimination methodsSupport Vector MachineDistance Weighted DiscriminationMean DifferenceMaximal Data PilingNaive Bayes

4 Asymptotic results for the normal vectors

5 Comparison of the MD and SVM methods

6 Conclusions

Page 5: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Content

1 Introduction

2 Geometric representation in the Gaussian setting

3 Binary discrimination methodsSupport Vector MachineDistance Weighted DiscriminationMean DifferenceMaximal Data PilingNaive Bayes

4 Asymptotic results for the normal vectors

5 Comparison of the MD and SVM methods

6 Conclusions

Page 6: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Content

1 Introduction

2 Geometric representation in the Gaussian setting

3 Binary discrimination methodsSupport Vector MachineDistance Weighted DiscriminationMean DifferenceMaximal Data PilingNaive Bayes

4 Asymptotic results for the normal vectors

5 Comparison of the MD and SVM methods

6 Conclusions

Page 7: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Content

1 Introduction

2 Geometric representation in the Gaussian setting

3 Binary discrimination methodsSupport Vector MachineDistance Weighted DiscriminationMean DifferenceMaximal Data PilingNaive Bayes

4 Asymptotic results for the normal vectors

5 Comparison of the MD and SVM methods

6 Conclusions

Page 8: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

HDLSS context

Data with High-Dimension and Low Sample Size (HDLSS)arise in many fields.

There is an increasing current interest in the statisticalanalysis of this kind of data.

Recently several authors have been interested in thecomparison of classical methods for Binary DiscriminationAnalysis with new ones designed for the HDLSS context.

Page 9: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Geometric representation

Let z(d) = (z(1), . . . , z(d))> be a d-dimensional randomvector drawn from the multivariate standard normaldistribution.

By Hall, Marron and Neeman (2005) the random vector z lienear the surface of an expanding sphere:

‖ z ‖ =

(d∑

k=1

z(k)2

)1/2

= d1/2 + Op(1), as d →∞.

Page 10: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

If z1 and z2 are independent then

‖ z1 − z2 ‖ = (2d)1/2 + Op(1), as d →∞,

and

Angle(z1, z2) =π

2+ Op(d−1/2), as d →∞.

Page 11: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Case d = 3 and n = 3

Figure: 3-simplex

Page 12: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Binary discrimination analysis

Suppose that we have the following training data set

(x1,w1), (x2,w2), . . . , (xN ,wN), (1)

where xi ∈ Rd and wi ∈ {−1, 1}, for i = 1, 2, . . . ,N.

We have two classes of data, the classes C+ and C−corresponding to the vectors with wi = 1 and wi = −1,respectively.

Let m and n be the cardinality of C+ and C−, respectively.

We want to assign a new data point x0 to one of these classes.

Page 13: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Binary discrimination analysis

Suppose that we have the following training data set

(x1,w1), (x2,w2), . . . , (xN ,wN), (1)

where xi ∈ Rd and wi ∈ {−1, 1}, for i = 1, 2, . . . ,N.

We have two classes of data, the classes C+ and C−corresponding to the vectors with wi = 1 and wi = −1,respectively.

Let m and n be the cardinality of C+ and C−, respectively.

We want to assign a new data point x0 to one of these classes.

Page 14: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Binary discrimination analysis

Suppose that we have the following training data set

(x1,w1), (x2,w2), . . . , (xN ,wN), (1)

where xi ∈ Rd and wi ∈ {−1, 1}, for i = 1, 2, . . . ,N.

We have two classes of data, the classes C+ and C−corresponding to the vectors with wi = 1 and wi = −1,respectively.

Let m and n be the cardinality of C+ and C−, respectively.

We want to assign a new data point x0 to one of these classes.

Page 15: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Binary discrimination analysis

Suppose that we have the following training data set

(x1,w1), (x2,w2), . . . , (xN ,wN), (1)

where xi ∈ Rd and wi ∈ {−1, 1}, for i = 1, 2, . . . ,N.

We have two classes of data, the classes C+ and C−corresponding to the vectors with wi = 1 and wi = −1,respectively.

Let m and n be the cardinality of C+ and C−, respectively.

We want to assign a new data point x0 to one of these classes.

Page 16: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Separable case

We say that the training data set is linearly separable if thereexists a hyperplane for which all the data of the class C+ areon one side of the hyperplane and all the data of the class C−are on the other side.

A hyperplane with such property is called a separatinghyperplane of the training data set.

Page 17: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Binary discrimination methods

We consider the following methods based on separatinghyperplanes:

Mean Difference (MD)

Support Vector Machine (SVM)

Distance Weighted Discrimination (DWD)

Maximal Data Piling (MDP)

Naive Bayes (NB)

The DWD and MDP methods are designed for the HDLSS context.

Page 18: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

The MD, SVM and DWD methods have been previouslystudied in Hall, Marron and Neeman (2005), where theprobability of misclassification of a new data point isconsidered when the dimension d tends to infinity.

Marron, Todd and Ahn (2007) observed by simulation thatthe MD, SVM and DWD methods have asymptotically thesame behavior, in terms of proportion of misclassificationof new data.

In the present work (Bolivar-Cime, Marron (2011)) we take adifferent asymptotic viewpoint, based on the angle betweenthe normal vectors of the separating hyperplanes.

Page 19: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Support Vector Machine (SVM)

Vapnik (1982, 1995)

Cortes and Vapnik (1995)

Burges (1998)

Cristianini and Shawe-Taylor (2000)

Izenman (2008)

Page 20: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Let d+ and d− be the shortest distances from the separatinghyperplane to the nearest vector in C+ and C−, respectively. Themargin of the separating hyperplane is defined as d0 = d+ + d−.

The vectors closer to the separating hyperplane are called supportvectors.

Page 21: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

In the separable case the SVM hyperplane

vT0 x + b0 = 0,

is the unique separating hyperplane with a maximal margin.

v0 is a linear function only of the support vectors.

Page 22: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

In the separable case the SVM hyperplane

vT0 x + b0 = 0,

is the unique separating hyperplane with a maximal margin.

v0 is a linear function only of the support vectors.

Page 23: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Distance Weighted Discrimination (DWD)

In the HDLSS situation Marron, Todd and Ahn (2007) observethat the projection of the data onto the normal vector of theSVM separating hyperplane produces substantial data piling.

They show that data piling may affect the generalizationperformance.

They propose the DWD method, which avoids the data pilingproblem and improves generalizability.

All the training data have a role in finding the DWDhyperplane, but data closer to the hyperplane have a biggerimpact than data that are farther away.

Page 24: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Define the residual of the i-th data vector as the distance ofthis vector to the separating hyperplane.

Page 25: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

The DWD hyperplane

v>1 x + b1 = 0,

solves the optimization problem

minimizeN∑i=1

1

ri

subject to ‖ v ‖= 1, ri ≥ 0, i = 1, 2, . . .N. (2)

Page 26: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Mean Difference (MD)

Also called the nearest centroid method (Scholkopf and Smola(2002)).

The separating hyperplane of this method is the one thatorthogonally bisects the segment joining the centroids ormeans of the two classes.

That is, if the means of the classes C+ and C− are given by

x+ =1

m

∑xi∈C+

xi and x− =1

n

∑xi∈C−

xi , (3)

respectively, then the MD hyperplane has normal vector

u = x+ − x− (4)

and bisects the segment joining the means x+ and x−.

Page 27: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Maximal Data Piling (MDP)

This method was proposed by Ahn and Marron (2010).

It was specially designed for the HDLSS context and we needto assume d ≥ N − 1 and that the subspace generated by thedata set has dimension N − 1.

There exist direction vectors onto which the projection of thetraining data are piled completely at two distinct points, onefor each class.

Page 28: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

For the MDP hyperplane

v>2 x + b2 = 0,

the unit normal vector v2 is the direction vector for which theprojections of the two class means have maximal distance, andthe projection of each data point onto the vector is the sameas its class mean.

The bias b2 can be calculated as

b2 = −v>2 (mx+ + nx−)/N. (5)

Page 29: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

For the MDP hyperplane

v>2 x + b2 = 0,

the unit normal vector v2 is the direction vector for which theprojections of the two class means have maximal distance, andthe projection of each data point onto the vector is the sameas its class mean.

The bias b2 can be calculated as

b2 = −v>2 (mx+ + nx−)/N. (5)

Page 30: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Exampled = 39, n = m = 20. The data of C− and C+ were drawn frommultivariate normal distributions with identity covariance matrixand means µ− = (−2.2, 0, . . . , 0)> and µ+ = (+2.2, 0, . . . , 0)>,respectively.

Page 31: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

The MDP method is equivalent to the Fisher LinearDiscrimination (FLD) in the non-HDLSS situation.

Bickel and Levina (2004) have demonstrated that FLD hasvery poor HDLSS properties.

Ahn and Marron (2010) showed that, although data pilingmay not be desirable, the MDP method can work very welland better than FLD in some settings in the HDLSS context.

Page 32: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Naive Bayes (NB)

It uses the Bayes Rule to obtain the normal vector of theseparating hyperplane (Bickel and Levina (2004)).

This method assumes that the common covariance matrix Σof the two classes is diagonal, i.e. the entries of the randomvectors are uncorrelated.

Page 33: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Suppose that the classes have normal distributions Nd(µ0,Σ)and Nd(µ1,Σ).

Let u be the MD normal vector and let D = diag(Σ̂) be thediagonal matrix whose entries are the diagonal elements of thepooled covariance matrix

Σ̂ =1

m + n − 2

∑xi∈C+

(xi − x+)2 +∑

xi∈C−

(xi − x+)2

. (6)

By Bickel and Levina (2004) the normal vector of the NBhyperplane is given by

v3 = D−1u. (7)

Page 34: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

The bias of the NB hyperplane can be calculated as

b3 = −v>3 µ̂, (8)

where µ̂ = (x+ + x−)/2.

Page 35: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Assumptions

The random vectors in C+ and C− are independent withd-multivariate normal distributions Nd(vd , Id) and Nd(0, Id),respectively.

The difference between these classes is determined by themean vector vd .

The length of vd , i.e. ‖ vd ‖, is crucial for classificationperformance.

Page 36: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Figure: Case ‖ vd ‖� d1/2

Figure: Case ‖ vd ‖� d1/2

Page 37: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Figure: Case ‖ vd ‖� d1/2

Figure: Case ‖ vd ‖� d1/2

Page 38: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Here, ‖ vd ‖≈ d1/2 is a critical boundary for classificationperformance.

Because the separating hyperplane of the methods aredetermined by their normal vector, the behavior ofclassification is studied considering the direction of this vector.

Under our assumptions, vd is the optimal direction for thenormal vector of the separating hyperplane.

Thus, a discrimination method has good properties if theangle between its normal vector and vd is close to zero.

Page 39: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Asymptotic behavior

Theorem 4.1

If v represents the normal vector of MD, SVM, DWD or MDPhyperplane we have that

Angle(v , vd)w−→

0, if ‖ vd ‖ d−1/2 →∞;π

2, if ‖ vd ‖ d−1/2 → 0;

arccos

(c

(γ + c2)1/2

), if ‖ vd ‖ d−1/2 → c > 0;

as d →∞, where γ = 1m + 1

n .

‖ vd ‖� d1/2: consistent

‖ vd ‖� d1/2: strongly inconsistent

Page 40: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Asymptotic behavior

Theorem 4.1

If v represents the normal vector of MD, SVM, DWD or MDPhyperplane we have that

Angle(v , vd)w−→

0, if ‖ vd ‖ d−1/2 →∞;π

2, if ‖ vd ‖ d−1/2 → 0;

arccos

(c

(γ + c2)1/2

), if ‖ vd ‖ d−1/2 → c > 0;

as d →∞, where γ = 1m + 1

n .

‖ vd ‖� d1/2: consistent

‖ vd ‖� d1/2: strongly inconsistent

Page 41: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Different asymptotic behavior of the NB method

Proposition 4.1

Let v be the NB normal vector and assume m + n > 6. Supposevd = β1d where β = βd → c as d →∞, with 0 ≤ c ≤ ∞, then

Angle(v , vd)w−→

arccos

(µ̃

(σ̃2 + µ̃2)1/2

), if c =∞;

arccos

(cµ̃

(γ + c2)1/2(σ̃2 + µ̃2)1/2

), if c ≥ 0;

(9)

as d →∞, where γ = 1m + 1

n ,

µ̃ =m + n − 2

m + n − 4, σ̃2 =

2(m + n − 2)2

(m + n − 4)2(m + n − 6). (10)

The NB normal vector is always inconsistent.

It is strongly inconsistent when c = 0.

Page 42: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Different asymptotic behavior of the NB method

Proposition 4.1

Let v be the NB normal vector and assume m + n > 6. Supposevd = β1d where β = βd → c as d →∞, with 0 ≤ c ≤ ∞, then

Angle(v , vd)w−→

arccos

(µ̃

(σ̃2 + µ̃2)1/2

), if c =∞;

arccos

(cµ̃

(γ + c2)1/2(σ̃2 + µ̃2)1/2

), if c ≥ 0;

(9)

as d →∞, where γ = 1m + 1

n ,

µ̃ =m + n − 2

m + n − 4, σ̃2 =

2(m + n − 2)2

(m + n − 4)2(m + n − 6). (10)

The NB normal vector is always inconsistent.

It is strongly inconsistent when c = 0.

Page 43: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Similar asymptotic behavior of the NB method

Proposition 4.2

Let v be the NB normal vector and assume m + n > 6. Supposevd = (dδ, 0, . . . , 0)>, with δ > 0. Then

Angle(v , vd)w−→

{0, if δ > 1/2;π

2, if δ < 1/2;

(11)

as d →∞.

The NB has a similar behavior to the other four methodswhen c = 0 (δ < 1/2) and c =∞ (δ > 1/2).

Page 44: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Similar asymptotic behavior of the NB method

Proposition 4.2

Let v be the NB normal vector and assume m + n > 6. Supposevd = (dδ, 0, . . . , 0)>, with δ > 0. Then

Angle(v , vd)w−→

{0, if δ > 1/2;π

2, if δ < 1/2;

(11)

as d →∞.

The NB has a similar behavior to the other four methodswhen c = 0 (δ < 1/2) and c =∞ (δ > 1/2).

Page 45: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

First case: The mean vector must interact with all of theestimated marginal variances. This introduces a large amountof noise into the classification process.

Second case: Only one estimated marginal variance hassubstantial influence on the classification, so its effect isasymptotically negligible.

Page 46: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Comparison of the methods

Hall, Marron and Neeman (2005) only studiedmisclassification probabilities, where little contrast betweenmethods was available.

We provide better comparison between methods throughdeeper study of the asymptotic behavior of them in terms ofthe normal vectors of their separating hyperplanes.

Page 47: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

When ‖ vd ‖� d1/2

Theorem 5.1 (Case m = 1, n = 2)

Suppose ‖ vd ‖= dδ with 1/2 < δ < 1. Let vSVM and vMD be thenormal vectors of the SVM and MD methods, respectively. LetASVM = Angle(vSVM , vd) and AMD = Angle(vMD , vd), then

A2SVM − A2

MD = 2X 21

d+ Op(d−(1+ε)) as d →∞,

for some ε > 0 and where X 21 is a r.v. with the chi-square

distribution with one degree of freedom.

Page 48: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

When ‖ vd ‖� d1/2

Theorem 5.2 (Case m = 1, n = 2)

Suppose ‖ vd ‖= dδ with δ > 1. Let ASVM and AMD be as inTheorem 5.1, then

P(ASVM > AMD) −→ 1 as d →∞. (12)

Page 49: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

When ‖ vd ‖� d1/2

Theorem 5.3 (Case m = 1, n = 2)

Suppose vd = (dδ, 0, . . . , 0)> with 0 < δ < 1/2. Let ASVM andAMD be as in Theorem 5.1, then

ASVM − AMD =N0

d+

X 21

γ1/2d3/2−δ + O(d−(3/2−δ+ε′))

=N0

d+ Op(d−(1+ε))

as d →∞ for some ε′, ε > 0, where N0 converges in distribution tothe product of two independent standard normal random variablesas d →∞ and X 2

1 is a r.v. with the chi-square distribution withone degree of freedom.

Page 50: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Conclusions

The MD, SVM, DWD and MDP methods have the sameasymptotic behavior when d →∞.

However, depending on the asymptotic behavior of ‖ vd ‖these methods may be asymptotically consistent orinconsistent.

The NB sometimes behaves worse than the other fourmethods.

Generally, MD is better than the SVM method when d is largebut fixed.

Page 51: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

J. Ahn and J. S. Marron.The Maximal Data Piling Direction for Discrimination.Biometrika, 97(1):254–259, 2010.

A. Bolivar-Cime and J. S. Marron.Comparison of Binary Discrimination Methods for HighDimension Low Sample Size Data.June, 2011.Manuscript.

P. J. Bickel and E. Levina.Some Theory for Fisher’s Linear Discriminant Function, “NaiveBayes” and some alternatives where there are many morevariables than observations.Bernoulli, 10:989–1010, 2004.

C. J. C. Burges.A Tutorial on Support Vector Machines for PatternRecognition.Data Mining and Knowledge Discovery, 2:121–167, 1998.

Page 52: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

N. Cristianini and J. Shawe-Taylor.An Introduction to Support Vector Machines and otherkernel-based learning methods.Cambridge University Press, Cambridge, U.K., 2000.

C. Cortes and V. N. Vapnik.Support-Vector Networks.Machine Learning, 20:273–297, 1995.

P. Hall, J.S. Marron, and A. Neeman.Geometric Representation of High Dimension, Low SampleSize Data.Journal of the Royal Statistical Society. Series B. StatisticalMethodology., 67(3):427–444, 2005.

A. J. Izenman.Modern Multivariate Statistical Techniques: Regression,Classification and Manifold Learning.Springer Texts in Statistics. Springer, New York, 2008.

J. S. Marron, M. J. Todd, and J. Ahn.

Page 53: Comparison of Binary Discrimination Methods for …jrojo/PASI/PASI2011/Dr.RojoPasi2011/prgacad/13.pdfContent 1 Introduction 2 Geometric representation in the Gaussian setting 3 Binary

Distance-Weighted Discrimination.Journal of the American Statistical Association,102(480):1267–1271, 2007.

B. Scholkopf and A. J. Smola.Learning with Kernels: Support Vector Machines,Regularization, Optimization and Beyond.The MIT press, Cambridge, Massachusetts, 2002.

V. N. Vapnik.Estimation of Dependences Based on Empirical Data.Springer Series in Statistics. Springer-Verlag, Berlin, 1982.

V. N. Vapnik.The Nature of Statistical Learning Theory.Springer-Verlag, Berlin, 1995.