22
UNCORRECTED PROOF AI Communications 00 (20xx) 1–22 1 DOI 10.3233/AIC-170729 IOS Press 1 52 2 53 3 54 4 55 5 56 6 57 7 58 8 59 9 60 10 61 11 62 12 63 13 64 14 65 15 66 16 67 17 68 18 69 19 70 20 71 21 72 22 73 23 74 24 75 25 76 26 77 27 78 28 79 29 80 30 81 31 82 32 83 33 84 34 85 35 86 36 87 37 88 38 89 39 90 40 91 41 92 42 93 43 94 44 95 45 96 46 97 47 98 48 99 49 100 50 101 51 102 Linear discriminant analysis: A detailed tutorial Alaa Tharwat a,b,,∗∗ , Tarek Gaber c,, Abdelhameed Ibrahim d,and Aboul Ella Hassanien e,a Department of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt am Main, Germany b Faculty of Engineering, Suez Canal University, Egypt E-mail: [email protected] c Faculty of Computers and Informatics, Suez Canal University, Egypt E-mail: [email protected] d Faculty of Engineering, Mansoura University, Egypt E-mail: [email protected] e Faculty of Computers and Information, Cairo University, Egypt E-mail: [email protected] Abstract. Linear Discriminant Analysis (LDA) is a very common technique for dimensionality reduction problems as a pre- processing step for machine learning and pattern classification applications. At the same time, it is usually used as a black box, but (sometimes) not well understood. The aim of this paper is to build a solid intuition for what is LDA, and how LDA works, thus enabling readers of all levels be able to get a better understanding of the LDA and to know how to apply this technique in different applications. The paper first gave the basic definitions and steps of how LDA technique works supported with visual explanations of these steps. Moreover, the two methods of computing the LDA space, i.e. class-dependent and class-independent methods, were explained in details. Then, in a step-by-step approach, two numerical examples are demonstrated to show how the LDA space can be calculated in case of the class-dependent and class-independent methods. Furthermore, two of the most common LDA problems (i.e. Small Sample Size (SSS) and non-linearity problems) were highlighted and illustrated, and state- of-the-art solutions to these problems were investigated and explained. Finally, a number of experiments was conducted with different datasets to (1) investigate the effect of the eigenvectors that used in the LDA space on the robustness of the extracted feature for the classification accuracy, and (2) to show when the SSS problem occurs and how it can be addressed. Keywords: Dimensionality reduction, PCA, LDA, Kernel Functions, Class-Dependent LDA, Class-Independent LDA, SSS (Small Sample Size) problem, eigenvectors artificial intelligence 1. Introduction Dimensionality reduction techniques are important in many applications related to machine learning [15], data mining [6,33], Bioinformatics [47], biometric [61] and information retrieval [73]. The main goal of the di- mensionality reduction techniques is to reduce the di- mensions by removing the redundant and dependent features by transforming the features from a higher di- mensional space that may lead to a curse of dimension- ality problem, to a space with lower dimensions. There * Scientific Research Group in Egypt, (SRGE), http://www. egyptscience.net. ** Corresponding author. E-mail: [email protected]. are two major approaches of the dimensionality reduc- tion techniques, namely, unsupervised and supervised approaches. In the unsupervised approach, there is no need for labeling classes of the data. While in the su- pervised approach, the dimensionality reduction tech- niques take the class labels into consideration [15,32]. There are many unsupervised dimensionality reduc- tion techniques such as Independent Component Anal- ysis (ICA) [28,31] and Non-negative Matrix Factor- ization (NMF) [14], but the most famous technique of the unsupervised approach is the Principal Component Analysis (PCA) [4,62,67,71]. This type of data reduc- tion is suitable for many applications such as visualiza- tion [2,40], and noise removal [70]. On the other hand, the supervised approach has many techniques such as [research-article] p. 1/22 0921-7126/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved

Linear discriminant analysis: A detailed tutorialscholar.cu.edu.eg/?q=abo/files/linear_discreminate_analysisp_detaile… · Linear discriminant analysis: A detailed tutorial Alaa

  • Upload
    others

  • View
    24

  • Download
    0

Embed Size (px)

Citation preview

  • UNCO

    RREC

    TED

    PRO

    OF

    AI Communications 00 (20xx) 1–22 1DOI 10.3233/AIC-170729IOS Press1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Linear discriminant analysis: A detailedtutorial

    Alaa Tharwat a,b,∗,∗∗, Tarek Gaber c,∗, Abdelhameed Ibrahim d,∗ and Aboul Ella Hassanien e,∗a Department of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt am Main,Germanyb Faculty of Engineering, Suez Canal University, EgyptE-mail: [email protected] Faculty of Computers and Informatics, Suez Canal University, EgyptE-mail: [email protected] Faculty of Engineering, Mansoura University, EgyptE-mail: [email protected] Faculty of Computers and Information, Cairo University, EgyptE-mail: [email protected]

    Abstract. Linear Discriminant Analysis (LDA) is a very common technique for dimensionality reduction problems as a pre-processing step for machine learning and pattern classification applications. At the same time, it is usually used as a black box,but (sometimes) not well understood. The aim of this paper is to build a solid intuition for what is LDA, and how LDA works,thus enabling readers of all levels be able to get a better understanding of the LDA and to know how to apply this technique indifferent applications. The paper first gave the basic definitions and steps of how LDA technique works supported with visualexplanations of these steps. Moreover, the two methods of computing the LDA space, i.e. class-dependent and class-independentmethods, were explained in details. Then, in a step-by-step approach, two numerical examples are demonstrated to show howthe LDA space can be calculated in case of the class-dependent and class-independent methods. Furthermore, two of the mostcommon LDA problems (i.e. Small Sample Size (SSS) and non-linearity problems) were highlighted and illustrated, and state-of-the-art solutions to these problems were investigated and explained. Finally, a number of experiments was conducted withdifferent datasets to (1) investigate the effect of the eigenvectors that used in the LDA space on the robustness of the extractedfeature for the classification accuracy, and (2) to show when the SSS problem occurs and how it can be addressed.

    Keywords: Dimensionality reduction, PCA, LDA, Kernel Functions, Class-Dependent LDA, Class-Independent LDA, SSS(Small Sample Size) problem, eigenvectors artificial intelligence

    1. Introduction

    Dimensionality reduction techniques are importantin many applications related to machine learning [15],data mining [6,33], Bioinformatics [47], biometric [61]and information retrieval [73]. The main goal of the di-mensionality reduction techniques is to reduce the di-mensions by removing the redundant and dependentfeatures by transforming the features from a higher di-mensional space that may lead to a curse of dimension-ality problem, to a space with lower dimensions. There

    *Scientific Research Group in Egypt, (SRGE), http://www.egyptscience.net.

    **Corresponding author. E-mail: [email protected].

    are two major approaches of the dimensionality reduc-tion techniques, namely, unsupervised and supervisedapproaches. In the unsupervised approach, there is noneed for labeling classes of the data. While in the su-pervised approach, the dimensionality reduction tech-niques take the class labels into consideration [15,32].

    There are many unsupervised dimensionality reduc-tion techniques such as Independent Component Anal-ysis (ICA) [28,31] and Non-negative Matrix Factor-ization (NMF) [14], but the most famous technique ofthe unsupervised approach is the Principal ComponentAnalysis (PCA) [4,62,67,71]. This type of data reduc-tion is suitable for many applications such as visualiza-tion [2,40], and noise removal [70]. On the other hand,the supervised approach has many techniques such as

    [research-article] p. 1/22

    0921-7126/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved

    mailto:[email protected]:[email protected]:[email protected]:[email protected]://www.egyptscience.nethttp://www.egyptscience.netmailto:[email protected]

  • UNCO

    RREC

    TED

    PRO

    OF

    2 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Mixture Discriminant Analysis (MDA) [25] and Neu-ral Networks (NN) [27], but the most famous techniqueof this approach is the Linear Discriminant Analysis(LDA) [50]. This category of dimensionality reductiontechniques are used in biometrics [12,36], Bioinfor-matics [77], and chemistry [11].

    The LDA technique is developed to transform thefeatures into a lower dimensional space, which max-imizes the ratio of the between-class variance to thewithin-class variance, thereby guaranteeing maximumclass separability [43,76]. There are two types of LDAtechnique to deal with classes: class-dependent andclass-independent. In the class-dependent LDA, oneseparate lower dimensional space is calculated for eachclass to project its data on it whereas, in the class-independent LDA, each class will be considered as aseparate class against the other classes [1,74]. In thistype, there is just one lower dimensional space for allclasses to project their data on it.

    Although the LDA technique is considered the mostwell-used data reduction techniques, it suffers from anumber of problems. In the first problem, LDA fails tofind the lower dimensional space if the dimensions aremuch higher than the number of samples in the datamatrix. Thus, the within-class matrix becomes singu-lar, which is known as the small sample problem (SSS).There are different approaches that proposed to solvethis problem. The first approach is to remove the nullspace of within-class matrix as reported in [56,79]. Thesecond approach used an intermediate subspace (e.g.PCA) to convert a within-class matrix to a full-rankmatrix; thus, it can be inverted [4,35]. The third ap-proach, a well-known solution, is to use the regulariza-tion method to solve a singular linear systems [38,57].In the second problem, the linearity problem, if differ-ent classes are non-linearly separable, the LDA can-not discriminate between these classes. One solution tothis problem is to use the kernel functions as reportedin [50].

    The brief tutorials on the two LDA types are re-ported in [1]. However, the authors did not show theLDA algorithm in details using numerical tutorials, vi-sualized examples, nor giving insight investigation ofexperimental results. Moreover, in [57], an overviewof the SSS for the LDA technique was presented in-cluding the theoretical background of the SSS prob-lem. Moreover, different variants of LDA techniquewere used to solve the SSS problem such as Di-rect LDA (DLDA) [22,83], regularized LDA (RLDA)[18,37,38,80], PCA+LDA [42], Null LDA (NLDA)[10,82], and kernel DLDA (KDLDA) [36]. In addition,

    the authors presented different applications that usedthe LDA-SSS techniques such as face recognition andcancer classification. Furthermore, they conducted dif-ferent experiments using three well-known face recog-nition datasets to compare between different variantsof the LDA technique. Nonetheless, in [57], there isno detailed explanation of how (with numerical exam-ples) to calculate the within and between class vari-ances to construct the LDA space. In addition, the stepsof constructing the LDA space are not supported withwell-explained graphs helping for well understandingof the LDA underlying mechanism. In addition, thenon-linearity problem was not highlighted.

    This paper gives a detailed tutorial about the LDAtechnique, and it is divided into five sections. Section 2gives an overview about the definition of the mainidea of the LDA and its background. This section be-gins by explaining how to calculate, with visual expla-nations, the between-class variance, within-class vari-ance, and how to construct the LDA space. The algo-rithms of calculating the LDA space and projecting thedata onto this space to reduce its dimension are thenintroduced. Section 3 illustrates numerical examples toshow how to calculate the LDA space and how to selectthe most robust eigenvectors to build the LDA space.While Section 4 explains the most two common prob-lems of the LDA technique and a number of state-of-the-art methods to solve (or approximately solve) theseproblems. Different applications that used LDA tech-nique are introduced in Section 5. In Section 6, differ-ent packages for the LDA and its variants were pre-sented. In Section 7, two experiments are conducted toshow (1) the influence of the number of the selectedeigenvectors on the robustness and dimension of theLDA space, (2) how the SSS problem occurs and high-lights the well-known methods to solve this problem.Finally, concluding remarks will be given in Section 8.

    2. LDA technique

    2.1. Definition of LDA

    The goal of the LDA technique is to project the orig-inal data matrix onto a lower dimensional space. Toachieve this goal, three steps needed to be performed.The first step is to calculate the separability betweendifferent classes (i.e. the distance between the meansof different classes), which is called the between-classvariance or between-class matrix. The second step is tocalculate the distance between the mean and the sam-ples of each class, which is called the within-class vari-

    [research-article] p. 2/22

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 3

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    ance or within-class matrix. The third step is to con-struct the lower dimensional space which maximizesthe between-class variance and minimizes the within-class variance. This section will explain these threesteps in detail, and then the full description of the LDAalgorithm will be given. Figures 1 and 2 are used tovisualize the steps of the LDA technique.

    2.2. Calculating the between-class variance (SB )

    The between-class variance of the ith class (SBi )represents the distance between the mean of the ithclass (μi) and the total mean (μ). LDA techniquesearches for a lower-dimensional space, which is usedto maximize the between-class variance, or simply

    Fig. 1. Visualized steps to calculate a lower dimensional subspace of the LDA technique.

    [research-article] p. 3/22

  • UNCO

    RREC

    TED

    PRO

    OF

    4 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Fig. 2. Projection of the original samples (i.e. data matrix) on the lower dimensional space of LDA (Vk).

    maximize the separation distance between classes.To explain how the between-class variance or thebetween-class matrix (SB ) can be calculated, the fol-lowing assumptions are made. Given the original datamatrix X = {x1, x2, . . . , xN }, where xi represents theith sample, pattern, or observation and N is the to-tal number of samples. Each sample is represented byM features (xi ∈ RM ). In other words, each sam-ple is represented as a point in M-dimensional space.Assume the data matrix is partitioned into c = 3classes as follows, X = [ω1, ω2, ω3] as shown inFig. 1 (step (A)). Each class has five samples (i.e.n1 = n2 = n3 = 5), where ni represents the number ofsamples of the ith class. The total number of samples(N ) is calculated as follows, N = ∑3i=1 ni .

    To calculate the between-class variance (SB ), theseparation distance between different classes whichis denoted by (mi − m) will be calculated as fol-lows:

    (mi − m)2 =(WT μi − WT μ

    )2= WT (μi − μ)(μi − μ)T W (1)

    where mi represents the projection of the mean of theith class and it is calculated as follows, mi = WT μi ,where m is the projection of the total mean of allclasses and it is calculated as follows, m = WT μ, Wrepresents the transformation matrix of LDA,1 μi(1 ×M) represents the mean of the ith class and it is com-puted as in Equation (2), and μ(1 × M) is the to-tal mean of all classes and it can be computed as inEquation (3) [36,83]. Figure 1 shows the mean of each

    1The transformation matrix (W ) will be explained in Section 2.4.

    class and the total mean in step (B) and (C), respec-tively.

    μj = 1nj

    ∑xi∈ωj

    xi (2)

    μ = 1N

    N∑i=1

    xi =c∑

    i=1

    ni

    Nμi (3)

    where c represents the total number of classes (in ourexample c = 3).

    The term (μi − μ)(μi − μ)T in Equation (1) repre-sents the separation distance between the mean of theith class (μi) and the total mean (μ), or simply it repre-sents the between-class variance of the ith class (SBi ).Substitute SBi into Equation (1) as follows:

    (mi − m)2 = WT SBiW (4)

    The total between-class variance is calculated as fol-lows, (SB = ∑ci=1 niSBi ). Figure 1 (step (D)) showsfirst how the between-class matrix of the first class(SB1 ) is calculated and then how the total between-class matrix (SB ) is then calculated by adding all thebetween-class matrices of all classes.

    2.3. Calculating the within-class variance (SW )

    The within-class variance of the ith class (SWi ) rep-resents the difference between the mean and the sam-ples of that class. LDA technique searches for a lower-dimensional space, which is used to minimize the dif-ference between the projected mean (mi) and the pro-jected samples of each class (WT xi), or simply min-imizes the within-class variance [36,83]. The within-

    [research-article] p. 4/22

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 5

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    class variance of each class (SWj ) is calculated as inEquation (5).

    ∑xi∈ωj ,j=1,...,c

    (WT xi − mj

    )2

    =∑

    xi∈ωj ,j=1,...,c

    (WT xij − WT μj

    )2

    =∑

    xi∈ωj ,j=1,...,cWT (xij − μj )2W

    =∑

    xi∈ωj ,j=1,...,cWT (xij − μj )(xij − μj )T W

    =∑

    xi∈ωj ,j=1,...,cWT SWj W (5)

    From Equation (5), the within-class variance foreach class can be calculated as follows, SWj = dTj ∗dj = ∑nji=1(xij −μj )(xij −μj )T , where xij representsthe ith sample in the j th class as shown in Fig. 1 (step(E), (F)), and dj is the centering data of the j th class,i.e. dj = ωj − μj = {xi}nji=1 − μj . Moreover, step (F)in the figure illustrates how the within-class varianceof the first class (SW1 ) in our example is calculated.The total within-class variance represents the sum ofall within-class matrices of all classes (see Fig. 1 (step(F))), and it can be calculated as in Equation (6).

    SW =3∑

    i=1SWi

    =∑

    xi∈ω1(xi − μ1)(xi − μ1)T

    +∑

    xi∈ω2(xi − μ2)(xi − μ2)T

    +∑

    xi∈ω3(xi − μ3)(xi − μ3)T (6)

    2.4. Constructing the lower dimensional space

    After calculating the between-class variance (SB )and within-class variance (SW ), the transformation ma-trix (W ) of the LDA technique can be calculated as inEquation (7), which is called Fisher’s criterion. Thisformula can be reformulated as in Equation (8).

    arg maxW

    WT SBW

    WT SWW(7)

    SWW = λSBW (8)

    where λ represents the eigenvalues of the transfor-mation matrix (W ). The solution of this problemcan be obtained by calculating the eigenvalues (λ ={λ1, λ2, . . . , λM}) and eigenvectors (V = {v1, v2, . . . ,vM}) of W = S−1W SB , if SW is non-singular [36,81,83].

    The eigenvalues are scalar values, while the eigen-vectors are non-zero vectors, which satisfies the Equa-tion (8) and provides us with the information aboutthe LDA space. The eigenvectors represent the direc-tions of the new space, and the corresponding eigen-values represent the scaling factor, length, or the mag-nitude of the eigenvectors [34,59]. Thus, each eigen-vector represents one axis of the LDA space, andthe associated eigenvalue represents the robustnessof this eigenvector. The robustness of the eigenvec-tor reflects its ability to discriminate between differ-ent classes, i.e. increase the between-class variance,and decreases the within-class variance of each class;hence meets the LDA goal. Thus, the eigenvectors withthe k highest eigenvalues are used to construct a lowerdimensional space (Vk), while the other eigenvectors({vk+1, vk+2, vM}) are neglected as shown in Fig. 1(step (G)).

    Figure 2 shows the lower dimensional space of theLDA technique, which is calculated as in Fig. 1 (step(G)). As shown, the dimension of the original data ma-trix (X ∈ RN×M ) is reduced by projecting it onto thelower dimensional space of LDA (Vk ∈ RM×k) as de-noted in Equation (9) [81]. The dimension of the dataafter projection is k; hence, M − k features are ig-nored or deleted from each sample. Thus, each sample(xi) which was represented as a point a M-dimensionalspace will be represented in a k-dimensional space byprojecting it onto the lower dimensional space (Vk) asfollows, yi = xiVk .

    Y = XVk (9)Figure 3 shows a comparison between two lower-

    dimensional sub-spaces. In this figure, the original datawhich consists of three classes as in our example areplotted. Each class has five samples, and all samplesare represented by two features only (xi ∈ R2) tobe visualized. Thus, each sample is represented as apoint in two-dimensional space. The transformationmatrix (W(2 × 2)) is calculated using the steps inSection 2.2, 2.3, and 2.4. The eigenvalues (λ1 and λ2)and eigenvectors (i.e. sub-spaces) (V = {v1, v2}) ofW are then calculated. Thus, there are two eigenvec-tors or sub-spaces. A comparison between the twolower-dimensional sub-spaces shows the following no-tices:

    [research-article] p. 5/22

  • UNCO

    RREC

    TED

    PRO

    OF

    6 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Fig. 3. A visualized comparison between the two lower-dimensionalsub-spaces which are calculated using three different classes.

    • First, the separation distance between differentclasses when the data are projected on the firsteigenvector (v1) is much greater than when thedata are projected on the second eigenvector (v2).As shown in the figure, the three classes are effi-ciently discriminated when the data are projectedon v1. Moreover, the distance between the meansof the first and second classes (m1 −m2) when theoriginal data are projected on v1 is much greaterthan when the data are projected on v2, which re-flects that the first eigenvector discriminates thethree classes better than the second one.

    • Second, the within-class variance when the dataare projected on v1 is much smaller than when itprojected on v2. For example, SW1 when the dataare projected on v1 is much smaller than when thedata are projected on v2. Thus, projecting the dataon v1 minimizes the within-class variance muchbetter than v2.

    From these two notes, we conclude that the first eigen-vector meets the goal of the lower-dimensional spaceof the LDA technique than the second eigenvector;hence, it is selected to construct a lower-dimensionalspace.

    2.5. Class-dependent vs. class-independent methods

    The aim of the two methods of the LDA is to calcu-late the LDA space. In the class-dependent LDA, one

    separate lower dimensional space is calculated for eachclass as follows, Wi = S−1Wi SB , where Wi representsthe transformation matrix for the ith class. Thus, eigen-values and eigenvectors are calculated for each trans-formation matrix separately. Hence, the samples ofeach class are projected on their corresponding eigen-vectors. On the other hand, in the class-independentmethod, one lower dimensional space is calculated forall classes. Thus, the transformation matrix is calcu-lated for all classes, and the samples of all classes areprojected on the selected eigenvectors [1].

    2.6. LDA algorithm

    In this section the detailed steps of the algorithms ofthe two LDA methods are presented. As shown in Al-gorithms 1 and 2, the first four steps in both algorithmsare the same. Table 1 shows the notations which areused in the two algorithms.

    2.7. Computational complexity of LDA

    In this section, the computational complexity forLDA is analyzed. The computational complexity forthe first four steps, common in both class-dependentand class-independent methods, are computed as fol-lows. As illustrated in Algorithm 1, in step (2), to cal-culate the mean of the ith class, there are niM addi-tions and M divisions, i.e., in total, there are (NM +cM) operations. In step (3), there are NM additionsand M divisions, i.e., there are (NM + M) opera-tions. The computational complexity of the fourth stepis c(M + M2 + M2), where M is for μi − μ, M2 for(μi −μ)(μi −μ)T , and the last M2 is for the multipli-cation between ni and the matrix (μi −μ)(μi −μ)T . Inthe fifth step, there are N(M + M2) operations, whereM is for (xij −μj ) and M2 is for (xij −μj )(xij −μj )T .In the sixth step, there are M3 operations to calculateS−1W , M3 is for the multiplication between S

    −1W and

    SB , and M3 to calculate the eigenvalues and eigenvec-tors. Thus, in class-independent method, the computa-tional complexity is O(NM2) if N > M; otherwise,the complexity is O(M3).

    In Algorithm 2, the number of operations to cal-culate the within-class variance for each class SWj inthe sixth step is nj (M + M2), and to calculate SW ,N(M + M2) operations are needed. Hence, calculat-ing the within-class variance for both LDA methodsare the same. In the seventh step and eighth, thereare M3 operations for the inverse, M3 for the multi-plication of S−1Wi SB , and M

    3 for calculating eigenval-

    [research-article] p. 6/22

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 7

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Algorithm 1 Linear Discriminant Analysis (LDA):Class-Independent

    1: Given a set of N samples [xi]Ni=1, each of which isrepresented as a row of length M as in Fig. 1 (step(A)), and X(N × M) is given by,

    X =

    ⎡⎢⎢⎢⎣

    x(1,1) x(1,2) . . . x(1,M)x(2,1) x(2,2) . . . x(2,M)

    ......

    ......

    x(N,1) x(N,2) . . . x(N,M)

    ⎤⎥⎥⎥⎦ (10)

    2: Compute the mean of each class μi(1 × M) as inEquation (2).

    3: Compute the total mean of all data μ(1 ×M) as inEquation (3).

    4: Calculate between-class matrix SB(M×M) as fol-lows:

    SB =c∑

    i=1ni(μi − μ)(μi − μ)T (11)

    5: Compute within-class matrix SW(M × M), as fol-lows:

    SW =c∑

    j=1

    nj∑i=1

    (xij − μj )(xij − μj )T (12)

    where xij represents the ith sample in the j thclass.

    6: From Equation (11) and (12), the matrix W thatmaximizing Fisher’s formula which is defined inEquation (7) is calculated as follows, W = S−1W SB .The eigenvalues (λ) and eigenvectors (V ) of W arethen calculated.

    7: Sorting eigenvectors in descending order accord-ing to their corresponding eigenvalues. The first keigenvectors are then used as a lower dimensionalspace (Vk).

    8: Project all original samples (X) onto the lower di-mensional space of LDA as in Equation (9).

    ues and eigenvectors. These two steps are repeated foreach class which increases the complexity of the class-dependent algorithm. Totally, the computational com-plexity of the class-dependent algorithm is O(NM2) ifN > M; otherwise, the complexity is O(cM3). Hence,

    Algorithm 2 Linear Discriminant Analysis (LDA):Class-Dependent

    1: Given a set of N samples [xi]Ni=1, each of which isrepresented as a row of length M as in Fig. 1 (step(A)), and X(N × M) is given by,

    X =

    ⎡⎢⎢⎢⎣

    x(1,1) x(1,2) . . . x(1,M)x(2,1) x(2,2) . . . x(2,M)

    ......

    ......

    x(N,1) x(N,2) . . . x(N,M)

    ⎤⎥⎥⎥⎦ (13)

    2: Compute the mean of each class μi(1 × M) as inEquation (2).

    3: Compute the total mean of all data μ(1 ×M) as inEquation (3).

    4: Calculate between-class matrix SB(M × M) as inEquation (11)

    5: for all Class i, i = 1, 2, . . . , c do6: Compute within-class matrix of each class

    SWi (M × M), as follows:

    SWj =∑

    xi∈ωj(xi − μj )(xi − μj )T (14)

    7: Construct a transformation matrix for each class(Wi) as follows:

    Wi = S−1Wi SB (15)

    8: The eigenvalues (λi) and eigenvectors (V i) ofeach transformation matrix (Wi) are then calcu-lated, where λi and V i represent the calculatedeigenvalues and eigenvectors of the ith class, re-spectively.

    9: Sorting the eigenvectors in descending order ac-cording to their corresponding eigenvalues. Thefirst k eigenvectors are then used to construct alower dimensional space for each class V ik .

    10: Project the samples of each class (ωi) onto theirlower dimensional space (V ik ), as follows:

    �j = xiV jk , xi ∈ ωj (16)

    where �j represents the projected samples ofthe class ωj .

    11: end for

    [research-article] p. 7/22

  • UNCO

    RREC

    TED

    PRO

    OF

    8 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Table 1

    Notation

    Notation Description

    X Data matrix

    N Total number of samples in X

    W Transformation matrix

    ni Number of samples in ωiμi The mean of the ith class

    μ Total or global mean of all samples

    SWi Within-class variance or scatter matrix of the ith class(ωi )

    SBi Between-class variance of the ith class (ωi )

    V Eigenvectors of W

    Vi ith eigenvector

    xij The ith sample in the j th class

    k The dimension of the lower dimensional space (Vk)

    xi ith sample

    M Dimension of X or the number of features of X

    Vk The lower dimensional space

    c Total number of classes

    mi The mean of the ith class after projection

    m The total mean of all classes after projection

    SW Within-class variance

    SB Between-class variance

    λ Eigenvalue matrix

    λi ith eigenvalue

    Y Projection of the original data

    ωi ith Class

    the class-dependent method needs computations morethan class-independent method.

    In our case, we assumed that there are 40 classesand each class has ten samples. Each sample is repre-sented by 4096 features (M > N ). Thus, the compu-tational complexity of the class-independent method isO(M3) = 40963, while the class-dependent methodneeds O(cM3) = 40 × 40963.

    3. Numerical examples

    In this section, two numerical examples will bepresented. The two numerical examples explain thesteps to calculate the LDA space and how the LDAtechnique is used to discriminate between only twodifferent classes. In the first example, the lower-dimensional space is calculated using the class-independent method, while in the second example, theclass-dependent method is used. Moreover, a compar-ison between the lower dimensional spaces of eachmethod is presented. In all numerical examples, the

    numbers are rounded up to the nearest hundredths(i.e. only two digits after the decimal point are dis-played).

    The first four steps of both class-independent andclass-dependent methods are common as illustrated inAlgorithms 1 and 2. Thus, in this section, we show howthese steps are calculated.

    Given two different classes, ω1(5×2) and ω2(6×2)have (n1 = 5) and (n2 = 6) samples, respectively.Each sample in both classes is represented by two fea-tures (i.e. M = 2) as follows:

    ω1 =

    ⎡⎢⎢⎢⎢⎣

    1.00 2.002.00 3.003.00 3.004.00 5.005.00 5.00

    ⎤⎥⎥⎥⎥⎦ and

    ω2 =

    ⎡⎢⎢⎢⎢⎢⎢⎣

    4.00 2.005.00 0.005.00 2.003.00 2.005.00 3.006.00 3.00

    ⎤⎥⎥⎥⎥⎥⎥⎦

    (17)

    To calculate the lower dimensional space usingLDA, first the mean of each class μj is calculated. Thetotal mean μ(1×2) is then calculated, which representsthe mean of all means of all classes. The values of themean of each class and the total mean are shown below,

    μ1 =[3.00 3.60

    ],

    μ2 =[4.67 2.00

    ], and

    μ = [ 511μ1 611μ2] = [3.91 2.727](18)

    The between-class variance of each class (SBi (2 ×2)) and the total between-class variance (SB(2×2)) arecalculated. The values of the between-class variance ofthe first class (SB1 ) is equal to,

    SB1 = n1(μ1 − μ)T (μ1 − μ)= 5 [−0.91 0.87]T [−0.91 0.87]

    =[

    4.13 −3.97−3.97 3.81

    ](19)

    Similarly, SB2 is calculated as follows:

    SB2 =[

    3.44 −3.31−3.31 3.17

    ](20)

    [research-article] p. 8/22

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 9

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    The total between-class variance is calculated a fol-lows:

    SB = SB1 + SB2=

    [4.13 −3.97

    −3.97 3.81]

    +[

    3.44 −3.31−3.31 3.17

    ]

    =[

    7.58 −7.27−7.27 6.98

    ](21)

    To calculate the within-class matrix, first subtractthe mean of each class from each sample in that classand this step is called mean-centering data and it is cal-culated as follows, di = ωi − μi , where di representscentering data of the class ωi . The values of d1 and d2are as follows:

    d1 =

    ⎡⎢⎢⎢⎢⎣

    −2.00 −1.60−1.00 −0.600.00 −0.601.00 1.402.00 1.40

    ⎤⎥⎥⎥⎥⎦ and

    d2 =

    ⎡⎢⎢⎢⎢⎢⎢⎣

    −0.67 0.000.33 −2.000.33 0.00

    −1.67 0.000.33 1.001.33 1.00

    ⎤⎥⎥⎥⎥⎥⎥⎦

    (22)

    In the next two subsections, two different methodsare used to calculate the LDA space.

    3.1. Class-independent method

    In this section, the LDA space is calculated us-ing the class-independent method. This method rep-resents the standard method of LDA as in Algo-rithm 1.

    After centring the data, the within-class variancefor each class (SWi (2 × 2)) is calculated as follows,SWj = dTj ∗ dj =

    ∑nji=1(xij − μj )T (xij − μj ), where

    xij represents the ith sample in the j th class. The totalwithin-class matrix (SW(2 × 2)) is then calculated asfollows, SW = ∑ci=1 SWi . The values of the within-

    class matrix for each class and the total within-classmatrix are as follows:

    SW1 =[

    10.00 8.008.00 7.20

    ],

    SW2 =[

    5.33 1.001.00 6.00

    ],

    SW =[

    15.33 9.009.00 13.20

    ](23)

    The transformation matrix (W ) in the class-independentmethod can be obtained as follows, W = S−1W SB , andthe values of (S−1W ) and (W ) are as follows:

    S−1W =[

    0.11 −0.07−0.07 0.13

    ]and

    W =[

    1.36 −1.31−1.48 1.42

    ] (24)

    The eigenvalues (λ(2 × 2)) and eigenvectors (V (2 ×2)) of W are then calculated as follows:

    λ =[

    0.00 0.000.00 2.78

    ]and

    V =[−0.69 0.68−0.72 −0.74

    ] (25)

    From the above results it can be noticed that, thesecond eigenvector (V2) has corresponding eigenvaluemore than the first one (V1), which reflects that, thesecond eigenvector is more robust than the first one;hence, it is selected to construct the lower dimen-sional space. The original data is projected on thelower dimensional space, as follows, yi = ωiV2, whereyi(ni ×1) represents the data after projection of the ithclass, and its values will be as follows:

    y1 = ω1V2

    =

    ⎡⎢⎢⎢⎢⎣

    1.00 2.002.00 3.003.00 3.004.00 5.005.00 5.00

    ⎤⎥⎥⎥⎥⎦

    [0.68

    −0.74]

    =

    ⎡⎢⎢⎢⎢⎣

    −0.79−0.85−0.18−0.97−0.29

    ⎤⎥⎥⎥⎥⎦ (26)

    [research-article] p. 9/22

  • UNCO

    RREC

    TED

    PRO

    OF

    10 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Fig. 4. Probability density function of the projected data of the firstexample, (a) the projected data on V1, (b) the projected data on V2.

    Similarly, y2 is as follows:

    y2 = ω2V2 =

    ⎡⎢⎢⎢⎢⎢⎢⎣

    1.243.391.920.561.181.86

    ⎤⎥⎥⎥⎥⎥⎥⎦

    (27)

    Figure 4 illustrates a probability density function(pdf) graph of the projected data (yi) on the two eigen-vectors (V1 and V2). A comparison of the two eigen-vectors reveals the following:

    • The data of each class is completely discriminatedwhen it is projected on the second eigenvector(see Fig. 4(b)) than the first one (see Fig. 4(a).In other words, the second eigenvector maximizesthe between-class variance more than the firstone.

    • The within-class variance (i.e. the variance be-tween the same class samples) of the two classesare minimized when the data are projected on thesecond eigenvector. As shown in Fig. 4(b), thewithin-class variance of the first class is smallcompared with Fig. 4(a).

    3.2. Class-dependent method

    In this section, the LDA space is calculated using theclass-dependent method. As mentioned in Section 2.5,the class-dependent method aims to calculate a sepa-rate transformation matrix (Wi) for each class.

    The within-class variance for each class (SWi (2 ×2)) is calculated as in class-independent method. Thetransformation matrix (Wi) for each class is then cal-culated as follows, Wi = S−1Wi SB . The values of thetwo transformation matrices (W1 and W2) will be asfollows:

    W1 = S−1W1 SB

    =[

    10.00 8.008.00 7.20

    ]−1 [ 7.58 −7.27−7.27 6.98

    ]

    =[

    0.90 −1.00−1.00 1.25

    ] [7.58 −7.27

    −7.27 6.98]

    =[

    14.09 −13.53−16.67 16.00

    ](28)

    Similarly, W2 is calculated as follows:

    W2 =[

    1.70 −1.63−1.50 1.44

    ](29)

    The eigenvalues (λi) and eigenvectors (Vi) for eachtransformation matrix (Wi) are calculated, and the val-ues of the eigenvalues and eigenvectors are shown be-low.

    λω1 =[

    0.00 0.000.00 30.01

    ]and

    Vω1 =[−0.69 0.65−0.72 −0.76

    ] (30)

    λω2 =[

    3.14 0.000.00 0.00

    ]and

    Vω2 =[

    0.75 0.69−0.66 0.72

    ] (31)

    where λωi and Vωi represent the eigenvalues and eigen-vectors of the ith class, respectively.

    From the results shown (above) it can be seen that,the second eigenvector of the first class (V {2}ω1 ) has cor-responding eigenvalue more than the first one; thus,the second eigenvector is used as a lower dimensionalspace for the first class as follows, y1 = ω1 ∗ V {2}ω1 ,where y1 represents the projection of the samples of

    [research-article] p. 10/22

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 11

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Fig. 5. Probability density function (pdf) of the projected data using

    class-dependent method, the first class is projected on V {2}ω1 , whilethe second class is projected on V {1}ω2 .

    the first class. While, the first eigenvector in the secondclass (V {1}ω2 ) has corresponding eigenvalue more thanthe second one. Thus, V {1}ω2 is used to project the data ofthe second class as follows, y2 = ω2 ∗ V {1}ω2 , where y2represents the projection of the samples of the secondclass. The values of y1 and y2 will be as follows:

    y1 =

    ⎡⎢⎢⎢⎢⎣

    −0.88−1.00−0.35−1.24−0.59

    ⎤⎥⎥⎥⎥⎦ and y2 =

    ⎡⎢⎢⎢⎢⎢⎢⎣

    1.683.762.430.931.772.53

    ⎤⎥⎥⎥⎥⎥⎥⎦

    (32)

    Figure 5 shows a pdf graph of the projected data (i.e.y1 and y2) on the two eigenvectors (V

    {2}ω1 and V

    {1}ω2 ) and

    a number of findings are revealed the following:

    • First, the projection data of the two classes areefficiently discriminated.

    • Second, the within-class variance of the projectedsamples is lower than the within-class variance ofthe original samples.

    3.3. Discussion

    In these two numerical examples, the LDA space iscalculated using class-dependent and class-independentmethods.

    Figure 6 shows a further explanation of the twomethods as following:

    • Class-Independent: As shown from the figure,there are two eigenvectors, V1 (dotted black line)and V2 (solid black line). The differences betweenthe two eigenvectors are as follows:

    ∗ The projected data on the second eigenvec-tor (V2) which has the highest correspondingeigenvalue will discriminate the data of thetwo classes better than the first eigenvector. Asshown in the figure, the distance between theprojected means m1 −m2 which represents SB ,increased when the data are projected on V2than V1.

    ∗ The second eigenvector decreases the within-class variance much better than the first eigen-vector. Figure 6 illustrates that the within-classvariance of the first class (SW1 ) was muchsmaller when it was projected on V2 than V1.

    ∗ As a result of the above two findings, V2 is usedto construct the LDA space.

    • Class-Dependent: As shown from the figure, thereare two eigenvectors, V {2}ω1 (red line) and V

    {1}ω2

    (blue line), which represent the first and secondclasses, respectively. The differences between thetwo eigenvectors are as following:

    ∗ Projecting the original data on the two eigen-vectors discriminates between the two classes.As shown in the figure, the distance betweenthe projected means m1 − m2 is larger than thedistance between the original means μ1 − μ2.

    ∗ The within-class variance of each class is de-creased. For example, the within-class varianceof the first class (SW1 ) is decreased when it isprojected on its corresponding eigenvector.

    ∗ As a result of the above two findings, V {2}ω1 andV

    {1}ω2 are used to construct the LDA space.

    • Class-Dependent vs. Class-Independent: The twoLDA methods are used to calculate the LDAspace, but a class-dependent method calculatesseparate lower dimensional spaces for each classwhich has two main limitations: (1) it needsmore CPU time and calculations more than class-independent method; (2) it may lead to SSS prob-lem because the number of samples in each classaffects the singularity of SWi .

    2

    These findings reveal that the standard LDA techniqueused the class-independent method rather than usingthe class-dependent method.

    2SSS problem will be explained in Section 4.2.

    [research-article] p. 11/22

  • UNCO

    RREC

    TED

    PRO

    OF

    12 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Fig. 6. Illustration of the example of the two different methods of LDA methods. The blue and red lines represent the first and second eigenvectorsof the class-dependent approach, respectively, while the solid and dotted black lines represent the second and first eigenvectors of class-indepen-dent approach, respectively.

    4. Main problems of LDA

    Although LDA is one of the most common data re-duction techniques, it suffers from two main problems:the Small Sample Size (SSS) and linearity problems.In the next two subsections, these two problems willbe explained, and some of the state-of-the-art solutionsare highlighted.

    4.1. Linearity problem

    LDA technique is used to find a linear transforma-tion that discriminates between different classes. How-ever, if the classes are non-linearly separable, LDA cannot find a lower dimensional space. In other words,LDA fails to find the LDA space when the discrimina-tory information are not in the means of classes. Fig-ure 7 shows how the discriminatory information doesnot exist in the mean, but in the variance of the data.This is because the means of the two classes are equal.

    The mathematical interpretation for this problem is asfollows: if the means of the classes are approximatelyequal, so the SB and W will be zero. Hence, the LDAspace cannot be calculated.

    One of the solutions of this problem is based on thetransformation concept, which is known as a kernelmethods or functions [3,50]. Figure 7 illustrates howthe transformation is used to map the original data intoa higher dimensional space; hence, the data will be lin-early separable, and the LDA technique can find thelower dimensional space in the new space. Figure 8graphically and mathematically shows how two non-separable classes in one-dimensional space are trans-formed into a two-dimensional space (i.e. higher di-mensional space); thus, allowing linear separation.

    The kernel idea is applied in Support Vector Ma-chine (SVM) [49,66,69] Support Vector Regression(SVR) [58], PCA [51], and LDA [50]. Let φ representsa nonlinear mapping to the new feature space Z . The

    [research-article] p. 12/22

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 13

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Fig. 7. Two examples of two non-linearly separable classes, toppanel shows how the two classes are non-separable, while the bot-tom shows how the transformation solves this problem and the twoclasses are linearly separable.

    transformation matrix (W ) in the new feature space(Z) is calculated as in Equation (33).

    F(W) = max∣∣∣∣ W

    T SφBW

    WT SφWW

    ∣∣∣∣ (33)

    where W is a transformation matrix and Z is the newfeature space. The between-class matrix (SφB ) and the

    within-class matrix (SφW ) are defined as follows:

    SφB =

    c∑i=1

    ni(μ

    φi − μφ

    )(μ

    φi − μφ

    )T (34)

    SφW =

    c∑j=1

    nj∑i=1

    (φ{xij } − μφj

    )

    × (φ{xij } − μφj )T (35)

    Fig. 8. Example of kernel functions, the samples lie on the top panel(X) which are represented by a line (i.e. one-dimensional space) arenon-linearly separable, where the samples lie on the bottom panel(Z) which are generated from mapping the samples of the top spaceare linearly separable.

    where μφi = 1ni∑ni

    i=1 φ{xi} and μφ = 1N ×∑Ni=1 φ{xi} =

    ∑ci=1

    niN

    μφi .

    Thus, in kernel LDA, all samples are transformednon-linearly into a new space Z using the function φ.In other words, the φ function is used to map theoriginal features into Z space by creating a non-linear combination of the original samples using adot-products of it [3]. There are many types of ker-nel functions to achieve this aim. Examples of thesefunction include Gaussian or Radial Basis Function(RBF), K(xi, xj ) = exp(−‖xi − xj‖2/2σ 2), whereσ is a positive parameter, and the polynomial kernelof degree d , K(xi, xj ) = (〈xi, xj 〉 + c)d , [3,29,51,72].

    4.2. Small sample size problem

    4.2.1. Problem definitionSingularity, Small Sample Size (SSS), or under-

    sampled problem is one of the big problems of LDAtechnique. This problem results from high-dimensionalpattern classification tasks or a low number of train-ing samples available for each class compared withthe dimensionality of the sample space [30,38,82,85].

    [research-article] p. 13/22

  • UNCO

    RREC

    TED

    PRO

    OF

    14 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    The SSS problem occurs when the SW is singular.3

    The upper bound of the rank4 of SW is N − c, whilethe dimension of SW is M × M [17,38]. Thus, in mostcases M � N − c which leads to SSS problem. Forexample, in face recognition applications, the size ofthe face image my reach to 100×100 = 10,000 pixels,which represent high-dimensional features and it leadsto a singularity problem.

    4.2.2. Common solutions to SSS problem:There are many studies that proposed many solu-

    tions for this problem; each has its advantages anddrawbacks.

    • Regularization (RLDA): In regularizationmethod, the identity matrix is scaled by multi-plying it by a regularization parameter (η > 0)and adding it to the within-class matrix to makeit non-singular [18,38,45,82]. Thus, the diagonalcomponents of the within-class matrix are biasedas follows, SW = SW + ηI . However, choosingthe value of the regularization parameter requiresmore tuning and a poor choice for this parame-ter can degrade the performance of the method[38,45]. Another problem of this method is thatthe parameter η is just added to perform the in-verse of SW and has no clear mathematical inter-pretation [38,57].

    • Sub-space: In this method, a non-singular inter-mediate space is obtained to reduce the dimen-sion of the original data to be equal to the rankof SW ; hence, SW becomes full-rank,5 and thenSW can be inverted. For example, Belhumeur etal. [4] used PCA, to reduce the dimensions of theoriginal space to be equal to N − c (i.e. the upperbound of the rank of SW ). However, as reportedin [22], losing some discriminant information is acommon drawback associated with the use of thismethod.

    • Null Space: There are many studies proposedto remove the null space of SW to make SWfull-rank; hence, invertible. The drawback of thismethod is that more discriminant information islost when the null space of SW is removed, whichhas a negative impact on how the lower dimen-sional space satisfies the LDA goal [83].

    3A matrix is singular if it is square, does not have a matrix in-verse, the determinant is zeros; hence, not all columns and rows areindependent.

    4The rank of the matrix represents the number of linearly inde-pendent rows or columns.

    5A is a full-rank matrix if all columns and rows of the matrix areindependent, (i.e. rank(A) = #rows = #cols) [23].

    Four different variants of the LDA technique that areused to solve the SSS problem are introduced as fol-lows:

    PCA+LDA technique. In this technique, the orig-inal d-dimensional features are first reduced to h-dimensional feature space using PCA, and then theLDA is used to further reduce the features to k-dimensions. The PCA is used in this technique to re-duce the dimensions to make the rank of SW is N − cas reported in [4]; hence, the SSS problem is addressed.However, the PCA neglects some discriminant infor-mation, which may reduce the classification perfor-mance [57,60].

    Direct LDA technique. Direct LDA (DLDA) is oneof the well-known techniques that are used to solvethe SSS problem. This technique has two main steps[83]. In the first step, the transformation matrix, W , iscomputed to transform the training data to the rangespace of SB . In the second step, the dimensionality ofthe transformed data is further transformed using someregulating matrices as in Algorithm 4. The benefit ofthe DLDA is that there is no discriminative features areneglected as in PCA+LDA technique [83].Regularized LDA technique. In the Regularized LDA(RLDA), a small perturbation is add to the SW matrixto make it non-singular as mentioned in [18]. This reg-ularization can be applied as follows:

    (SW + ηI)−1SBwi = λiwi (36)

    where η represents a regularization parameter. The di-agonal components of the SW are biased by adding thissmall perturbation [13,18]. However, the regularizationparameter need to be tuned and poor choice of it candegrade the generalization performance [57].

    Null LDA technique. The aim of the NLDA techniqueis to find the orientation matrix W , and this can beachieved using two steps. In the first step, the rangespace of the SW is neglected, and the data are projectedonly on the null space of SW as follows, SWW = 0. Inthe second step, the aim is to search for W that satis-fies SBW = 0 and maximizes |WT SBW |. The higherdimensionality of the feature space may lead to com-putational problems. This problem can be solved by(1) using the PCA technique as a pre-processing step,i.e. before applying the NLDA technique, to reduce thedimension of feature space to be N − 1; by removingthe null space of ST = SB+SW [57], (2) using the PCAtechnique before the second step of the NLDA tech-

    [research-article] p. 14/22

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 15

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    nique [54]. Mathematically, in the Null LDA (NLDA)technique, the h column vectors of the transformationmatrix W = [w1, w2, . . . , wh] are taken to be thenull space of the SW as follows, wTi SWwi = 0,∀i =1, . . . , h, where wTi SBwi = 0. Hence, M−(N−c) lin-early independent vectors are used to form a new ori-entation matrix, which is used to maximize |WT SBW |subject to the constraint |WT SWW | = 0 as in Equa-tion (37).

    W = arg max|WT SW W |=0

    ∣∣WT SBW ∣∣ (37)

    5. Applications of the LDA technique

    In many applications, due to the high number of fea-tures or dimensionality, the LDA technique have beenused. Some of the applications of the LDA techniqueand its variants are described as follows:

    5.1. Biometrics applications

    Biometrics systems have two main steps, namely,feature extraction (including pre-processing steps) andrecognition. In the first step, the features are extractedfrom the collected data, e.g. face images, and in thesecond step, the unknown samples, e.g. unknown faceimage, is identified/verified. The LDA technique andits variants have been applied in this application. Forexample, in [10,20,41,68,75,83], the LDA techniquehave been applied on face recognition. Moreover, theLDA technique was used in Ear [84], fingerprint [44],gait [5], and speech [24] applications. In addition, theLDA technique was used with animal biometrics as in[19,65].

    5.2. Agriculture applications

    In agriculture applications, an unknown sample canbe classified into a pre-defined species using computa-tional models [64]. In this application, different vari-ants of the LDA technique was used to reduce the di-mension of the collected features as in [9,21,26,46,63,64].

    5.3. Medical applications

    In medical applications, the data such as the DNAmicroarray data consists of a large number of features

    or dimensions. Due to this high dimensionality, thecomputational models need more time to train theirmodels, which may be infeasible and expensive. More-over, this high dimensionality reduces the classifica-tion performance of the computational model and in-creases its complexity. This problem can be solved us-ing LDA technique to construct a new set of featuresfrom a large number of original features. There aremany papers have been used LDA in medical applica-tions [8,16,39,52–55].

    6. Packages

    In this section, some of the available packages thatare used to compute the space of LDA variants. Forexample, WEKA6 is a well-known Java-based datamining tool with open source machine learning soft-ware such as classification, association rules, regres-sion, pre-processing, clustering, and visualization. InWEKA, the machine learning algorithms can be ap-plied directly on the dataset or called from person’sJava code. XLSTAT7 is another data analysis and sta-tistical package for Microsoft Excel that has a widevariety of dimensionality reduction algorithms includ-ing LDA. dChip8 package is also used for visualiza-tion of gene expression and SNP microarray includingsome data analysis algorithms such as LDA, cluster-ing, and PCA. LDA-SSS9 is a Matlab package, and itcontains several algorithms related to the LDA tech-niques and its variants such as DLDA, PCA+LDA,and NLDA. MASS10 package is based on R, and it hasfunctions that are used to perform linear and quadraticdiscriminant function analysis. Dimensionality reduc-tion11 package is mainly written in Matlab, and it hasa number of dimensionality reduction techniques suchas ULDA, QLDA, and KDA. DTREG12 is a softwarepackage that is used for medical data and modelingbusiness, and it has several predictive modeling meth-ods such as LDA, PCA, linear regression, and decisiontrees.

    7. Experimental results and discussion

    In this section, two experiments were conducted toillustrate: (1) how the LDA is used for different appli-

    6http:/www.cs.waikato.ac.nz/ml/weka/7http://www.xlstat.com/en/8https://sites.google.com/site/dchipsoft/home9http://www.staff.usp.ac.fj/sharma_al/index.htm10http://www.statmethods.net/advstats/discriminant.html11http://www.public.asu.edu/*jye02/Software/index.html12http://www.dtreg.com/index.htm

    [research-article] p. 15/22

    http:/www.cs.waikato.ac.nz/ml/weka/http://www.xlstat.com/en/https://sites.google.com/site/dchipsoft/homehttp://www.staff.usp.ac.fj/sharma_al/index.htmhttp://www.statmethods.net/advstats/discriminant.htmlhttp://www.public.asu.edu/*jye02/Software/index.htmlhttp://www.dtreg.com/index.htm

  • UNCO

    RREC

    TED

    PRO

    OF

    16 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    cations, (2) what is the relation between its parame-ter (Eigenvectors) and the accuracy of a classificationproblem, (3) when the SSS problem could appear anda method for solving it.

    7.1. Experimental setup

    This section gives an overview of the databases,the platform, and the machine specification used toconduct our experiments. Different biometric datasetswere used in the experiments to show how the LDAusing its parameter behaves with different data. Thesedatasets are described as follows:

    • ORL dataset13 face images dataset (Olivetti Re-search Laboratory, Cambridge) [48], which con-sists of 40 distinct individuals, was used. In thisdataset, each individual has ten images taken atdifferent times and varying light conditions. Thesize of each image is 92 × 112.

    • Yale dataset14 is another face images datasetwhich contains 165 grey scale images in GIF for-mat of 15 individuals [78]. Each individual has11 images in different expressions and configura-tion: center-light, happy, left-light, with glasses,normal, right-light, sad, sleepy, surprised, and awink.

    • 2D ear dataset (Carreira-Perpinan,1995)15 imagesdataset [7] was used. The ear data set consists of17 distinct individuals. Six views of the left pro-file from each subject were taken under a uniform,diffuse lighting.

    In all experiments, k-fold cross-validation tests haveused. In k-fold cross-validation, the original samples ofthe dataset were randomly partitioned into k subsets of(approximately) equal size and the experiment is run ktimes. For each time, one subset was used as the test-ing set and the other k − 1 subsets were used as thetraining set. The average of the k results from the foldscan then be calculated to produce a single estimation.In this study, the value of k was set to 10.

    The images in all datasets resized to be 64 × 64 and32 × 32 as shown in Table 2. Figure 9 shows samplesof the used datasets and Table 2 shows a description ofthe datasets used in our experiments.

    In all experiments, to show the effect of the LDAwith its eigenvector parameter and its SSS problem on

    13http://www.cam-orl.co.uk14http://vision.ucsd.edu/content/yale-face-database15http://faculty.ucmerced.edu/mcarreira-perpinan/software.html

    Table 2

    Dataset description

    Dataset Dimension (M) No. of Samples (N ) No. of classes (c)

    ORL64×64 4096 400 40ORl32×32 1024Ear64×64 4096 102 17Ear32×32 1024Yale64×64 4096 165 15Yale32×32 1024

    the classification accuracy, the Nearest Neighbour clas-sifier was used. This classifier aims to classify the test-ing image by comparing its position in the LDA spacewith the positions of training images. Furthermore,class-independent LDA was used in all experiments.Moreover, Matlab Platform (R2013b) and using a PCwith the following specifications: Intel(R) Core(TM)i5-2400 CPU @ 3.10 GHz and 4.00 GB RAM, underWindows 32-bit operating system were used in our ex-periments.

    7.2. Experiment on LDA parameter (eigenvectors)

    The aim of this experiment is to investigate the re-lation between the number of eigenvectors used in theLDA space, and the classification accuracy based onthese eigenvectors and the required CPU time for thisclassification.

    As explained earlier that the LDA space consists ofk eigenvectors, which are sorted according to their ro-bustness (i.e. their eigenvalues). The robustness of eacheigenvector reflects its ability to discriminate betweendifferent classes. Thus, in this experiment, it will bechecked whether increasing the number of eigenvec-tors would increase the total robustness of the con-structed LDA space; hence, different classes could bewell discriminated. Also, it will be tested whether in-creasing the number of eigenvectors would increase thedimension of the LDA space and the projected data;hence, CPU time increases. To investigate these is-sue, three datasets listed in Table 2 (i.e. ORL32×32,Ear32×32, Yale32×32), were used. Moreover, seven,four, and eight images from each subject in the ORL,ear, Yale datasets, respectively, are used in this exper-iment. The results of this experiment are presented inFig. 10.

    From Fig. 10 it can be noticed that the accuracy andCPU time are proportional with the number of eigen-vectors which are used to construct the LDA space.Thus, the choice of using LDA in a specific applica-tion should consider a trade-off between these factors.

    [research-article] p. 16/22

    http://www.cam-orl.co.ukhttp://vision.ucsd.edu/content/yale-face-databasehttp://faculty.ucmerced.edu/mcarreira-perpinan/software.html

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 17

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Fig. 9. Samples of the first individual in: ORL face dataset (top row); Ear dataset (middle row), and Yale face dataset (bottom row).

    Fig. 10. Accuracy and CPU time of the LDA techniques using dif-ferent percentages of eigenvectors, (a) Accuracy (b) CPU time.

    Moreover, from Fig. 10(a), it can be remarked thatwhen the number of eigenvectors used in computingthe LDA space was increased, the classification accu-racy was also increased to a specific extent after whichthe accuracy remains constant. As seen in Fig. 10(a),this extent differs from application to another. For ex-ample, the accuracy of the ear dataset remains con-stant when the percentage of the used eigenvectors ismore than 10%. This was expected as the eigenvec-tors of the LDA space are sorted according to their ro-bustness (see Section 2.6). Similarly, in ORL and Yaledatasets the accuracy became approximately constantwhen the percentage of the used eigenvectors is morethan 40%. In terms of CPU time, Fig. 10(b) shows theCPU time using different percentages of the eigenvec-tors. As shown, the CPU time increased dramaticallywhen the number of eigenvectors increases.

    Figure 11 shows the weights16 of the first 40 eigen-vectors which confirms our findings. From these re-sults, we can conclude that the high order eigenvectorsof the data of each application (the first 10% of eardatabase and the first 40% of ORL and Yale datasets)are robust enough to extract and save the most discrim-inative features which are used to achieve a good accu-racy.

    These experiments confirmed that increasing thenumber of eigenvectors will increase the dimension ofthe LDA space; hence, CPU time increases. Conse-quently, the amount of discriminative information andthe accuracy increases.

    16The weight of the eigenvector represents the ratio of its cor-responding eigenvalue (λi ) to the total of all eigenvalues (λi , i =1, 2, . . . , k) as follows, λi∑k

    j=1 λj.

    [research-article] p. 17/22

  • UNCO

    RREC

    TED

    PRO

    OF

    18 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Fig. 11. The robustness of the first 40 eigenvectors of the LDA tech-nique using ORL32×32, Ear32×32, and Yale32×32 datasets.

    7.3. Experiments on the small sample size problem

    The aim of this experiment is to show when the LDAis subject to the SSS problem and what are the methodsthat could be used to solve this problem. In this experi-ment, PCA-LDA [4] and Direct-LDA [22,83] methodsare used to address the SSS problem. A set of exper-iments was conducted to show how the two methodsinteract with the SSS problem, including the number ofsamples in each class, the total number of classes used,and the dimension of each sample.

    As explained in Section 4, using LDA directly maylead to the SSS problem when the dimension of thesamples are much higher than the total number of sam-ples. As shown in Table 2, the size of each image orsample of the ORL dataset is 64×64 is 4096 and the to-tal number of samples is 400. The mathematical inter-pretation of this point shows that the dimension of SWis M × M , while the upper bound of the rank of SW isN −c [79,82]. Thus, all the datasets which are reportedin Table 2 lead to singular SW .

    To address this problem, PCA-LDA and direct-LDAmethods were implemented. In the PCA-LDA method,PCA is first used to reduce the dimension of the origi-nal data to make SW full-rank, and then standard LDAcan be performed safely in the subspace of SW as in Al-gorithm 3. For more details of the PCA-LDA methodare reported in [4]. In the direct-LDA method, the null-space of SW matrix is removed to make SW full-rank,then standard LDA space can be calculated as in Algo-rithm 4. More details of direct-LDA methods are foundin [83].

    Table 2 illustrates various scenarios designed to testthe effect of different dimensions on the rank of SW

    Algorithm 3 PCA-LDA1: Read the training images (X = {x1, x2, . . . , xN }),

    where xi(ro×co) represents the ith training image,ro and co represent the rows (height) and columns(width) of xi , respectively, N represents the totalnumber of training images.

    2: Convert all images in vector representation Z ={z1, z2, . . . , zN }, where the dimension of Z isM × 1, M = ro × co.

    3: calculate the mean of each class μi , total mean ofall data μ, between-class matrix SB(M × M), andwithin-class matrix SW(M ×M) as in Algorithm 1(Step (2–5)).

    4: Use the PCA technique to reduce the dimension ofX to be equal to or lower than r , where r representsthe rank of SW , as follows:

    XPCA = UT X (38)

    where, U ∈ RM×r is the lower dimensional spaceof the PCA and XPCA represents the projected dataon the PCA space.

    5: Calculate the mean of each class μi , total mean ofall data μ, between-class matrix SB , and within-class matrix SW of XPCA as in Algorithm 1 (Step(2–5)).

    6: Calculate W as in Equation (7) and then calculatethe eigenvalues (λ) and eigenvectors (V ) of the Wmatrix.

    7: Sorting the eigenvectors in descending order ac-cording to their corresponding eigenvalues. Thefirst k eigenvectors are then used as a lower dimen-sional space (Vk).

    8: The original samples (X) are first projected on thePCA space as in Equation (38). The projection onthe LDA space is then calculated as follows:

    XLDA = V Tk XPCA = V Tk UT X (39)

    where XLDA represents the final projected on theLDA space.

    and the accuracy. The results of these scenarios usingboth PCA-LDA and direct-LDA methods are summa-rized in Table 3.

    As summarized in Table 3, the rank of SW is verysmall compared to the whole dimension of SW ; hence,the SSS problem occurs in all cases. As shown inTable 3, using the PCA-LDA and the Direct-LDA,the SSS problem can be solved and the Direct-LDA

    [research-article] p. 18/22

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 19

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    Table 3

    Accuracy of the PCA-LDA and direct-LDA methods using the datasets listed in Table 2

    Dataset Dim(SW ) # Training images # Testing images Rank(SW ) Accuracy (%)

    PCA-LDA Direct-LDA

    ORL32×32 1024 × 1024 5 5 160 75.5 88.57 3 240 75.5 97.5

    9 1 340 80.39 97.5

    Ear32×32 1024 × 1024 3 3 34 80.39 96.084 2 51 88.24 94.12

    5 1 68 100 100

    Yale32×32 1024 × 1024 6 5 75 78.67 90.678 3 105 84.44 97.79

    10 1 135 100 100

    ORL64×64 4096 × 4096 5 5 160 72 87.57 3 240 81.67 96.67

    9 1 340 82.5 97.5

    Ear64×64 4096 × 4096 3 3 34 74.5 96.084 2 51 91.18 96.08

    5 1 68 100 100

    Yale64×64 4096 × 4096 6 5 75 74.67 928 3 105 95.56 97.78

    10 1 135 93.33 100The bold values indicate that the corresponding methods obtain best performances.

    Algorithm 4 Direct-LDA1: Read the training images (X = {x1, x2, . . . , xN }),

    where xi(ro×co) represents the ith training image,ro and co represent the rows (height) and columns(width) of xi , respectively, N represents the totalnumber of training images.

    2: Convert all images in vector representation Z ={z1, z2, . . . , zN }, where the dimension of Z is M×1, M = ro × co.

    3: Calculate the mean of each class μi , total meanof all data μ, between-class matrix SB(M × M),and within-class matrix SW(M × M) of X as inAlgorithm 1 (Step (2–5)).

    4: Find the k eigenvectors of SB with non-zeros eigenvalues, and denote them as U =[u1, u2, . . . , uk] (i.e. UT SbU > 0).

    5: Calculate the eigenvalues and eigenvectors ofUT SWU , and then sort the eigenvalues and dis-card the eigenvectors which have high values. Theselected eigenvectors are denoted by V ; thus, Vrepresents the null space of SW .

    6: The final LDA matrix (�) consists of the range17

    SB and the null space of SW as follows, � = UV .7: The original data are projected on the LDA space

    as follows, Y = X� = XUV .

    method achieved results better than PCA-LDA becauseas reported in [22,83], Direct-LDA method saves ap-proximately all important information for classifica-tion while the PCA in PCA-LDA method saves the in-formation with high variance.

    8. Conclusions

    In this paper, the definition, mathematics, and im-plementation of LDA were presented and explained.The paper aimed to give low-level details on how theLDA technique can address the reduction problem byextracting discriminative features based on maximiz-ing the ratio between the between-class variance, SB ,and within-class variance, SW , thus discriminating be-tween different classes. To achieve this aim, the pa-per followed the approach of not only explaining thesteps of calculating the SB , and SW (i.e. the LDAspace) but also visualizing these steps with figuresand diagrams to make it easy to understand. More-over, two LDA methods, i.e. class-dependent and class-independent, are explained and two numerical exam-ples were given and graphically illustrated to explain

    17The range of a matrix A represents the column-space of A(C(A)), where the dimension of C(A) is equal to the rank of A.

    [research-article] p. 19/22

  • UNCO

    RREC

    TED

    PRO

    OF

    20 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    how the LDA space using the two methods can be con-structed. In all examples, the mathematical interpreta-tion of the robustness and the selection of the eigenvec-tors as well the data projection were detailed and dis-cussed. Also, LDA common problems (e.g. the SSS andlinearity) were mathematically explained using graph-ical examples, then their state-of-the-art solutions arehighlighted. Moreover, a detailed implementation ofLDA applications was presented. Using three standarddatasets, a number of experiments were conducted to(1) investigate and explain the relation between thenumber of eigenvectors and the robustness of the LDAspace, (2) to practically show when the SSS problemoccurs and how it can be addressed.

    References

    [1] S. Balakrishnama and A. Ganapathiraju, Linear discriminantanalysis-a brief tutorial, Institute for Signal and InformationProcessing (1998).

    [2] E. Barshan, A. Ghodsi, Z. Azimifar and M.Z. Jahromi, Su-pervised principal component analysis: Visualization, classifi-cation and regression on subspaces and submanifolds, PatternRecognition 44(7) (2011), 1357–1371. doi:10.1016/j.patcog.2010.12.015.

    [3] G. Baudat and F. Anouar, Generalized discriminant analysisusing a kernel approach, Neural Computation 12(10) (2000),2385–2404. doi:10.1162/089976600300014980.

    [4] P.N. Belhumeur, J.P. Hespanha and D. Kriegman, Eigenfacesvs. fisherfaces: Recognition using class specific linear projec-tion, IEEE Transactions on Pattern Analysis and Machine In-telligence 19(7) (1997), 711–720. doi:10.1109/34.598228.

    [5] N.V. Boulgouris and Z.X. Chi, Gait recognition using Radontransform and linear discriminant analysis, IEEE Transactionson Image Processing 16(3) (2007), 731–740. doi:10.1109/TIP.2007.891157.

    [6] M. Bramer, Principles of Data Mining, 2nd edn, Springer,2013.

    [7] M. Carreira-Perpinan, Compression neural networks for fea-ture extraction: Application to human recognition from ear im-ages, MS thesis, Faculty of Informatics, Technical Universityof Madrid, Spain, 1995.

    [8] H.-P. Chan, D. Wei, M.A. Helvie, B. Sahiner, D.D. Adler,M.M. Goodsitt and N. Petrick, Computer-aided classificationof mammographic masses and normal tissue: Linear discrimi-nant analysis in texture feature space, Physics in Medicine andBiology 40(5) (1995), 857–876. doi:10.1088/0031-9155/40/5/010.

    [9] L. Chen, N.C. Carpita, W.-D. Reiter, R.H. Wilson, C. Jef-fries and M.C. McCann, A rapid method to screen for cell-wall mutants using discriminant analysis of Fourier transforminfrared spectra, The Plant Journal 16(3) (1998), 385–392.doi:10.1046/j.1365-313x.1998.00301.x.

    [10] L.-F. Chen, H.-Y.M. Liao, M.-T. Ko, J.-C. Lin and G.-J. Yu,A new lda-based face recognition system which can solve thesmall sample size problem, Pattern Recognition 33(10) (2000),1713–1726. doi:10.1016/S0031-3203(99)00139-9.

    [11] D. Coomans, D. Massart and L. Kaufman, Optimization bystatistical linear discriminant analysis in analytical chemistry,Analytica Chimica Acta 112(2) (1979), 97–122. doi:10.1016/S0003-2670(01)83513-3.

    [12] J. Cui and Y. Xu, Three dimensional palmprint recognitionusing linear discriminant analysis method, in: Proceedings ofthe Second International Conference on Innovations in Bio-Inspired Computing and Applications (IBICA), 2011, IEEE,2011, pp. 107–111. doi:10.1109/IBICA.2011.31.

    [13] D.-Q. Dai and P.C. Yuen, Face recognition by regularized dis-criminant analysis, IEEE Transactions on Systems, Man, andCybernetics, Part B (Cybernetics) 37(4) (2007), 1080–1085.doi:10.1109/TSMCB.2007.895363.

    [14] D. Donoho and V. Stodden, When does non-negative matrixfactorization give a correct decomposition into parts? in: Ad-vances in Neural Information Processing Systems 16, MITPress, 2004, pp. 1141–1148.

    [15] R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification,2nd edn, Wiley, 2012.

    [16] S. Dudoit, J. Fridlyand and T.P. Speed, Comparison of discrim-ination methods for the classification of tumors using gene ex-pression data, Journal of the American Statistical Association97(457) (2002), 77–87. doi:10.1198/016214502753479248.

    [17] T.-t. Feng and G. Wu, A theoretical contribution to the fast im-plementation of null linear discriminant analysis method us-ing random matrix multiplication with scatter matrices, 2014,arXiv preprint arXiv:1409.2579.

    [18] J.H. Friedman, Regularized discriminant analysis, Journal ofthe American Statistical Association 84(405) (1989), 165–175.doi:10.1080/01621459.1989.10478752.

    [19] T. Gaber, A. Tharwat, A.E. Hassanien and V. Snasel, Biometriccattle identification approach based on Weber’s local descriptorand adaboost classifier, Computers and Electronics in Agricul-ture 122 (2016), 55–66. doi:10.1016/j.compag.2015.12.022.

    [20] T. Gaber, A. Tharwat, A. Ibrahim, V. Snáel and A.E. Has-sanien, Human thermal face recognition based on random lin-ear oracle (rlo) ensembles, in: International Conference on In-telligent Networking and Collaborative Systems, IEEE, 2015,pp. 91–98. doi:10.1109/INCoS.2015.67.

    [21] T. Gaber, A. Tharwat, V. Snasel and A.E. Hassanien, Plantidentification: Two dimensional-based vs. one dimensional-based feature extraction methods, in: 10th International Con-ference on Soft Computing Models in Industrial and Environ-mental Applications, Springer, 2015, pp. 375–385.

    [22] H. Gao and J.W. Davis, Why direct lda is not equivalent to lda,Pattern Recognition 39(5) (2006), 1002–1006. doi:10.1016/j.patcog.2005.11.016.

    [23] G.H. Golub and C.F. Van Loan, Matrix Computations, 4th edn,Vol. 3, Johns Hopkins University Press, 2012.

    [24] R. Haeb-Umbach and H. Ney, Linear discriminant analysis forimproved large vocabulary continuous speech recognition, in:IEEE International Conference on Acoustics, Speech, and Sig-nal Processing (1992), Vol. 1, IEEE, 1992, pp. 13–16.

    [25] T. Hastie and R. Tibshirani, Discriminant analysis by Gaus-sian mixtures, Journal of the Royal Statistical Society. Series B(Methodological) (1996), 155–176.

    [26] K. Héberger, E. Csomós and L. Simon-Sarkadi, Principal com-ponent and linear discriminant analyses of free amino acidsand biogenic amines in Hungarian wines, Journal of Agricul-tural and Food Chemistry 51(27) (2003), 8055–8060. doi:10.1021/jf034851c.

    [research-article] p. 20/22

    http://dx.doi.org/10.1016/j.patcog.2010.12.015http://dx.doi.org/10.1016/j.patcog.2010.12.015http://dx.doi.org/10.1162/089976600300014980http://dx.doi.org/10.1109/34.598228http://dx.doi.org/10.1109/TIP.2007.891157http://dx.doi.org/10.1109/TIP.2007.891157http://dx.doi.org/10.1088/0031-9155/40/5/010http://dx.doi.org/10.1088/0031-9155/40/5/010http://dx.doi.org/10.1046/j.1365-313x.1998.00301.xhttp://dx.doi.org/10.1016/S0031-3203(99)00139-9http://dx.doi.org/10.1016/S0003-2670(01)83513-3http://dx.doi.org/10.1016/S0003-2670(01)83513-3http://dx.doi.org/10.1109/IBICA.2011.31http://dx.doi.org/10.1109/TSMCB.2007.895363http://dx.doi.org/10.1198/016214502753479248http://arxiv.org/abs/arXiv:1409.2579http://dx.doi.org/10.1080/01621459.1989.10478752http://dx.doi.org/10.1016/j.compag.2015.12.022http://dx.doi.org/10.1109/INCoS.2015.67http://dx.doi.org/10.1016/j.patcog.2005.11.016http://dx.doi.org/10.1016/j.patcog.2005.11.016http://dx.doi.org/10.1021/jf034851chttp://dx.doi.org/10.1021/jf034851c

  • UNCO

    RREC

    TED

    PRO

    OF

    A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 21

    1 52

    2 53

    3 54

    4 55

    5 56

    6 57

    7 58

    8 59

    9 60

    10 61

    11 62

    12 63

    13 64

    14 65

    15 66

    16 67

    17 68

    18 69

    19 70

    20 71

    21 72

    22 73

    23 74

    24 75

    25 76

    26 77

    27 78

    28 79

    29 80

    30 81

    31 82

    32 83

    33 84

    34 85

    35 86

    36 87

    37 88

    38 89

    39 90

    40 91

    41 92

    42 93

    43 94

    44 95

    45 96

    46 97

    47 98

    48 99

    49 100

    50 101

    51 102

    [27] G.E. Hinton and R.R. Salakhutdinov, Reducing the dimension-ality of data with neural networks, Science 313(5786) (2006),504–507. doi:10.1126/science.1127647.

    [28] K. Honda and H. Ichihashi, Fuzzy local independent com-ponent analysis with external criteria and its application toknowledge discovery in databases, International Journal ofApproximate Reasoning 42(3) (2006), 159–173. doi:10.1016/j.ijar.2005.10.011.

    [29] Q. Hu, L. Zhang, D. Chen, W. Pedrycz and D. Yu, Gaussiankernel based fuzzy rough sets: Model, uncertainty measuresand applications, International Journal of Approximate Rea-soning 51(4) (2010), 453–471. doi:10.1016/j.ijar.2010.01.004.

    [30] R. Huang, Q. Liu, H. Lu and S. Ma, Solving the small sam-ple size problem of lda, in: Proceedings of 16th InternationalConference on Pattern Recognition, 2002, Vol. 3, IEEE, 2002,pp. 29–32.

    [31] A. Hyvärinen, J. Karhunen and E. Oja, Independent Compo-nent Analysis, Vol. 46, Wiley, 2004.

    [32] M. Kirby, Geometric Data Analysis: An Empirical Approachto Dimensionality Reduction and the Study of Patterns, Wiley,2000.

    [33] D.T. Larose, Discovering Knowledge in Data: An Introductionto Data Mining, 1st edn, Wiley, 2014.

    [34] T. Li, S. Zhu and M. Ogihara, Using discriminant analysisfor multi-class classification: An experimental investigation,Knowledge and Information Systems 10(4) (2006), 453–472.doi:10.1007/s10115-006-0013-y.

    [35] K. Liu, Y.-Q. Cheng, J.-Y. Yang and X. Liu, An efficient al-gorithm for foley–sammon optimal set of discriminant vectorsby algebraic method, International Journal of Pattern Recog-nition and Artificial Intelligence 6(5) (1992), 817–829. doi:10.1142/S0218001492000412.

    [36] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Face recog-nition using lda-based algorithms, IEEE Transactions on Neu-ral Networks 14(1) (2003), 195–200. doi:10.1109/TNN.2002.806647.

    [37] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regularizeddiscriminant analysis for the small sample size problem in facerecognition, Pattern Recognition Letters 24(16) (2003), 3079–3087. doi:10.1016/S0167-8655(03)00167-3.

    [38] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regular-ization studies of linear discriminant analysis in small sam-ple size scenari