Svm my

  • View

  • Download

Embed Size (px)



Text of Svm my

  • 1. B Y :-

2. CONTENTS Classifiers Difference b/w Classification and Clustering What is SVM Supervised learning Linear SVM NON Linear SVM Features and Application 3. C LASSIFIERS The the goal of Classifiers is to use an objectscharacteristics to identify which class (or group) itbelongs to. Have labels for some points Supervised learning GenesProteinsFeature Y Feature X 4. D IFFERENCE B / W C LASSIFICATION AND C LUSTERING In general, in classification you have a set of predefined classes and want to know which class a new object belongs to. Clustering tries to group a set of object. In the context of machine learning, classification is supervised learning and clustering is unsupervised learning. 5. W HAT I S SVM? Support Vector Machines are based on theconcept of decision planes that define decisionboundaries.A decision plane is one that separates between aset of objects having different classmemberships. 6. the objects belong either to class GREEN or RED.The separating line defines a boundary on the right side ofwhich all objects are GREEN and to the left of which allobjects are RED. Any new object (white circle) falling tothe right is labeled, i.e., classified, as GREEN (or classifiedas RED should it fall to the left of the separating line). This is a classic example of a linear classifier 7. Most classification tasks are not as simple, as we haveseen in previous example More complex structures are needed in order to make anoptimal separationFull separation of the GREEN and RED objects wouldrequire a curve (which is more complex than a line). 8. In fig. we can see the original objects (left side of theschematic) mapped, i.e., rearranged, using a set ofmathematical functions, known as kernels. The process of rearranging the objects is known asmapping (transformation). Note that in this new setting,the mapped objects (right side of the schematic) is linearlyseparable and, thus, instead of constructing the complexcurve (left schematic), all we have to do is to find anoptimal line that can separate the GREEN and the REDobjects. 9. Support Vector Machine (SVM) is primarily a classiermethod that performs classification tasks byconstructing hyperplanes in a multidimensional spacethat separates cases of different class labels. SVM supports both regression and classification tasksand can handle multiple continuous and categoricalvariables. For categorical variables a dummy variable iscreated with case values as either 0 or 1. Thus, acategorical dependent variable consisting of threelevels, say (A, B, C), is represented by a set of threedummy variables: A: {1 0 0}, B: {0 1 0}, C: {0 0 1} 10. S UPPORT V ECTOR M ACHINES 11. S UPERVISED L EARNING Training set: a number of expression profiles with knownlabels which represent the true population.Difference to clustering: there you dont know the labels,youhave to find a structure on your own. Learning/Training:find a decision rule which explains thetraining set well.This is the easy part, because we know the labels of thetraining set! Generalisation ability: how does the decision rule learned from the training set generalize to new specimen? Goal: find a decision rule with high generalisation ability. 12. L INEAR S EPARATORS Binary classification can be viewed as thetask of separating classes in feature space: wTx + b = 0wTx + b > 0 wTx + b < 0 f(x) = sign(wTx + b) 13. L INEAR SEPARATION OF THE TRAINING SET A separating hyperplane isdefined by- the normal vector w and- the offset b: hyperplane = {x |+ b = 0} is called inner product,scalar product or dot product. Training: Choose w and b fromthe labelled examples in thetraining set. 14. P REDICT THE LABEL OF ANEW POINT Prediction: On which sideof the hyper-plane doesthe new point lie?Points in the direction ofthe normal vector areclassified as POSITIVE.Points in the oppositedirection are classified asNEGATIVE. 15. W HICH OF THE LINEARSEPARATORS IS OPTIMAL ? 16. C LASSIFICATION M ARGIN wT xi b Distance from example xi to the separator is rw Examples closest to the hyperplane are supportvectors. Margin of therseparator is thedistance betweensupport vectors. 17. M AXIMUM M ARGIN C LASSIFICATION Maximizing the margin is good according tointuition and PAC theory. Implies that only support vectors matter; othertraining examples are ignorable. 18. L INEAR SVMM ATHEMATICALLY Let training set {(xi, yi)}i=1..n, xi Rd, yi {-1, 1} be separatedby a hyperplane with margin . Then for each trainingexample (xi, yi): wTxi + b - /2 if yi = -1 yi(wTxi + b) /2 wTxi + b /2if yi = 1 For every support vector xs the above inequality is anequality. After rescaling w and b by /2 in the equality,we obtain that distance between each xs and thehyperplane is Ty s ( w x s b) 1r w w Then the margin can be expressed through (rescaled) wand b as: 2r 2w 19. L INEAR SVM SM ATHEMATICALLY ( CONT.)Then we can formulate the quadratic optimization problem: Find w and b such that 2is maximizedw and for all (xi, yi), i=1..n : yi(wTxi + b) 1Which can be reformulated as:Find w and b such that(w) = ||w||2=wTw is minimizedand for all (xi, yi), i=1..n : yi (wTxi + b) 1 20. S OLVING THE O PTIMIZATIONP ROBLEMFind w and b such that(w) =wTw is minimizedand for all (xi, yi), i=1..n : yi (wTxi + b) 1 Need to optimize a quadratic function subject to linearconstraints. Quadratic optimization problems are a well-known class ofmathematical programming problems for which several (non-trivial) algorithms exist. The solution involves constructing a dual problem where aLagrange multiplier i is associated with every inequalityconstraint in the primal (original) problem:Find 1n such thatQ() =i - ijyiyjxiTxj is maximized and(1) iyi = 0(2) i 0 for all i 21. T HE O PTIMIZATION P ROBLEMS OLUTION Given a solution 1n to the dual problem, solution to theprimal is:w =iyixib = yk - iyixi Txk for any k > 0 Each non-zero i indicates that corresponding xi is a supportvector. Then the classifying function is (note that we dont need wexplicitly): f(x) = iyixiTx + b Notice that it relies on an inner product between the test pointx and the support vectors xi we will return to this later. Also keep in mind that solving the optimization probleminvolved computing the inner products xiTxj between all trainingpoints. 22. S OFT M ARGINC LASSIFICATION What if the training set is not linearly separable? Slack variables i can be added to allowmisclassification of difficult or noisy examples,resulting margin called soft. i i 23. S OFT M ARGIN C LASSIFICATIONM ATHEMATICALLY The old formulation:Find w and b such that(w) =wTw is minimizedand for all (xi ,yi), i=1..n : yi (wTxi + b) 1 Modified formulation incorporates slack variables:Find w and b such that(w) =wTw + Ci is minimizedand for all (xi ,yi), i=1..n : yi (wTxi + b) 1 i, , i 0 Parameter C can be viewed as a way to control overfitting: ittrades off the relative importance of maximizing themargin and fitting the training data. 24. S OFT M ARGIN C LASSIFICATION S OLUTION Dual problem is identical to separable case (would not beidentical if the 2-norm penalty for slack variables Ci2 wasused in primal objective, we would need additionalLagrange multipliers for slack variables):Find 1N such thatQ() =i - ijyiyjxiTxj is maximized and(1) iyi = 0(2) 0 i C for all i Again, xi with non-zero i will be support vectors.Solution to the dual problem is: Again, we dont needto compute w explicitlyw =iyixifor classification:b= yk(1- k) - iyixiTxk for any k s.t. k>0 f(x) = iyixiTx + b 25. T HEORETICAL J USTIFICATIONFOR M AXIMUM M ARGINS Vapnik has proved the following:The class of optimal linear separators has VC dimension hbounded from above as2D h min2, m0 1where is the margin, D is the diameter of the smallest spherethat can enclose all of the training examples, and m0 is thedimensionality. Intuitively, this implies that regardless of dimensionality m0 wecan minimize the VC dimension by maximizing the margin . Thus, complexity of the classifier is kept small regardless ofdimensionality. 26. L INEAR SVM S : O VERVIEW The classifier is a separating hyperplane. Most important training points are support vectors;they define the hyperplane. Quadratic optimization algorithms can identify whichtraining points xi are support vectors with non-zeroLagrangian multipliers i. Both in the dual formulation of the problem and in thesolution training points appear only inside innerproducts:Find 1N such thatf(x) = iyixiTx + bQ() =i - ijyiyjxiTxj is maximized and(1) iyi = 0(2) 0 i C for all i 27. N ON - LINEAR SVM S Datasets that are linearly separable with some noisework out great: 0 x But what are we going to do if the dataset is just toohard? 0 x How about mapping data to a higher-dimensionalspace: x2 0 x 28. N ON - LINEAR SVM S : F EATURE SPACES General idea: the original feature space can always bemapped to some higher-dimensional feature spacewhere the training set is separable:: x (x) 29. P ROPERTIESOF SVMFlexibility in choosing a similarity functionSparseness of solution when dealing with large data sets- only support vectors are used to specify theseparating hyperplaneAbility to handle large feature spaces- complexity does not depend on thedimensionality of the feature spaceGuaranteed to converge to a single global solution 30. SVM A PPLICATIONSSVM has been used successfully in many real-world problems- text (and hypertext) categorization- image classification- bioinformatics (Protein classification,Cancer classification)- hand-written character recognition 31. T HANK Y OU