Review for final exam 2015 Fundamentals of ANN RBF-ANN using clustering Bayesian decision theory Genetic algorithm SOM SVM

Review for final exam 2015 Fundamentals of ANN RBF-ANN using clustering Bayesian decision theory Genetic algorithm SOM SVM Fundamentals of ANN Boolean AND: linearly separable 2D binary classification problem 3 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) data table S x 1 +x 2 =1.5 is an acceptable linear discriminant 4 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) linear discriminant w T x = 0 x1x2rrequiredchoice 000w0 2 Classes 24 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) In Bayes rule, which functions are a priors, which are class likelihoods, and which are posteriors? How do I use Bayes rules to assign an instances to a class? Review: Bayes Rule: K>2 Classes 25 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) P(Ci)=priorp(x|Ci)=class likelihoodP(Ci|x)=posterior 26 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) Estimating priors and class likelihoods from data Assuming training data are Gaussian distributed, how do we estimate priors and class likelihoods? 27 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) With class labels r i t, estimators are Estimating priors and class likelihoods from data 28 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) nave Bayes classification What assumption is made to get a simpler form of class likelihoods called nave Bayes? What is the form of this approximation to class likelihoods? What parameters characterize a class in this approximation? 29 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) Review: nave Bayes classification Each class is characterized by a set of means and variances for the components of the attributes in that class. Assume the components of x are independent random variables. Covariance matrix is diagonal and p(x|C) is product of probabilities for each component of x. Actions: i assigning x to C i of K classes Loss ik occurs if we take i when x belongs to C k Risk of action i given x 30 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) Minimizing risk of classification given attributes x Calculate R( i |x) for 0/1 loss function (correct decisions no loss and error have equal cost) when posteriors are normalized. 31 For minimum risk, choose the most probable class Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) R( 1|x) = 11 P(C1|x) + 12 P(C2|x) = 10 P(C2|x) R( 2|x) = 21 P(C1|x) + 22 P(C2|x) = P(C1|x) Choose C1 if R( 1|x) < R( 2|x), which is true if 10 P(C2|x) < P(C1|x), which becomes P(C1|x) > 10/11 using normalization of posteriors Consequence of erroneously assigning instance to C1 is so bad that we choose C1 only when we are virtually certain it is correct. Example of risk minimization with 11 = 22 = 0, 12 = 10, and 21 = 1 Loss ik occurs if we take i when x belongs to C k Genetic algorithm Given the fitness of chromosomes in a population, how do we choose a pair of chromosomes to update the population by crossover? Given fitness f(x i ) for each chromosome in population, assigned each chromosome a discrete probability by Use p i to design a roulette wheel Divide number line between 0 and 1 into segments of length p i in a specified order Get r, random number uniformly distributed between 0 and 1 Choose the chromosome of the line segment containing r Repeat for a second chromosome. 00100fitness = fitness = fitness = fitness = bit chromosomes have fitness given below design a roulette wheel for random selection of chromosomes to replicate 00100fitness = pi = fitness = pi = fitness = pi = fitness = pi = i f(x i ) = bit chromosomes have fitness given below Assume the pair with largest 2 probabilities are selected for replication by crossover at the locus between the 1 st and 2 nd bits. How does the population change? Assume a mixing point (locus) is chosen between first and second bit. Crossover selected to induce change Mutation is rejected as method to induce change Self organizing maps How many prototype vectors will be generated in the SOM application illustrated? How many prototype vectors will be generated in the SOM application illustrated? One for each node of output array Describe the following types of SOM output? elastic net contexture map sematic map unified distance matrix (UMAT) UMAT with connectedness Describe the following types of SOM output? elastic net=deformable grid connecting positions of prototypical vectors in attribute space contexture map=mark output nodes with greatest activation by test patterns sematic map=label all output nodes by the test pattern that generates the greatest activation unified distance matrix (UMAT)=heat map of average difference between an output nodes prototype and prototypes of its neighbors UMAT with connectedness=add stars that connect maxima on UMAT with one and only one output node Are the bars that illustrate convergence of this SOM, elastic nets, semantic maps, or UMATS? Is this elastic net covering input space or the lattice of output nodes? What are the dimensions of the output-node array in the SOM that produced this elastic net? Use the stars to draw a boundary on the cluster that contains horse and cow Support vector machines Review: constrained optimization by Lagrange multipliers find the stationary point of f(x 1, x 2 ) = 1 - x 1 2 x 2 2 subject to the constraint g(x 1, x 2 ) = x 1 + x 2 = 1 Constrained optimization Form the Lagrangian L(x, ) = f(x 1, x 2 ) + (g(x 1, x 2 ) - c) L(x, ) = 1-x 1 2 -x (x 1 +x 2 -1) -2x 1 + = 0 -2x 2 + = 0 x 1 + x 2 -1 = 0 Solve for x 1 and x 2 Set the partial derivatives of L with respect to x 1, x 2, and equal to zero L(x, ) = 1-x 1 2 -x (x 1 +x 2 -1) In this case, not necessary to find sometimes called undetermined multiplier Solution is x 1 * = x 2 * = 52 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) Why is maximizing margins a good strategy for classification? Separating hyperplane margins Support vectors 53 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) For linearly separable classes, SVM achieves this goal by maximizing the distance of all points from the separating hyperplane. If the graph below represents achievement of this goal, how do I draw the margins? 54 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) If the distance, d t, of instance x t from a separating hyperplane is |g(x t )|/||w||, write this relationship in terms of x t, w, and r t, the label of x t. Explain why r t must be +1. 55 Distance of x t to separating hyperplane is Weights that define a separating hyperplane that is has maximum distances from all instances in the training set Use the definition of margins for binary classification to explain why this separating hyperplane has maximum margins. Why are margins the same width on both sides of the separating hyperplane? Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) Duality in constrained optimization Primal variables: variables, like weights, that we want to optimize Dual variables: coefficients of constraints added to original objective function to achieve constrained optimization of the primal variables. Also called Lagrange multipliers. In L p, which variables are primal and which are dual? How is L p, changed to L d ? L d has dual variables only. Active set in constrained optimization What is the active set in constrained optimization? How does maximization of the dual make constraints on support vectors the active set? 58 Set t = 0 for data point sufficiently far from discriminant to be ignored in the search for hyperplane with maximum margins. Find remaining a t > 0 by quadratic programing Given the a t > 0 that maximize L d calculate This is an iterative process. Suggest a way to obtain an initial guess of which instances have attributes that are near the separating hyperplane. Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) Maximize Perceptron learning algorithm (PLA) sign(w T x) negative sign(w T x) positive Each iteration pulls discriminant in direction that tends to correct the misclassified data point. 60 Set t = 0 for data point sufficiently far from discriminant to be ignored in the search for hyperplane with maximum margins. Find remaining a t > 0 by quadratic programing Given the a t > 0 that maximize L d calculate What is the dimension of w? How do you find a value for w 0 ? Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) Maximize 61 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) Given Show that is a discriminant for the classes on the margins of the hyperplane Given a binary classification problem where the number of attributes of an instance exceeds the size of the training set, which of the following methods cannot be used to find a solution? 1)Perceptron training algorithm 2)Logistic regression 3)Classification by least-squares regression 4)Classification by ANN 5)SVM Which method is preferred? In the equation for Lp below, which variables are primal and which are dual? Do they contain any parameters? If so, what is their purpose. In SVM, constraints are placed on dual variables, for example: What are the origins of these constraints? maximize In the equation for Lp below, which variables are primal and which are dual? Does Lp contain any parameters? If so, what is their purpose. x t with r t = 1 is correctly classified but in the margins. What are the bounds on hinge loss? x t with r t = 1 is misclassified. (a) What are the bounds on hinge loss if x t is in the margins? (b) What are the bounds on hinge loss if x t is outside of the margins? Would the answer to these questions be different if r t = -1 ? What would the graph look like if r t = -1 ? The components of soft error = have the form of hinge loss weight decay How does augmented error achieve weight decay and what is its effect? Based on your understanding of weight decay by augmented error, explain how the value of regularization parameter C effects binary classification by SVM with slack variables What are the primal variables in the L p for -SVM shown below? Does it contain any parameters? If so, what is their purpose? Add the Lagrange multiplier to L p By comparison with C-SVM where Kernel machines What 2 equations in feature-space SVM enabled the kernel trick? 72 Kernel machines Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) What 2 equations in feature space enable the kernel trick? The dual and the discriminant Explicit use of weights which cannot be written as a dot product of features, (x t ) T (x), is not needed. Kernel machines Do kernel machines include regularization? If so, how? Kernel machines Do kernel machines include regularization? If so, how? Kernel machines still contain the regularization parameter C through the constraint 0< t 0 and t >0 for constraints 76 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) Set derivatives with respect to primal variables R, a and t = 0 0 < t < C Substituting back into L p we get dual to be maximized Given region center, a, how do we find optimum radius R? One-Class SVM Machines R will be determined by instances that are support vectors on the surface of the hyper-sphere 77 Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) What is wrong with these equations as a start to the development of an SVM for soft-margin hyperplans What is wrong with these equations as a start to the development of an SVM for soft-margin hyperplans C is not a primal variable

Documents

Review for final exam 2015 Fundamentals of ANN RBF-ANN using clustering Bayesian decision theory Genetic algorithm SOM SVM