Introduction to Machine Learning

  • View

  • Download

Embed Size (px)



Text of Introduction to Machine Learning

  • 1. Introduction to Machine LearningJinhyuk Choi Human-Computer Interaction Lab @ Information and Communications University

2. Contents Concepts of Machine Learning Multilayer Perceptrons Decision Trees Bayesian Networks 3. What is Machine Learning? Large storage / large amount of data Looks random but certain patterns Web log data Medical record Network optimization Bioinformatics Machine vision Speech recognition No complete identification of the process A good or useful approximation 4. What is Machine Learning? Definition Programming computers to optimize a performance criterion using example data or past experience Role of Statistics Inference from a sample Role of Computer science Efficient algorithms to solve the optimization problem Representing and evaluating the model for inference Descriptive (training) / predictive (generalization) Learning from Human-generated data?? 5. What is Machine Learning? Concept Learning Inducing general functions from specific training examples (positive or negative) Looking for the hypothesis that best fits the training examples ObjectsConcept, , Bird, , , boolean function : Bird(animal) true or not Concepts: - describing some subset of objects or events defined over a larger set - a boolean-valued function 6. What is Machine Learning? Concept Learning Inferring a boolean-valued function from training examples of its input and output Hypothesis 1Hypothesis 2Concept Web log data Medical record Network optimizationPositive examplesBioinformaticsNegative examplesMachine vision Speech recognition 7. What is Machine Learning? Learning Problem Design Do you enjoy sports ? Learn to predict the value of EnjoySports for an arbitrary day, based on the value of its other attributes What problem? Why learning? Attributes selection Effective? Enough? What learning algorithm? 8. Applications Learning associations Classification Regression Unsupervised learning Reinforcement learning 9. Examples (1) TV program preference inference based on web usage data Web page #1 TV Program #1 Web page #2 TV Program #2 Web page #3 ClassifierTV Program #3 Web page #4 1 2 TV Program #4 . . 3What are we supposed to do at each step? 10. Examples (2) from a HW of Neural Networks Class (KAIST-2002) Function approximation (Mexican hat) f3 ( x1 , x2 ) sin 2 x12 x2 ,2x1 , x2 [1,1] 11. Examples (3) from a HW of Machine Learning Class (ICU-2006) Face image classification 12. Examples (4) from a HW of Machine Learning Class (ICU-2006) 13. Examples (5) from a HW of Machine Learning Class (ICU-2006) Sensay 14. Examples (6) A. Krause et. al, Unsupervised, Dynamic Identification of Physiological and Activity Context in Wearable Computing, ISWC 2005 15. #1. Multilayer Perceptrons 16. Neural Network? VS. Adaline MLP SOM Hopfield network RDFN Bifurcating neuron networks 17. Multilayer Networks of Sigmoid Units Supervised learning 2-layer Fully connected Really looks like the brain?? 18. Sigmoid Unit 19. The back-propagation algorithm Network modelInput layer hidden layeroutput layerxiyjokv jiwkj y j s v ji x i w y ok s kj j i j 1 E v , w tk ok 2 Error function: 2 k Stochastic gradient descent 20. Gradient-Descent Function Minimization 21. Gradient-descent function minimization In order to find a vector parameter x that minimizes a function f x Start with a random initial value of x x 0 . Determine the direction of the steepest descent in the parameter space by f f f f ,,..., x x 1 2 x n Move to the direction a step. x i 1 x i hfx Repeat the above two steps until no more change in. For gradient-descent to work The function to be minimized should be continuous. The function should not have too many local minima. 22. Back-propagation 23. Derivation of back-propagation algorithmAdjustment ofwkj :2E 12 1 t s w y tk ok k k j j wk j wk j 2 k 2 wk j j 1 y j ok 1 ok 1 2 tk ok 2 y j ok 1 ok tk ok E wkj h h ok 1 ok tk ok y jwkj o dk 24. Derivation of back-propagation algorithmAdjustment of vji : 2E 121 t s w y tk ok kj j v j i v j i 2 k 2 k v j i k j 21 t s w s v x k kj ji i 2 k v j i j i 1 x i y j 1 y j wkj ok 1 ok 1 2 tk ok 2 k x i y j 1 y j wkj ok 1 ok tk ok k Ev ji h hy j 1 y j wkjok 1 ok tk ok x iv ji k h y j 1 y j wkj dko x iy kdj 25. Backpropagation 26. Batch learning vs. Incremental learning Batch standard backprop proceeds as Incremental standard backprop can be done as follows: follows:Initialize the weights W.Initialize the weights W.Repeat the following steps for j = 1 to NL:Repeat the following steps:Process one training case (y_j,X_j) to compute the gradientProcess all the training data DL to compute the gradient of the error (loss) function Q(y_j,X_j,W). of the average error function AQ(DL,W).Update the weights by subtracting the gradient times theUpdate the weights by subtracting the gradient times the learning rate. learning rate. 27. Training 28. Overfitting 29. #2. Decision Trees 30. Introduction Divide & conquer Hierarchical model Sequence ofrecursive splits Decision node vs.leaf node Advantage Interpretability IF-THEN rules 31. Divide and Conquer Internal decision nodes Univariate: Uses a single attribute, xi Numeric xi : Binary split : xi > wm Discrete xi : n-way split for n possible values Multivariate: Uses all attributes, x Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit Learning Construction of the tree using training examples Looking for the simplest tree among the trees that code the training data without error Based on heuristics NP-complete Greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) 32. Classification Trees Split is main procedure for tree construction By impurity measure For node m, Nm instances reach m, Nim belong to Ci i Ci | x ,m pm i Nm PNmTo be pure!!! Node m is pure if pim is 0 or 1 K Measure of impurity is entropyIm pm log2pmiii 1 33. Representation Each node specifies a test of some attribute of the instance Each branch correspond to one of the possible values for this attribute 34. Best Split If node m is pure, generate a leaf and stop, otherwise split and continue recursively Impurity after split: Nmj of Nm take branch j. Nimj belong to Ci iN mj Ci | x ,m, j pmj Pi N mjn N mj KI'm p imji log2pmj j 1 Nm i 1 Find the variable and split that min impurity (among all variables -- and split positions for numeric variables) Q) Which attribute should be tested at the root of the tree? 35. Top-Down Induction of Decision Trees 36. Entropy Measure of uncertainty Expected number of bits to resolve uncertainty Suppose Pr{X = 0} = 1/8 If other events are equally likely, the number of events is 8. To indicate one out of so many events, one needs lg 8 bits. Consider a binary random variable X s.t. Pr{X = 0} = 0.1. 1 0.1 lg11 The expected number of bits:0.1 lg 0.1 1 0.1 In general, if a random variable X has c values with prob. p_c: cc1 The expected number of bits:H pi lg pi lg pi i 1 pii 1 37. Entropy Example 14 examples Entropy([9,5]) (9 /14) log 2 (9 /14) (5 /14) log 2 (5 /14) 0.940 Entropy 0 : all members positive or negativeEntropy 1 : equal number of positive & negative0 < Entropy < 1 : unequal number of positive & negative 38. Information Gain Measures the expected reduction in entropy caused by partitioning the examples 39. Information Gain # of samples = 100ICU-Student tree # of positive samples = 50 Candidate Entropy = 1 Left side: # of samples = 50Gender # of positive samples = 40 Entropy = 0.72 Right side: Male Female # of samples = 50 # of positive samples = 10 Entropy = 0.72IQHeight On average Entropy = 0.5 * 0.72 + 0.5*0.72 = 0.72 Reduction in entropy = 0.28 Information gain 40. Training Examples 41. Selecting the Next Attribute 42. Partially learned tree 43. Hypothesis Space Search Hypothesis space: the set of all possible decision trees DT is guided by information gain measure.Occams razor ?? 44. Overfitting Why over-fitting? A model can become more complex than the true target function(concept) when it tries to satisfy noisy data as well 45. Avoiding over-fitting the data Two classes of approaches to avoid overfitting Stop growing the tree earlier. Post-prune the tree after overfitting Ok, but how to determine the optimal size of a tree? Use validation examples to evaluate the effect of pruning (stopping) Use a statistical test to estimate the effect of pruning (stopping) Use a measure of complexity for encoding decision tree. Approaches based on the first strategy Reduced error pruning Rule post-pruning 46. Rule Extraction from TreesC4.5Rules (Quinlan, 1993) 47. #3. Bayesian Networks 48. Bayes Rule Introductionprior likelihoodposterior P C p x | C P C | x p x evidence P C 0 P C 1 1p x p x | C 1P C 1 p x | C 0P C 0p C 0 | x P C 1 | x 1 49. Bayes Rule: K>2 Classes Introductionp x | Ci P Ci P Ci | x p x p x | Ci P Ci K p x | Ck P Ck k 1K P Ci 0 and P Ci 1 i 1 choose Ci if P Ci | x max k P Ck | x 50. Bayesian Networks Introduction Graphical models, probabilistic networks causality and influence Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis Arcs are direct influences between hypotheses The structure is represented as a directed acyclic graph (DAG) Representation of the dependencies among random variables The parameters are the conditional probs in the arcs Small set ofall possible probability, relating B.N.combinations of only neighbor nodecicumstances 51. Bayesian Networks Introduction Learning Inducing a graph From prior knowledge From structure learni