Optimization Tutorial

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

Optimization Tutorial. Pritam Sukumar & Daphne Tsatsoulis CS 546: Machine Learning for Natural Language Processing. What is Optimization?. Find the minimum or maximum of an objective function given a set of constraints:. Why Do We Care?. Linear ClassificationMaximum Likelihood - PowerPoint PPT Presentation

Transcript

PowerPoint Presentation

Optimization TutorialPritam Sukumar & Daphne TsatsoulisCS 546: Machine Learning for Natural Language Processing1What is Optimization?Find the minimum or maximum of an objective function given a set of constraints:

2Why Do We Care?Linear ClassificationMaximum Likelihood

K-Means

3Prefer Convex ProblemsLocal (non global) minima and maxima:

4

Convex Functions and Sets

5Important Convex Functions

6Convex Optimization Problem

7Lagrangian Dual

8First Order Methods: Gradient DescentNewtons MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011Linear ClassificationRecent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)9First Order Methods: Gradient DescentNewtons MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011Linear ClassificationRecent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)10Gradient Descent

11Single Step Illustration

12Full Gradient Descent Illustration13Elipses are level curves (function is quadratic). If we start on ---- line gradient goes towards minimum (middle), if not, bounce back and forth13First Order Methods: Gradient DescentNewtons MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011Linear ClassificationRecent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)14First Order Methods: Gradient DescentNewtons MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011Linear ClassificationRecent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)15Newtons Method

Inverse HessianGradient16Need Hessian to be invertible --- positive definite 16Newtons Method Picture

17First Order Methods: Gradient DescentNewtons MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011Linear ClassificationRecent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)18First Order Methods: Gradient DescentNewtons MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011Linear ClassificationRecent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)19Subgradient Descent Motivation

20Subgradient Descent Algorithm

21First Order Methods: Gradient DescentNewtons MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011Linear ClassificationRecent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)22First Order Methods: Gradient DescentNewtons MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011Linear ClassificationRecent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)23A brief introduction to Stochastic gradient descent23Online learning and optimizationGoal of machine learning :Minimize expected loss

given samples

This is Stochastic OptimizationAssume loss function is convex

24Batch (sub)gradient descent for MLProcess all examples together in each step

Entire training set examined at each stepVery slow when n is very large

25Stochastic (sub)gradient descentOptimize one example at a timeChoose examples randomly (or reorder and choose in order)Learning representative of example distribution

26Stochastic (sub)gradient descentEquivalent to online learning (the weight vector w changes with every example)Convergence guaranteed for convex functions (to local minimum)

27Hybrid!Stochastic 1 example per iterationBatch All the examples!Sample Average Approximation (SAA): Sample m examples at each step and perform SGD on themAllows for parallelization, but choice of m based on heuristics

28SGD - IssuesConvergence very sensitive to learning rate ( ) (oscillations near solution due to probabilistic nature of sampling)Might need to decrease with time to ensure the algorithm converges eventuallyBasically SGD good for machine learning with large data sets!

29First Order Methods: Gradient DescentNewtons MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy mo