35
A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

ABriefLookatOptimization

CSC412/2506TutorialDavidMadras

January18,2018

Slidesadaptedfromlastyear’sversion

Page 2: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Overview

• Introduction• Classesofoptimizationproblems• Linearprogramming• Steepest(gradient)descent• Newton’smethod• Quasi-Newtonmethods• Conjugategradients• Stochasticgradientdescent

Page 3: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Whatisoptimization?• Typicalsetup(inmachinelearning,life):

– Formulateaproblem– Designasolution(usuallyamodel)– Usesomequantitativemeasuretodetermine howgoodthesolution is.

• E.g.,classification:– Createasystemtoclassifyimages– Modelissomesimpleclassifier, likelogisticregression– Quantitativemeasure isclassification error(lowerisbetter inthiscase)

• Thenaturalquestiontoaskis:canwefindasolutionwithabetterscore?

• Question:whatcouldwechangeintheclassificationsetuptolowertheclassificationerror(whatarethefreevariables)?

Page 4: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Formaldefinition

• f(θ):somearbitraryfunction• c(θ):somearbitraryconstraints• Minimizingf(θ)isequivalenttomaximizing-f(θ),sowecanjusttalkaboutminimizationandbeOK.

Page 5: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Typesofoptimizationproblems

• Dependingonf,c,andthedomainofθwegetmanyproblemswithmanydifferentcharacteristics.

• Generaloptimizationofarbitraryfunctionswitharbitraryconstraintsisextremelyhard.

• Mosttechniquesexploitstructureintheproblemtofindasolutionmoreefficiently.

Page 6: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Typesofoptimization• Simpleenoughproblemshaveaclosedformsolution:

• f(x)=x2• Linearregression

• Iffandcarelinearfunctionsthenwecanuselinearprogramming(solvableinpolynomialtime).

• Iffandcareconvexthenwecanuseconvexoptimizationtechnique(mostofmachinelearningusesthese).

• Iffandcarenon-convexweusuallypretendit’sconvexandfindasub-optimal,buthopefullygoodenoughsolution(e.g.,deeplearning).

• Intheworstcasethereareglobaloptimizationtechniques(operationsresearchisverygoodatthese).

• Thereareyetmoretechniqueswhenthedomainofθisdiscrete.• Thislistisfarfromexhaustive.

Page 7: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Typesofoptimization

• Takeaway:

Thinkhardaboutyourproblem,findthesimplestcategorythatitfitsinto,usethetoolsfromthatbranchofoptimization.

• Sometimesyoucansolveahardproblemwithaspecial-purposealgorithm,butmosttimeswefavorablack-boxapproachbecauseit’ssimpleandusuallyworks.

Page 8: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Reallynaïveoptimizationalgorithm• Suppose

– D-dimensional vectorofparameterswhereeachdimension isboundedaboveandbelow.

• ForeachdimensionIpicksomesetofvaluestotry:

• Tryallcombinationsofvaluesforeachdimension,recordfforeachone.

• Pickthecombinationthatminimizesf.

Page 9: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Reallynaïveoptimizationalgorithm

• Thisiscalledgridsearch.Itworksreallywellinlowdimensionswhenyoucanaffordtoevaluatefmanytimes.

• Lessappealingwhenfisexpensiveorinhighdimensions.

• YoumayhavealreadydonethiswhensearchingforagoodL2penaltyvalue.

Page 10: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Convexfunctions

Usethelinetest.

Page 11: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Convexfunctions

Page 12: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Convexoptimization

• We’vetalkedabout1Dfunctions,butthedefinitionstillappliestohigherdimensions.

• Whydowecareaboutconvexfunctions?• Inaconvexfunction,anylocalminimumisautomaticallyaglobalminimum.

• Thismeanswecanapplyfairlynaïvetechniquestofindthenearestlocalminimumandstillguaranteethatwe’vefoundthebestsolution!

Page 13: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Steepest(gradient)descent

• Cauchy(1847)

Page 14: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version
Page 15: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Aside:Taylorseries

• ATaylorseriesisapolynomialseriesthatconvergestoafunctionf.

• WesaythattheTaylorseriesexpansionofatxaroundapointa,f(x+a)is:

• Truncatingthisseriesgivesapolynomialapproximationtoafunction.

Page 16: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Blue:exponential function;Red:Taylorseriesapproximation

Page 17: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

MultivariateTaylorSeries

• Thefirst-orderTaylorseriesexpansionofafunctionf(θ)aroundapointdis:

Page 18: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Steepestdescentderivation• Supposeweareatθandwewanttopickadirectiond(withnorm1)suchthatf(θ+ηd)isassmallaspossibleforsomestepsizeη.Thisisequivalenttomaximizingf(θ)- f(θ+ηd).

• Usingalinearapproximation:

• Thisapproximationgetsbetterasηgetssmallersinceaswezoominonadifferentiable functionitwilllookmoreandmorelinear.

Page 19: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Steepestdescentderivation• Weneedtofindthevalue fordthatmaximizes subject to

• Usingthedefinitionofcosineastheanglebetween twovectors:

Page 20: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Howtochoosethestepsize?• Atiterationt• Generalidea:varyηt untilwefindtheminimumalong

• Thisisa1Doptimizationproblem.• Intheworstcasewecanjustmakeηt verysmall,butthenweneedtotakealotmoresteps.

• Generalstrategy:startwithabigηtandprogressivelymakeitsmallerbye.g.,halvingituntilthefunctiondecreases.

Page 21: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Whenhaveweconverged?

• When• Ifthefunctionisconvexthenwehavereachedaglobalminimum.

Page 22: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Theproblemwithgradientdescent

source:http://trond.hjorteland.com/thesis/img208.gif

Page 23: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Newton’smethod

• Tospeedupconvergence,wecanuseamoreaccurateapproximation.

• SecondorderTaylorexpansion:

• HistheHessian matrixcontainingsecondderivatives.

Page 24: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Newton’smethod

Page 25: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Whatisitdoing?

• Ateachstep,Newton’smethodapproximatesthefunctionwithaquadraticbowl,thengoestotheminimumofthisbowl.

• Fortwiceormoredifferentiableconvexfunctions,thisisusuallymuchfasterthansteepestdescent(provably).

• Con:computingHessianrequiresO(D2)timeandstorage.InvertingtheHessianisevenmoreexpensive(uptoO(D3)).Thisisproblematicinhighdimensions.

Page 26: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Quasi-Newtonmethods

• ComputationinvolvingtheHessianisexpensive.• Modernapproachesusecomputationallycheaperapproximations totheHessianorit’sinverse.

• Derivingtheseisbeyondthescopeofthistutorial,butwe’lloutlinesomeofthekeyideas.

• Theseareimplementedinmanygoodsoftwarepackagesinmanylanguagesandcanbetreatedasblackboxsolvers,butit’sgoodtoknowwheretheycomefromsothatyouknowwhenyouusethem.

Page 27: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

BFGS

• MaintainarunningestimateoftheHessianBt.• Ateachiteration,setBt+1 =Bt +Ut +Vt whereUandVarerank1matrices(thesearederivedspecificallyforthealgorithm).

• Theadvantageofusingalow-rankupdatetoimprovetheHessianestimateisthatBcanbecheaplyinvertedateachiteration.

Page 28: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

LimitedmemoryBFGS• BFGSprogressivelyupdatesBandsoonecanthinkofBt asa

sumofrank-1matricesfromsteps1tot.WecouldinsteadstoretheseupdatesandrecomputeBt ateachiteration(althoughthiswouldinvolvealotofredundantwork).

• L-BFGSonlystoresthemostrecentupdates,thereforetheapproximationitselfisalwayslowrankandonlyalimitedamountofmemoryneedstobeused(linearinD).

• L-BFGSworksextremelywellinpractice.• L-BFGS-BextendsL-BFGStohandleboundconstraintsonthe

variables.

Page 29: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Conjugategradients• Steepestdescentoftenpicksadirectionit’stravelledinbefore

(thisresultsinthewigglybehavior).• Conjugategradientsmakesurewedon’ttravelinthesame

directionagain.• Thederivationforquadraticsismoreinvolvedthanwehave

timefor.• Thederivationforgeneralconvexfunctionsisfairlyhacky,but

reducestothequadraticversionwhenthefunctionisindeedquadratic.

• Takeaway:conjugategradientworksbetterthansteepestdescent,almostasgoodasL-BFGS.Italsohasamuchcheaperper-iterationcost(stilllinear,butbetterconstants).

Page 30: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

StochasticGradientDescent

• Recallthatwecanwritethelog-likelihoodofadistributionas:

Page 31: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Stochasticgradientdescent• Anyiterationofagradientdescent(orquasi-Newton)methodrequiresthatwesumovertheentiredatasettocomputethegradient.

• SGDidea:ateachiteration,sub-sampleasmallamountofdata(evenjust1pointcanwork)andusethattoestimatethegradient.

• Eachupdateisnoisy,butveryfast!• ThisisthebasisofoptimizingMLalgorithmswithhugedatasets(e.g.,recentdeeplearning).

• Computinggradientsusingthefulldatasetiscalledbatchlearning,usingsubsetsofdataiscalledmini-batchlearning.

Page 32: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Stochasticgradientdescent• Supposewemadeacopyofeachpoint,y=xsothatwenowhavetwiceasmuchdata.Thelog-likelihoodisnow:

• Inotherwords,theoptimalparametersdon’tchange,butwehavetodotwiceasmuchworktocomputethelog-likelihoodandit’sgradient!

• ThereasonSGDworksisbecausesimilardatayieldssimilargradients,soifthereisenoughredundancyinthedata,thenoisyfromsubsamplingwon’tbesobad.

Page 33: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

Stochasticgradientdescent• Inthestochasticsetting,linesearchesbreakdownandsodoestimatesoftheHessian,sostochasticquasi-Newtonmethodsareverydifficulttogetright.

• Sohowdowechooseanappropriatestepsize?• RobbinsandMonro(1951):pickasequenceofηt suchthat:

• Satisfiedby(asoneexample).• Balances“makingprogress”withaveragingoutnoise.

Page 34: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

FinalwordsonSGD

• SGDisveryeasytoimplementcomparedtoothermethods,butthestepsizesneedtobetunedtodifferentproblems,whereasbatchlearningtypically“justworks”.

• Tip1:dividethelog-likelihoodestimatebythesizeofyourmini-batches.Thismakesthelearningrateinvarianttomini-batchsize.

• Tip2:subsamplewithoutreplacementsothatyouvisiteachpointoneachpassthroughthedataset(thisisknownasanepoch).

Page 35: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

UsefulReferences• Linearprogramming:

- LinearProgramming:FoundationsandExtensions(http://www.princeton.edu/~rvdb/LPbook/

• Convexoptimization:- http://web.stanford.edu/class/ee364a/index.html- http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

• LPsolver:– Gurobi:http://www.gurobi.com/

• Stats(python):– Scipy stats:http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html

• Optimization(python):– Scipy optimize: http://docs.scipy.org/doc/scipy/reference/optimize.html

• Optimization(Matlab):– minFunc: http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html

• GeneralML:– Scikit-Learn: http://scikit-learn.org/stable/