Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
ABriefLookatOptimization
CSC412/2506TutorialDavidMadras
January18,2018
Slidesadaptedfromlastyear’sversion
Overview
• Introduction• Classesofoptimizationproblems• Linearprogramming• Steepest(gradient)descent• Newton’smethod• Quasi-Newtonmethods• Conjugategradients• Stochasticgradientdescent
Whatisoptimization?• Typicalsetup(inmachinelearning,life):
– Formulateaproblem– Designasolution(usuallyamodel)– Usesomequantitativemeasuretodetermine howgoodthesolution is.
• E.g.,classification:– Createasystemtoclassifyimages– Modelissomesimpleclassifier, likelogisticregression– Quantitativemeasure isclassification error(lowerisbetter inthiscase)
• Thenaturalquestiontoaskis:canwefindasolutionwithabetterscore?
• Question:whatcouldwechangeintheclassificationsetuptolowertheclassificationerror(whatarethefreevariables)?
Formaldefinition
• f(θ):somearbitraryfunction• c(θ):somearbitraryconstraints• Minimizingf(θ)isequivalenttomaximizing-f(θ),sowecanjusttalkaboutminimizationandbeOK.
Typesofoptimizationproblems
• Dependingonf,c,andthedomainofθwegetmanyproblemswithmanydifferentcharacteristics.
• Generaloptimizationofarbitraryfunctionswitharbitraryconstraintsisextremelyhard.
• Mosttechniquesexploitstructureintheproblemtofindasolutionmoreefficiently.
Typesofoptimization• Simpleenoughproblemshaveaclosedformsolution:
• f(x)=x2• Linearregression
• Iffandcarelinearfunctionsthenwecanuselinearprogramming(solvableinpolynomialtime).
• Iffandcareconvexthenwecanuseconvexoptimizationtechnique(mostofmachinelearningusesthese).
• Iffandcarenon-convexweusuallypretendit’sconvexandfindasub-optimal,buthopefullygoodenoughsolution(e.g.,deeplearning).
• Intheworstcasethereareglobaloptimizationtechniques(operationsresearchisverygoodatthese).
• Thereareyetmoretechniqueswhenthedomainofθisdiscrete.• Thislistisfarfromexhaustive.
Typesofoptimization
• Takeaway:
Thinkhardaboutyourproblem,findthesimplestcategorythatitfitsinto,usethetoolsfromthatbranchofoptimization.
• Sometimesyoucansolveahardproblemwithaspecial-purposealgorithm,butmosttimeswefavorablack-boxapproachbecauseit’ssimpleandusuallyworks.
Reallynaïveoptimizationalgorithm• Suppose
– D-dimensional vectorofparameterswhereeachdimension isboundedaboveandbelow.
• ForeachdimensionIpicksomesetofvaluestotry:
• Tryallcombinationsofvaluesforeachdimension,recordfforeachone.
• Pickthecombinationthatminimizesf.
Reallynaïveoptimizationalgorithm
• Thisiscalledgridsearch.Itworksreallywellinlowdimensionswhenyoucanaffordtoevaluatefmanytimes.
• Lessappealingwhenfisexpensiveorinhighdimensions.
• YoumayhavealreadydonethiswhensearchingforagoodL2penaltyvalue.
Convexfunctions
Usethelinetest.
Convexfunctions
Convexoptimization
• We’vetalkedabout1Dfunctions,butthedefinitionstillappliestohigherdimensions.
• Whydowecareaboutconvexfunctions?• Inaconvexfunction,anylocalminimumisautomaticallyaglobalminimum.
• Thismeanswecanapplyfairlynaïvetechniquestofindthenearestlocalminimumandstillguaranteethatwe’vefoundthebestsolution!
Steepest(gradient)descent
• Cauchy(1847)
Aside:Taylorseries
• ATaylorseriesisapolynomialseriesthatconvergestoafunctionf.
• WesaythattheTaylorseriesexpansionofatxaroundapointa,f(x+a)is:
• Truncatingthisseriesgivesapolynomialapproximationtoafunction.
Blue:exponential function;Red:Taylorseriesapproximation
MultivariateTaylorSeries
• Thefirst-orderTaylorseriesexpansionofafunctionf(θ)aroundapointdis:
Steepestdescentderivation• Supposeweareatθandwewanttopickadirectiond(withnorm1)suchthatf(θ+ηd)isassmallaspossibleforsomestepsizeη.Thisisequivalenttomaximizingf(θ)- f(θ+ηd).
• Usingalinearapproximation:
• Thisapproximationgetsbetterasηgetssmallersinceaswezoominonadifferentiable functionitwilllookmoreandmorelinear.
Steepestdescentderivation• Weneedtofindthevalue fordthatmaximizes subject to
• Usingthedefinitionofcosineastheanglebetween twovectors:
Howtochoosethestepsize?• Atiterationt• Generalidea:varyηt untilwefindtheminimumalong
• Thisisa1Doptimizationproblem.• Intheworstcasewecanjustmakeηt verysmall,butthenweneedtotakealotmoresteps.
• Generalstrategy:startwithabigηtandprogressivelymakeitsmallerbye.g.,halvingituntilthefunctiondecreases.
Whenhaveweconverged?
• When• Ifthefunctionisconvexthenwehavereachedaglobalminimum.
Theproblemwithgradientdescent
source:http://trond.hjorteland.com/thesis/img208.gif
Newton’smethod
• Tospeedupconvergence,wecanuseamoreaccurateapproximation.
• SecondorderTaylorexpansion:
• HistheHessian matrixcontainingsecondderivatives.
Newton’smethod
Whatisitdoing?
• Ateachstep,Newton’smethodapproximatesthefunctionwithaquadraticbowl,thengoestotheminimumofthisbowl.
• Fortwiceormoredifferentiableconvexfunctions,thisisusuallymuchfasterthansteepestdescent(provably).
• Con:computingHessianrequiresO(D2)timeandstorage.InvertingtheHessianisevenmoreexpensive(uptoO(D3)).Thisisproblematicinhighdimensions.
Quasi-Newtonmethods
• ComputationinvolvingtheHessianisexpensive.• Modernapproachesusecomputationallycheaperapproximations totheHessianorit’sinverse.
• Derivingtheseisbeyondthescopeofthistutorial,butwe’lloutlinesomeofthekeyideas.
• Theseareimplementedinmanygoodsoftwarepackagesinmanylanguagesandcanbetreatedasblackboxsolvers,butit’sgoodtoknowwheretheycomefromsothatyouknowwhenyouusethem.
BFGS
• MaintainarunningestimateoftheHessianBt.• Ateachiteration,setBt+1 =Bt +Ut +Vt whereUandVarerank1matrices(thesearederivedspecificallyforthealgorithm).
• Theadvantageofusingalow-rankupdatetoimprovetheHessianestimateisthatBcanbecheaplyinvertedateachiteration.
LimitedmemoryBFGS• BFGSprogressivelyupdatesBandsoonecanthinkofBt asa
sumofrank-1matricesfromsteps1tot.WecouldinsteadstoretheseupdatesandrecomputeBt ateachiteration(althoughthiswouldinvolvealotofredundantwork).
• L-BFGSonlystoresthemostrecentupdates,thereforetheapproximationitselfisalwayslowrankandonlyalimitedamountofmemoryneedstobeused(linearinD).
• L-BFGSworksextremelywellinpractice.• L-BFGS-BextendsL-BFGStohandleboundconstraintsonthe
variables.
Conjugategradients• Steepestdescentoftenpicksadirectionit’stravelledinbefore
(thisresultsinthewigglybehavior).• Conjugategradientsmakesurewedon’ttravelinthesame
directionagain.• Thederivationforquadraticsismoreinvolvedthanwehave
timefor.• Thederivationforgeneralconvexfunctionsisfairlyhacky,but
reducestothequadraticversionwhenthefunctionisindeedquadratic.
• Takeaway:conjugategradientworksbetterthansteepestdescent,almostasgoodasL-BFGS.Italsohasamuchcheaperper-iterationcost(stilllinear,butbetterconstants).
StochasticGradientDescent
• Recallthatwecanwritethelog-likelihoodofadistributionas:
Stochasticgradientdescent• Anyiterationofagradientdescent(orquasi-Newton)methodrequiresthatwesumovertheentiredatasettocomputethegradient.
• SGDidea:ateachiteration,sub-sampleasmallamountofdata(evenjust1pointcanwork)andusethattoestimatethegradient.
• Eachupdateisnoisy,butveryfast!• ThisisthebasisofoptimizingMLalgorithmswithhugedatasets(e.g.,recentdeeplearning).
• Computinggradientsusingthefulldatasetiscalledbatchlearning,usingsubsetsofdataiscalledmini-batchlearning.
Stochasticgradientdescent• Supposewemadeacopyofeachpoint,y=xsothatwenowhavetwiceasmuchdata.Thelog-likelihoodisnow:
• Inotherwords,theoptimalparametersdon’tchange,butwehavetodotwiceasmuchworktocomputethelog-likelihoodandit’sgradient!
• ThereasonSGDworksisbecausesimilardatayieldssimilargradients,soifthereisenoughredundancyinthedata,thenoisyfromsubsamplingwon’tbesobad.
Stochasticgradientdescent• Inthestochasticsetting,linesearchesbreakdownandsodoestimatesoftheHessian,sostochasticquasi-Newtonmethodsareverydifficulttogetright.
• Sohowdowechooseanappropriatestepsize?• RobbinsandMonro(1951):pickasequenceofηt suchthat:
• Satisfiedby(asoneexample).• Balances“makingprogress”withaveragingoutnoise.
FinalwordsonSGD
• SGDisveryeasytoimplementcomparedtoothermethods,butthestepsizesneedtobetunedtodifferentproblems,whereasbatchlearningtypically“justworks”.
• Tip1:dividethelog-likelihoodestimatebythesizeofyourmini-batches.Thismakesthelearningrateinvarianttomini-batchsize.
• Tip2:subsamplewithoutreplacementsothatyouvisiteachpointoneachpassthroughthedataset(thisisknownasanepoch).
UsefulReferences• Linearprogramming:
- LinearProgramming:FoundationsandExtensions(http://www.princeton.edu/~rvdb/LPbook/
• Convexoptimization:- http://web.stanford.edu/class/ee364a/index.html- http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
• LPsolver:– Gurobi:http://www.gurobi.com/
• Stats(python):– Scipy stats:http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html
• Optimization(python):– Scipy optimize: http://docs.scipy.org/doc/scipy/reference/optimize.html
• Optimization(Matlab):– minFunc: http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html
• GeneralML:– Scikit-Learn: http://scikit-learn.org/stable/