A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version

ABriefLookatOptimization

CSC412/2506TutorialDavidMadras

January18,2018

Slidesadaptedfromlastyear’sversion

Overview

• Introduction• Classesofoptimizationproblems• Linearprogramming• Steepest(gradient)descent• Newton’smethod• Quasi-Newtonmethods• Conjugategradients• Stochasticgradientdescent

Whatisoptimization?• Typicalsetup(inmachinelearning,life):

– Formulateaproblem– Designasolution(usuallyamodel)– Usesomequantitativemeasuretodetermine howgoodthesolution is.

• E.g.,classification:– Createasystemtoclassifyimages– Modelissomesimpleclassifier, likelogisticregression– Quantitativemeasure isclassification error(lowerisbetter inthiscase)

• Thenaturalquestiontoaskis:canwefindasolutionwithabetterscore?

• Question:whatcouldwechangeintheclassificationsetuptolowertheclassificationerror(whatarethefreevariables)?

Formaldefinition

• f(θ):somearbitraryfunction• c(θ):somearbitraryconstraints• Minimizingf(θ)isequivalenttomaximizing-f(θ),sowecanjusttalkaboutminimizationandbeOK.

Typesofoptimizationproblems

• Dependingonf,c,andthedomainofθwegetmanyproblemswithmanydifferentcharacteristics.

• Generaloptimizationofarbitraryfunctionswitharbitraryconstraintsisextremelyhard.

• Mosttechniquesexploitstructureintheproblemtofindasolutionmoreefficiently.

Typesofoptimization• Simpleenoughproblemshaveaclosedformsolution:

• f(x)=x2• Linearregression

• Iffandcarelinearfunctionsthenwecanuselinearprogramming(solvableinpolynomialtime).

• Iffandcareconvexthenwecanuseconvexoptimizationtechnique(mostofmachinelearningusesthese).

• Iffandcarenon-convexweusuallypretendit’sconvexandfindasub-optimal,buthopefullygoodenoughsolution(e.g.,deeplearning).

• Intheworstcasethereareglobaloptimizationtechniques(operationsresearchisverygoodatthese).

• Thereareyetmoretechniqueswhenthedomainofθisdiscrete.• Thislistisfarfromexhaustive.

Typesofoptimization

• Takeaway:

Thinkhardaboutyourproblem,findthesimplestcategorythatitfitsinto,usethetoolsfromthatbranchofoptimization.

• Sometimesyoucansolveahardproblemwithaspecial-purposealgorithm,butmosttimeswefavorablack-boxapproachbecauseit’ssimpleandusuallyworks.

Reallynaïveoptimizationalgorithm• Suppose

– D-dimensional vectorofparameterswhereeachdimension isboundedaboveandbelow.

• ForeachdimensionIpicksomesetofvaluestotry:

• Tryallcombinationsofvaluesforeachdimension,recordfforeachone.

• Pickthecombinationthatminimizesf.

Reallynaïveoptimizationalgorithm

• Thisiscalledgridsearch.Itworksreallywellinlowdimensionswhenyoucanaffordtoevaluatefmanytimes.

• Lessappealingwhenfisexpensiveorinhighdimensions.

• YoumayhavealreadydonethiswhensearchingforagoodL2penaltyvalue.

Convexfunctions

Usethelinetest.

Convexfunctions

Convexoptimization

• We’vetalkedabout1Dfunctions,butthedefinitionstillappliestohigherdimensions.

• Whydowecareaboutconvexfunctions?• Inaconvexfunction,anylocalminimumisautomaticallyaglobalminimum.

• Thismeanswecanapplyfairlynaïvetechniquestofindthenearestlocalminimumandstillguaranteethatwe’vefoundthebestsolution!

Steepest(gradient)descent

• Cauchy(1847)

Aside:Taylorseries

• ATaylorseriesisapolynomialseriesthatconvergestoafunctionf.

• WesaythattheTaylorseriesexpansionofatxaroundapointa,f(x+a)is:

• Truncatingthisseriesgivesapolynomialapproximationtoafunction.

Blue:exponential function;Red:Taylorseriesapproximation

MultivariateTaylorSeries

• Thefirst-orderTaylorseriesexpansionofafunctionf(θ)aroundapointdis:

Steepestdescentderivation• Supposeweareatθandwewanttopickadirectiond(withnorm1)suchthatf(θ+ηd)isassmallaspossibleforsomestepsizeη.Thisisequivalenttomaximizingf(θ)- f(θ+ηd).

• Usingalinearapproximation:

• Thisapproximationgetsbetterasηgetssmallersinceaswezoominonadifferentiable functionitwilllookmoreandmorelinear.

Steepestdescentderivation• Weneedtofindthevalue fordthatmaximizes subject to

• Usingthedefinitionofcosineastheanglebetween twovectors:

Howtochoosethestepsize?• Atiterationt• Generalidea:varyηt untilwefindtheminimumalong

• Thisisa1Doptimizationproblem.• Intheworstcasewecanjustmakeηt verysmall,butthenweneedtotakealotmoresteps.

• Generalstrategy:startwithabigηtandprogressivelymakeitsmallerbye.g.,halvingituntilthefunctiondecreases.

Whenhaveweconverged?

• When• Ifthefunctionisconvexthenwehavereachedaglobalminimum.

Theproblemwithgradientdescent

source:http://trond.hjorteland.com/thesis/img208.gif

Newton’smethod

• Tospeedupconvergence,wecanuseamoreaccurateapproximation.

• SecondorderTaylorexpansion:

• HistheHessian matrixcontainingsecondderivatives.

Newton’smethod

Whatisitdoing?

• Ateachstep,Newton’smethodapproximatesthefunctionwithaquadraticbowl,thengoestotheminimumofthisbowl.

• Fortwiceormoredifferentiableconvexfunctions,thisisusuallymuchfasterthansteepestdescent(provably).

• Con:computingHessianrequiresO(D2)timeandstorage.InvertingtheHessianisevenmoreexpensive(uptoO(D3)).Thisisproblematicinhighdimensions.

Quasi-Newtonmethods

• ComputationinvolvingtheHessianisexpensive.• Modernapproachesusecomputationallycheaperapproximations totheHessianorit’sinverse.

• Derivingtheseisbeyondthescopeofthistutorial,butwe’lloutlinesomeofthekeyideas.

• Theseareimplementedinmanygoodsoftwarepackagesinmanylanguagesandcanbetreatedasblackboxsolvers,butit’sgoodtoknowwheretheycomefromsothatyouknowwhenyouusethem.

BFGS

• MaintainarunningestimateoftheHessianBt.• Ateachiteration,setBt+1 =Bt +Ut +Vt whereUandVarerank1matrices(thesearederivedspecificallyforthealgorithm).

• Theadvantageofusingalow-rankupdatetoimprovetheHessianestimateisthatBcanbecheaplyinvertedateachiteration.

LimitedmemoryBFGS• BFGSprogressivelyupdatesBandsoonecanthinkofBt asa

sumofrank-1matricesfromsteps1tot.WecouldinsteadstoretheseupdatesandrecomputeBt ateachiteration(althoughthiswouldinvolvealotofredundantwork).

• L-BFGSonlystoresthemostrecentupdates,thereforetheapproximationitselfisalwayslowrankandonlyalimitedamountofmemoryneedstobeused(linearinD).

• L-BFGSworksextremelywellinpractice.• L-BFGS-BextendsL-BFGStohandleboundconstraintsonthe

variables.

Conjugategradients• Steepestdescentoftenpicksadirectionit’stravelledinbefore

(thisresultsinthewigglybehavior).• Conjugategradientsmakesurewedon’ttravelinthesame

directionagain.• Thederivationforquadraticsismoreinvolvedthanwehave

timefor.• Thederivationforgeneralconvexfunctionsisfairlyhacky,but

reducestothequadraticversionwhenthefunctionisindeedquadratic.

• Takeaway:conjugategradientworksbetterthansteepestdescent,almostasgoodasL-BFGS.Italsohasamuchcheaperper-iterationcost(stilllinear,butbetterconstants).

StochasticGradientDescent

• Recallthatwecanwritethelog-likelihoodofadistributionas:

Stochasticgradientdescent• Anyiterationofagradientdescent(orquasi-Newton)methodrequiresthatwesumovertheentiredatasettocomputethegradient.

• SGDidea:ateachiteration,sub-sampleasmallamountofdata(evenjust1pointcanwork)andusethattoestimatethegradient.

• Eachupdateisnoisy,butveryfast!• ThisisthebasisofoptimizingMLalgorithmswithhugedatasets(e.g.,recentdeeplearning).

• Computinggradientsusingthefulldatasetiscalledbatchlearning,usingsubsetsofdataiscalledmini-batchlearning.

Stochasticgradientdescent• Supposewemadeacopyofeachpoint,y=xsothatwenowhavetwiceasmuchdata.Thelog-likelihoodisnow:

• Inotherwords,theoptimalparametersdon’tchange,butwehavetodotwiceasmuchworktocomputethelog-likelihoodandit’sgradient!

• ThereasonSGDworksisbecausesimilardatayieldssimilargradients,soifthereisenoughredundancyinthedata,thenoisyfromsubsamplingwon’tbesobad.

Stochasticgradientdescent• Inthestochasticsetting,linesearchesbreakdownandsodoestimatesoftheHessian,sostochasticquasi-Newtonmethodsareverydifficulttogetright.

• Sohowdowechooseanappropriatestepsize?• RobbinsandMonro(1951):pickasequenceofηt suchthat:

• Satisfiedby(asoneexample).• Balances“makingprogress”withaveragingoutnoise.

FinalwordsonSGD

• SGDisveryeasytoimplementcomparedtoothermethods,butthestepsizesneedtobetunedtodifferentproblems,whereasbatchlearningtypically“justworks”.

• Tip1:dividethelog-likelihoodestimatebythesizeofyourmini-batches.Thismakesthelearningrateinvarianttomini-batchsize.

• Tip2:subsamplewithoutreplacementsothatyouvisiteachpointoneachpassthroughthedataset(thisisknownasanepoch).

UsefulReferences• Linearprogramming:

- LinearProgramming:FoundationsandExtensions(http://www.princeton.edu/~rvdb/LPbook/

• Convexoptimization:- http://web.stanford.edu/class/ee364a/index.html- http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

• LPsolver:– Gurobi:http://www.gurobi.com/

• Stats(python):– Scipy stats:http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html

• Optimization(python):– Scipy optimize: http://docs.scipy.org/doc/scipy/reference/optimize.html

• Optimization(Matlab):– minFunc: http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html

• GeneralML:– Scikit-Learn: http://scikit-learn.org/stable/

Documents

A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version