55
Bandit Learning for NMT Hyperparameter Search Kevin Duh

Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

BanditLearningforNMTHyperparameter Search

KevinDuh

Page 2: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

May2018discussion

1

Page 3: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

SpeedingupHyperparameter Search

Givenbudgetconstraints,howtodecidewhichruntokillbeforeconvergence?

2

Page 4: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

K-armbanditproblem

- Eachrun/modelisanarm- Eachtimewepullanarm,wetrainthemodelbyonestep

- Whicharmshouldwepullfirst?

3

Page 5: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Simulation

• WNMT2018DE-ENdata(4Msentences)• RunKmodelstoconvergence.Checkifbanditlearningcanchoosecorrectly.

• Seq2SeqHyperparameters:– Varied=BPE:10k,30k,50k;Embeddingsize:100,300,500;RNNhiddensize:100,300,500;#layers:1,2;Dropout:0.0-0.4

– Fixed=Defaultoptimizer,learning-ratescheduler• Checkpointfrequency:10k,Batchsize:128– Eachcheckpoint=1unitofbudget

4

Page 6: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Epsilon-GreedyAlgorithm

• Foreachturnuntilbudgetrunsout:– Drawxfromrandom_uniform(0,1)– Ifx<epsilon(e.g.0.1)

• Pullrandomarm

– Else:• Pullbestarm:k’=argmax_k value[k]• Updatevalue[k’]=latestBLEU(oraveragesofar)

5

Page 7: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Epsilon-Greedytendstoexploreonlymodelsthataregoodinitially.OKherebutrisky.(budget=40)

6

Page 8: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

UpperConfidenceBound(UCB)

• Idea:moreuncertaintyonarmslesspulled,sofavorthem.

• Foreachturnuntilbudgetrunsout:– Foreacharmk:

• Computebound[k]=sqrt(2log(totalcount)/count[k])– Pullbestarm:k’=argmax_k value[k]+bound[k]– Updatevalue[k’]=latestBLEU(oraveragesofar)– Incrementcount[k’]+=1;totalcount +=1

7

Page 9: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Boundistoolargeinpractice.UCBuniformlyexploresallarms.(Budget=40)

8

Page 10: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Hyperband/SuccessiveHalvingLiet.al.2016.Hyperband:ANovelBandit-BasedApproachtoHyperparameter Optimization

• Previously:choosing1armistoorisky,andvaluesaren’tfairlycomparableacrosssteps

• Idea:Choosehalfofpopulationateachturn

• L =list(Arms)• Foreachturnuntilbudgetrunsout:– PulleacharmkinL;updatevalue[k]=currentBLEU– S =[armssortedbyvalue]– L=tophalfofS

9

Page 11: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

PromisingarmstrainsuccessivelylongerunderSuccessiveHalving(Budget=40)

10

Page 12: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

16arms.SuccessiveHalvingwithBudget=96.

11

Page 13: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Considerations/Discussions

• Nextsteps:– IncludemultipleobjectivesviaParetoranks–Makethismorepractical.Implementsuccessivehalvingasinnerloopwithinevolutionarysearch

• Algorithmicquestions:– Fixedoptimizer&learningratescheduler– Newrunvsoldrun

12

Page 14: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Considerations/Discussions

• Implementationquestions:– ContinueonfinishedrunforSockeye:

• --params,--source-vocab,--target-vocab• Differentdatasets?E.g.smallerdatasets• Sockeye.prepared_data?

–Measurements:• Time:CPUdecoding?VsGPUdecoding• Accuracy:validationBLEUvstrainperplexity

13

Page 15: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

14

Page 16: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

SuggestionsfromMichael

• Tryvastlydifferentarchitecturesandrepeatthesimulation

• Trydifferentoptimizers– ADAMforlargedata- longruns,lookingattrainingperplexity,decreasinglearningrateslowlyby0.9

– EVEforsmalldata– NADAMdoesn’twork,butinterestingtotry

• CPUdecoder:lookatWNMT’18DockerforMKLversionthatismoreperformant

15

Page 17: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

SuggestionsforRudolphe

• Initialcondition:maybeIneedtotraineacharmlongerinitiallybeforestartingtheK-armbandits

• ButwhatifIhavetoomanyK?

16

Page 18: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

June2018discussion

17

Page 19: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

TEDDE-EN– differentoptimizers{adadelta,adagrad,adam,eve,nadam,rmsprop,sgd}x

initiallearningrate={0.0002,0.001}

batch_size=4096schedule=plataeu-reducelearning_rate_reduce_factor=0.7loss="cross-entropy”checkpoint=750(~1epoch)

Totalresource=56

18

Page 20: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Sameaslastslide,butevaluateat10checkpointintervals(750x10updates)

Totalresource=280

Increasingresourceusageà safer19

Page 21: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Initiallearning rate ValidationPerplexity

ValidationBLEU

Adadelta1 0.0002 19.17 23.27

Adadelta2 0.001 17.24 25.02

Adam1b 0.0002 20.68 24.53

Adam1 0.0002 20.41 24.25

Adam2b 0.001 17.18 25.68

Adam2 0.001 19.14 25.24

Eve1 0.0002 14.83 27.39

Eve2 0.001 40.06 12.59

Nadam1 0.0002 20.89 24.24

Nadam2 0.001 15.43 26.97

RMSprop1 0.0002 19.08 24.55

RMSprop2 0.001 16.10 27.00

adagrad 0.001,0.0002 411, 195

sgd 4171,66220

Page 22: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

WMTZH-EN– differencearchitecture

21

Page 23: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

WMTRU-EN– differentarchiecture

22

Page 24: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

SuggestionsfromMichael

• Differentbatchsize• Differentscheduler(sqrt)• Mixarchitectures(&differentencoder/decoderdepths&layersize)

• BLEU– howclosetothebestmodel,i.e.canIgetto0.2BLEUofbestmodelwith10%oftheresources?

23

Page 25: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

July2018discussion

24

Page 26: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

MoreexperimentstoverifyHyberband’s robustness

• Motivation:– PreviousHyperband resultswerepromising,butwanttotestonmorediverse(e.g.crisscrossing)learningcurves

• Thismonth:– Curriculumlearningexperiments

25

Page 27: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

CurriculumLearning

• Hunch:– Startbytrainingeasysamples– Asmodelimproves,addinhardersamples– Maybemodelwillconvergefaster?OrbetterBLEU?

• SockeyeImplementation(atMTMA):– Easy/hardsamplesareassignedtodifferentshards– Schedulewhatshardisvisibletotraineratwhattime

26

Page 28: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

CurriculumLearning- Visualization

TrainingTimei.e.updates

VeryEasy

Hard

Startwitheasyshard

VeryHard

Easy MidLevel

Graduallyaddhardershards

CurriculumUpdateFrequencye.g.every1000updates

Atthispoint,seealldataandgetrandombatches

Visible(i.e.available)shards27

Page 29: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

CurriculumLearning– manyvariantstogetdifferentlearningcurves

• Differentschedules,e.g.

• Differentdefinitionsofeasy/hard:– Sentencelength– Vocabularyfrequency– Force-decode/1bestscoreofexistingmodel

• Differentcurriculumupdatefrequency28

Page 30: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Setup

• Data:De-EnTED• Preparation:– 100trainingrunswithdifferentcurriculumlearningsetting

– Randomlydraw16runseachtimeandobserveHyberband results

• Question:– CanHyperband correctlybetonnear-bestruns?

29

Page 31: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

30

Page 32: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

31

Page 33: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

RankHistogram(100randomtrials)

8120/64[19,19,20,18,14,5,3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]8240/128[26,23,20,14,10,2,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]8360/192[42,31,14,9,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]8480/256[43,23,18,10,5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]16148/256[16,11,10,10,14,12,11,3,6,2,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]16296/512[41,20,14,8,1,8,0,5,0,1,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]163144/768[46,23,12,3,6,7,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]164192/1024[47,21,8,6,7,4,3,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]321112/1024[28,11,7,5,7,4,9,5,2,3,1,3,0,1,0,2,0,2,3,1,1,0,0,0,2,2,1,0,0,0]322224/2048[45,21,11,2,1,3,3,4,5,3,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]323336/3072[101,61,14,6,2,4,2,1,2,3,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]324448/4096[56,33,1,4,1,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]641256/4096[10,22,27,14,1,1,0,1,0,0,2,0,0,2,3,0,3,3,3,4,0,0,2,0,2,0,0,0,0,0]642512/8192[19,32,18,18,8,3,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0]

#run

Halving_freq Resourceusedvs gridsearch

In19/100trials,Hyberband choserank1(best)curveIn20/100trials,Hyberband choserank3curve

32

Page 34: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Summary

• FoundrelativelyrobustsettingsforHyberbandonNMTlearningcurves

• Next:tryondifferentNMTarchitecturesandincorporatespeed/accuracymulti-objective

• (Nextmeeting:SeptemberratherthanAugust?CurrentlydoingsummerworkshoponDomainAdaptationforNMT)

33

Page 35: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

JHUHLTCOESCALE2018Workshop:ResilientMachineTranslation

forNewDomains

KevinDuh,PaulMcNamee,KathyBaker,PhilippKoehn,BrianThompson,ChrisCallison-Burch,

JanNiehues,MarineCarpuat,TimAnderson,JeremyGwinnup,MariannaMartindale,Jenn Drexler,

Calandra Moore,StevenBradtke,JamesWoo,Gaurav Kumar,HudaKhayrallah,PamelaShapiro,BeckyMarvin,JonathanWeese,Dusan Varis

FinalPresentation:Baltimore,August9(savethedate!)

Page 36: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Goal:ImproveDomainAdaptationofNMT

Test TrainingdataforNMT Ar-En De-En Fa-En Ko-En Ru-En

Zh-En

TEDTalks

GeneralDomain 29.6 34.6 22.2 11.6 23.4 15.9InDomain(TED) 27.4 32.3 21.3 14.4 22.9 16.2ContinuedTraining 35.4 39.9 27.9 17.2 28.6 20.4

Patent GeneralDomain n/a 36.0 n/a 2.7 23.4 12.6In-Domain(Patent) n/a 61.9 n/a 29.9 26.9 40.2Continued Training n/a 62.3 n/a 31.7 37.0 43.7

35BLEUscores:ContinuedTraininggivesconsistentgains(~0.5-5BLEU)

LargeGeneral-DomainBitext In-DomainBitext:e.g.patents

GENERALMODEL

ADAPTEDMODEL

1.Train3.ContinueTraining

2.Initialize

Page 37: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

SuggestionsfromMichael

• Quantifyresourcessaved:– whatpercentageofresourcescanwesavevs gridsearchwhileachievingsimilarmodels

• Whatwerethebadexamplesinthehistogramcurve?

• PullRequestforCurriculumLearningcode

36

Page 38: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

September2018discussion

37

Page 39: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Summarysofar

38

Withbanditlearning,wecansaveX%ofresourceswhileachievinglessthanYdegradationinBLEU.(Here,X=81%,Y=0)

Openquestion:- Resultsfordrastically

differentarchitectures

Page 40: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Otherdirections

39

• Methodsforspeedinguptraining(i.e.inner-loopofhyperparameter optimization)– Banditlearning– Datasub-selectionfortrainingspeedup

• Methodsforspeedingupmodelsorreducingresourceusageduringinferenceingeneral–Modelcompression

Page 41: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Datasubsetselection:Formulation

• TrainingdataT:Nsamples• CanweselectsubsetSofM<<Nsamples– Wheretrainingonsubsetgivessamehyperparametersearchrecommendationsastrainingonfullset?

• Formulation:1. TrainKmodelswithdifferenthyperparameters onT2. SimilarlytheKmodelstrainonsubsetS3. Comparetherankingof(1)and(2).Ifsame,then

datasubsetisagoodsurrogate

40

Page 42: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Datasubsetselection:Details

• Baseline1:TrainonT asusual,withuptosametrainingtimeasM*#epoch

• Baseline2:Flipthesubsetselectioncriteria• Subsetselectionmethod:– Cynicaldataselection– Vocabularybasedselection

• Evaluation:– Howtointerpretrankingdifferences?

41

Page 43: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Modelcompression

• Focusmoreoninferenceresourceconstraints• Existingideastoexplore:–ModelDistillation– Quantization– (Discussion)

• Comparespeed,memoryfootprint• Integratethisinlargerauto-tuningloop

42

Page 44: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Discussionnotes(withMichael)• Datasubsetselection:

– It’dbegoodtohavealltheplots– informativewhetherresultsaregoodorbad

– Forvocabselection:currentlysomethingsimilarisdoneinunittests.(Replacemostvocabwithunk)

• Modeldistillation:– Trainbigmodelandtranslatetrainingdata.Trainsmallmodelandthencontinuetrainingonbigmodel’soutputs.Thismaybesufficient(noneedforoutputdistributionasIoriginallyimagined)

• Quantization:– MichaelwillhelplookforpointersonquantizationworkinMxNet

43

Page 45: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

44

Page 46: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

• Next:exploringaneworthogonaldirectionforspeedinguphyperparameter search

45

Page 47: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

DataSubsetSelectionforNMTHyperparameter Search

KevinDuh

Page 48: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Motivation

• Ittakestimetotrainmodelstoconvergenceonlargedatasets

• Question:Canwetrainmodelstoconvergenceonasmalldataset?– E.g.inapaperlongtimeago,Lecun suggestsfiddlinglearningrateonsmallsubsetfirst

–Whatsubsetleadstofastconvergence?– Doestherankingofhyperparameters onsmallsubsetcorrelatewiththatonthefulldataset?

47

Page 49: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Datasubsetselection:Formulation

• TrainingdataT:Nsamples• CanweselectsubsetSofM<<Nsamples– Wheretrainingonsubsetgivessamehyperparametersearchrecommendationsastrainingonfullset?

• Formulation:1. TrainKmodelswithdifferenthyperparameters onT2. SimilarlytheKmodelstrainonsubsetS3. Comparetherankingof(1)and(2).Ifsame,then

datasubsetisagoodsurrogate

48

Page 50: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

PreliminaryExperiments• Big:Modeltrainedon28millionsentencepairsofgeneral-domainDE-EN– Approx 300k-500kupdatestoconverge

• Small:Modeltrainedonrandomlyselected10%ofdata– Approx 150k-300kupdates(30-60hours)toconverge

• Vocab:Modeltrainedonsentencescontainingonlythetop1/256vocabulary– Approx 100k-200kupdatestoconverge

à Vary#layers,size,etc.andseeifrankingisthesameforBigvsSmallandBigvsVocab

49

Page 51: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Datasubsetselectionbyvocabulary

50

Page 52: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

LSTMvs Transformer,Layer=1,2,4

51

Page 53: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Changinglearningrates

52

Page 54: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

ConnectionstoBanditLearning

• Bandit:Stopsomerunsbeforeconvergence• DataSelection:Shortertimetoconvergence• Thesearebothheuristicsonearlystoppingsomemodelsduringhyperparameter search

• Nextstep:– Collectmoreempiricalresults– Experimentwithotherdataselectionmethods

53

Page 55: Bandit Learning for NMT HyperparameterSearchkevinduh/t/201805-bandit.pdfBandit Learning for NMT HyperparameterSearch Kevin Duh May 2018 discussion 1 Speeding up HyperparameterSearch

Discussionnotes(withMichael)

• Plotallresultsonsamefigure• Tryevensmallerdatasetsandseewhenrankingstartstobreak

• Experimentonatleastonemoredataset• Aswrap-up,findgeneralrecommendations:basedonsomedatasetcharacteristic,whatspeed-upmethodtouse?

54