Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
MachineLearning&DataMiningCMS/CS/CNS/EE155
Lecture2:Perceptron&StochasticGradient
Descent
Recap: BasicRecipe(supervised)
• TrainingData:
• ModelClass:
• LossFunction:
• LearningObjective:
S = (xi, yi ){ }i=1N
f (x |w,b) = wT x − b
L(a,b) = (a− b)2
LinearModels
SquaredLoss
x ∈ RD
y ∈ −1,+1{ }
argminw,b
L yi, f (xi |w,b)( )i=1
N
∑
Optimization Problem2
Recap:Bias-VarianceTrade-off
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5VarianceBias VarianceBias VarianceBias
3
Recap:CompletePipeline
S = (xi, yi ){ }i=1N
TrainingData
f (x |w,b) = wT x − b
ModelClass(es)
L(a,b) = (a− b)2
LossFunction
argminw,b
L yi, f (xi |w,b)( )i=1
N
∑
CrossValidation&ModelSelection Profit!
4
Today
• TwoBasicLearningAlgorithms
• PerceptronAlgorithm
• (Stochastic)GradientDescent– Aka,actuallysolvingtheoptimizationproblem
5
ThePerceptron
• Oneoftheearliestlearningalgorithms– 1957byFrankRosenblatt
• Stillagreatalgorithm– Fast– Cleananalysis– PrecursortoNeuralNetworks
6
FrankRosenblattwiththeMark1PerceptronMachine
PerceptronLearningAlgorithm(LinearClassificationModel)
• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt,bt)=y• [wt+1, bt+1]=[wt, bt]
– Else• wt+1=wt +yx• bt+1 =bt - y
7
S = (xi, yi ){ }i=1N
y ∈ +1,−1{ }
TrainingSet:
Gothroughtrainingsetinarbitraryorder(e.g.,randomly)
f (x |w) = sign(wT x − b)
• Lineisa1D,Planeis2D• Hyperplane ismanyD– IncludesLineandPlane
• Definedby(w,b)
• Distance:
• SignedDistance:
Aside:Hyperplane Distance
wT x − bw
wT x − bw
w
un-normalizedsigneddistance!
LinearModel=
b/|w|
9
PerceptronLearning
10
Misclassified!
PerceptronLearning
11
Update!
PerceptronLearning
12
Correct!
PerceptronLearning
13
Misclassified!
PerceptronLearning
14
Update!
PerceptronLearning
15
Update!
PerceptronLearning
16
Correct!
PerceptronLearning
17
Correct!
PerceptronLearning
18
Misclassified!
PerceptronLearning
19
Update!
PerceptronLearning
20
Update!
PerceptronLearning
21
AllTrainingExamplesCorrectlyClassified!
PerceptronLearning
22
PerceptronLearningStartAgain
23
Misclassified!
PerceptronLearning
24
Update!
PerceptronLearning
25
Correct!
PerceptronLearning
26
Correct!
PerceptronLearning
27
Misclassified!
PerceptronLearning
28
Update!
PerceptronLearning
29
Update!
PerceptronLearning
30
Correct!
PerceptronLearning
31
Correct!
PerceptronLearning
32
Misclassified!
PerceptronLearning
33
Update!
PerceptronLearning
34
Update!
PerceptronLearning
35
Misclassified!
PerceptronLearning
36
Update!
PerceptronLearning
37
Update!
PerceptronLearning
38
Misclassified!
PerceptronLearning
39
Update!
PerceptronLearning
40
Update!
PerceptronLearning
41
Misclassified!
PerceptronLearning
42
Update!
PerceptronLearning
43
Update!
PerceptronLearning
44
AllTrainingExamplesCorrectlyClassified!
PerceptronLearning
Recap:PerceptronLearningAlgorithm(LinearClassificationModel)
• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt)=y• [wt+1, bt+1]=[wt, bt]
– Else• wt+1=wt +yx• bt+1 =bt - y
45
S = (xi, yi ){ }i=1N
y ∈ +1,−1{ }
TrainingSet:
Gothroughtrainingsetinarbitraryorder(e.g.,randomly)
f (x |w) = sign(wT x − b)
46
ComparingtheTwoModels
47
ConvergencetoMistakeFree=LinearlySeparable!
48
Margin
γ =maxwmin(x,y)
y(wT x)w
LinearSeparability
• AclassificationproblemisLinearlySeparable:– Existswwithperfectclassificationaccuracy
• SeparablewithMarginγ:
• LinearlySeparable:γ >0
49
γ =maxwmin(x,y)
y(wT x)w
PerceptronMistakeBound
50
#MistakesBoundedBy: R2
γ 2
Margin
R =maxx
x
**IfLinearlySeparable
MoreDetails:http://www.cs.nyu.edu/~mohri/pub/pmb.pdf
Holdsforanyorderingoftrainingexamples!
“Radius”ofFeatureSpace
IntheRealWorld…
• MostproblemsareNOTlinearlyseparable!
• Mayneverconverge…
• Sowhattodo?
• Usevalidationset!
51
EarlyStoppingviaValidation
• RunPerceptronLearningonTrainingSet
• EvaluatecurrentmodelonValidationSet
• Terminatewhenvalidationaccuracystopsimproving
52
https://en.wikipedia.org/wiki/Early_stopping
OnlineLearningvs BatchLearning
• OnlineLearning:– Receiveastreamofdata(x,y)– Makeincrementalupdates(typically)– PerceptronLearningisaninstanceofOnlineLearning
• BatchLearning– Givenallthedataupfront– Canuseonlinelearningalgorithmsforbatchlearning– E.g.,streamthedatatothelearningalgorithm
53https://en.wikipedia.org/wiki/Online_machine_learning
Recap: Perceptron
• Oneofthefirstmachinelearningalgorithms
• Benefits:– Simpleandfast– Cleananalysis
• Drawbacks:– Mightnotconvergetoaverygoodmodel–Whatistheobjectivefunction?
54
(Stochastic)GradientDescent
55
BacktoOptimizingObjectiveFunctions
• TrainingData:
• ModelClass:
• LossFunction:
• LearningObjective:
S = (xi, yi ){ }i=1N
f (x |w,b) = wT x − b
L(a,b) = (a− b)2
LinearModels
SquaredLoss
x ∈ RD
y ∈ −1,+1{ }
argminw,b
L yi, f (xi |w,b)( )i=1
N
∑
OptimizationProblem56
BacktoOptimizingObjectiveFunctions
• Typically,requiresoptimizationalgorithm.
• Simplest:GradientDescent
• ThisLecture:stickwithsquaredloss– Talkaboutvariouslossfunctionsnextlecture
argminw,b
L(w,b) ≡ L yi, f (xi |w,b)( )i=1
N
∑
GradientReviewforSquaredLoss
∂wL(w,b) = ∂w L yi, f (xi |w,b)( )i=1
N
∑
L(a,b) = (a− b)2
= ∂wL yi, f (xi |w,b)( )i=1
N
∑
= −2(yi − f (xi |w,b))∂w f (xi |w,b)i=1
N
∑
f (x |w,b) = wT x − b= −2(yi − f (xi |w,b))xii=1
N
∑
LinearityofDifferentiation
ChainRule
GradientDescent
• Initialize:w1 =0,b1 =0
• Fort=1…
59
wt+1 = wt −η t+1∂wL(wt,bt )
bt+1 = bt −η t+1∂bL(wt,bt )
“StepSize”
−0.5 0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
HowtoChooseStepSize?
60w
L
η =1 ∂wL(w) = −2(1−w)
−0.5 0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
HowtoChooseStepSize?
61w
L
η =1 ∂wL(w) = −2(1−w)
−0.5 0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
HowtoChooseStepSize?
62w
L
η =1 ∂wL(w) = −2(1−w)
−0.5 0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
HowtoChooseStepSize?
63w
L
η =1 ∂wL(w) = −2(1−w)
OscillateInfinitely!
−0.5 0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
HowtoChooseStepSize?
64w
L
η = 0.0001 ∂wL(w) = −2(1−w)
−0.5 0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
HowtoChooseStepSize?
65w
L
η = 0.0001 ∂wL(w) = −2(1−w)
−0.5 0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
HowtoChooseStepSize?
66w
L
η = 0.0001 ∂wL(w) = −2(1−w)
−0.5 0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
HowtoChooseStepSize?
67w
L
η = 0.0001 ∂wL(w) = −2(1−w)
TakesReallyLongTime!
HowtoChooseStepSize?
68
0 20 40 60 80 100 120 140 160 180 2000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0.0010.010.10.5
Iterations
Loss
NotethattheabsolutescaleisnotmeaningfulFocusontherelativemagnitude differences
AsLargeAsPossible!(WithoutDiverging)
BeingScaleInvariant
• Considerthefollowingtwogradientupdates:
• Suppose:– Howarethetwostepsizesrelated?
69
wt+1 = wt −η t+1∂wL(wt,bt )
wt+1 = wt −η̂ t+1∂wL̂(wt,bt )
L̂ =1000L
η̂ t+1 =η /1000
PracticalRulesofThumb
• DivideLossFunctionbyNumberofExamples:
• Startwithlargestepsize– Iflossplateaus,dividestepsizeby2– (Canalsouse advancedoptimizationmethods)– (Stepsizemustdecreaseovertimetoguaranteeconvergencetoglobaloptimum)
70
wt+1 = wt −η t+1
N"
#$
%
&'∂wL(w
t,bt )
Aside: Convexity
71
1/15/2015 ConvexFunction.svg
file:///Users/yyue/Downloads/ConvexFunction.svg 1/2
ImageSource:http://en.wikipedia.org/wiki/Convex_function
Easytofindglobaloptima!
Strictconvexifdiffalways>0
NotConvex
Aside: Convexity
72
−0.5 0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
L(x2 ) ≥ L(x1)+∇L(x1)T (x2 − x1)
Functionisalwaysabovethelocallylinearextrapolation
Aside:Convexity
• Alllocaloptimaareglobaloptima:
• Strictlyconvex:uniqueglobaloptimum:
• Almostallstandardobjectivesare(strictly)convex:– SquaredLoss,SVMs,LR,Ridge,Lasso– Wewillseenon-convexobjectiveslater(e.g.,deeplearning)
73
GradientDescentwillfindoptimum
Assumingstepsizechosensafely
Convergence
• AssumeLisconvex• Howmanyiterationstoachieve:
• If:– ThenO(1/ε2)iterations
• If:– ThenO(1/ε)iterations
• If:– ThenO(log(1/ε))iterations
74MoreDetails:Bubeck TextbookChapter3
L(a)− L(b) ≤ ρ a− b Lis“ρ-Lipschitz”
L(w)− L(w*) ≤ ε
∇L(a)−∇L(b) ≤ ρ a− bLis“ρ-smooth”
L(a) ≥ L(b)+∇L(b)T (a− b)+ ρ2a− b 2
Lis“ρ-stronglyconvex”
Convergence
• Ingeneral,takesinfinitetimetoreachglobaloptimum.• Butingeneral,wedon’tcare!
– Aslongaswe’recloseenoughtotheglobaloptimum
75
0 20 40 60 80 100 120 140 160 180 2000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0.0010.010.10.5
Iterations
Loss
Howdoweknowifwe’rehere?
Andnothere?
WhentoStop?
• Convergenceanalyses=worst-caseupperbounds– Whattodoinpractice?
• Stopwhenprogressissufficientlysmall– E.g.,relativereductionlessthan0.001
• Stopafterpre-specified#iterations– E.g.,100000
• Stopwhenvalidationerrorstopsgoingdown
76
LimitationofGradientDescent
• Requiresfullpassovertrainingsetperiteration
• Veryexpensiveiftrainingsetishuge
• Doweneedtodoafullpassoverthedata?
77
∂wL(w,b | S) = ∂w L yi, f (xi |w,b)( )i=1
N
∑
StochasticGradientDescent
• SupposeLossFunctionDecomposesAdditively
• Gradient=expectedgradientofsub-functions
78
L(w,b) = 1N
Li (w,b)i=1
N
∑ = Ei Li (w,b)[ ]
EachLi correspondstoasingledatapoint
∂wL(w,b) = ∂w Ei Li (w,b)[ ] = Ei ∂wLi (w,b)[ ]Li (w,b) ≡ yi − f (xi |w,b( )2
StochasticGradientDescent
• Sufficestotakerandomgradientupdate– Solongasitmatchesthetruegradientinexpectation
• Eachiterationt:– Choosei atrandom
• SGDisanonlinelearningalgorithm!
79
wt+1 = wt −η t+1∂wLi (w,b)
bt+1 = bt −η t+1∂bLi (w,b)
ExpectedValueis: ∂wL(w,b)
Mini-BatchSGD
• EachLi isasmallbatchoftrainingexamples– E.g,.500-1000examples– Canleveragevectoroperations– Decreasevolatilityofgradientupdates
• Industrystate-of-the-art– Everyoneusesmini-batchSGD– Oftenparallelized• (e.g.,differentcoresworkondifferentmini-batches)
80
CheckingforConvergence
• Howtocheckforconvergence?– Evaluatinglossonentiretrainingsetseemsexpensive…
81
0 20 40 60 80 100 120 140 160 180 2000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0.0010.010.10.5
Iterations
Loss
CheckingforConvergence
• Howtocheckforconvergence?– Evaluatinglossonentiretrainingsetseemsexpensive…
• Don’tcheckaftereveryiteration– E.g.,checkevery1000iterations
• Evaluatelossonasubsetoftrainingdata– E.g.,theprevious5000examples.
82
Recap:StochasticGradientDescent
• Conceptually:– DecomposeLossFunctionAdditively– ChooseaComponentRandomly– GradientUpdate
• Benefits:– Avoiditeratingentiredatasetforeveryupdate– Gradientupdateisconsistent(inexpectation)
• IndustryStandard
83
PerceptronRevisited(WhatistheObjectiveFunction?)
• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt,bt)=y• [wt+1, bt+1]=[wt, bt]
– Else• wt+1=wt +yx• bt+1 =bt - y
84
S = (xi, yi ){ }i=1N
y ∈ +1,−1{ }
TrainingSet:
Gothroughtrainingsetinarbitraryorder(e.g.,randomly)
f (x |w) = sign(wT x − b)
Perceptron(Implicit)Objective
85
Li (w,b) =max 0,−yi f (xi |w,b){ }
-2 -1.5 -1 -0.5 0 0.5 1 1.5 20
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
yf(x)
Loss
Recap:CompletePipeline
S = (xi, yi ){ }i=1N
TrainingData
f (x |w,b) = wT x − b
ModelClass(es)
L(a,b) = (a− b)2
LossFunction
argminw,b
L yi, f (xi |w,b)( )i=1
N
∑
CrossValidation&ModelSelection Profit!
86
UseSGD!
NextWeek
• DifferentLossFunctions– HingeLoss(SVM)– LogLoss(LogisticRegression)
• Non-linearmodelclasses– NeuralNets
• Regularization
• NextThursdayRecitation:– LinearAlgebra&Calculus
87