Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

MachineLearning&DataMiningCMS/CS/CNS/EE155

Lecture2:Perceptron&StochasticGradient

Descent

Recap: BasicRecipe(supervised)

• TrainingData:

• ModelClass:

• LossFunction:

• LearningObjective:

S = (xi, yi ){ }i=1N

f (x |w,b) = wT x − b

L(a,b) = (a− b)2

LinearModels

SquaredLoss

x ∈ RD

y ∈ −1,+1{ }

argminw,b

L yi, f (xi |w,b)( )i=1

N

∑

Optimization Problem2

Recap:Bias-VarianceTrade-off

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5VarianceBias VarianceBias VarianceBias

3

Recap:CompletePipeline

S = (xi, yi ){ }i=1N

TrainingData

f (x |w,b) = wT x − b

ModelClass(es)

L(a,b) = (a− b)2

LossFunction

argminw,b

L yi, f (xi |w,b)( )i=1

N

∑

CrossValidation&ModelSelection Profit!

4

Today

• TwoBasicLearningAlgorithms

• PerceptronAlgorithm

• (Stochastic)GradientDescent– Aka,actuallysolvingtheoptimizationproblem

5

ThePerceptron

• Oneoftheearliestlearningalgorithms– 1957byFrankRosenblatt

• Stillagreatalgorithm– Fast– Cleananalysis– PrecursortoNeuralNetworks

6

FrankRosenblattwiththeMark1PerceptronMachine

PerceptronLearningAlgorithm(LinearClassificationModel)

• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt,bt)=y• [wt+1, bt+1]=[wt, bt]

– Else• wt+1=wt +yx• bt+1 =bt - y

7

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:

Gothroughtrainingsetinarbitraryorder(e.g.,randomly)

f (x |w) = sign(wT x − b)

• Lineisa1D,Planeis2D• Hyperplane ismanyD– IncludesLineandPlane

• Definedby(w,b)

• Distance:

• SignedDistance:

Aside:Hyperplane Distance

wT x − bw

wT x − bw

w

un-normalizedsigneddistance!

LinearModel=

b/|w|

9

PerceptronLearning

10

Misclassified!

PerceptronLearning

11

Update!

PerceptronLearning

12

Correct!

PerceptronLearning

13

Misclassified!

PerceptronLearning

14

Update!

PerceptronLearning

15

Update!

PerceptronLearning

16

Correct!

PerceptronLearning

17

Correct!

PerceptronLearning

18

Misclassified!

PerceptronLearning

19

Update!

PerceptronLearning

20

Update!

PerceptronLearning

21

AllTrainingExamplesCorrectlyClassified!

PerceptronLearning

22

PerceptronLearningStartAgain

23

Misclassified!

PerceptronLearning

24

Update!

PerceptronLearning

25

Correct!

PerceptronLearning

26

Correct!

PerceptronLearning

27

Misclassified!

PerceptronLearning

28

Update!

PerceptronLearning

29

Update!

PerceptronLearning

30

Correct!

PerceptronLearning

31

Correct!

PerceptronLearning

32

Misclassified!

PerceptronLearning

33

Update!

PerceptronLearning

34

Update!

PerceptronLearning

35

Misclassified!

PerceptronLearning

36

Update!

PerceptronLearning

37

Update!

PerceptronLearning

38

Misclassified!

PerceptronLearning

39

Update!

PerceptronLearning

40

Update!

PerceptronLearning

41

Misclassified!

PerceptronLearning

42

Update!

PerceptronLearning

43

Update!

PerceptronLearning

44

AllTrainingExamplesCorrectlyClassified!

PerceptronLearning

Recap:PerceptronLearningAlgorithm(LinearClassificationModel)

• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt)=y• [wt+1, bt+1]=[wt, bt]


45

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:



46

ComparingtheTwoModels

47

ConvergencetoMistakeFree=LinearlySeparable!

48

Margin

γ =maxwmin(x,y)

y(wT x)w

LinearSeparability

• AclassificationproblemisLinearlySeparable:– Existswwithperfectclassificationaccuracy

• SeparablewithMarginγ:

• LinearlySeparable:γ >0

49

γ =maxwmin(x,y)

y(wT x)w

PerceptronMistakeBound

50

#MistakesBoundedBy: R2

γ 2

Margin

R =maxx

x

**IfLinearlySeparable

MoreDetails:http://www.cs.nyu.edu/~mohri/pub/pmb.pdf

Holdsforanyorderingoftrainingexamples!

“Radius”ofFeatureSpace

IntheRealWorld…

• MostproblemsareNOTlinearlyseparable!

• Mayneverconverge…

• Sowhattodo?

• Usevalidationset!

51

EarlyStoppingviaValidation

• RunPerceptronLearningonTrainingSet

• EvaluatecurrentmodelonValidationSet

• Terminatewhenvalidationaccuracystopsimproving

52

https://en.wikipedia.org/wiki/Early_stopping

OnlineLearningvs BatchLearning

• OnlineLearning:– Receiveastreamofdata(x,y)– Makeincrementalupdates(typically)– PerceptronLearningisaninstanceofOnlineLearning

• BatchLearning– Givenallthedataupfront– Canuseonlinelearningalgorithmsforbatchlearning– E.g.,streamthedatatothelearningalgorithm

53https://en.wikipedia.org/wiki/Online_machine_learning

Recap: Perceptron

• Oneofthefirstmachinelearningalgorithms

• Benefits:– Simpleandfast– Cleananalysis

• Drawbacks:– Mightnotconvergetoaverygoodmodel–Whatistheobjectivefunction?

54

(Stochastic)GradientDescent

55

BacktoOptimizingObjectiveFunctions

• TrainingData:

• ModelClass:

• LossFunction:

• LearningObjective:

S = (xi, yi ){ }i=1N

f (x |w,b) = wT x − b

L(a,b) = (a− b)2

LinearModels

SquaredLoss

x ∈ RD

y ∈ −1,+1{ }

argminw,b

L yi, f (xi |w,b)( )i=1

N

∑

OptimizationProblem56

BacktoOptimizingObjectiveFunctions

• Typically,requiresoptimizationalgorithm.

• Simplest:GradientDescent

• ThisLecture:stickwithsquaredloss– Talkaboutvariouslossfunctionsnextlecture

argminw,b

L(w,b) ≡ L yi, f (xi |w,b)( )i=1

N

∑

GradientReviewforSquaredLoss

∂wL(w,b) = ∂w L yi, f (xi |w,b)( )i=1

N

∑

L(a,b) = (a− b)2

= ∂wL yi, f (xi |w,b)( )i=1

N

∑

= −2(yi − f (xi |w,b))∂w f (xi |w,b)i=1

N

∑

f (x |w,b) = wT x − b= −2(yi − f (xi |w,b))xii=1

N

∑

LinearityofDifferentiation

ChainRule

GradientDescent

• Initialize:w1 =0,b1 =0

• Fort=1…

59

wt+1 = wt −η t+1∂wL(wt,bt )

bt+1 = bt −η t+1∂bL(wt,bt )

“StepSize”

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

HowtoChooseStepSize?

60w

L

η =1 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5


61w

L

η =1 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5


62w

L

η =1 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5


63w

L

η =1 ∂wL(w) = −2(1−w)

OscillateInfinitely!

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5


64w

L

η = 0.0001 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5


65w

L

η = 0.0001 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5


66w

L

η = 0.0001 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5


67w

L

η = 0.0001 ∂wL(w) = −2(1−w)

TakesReallyLongTime!


68

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0.0010.010.10.5

Iterations

Loss

NotethattheabsolutescaleisnotmeaningfulFocusontherelativemagnitude differences

AsLargeAsPossible!(WithoutDiverging)

BeingScaleInvariant

• Considerthefollowingtwogradientupdates:

• Suppose:– Howarethetwostepsizesrelated?

69

wt+1 = wt −η t+1∂wL(wt,bt )

wt+1 = wt −η̂ t+1∂wL̂(wt,bt )

L̂ =1000L

η̂ t+1 =η /1000

PracticalRulesofThumb

• DivideLossFunctionbyNumberofExamples:

• Startwithlargestepsize– Iflossplateaus,dividestepsizeby2– (Canalsouse advancedoptimizationmethods)– (Stepsizemustdecreaseovertimetoguaranteeconvergencetoglobaloptimum)

70

wt+1 = wt −η t+1

N"

#$

%

&'∂wL(w

t,bt )

Aside: Convexity

71

1/15/2015 ConvexFunction.svg

file:///Users/yyue/Downloads/ConvexFunction.svg 1/2

ImageSource:http://en.wikipedia.org/wiki/Convex_function

Easytofindglobaloptima!

Strictconvexifdiffalways>0

NotConvex

Aside: Convexity

72

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

L(x2 ) ≥ L(x1)+∇L(x1)T (x2 − x1)

Functionisalwaysabovethelocallylinearextrapolation

Aside:Convexity

• Alllocaloptimaareglobaloptima:

• Strictlyconvex:uniqueglobaloptimum:

• Almostallstandardobjectivesare(strictly)convex:– SquaredLoss,SVMs,LR,Ridge,Lasso– Wewillseenon-convexobjectiveslater(e.g.,deeplearning)

73

GradientDescentwillfindoptimum

Assumingstepsizechosensafely

Convergence

• AssumeLisconvex• Howmanyiterationstoachieve:

• If:– ThenO(1/ε2)iterations

• If:– ThenO(1/ε)iterations

• If:– ThenO(log(1/ε))iterations

74MoreDetails:Bubeck TextbookChapter3

L(a)− L(b) ≤ ρ a− b Lis“ρ-Lipschitz”

L(w)− L(w*) ≤ ε

∇L(a)−∇L(b) ≤ ρ a− bLis“ρ-smooth”

L(a) ≥ L(b)+∇L(b)T (a− b)+ ρ2a− b 2

Lis“ρ-stronglyconvex”

Convergence

• Ingeneral,takesinfinitetimetoreachglobaloptimum.• Butingeneral,wedon’tcare!

– Aslongaswe’recloseenoughtotheglobaloptimum

75

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0.0010.010.10.5

Iterations

Loss

Howdoweknowifwe’rehere?

Andnothere?

WhentoStop?

• Convergenceanalyses=worst-caseupperbounds– Whattodoinpractice?

• Stopwhenprogressissufficientlysmall– E.g.,relativereductionlessthan0.001

• Stopafterpre-specified#iterations– E.g.,100000

• Stopwhenvalidationerrorstopsgoingdown

76

LimitationofGradientDescent

• Requiresfullpassovertrainingsetperiteration

• Veryexpensiveiftrainingsetishuge

• Doweneedtodoafullpassoverthedata?

77

∂wL(w,b | S) = ∂w L yi, f (xi |w,b)( )i=1

N

∑

StochasticGradientDescent

• SupposeLossFunctionDecomposesAdditively

• Gradient=expectedgradientofsub-functions

78

L(w,b) = 1N

Li (w,b)i=1

N

∑ = Ei Li (w,b)[ ]

EachLi correspondstoasingledatapoint

∂wL(w,b) = ∂w Ei Li (w,b)[ ] = Ei ∂wLi (w,b)[ ]Li (w,b) ≡ yi − f (xi |w,b( )2

StochasticGradientDescent

• Sufficestotakerandomgradientupdate– Solongasitmatchesthetruegradientinexpectation

• Eachiterationt:– Choosei atrandom

• SGDisanonlinelearningalgorithm!

79

wt+1 = wt −η t+1∂wLi (w,b)

bt+1 = bt −η t+1∂bLi (w,b)

ExpectedValueis: ∂wL(w,b)

Mini-BatchSGD

• EachLi isasmallbatchoftrainingexamples– E.g,.500-1000examples– Canleveragevectoroperations– Decreasevolatilityofgradientupdates

• Industrystate-of-the-art– Everyoneusesmini-batchSGD– Oftenparallelized• (e.g.,differentcoresworkondifferentmini-batches)

80

CheckingforConvergence

• Howtocheckforconvergence?– Evaluatinglossonentiretrainingsetseemsexpensive…

81

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0.0010.010.10.5

Iterations

Loss

CheckingforConvergence

• Howtocheckforconvergence?– Evaluatinglossonentiretrainingsetseemsexpensive…

• Don’tcheckaftereveryiteration– E.g.,checkevery1000iterations

• Evaluatelossonasubsetoftrainingdata– E.g.,theprevious5000examples.

82

Recap:StochasticGradientDescent

• Conceptually:– DecomposeLossFunctionAdditively– ChooseaComponentRandomly– GradientUpdate

• Benefits:– Avoiditeratingentiredatasetforeveryupdate– Gradientupdateisconsistent(inexpectation)

• IndustryStandard

83

PerceptronRevisited(WhatistheObjectiveFunction?)

• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt,bt)=y• [wt+1, bt+1]=[wt, bt]


84

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:



Perceptron(Implicit)Objective

85

Li (w,b) =max 0,−yi f (xi |w,b){ }

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

yf(x)

Loss

Recap:CompletePipeline

S = (xi, yi ){ }i=1N

TrainingData

f (x |w,b) = wT x − b

ModelClass(es)

L(a,b) = (a− b)2

LossFunction

argminw,b

L yi, f (xi |w,b)( )i=1

N

∑

CrossValidation&ModelSelection Profit!

86

UseSGD!

NextWeek

• DifferentLossFunctions– HingeLoss(SVM)– LogLoss(LogisticRegression)

• Non-linearmodelclasses– NeuralNets

• Regularization

• NextThursdayRecitation:– LinearAlgebra&Calculus

87

Documents

Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5