87
Machine Learning & Data Mining CMS/CS/CNS/EE 155 Lecture 2: Perceptron & Stochastic Gradient Descent

Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

MachineLearning&DataMiningCMS/CS/CNS/EE155

Lecture2:Perceptron&StochasticGradient

Descent

Page 2: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Recap: BasicRecipe(supervised)

• TrainingData:

• ModelClass:

• LossFunction:

• LearningObjective:

S = (xi, yi ){ }i=1N

f (x |w,b) = wT x − b

L(a,b) = (a− b)2

LinearModels

SquaredLoss

x ∈ RD

y ∈ −1,+1{ }

argminw,b

L yi, f (xi |w,b)( )i=1

N

Optimization Problem2

Page 3: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Recap:Bias-VarianceTrade-off

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5VarianceBias VarianceBias VarianceBias

3

Page 4: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Recap:CompletePipeline

S = (xi, yi ){ }i=1N

TrainingData

f (x |w,b) = wT x − b

ModelClass(es)

L(a,b) = (a− b)2

LossFunction

argminw,b

L yi, f (xi |w,b)( )i=1

N

CrossValidation&ModelSelection Profit!

4

Page 5: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Today

• TwoBasicLearningAlgorithms

• PerceptronAlgorithm

• (Stochastic)GradientDescent– Aka,actuallysolvingtheoptimizationproblem

5

Page 6: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

ThePerceptron

• Oneoftheearliestlearningalgorithms– 1957byFrankRosenblatt

• Stillagreatalgorithm– Fast– Cleananalysis– PrecursortoNeuralNetworks

6

FrankRosenblattwiththeMark1PerceptronMachine

Page 7: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

PerceptronLearningAlgorithm(LinearClassificationModel)

• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt,bt)=y• [wt+1, bt+1]=[wt, bt]

– Else• wt+1=wt +yx• bt+1 =bt - y

7

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:

Gothroughtrainingsetinarbitraryorder(e.g.,randomly)

f (x |w) = sign(wT x − b)

Page 8: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

• Lineisa1D,Planeis2D• Hyperplane ismanyD– IncludesLineandPlane

• Definedby(w,b)

• Distance:

• SignedDistance:

Aside:Hyperplane Distance

wT x − bw

wT x − bw

w

un-normalizedsigneddistance!

LinearModel=

b/|w|

Page 9: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

9

PerceptronLearning

Page 10: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

10

Misclassified!

PerceptronLearning

Page 11: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

11

Update!

PerceptronLearning

Page 12: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

12

Correct!

PerceptronLearning

Page 13: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

13

Misclassified!

PerceptronLearning

Page 14: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

14

Update!

PerceptronLearning

Page 15: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

15

Update!

PerceptronLearning

Page 16: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

16

Correct!

PerceptronLearning

Page 17: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

17

Correct!

PerceptronLearning

Page 18: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

18

Misclassified!

PerceptronLearning

Page 19: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

19

Update!

PerceptronLearning

Page 20: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

20

Update!

PerceptronLearning

Page 21: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

21

AllTrainingExamplesCorrectlyClassified!

PerceptronLearning

Page 22: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

22

PerceptronLearningStartAgain

Page 23: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

23

Misclassified!

PerceptronLearning

Page 24: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

24

Update!

PerceptronLearning

Page 25: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

25

Correct!

PerceptronLearning

Page 26: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

26

Correct!

PerceptronLearning

Page 27: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

27

Misclassified!

PerceptronLearning

Page 28: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

28

Update!

PerceptronLearning

Page 29: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

29

Update!

PerceptronLearning

Page 30: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

30

Correct!

PerceptronLearning

Page 31: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

31

Correct!

PerceptronLearning

Page 32: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

32

Misclassified!

PerceptronLearning

Page 33: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

33

Update!

PerceptronLearning

Page 34: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

34

Update!

PerceptronLearning

Page 35: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

35

Misclassified!

PerceptronLearning

Page 36: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

36

Update!

PerceptronLearning

Page 37: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

37

Update!

PerceptronLearning

Page 38: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

38

Misclassified!

PerceptronLearning

Page 39: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

39

Update!

PerceptronLearning

Page 40: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

40

Update!

PerceptronLearning

Page 41: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

41

Misclassified!

PerceptronLearning

Page 42: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

42

Update!

PerceptronLearning

Page 43: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

43

Update!

PerceptronLearning

Page 44: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

44

AllTrainingExamplesCorrectlyClassified!

PerceptronLearning

Page 45: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Recap:PerceptronLearningAlgorithm(LinearClassificationModel)

• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt)=y• [wt+1, bt+1]=[wt, bt]

– Else• wt+1=wt +yx• bt+1 =bt - y

45

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:

Gothroughtrainingsetinarbitraryorder(e.g.,randomly)

f (x |w) = sign(wT x − b)

Page 46: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

46

ComparingtheTwoModels

Page 47: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

47

ConvergencetoMistakeFree=LinearlySeparable!

Page 48: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

48

Margin

γ =maxwmin(x,y)

y(wT x)w

Page 49: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

LinearSeparability

• AclassificationproblemisLinearlySeparable:– Existswwithperfectclassificationaccuracy

• SeparablewithMarginγ:

• LinearlySeparable:γ >0

49

γ =maxwmin(x,y)

y(wT x)w

Page 50: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

PerceptronMistakeBound

50

#MistakesBoundedBy: R2

γ 2

Margin

R =maxx

x

**IfLinearlySeparable

MoreDetails:http://www.cs.nyu.edu/~mohri/pub/pmb.pdf

Holdsforanyorderingoftrainingexamples!

“Radius”ofFeatureSpace

Page 51: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

IntheRealWorld…

• MostproblemsareNOTlinearlyseparable!

• Mayneverconverge…

• Sowhattodo?

• Usevalidationset!

51

Page 52: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

EarlyStoppingviaValidation

• RunPerceptronLearningonTrainingSet

• EvaluatecurrentmodelonValidationSet

• Terminatewhenvalidationaccuracystopsimproving

52

https://en.wikipedia.org/wiki/Early_stopping

Page 53: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

OnlineLearningvs BatchLearning

• OnlineLearning:– Receiveastreamofdata(x,y)– Makeincrementalupdates(typically)– PerceptronLearningisaninstanceofOnlineLearning

• BatchLearning– Givenallthedataupfront– Canuseonlinelearningalgorithmsforbatchlearning– E.g.,streamthedatatothelearningalgorithm

53https://en.wikipedia.org/wiki/Online_machine_learning

Page 54: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Recap: Perceptron

• Oneofthefirstmachinelearningalgorithms

• Benefits:– Simpleandfast– Cleananalysis

• Drawbacks:– Mightnotconvergetoaverygoodmodel–Whatistheobjectivefunction?

54

Page 55: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

(Stochastic)GradientDescent

55

Page 56: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

BacktoOptimizingObjectiveFunctions

• TrainingData:

• ModelClass:

• LossFunction:

• LearningObjective:

S = (xi, yi ){ }i=1N

f (x |w,b) = wT x − b

L(a,b) = (a− b)2

LinearModels

SquaredLoss

x ∈ RD

y ∈ −1,+1{ }

argminw,b

L yi, f (xi |w,b)( )i=1

N

OptimizationProblem56

Page 57: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

BacktoOptimizingObjectiveFunctions

• Typically,requiresoptimizationalgorithm.

• Simplest:GradientDescent

• ThisLecture:stickwithsquaredloss– Talkaboutvariouslossfunctionsnextlecture

argminw,b

L(w,b) ≡ L yi, f (xi |w,b)( )i=1

N

Page 58: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

GradientReviewforSquaredLoss

∂wL(w,b) = ∂w L yi, f (xi |w,b)( )i=1

N

L(a,b) = (a− b)2

= ∂wL yi, f (xi |w,b)( )i=1

N

= −2(yi − f (xi |w,b))∂w f (xi |w,b)i=1

N

f (x |w,b) = wT x − b= −2(yi − f (xi |w,b))xii=1

N

LinearityofDifferentiation

ChainRule

Page 59: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

GradientDescent

• Initialize:w1 =0,b1 =0

• Fort=1…

59

wt+1 = wt −η t+1∂wL(wt,bt )

bt+1 = bt −η t+1∂bL(wt,bt )

“StepSize”

Page 60: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

HowtoChooseStepSize?

60w

L

η =1 ∂wL(w) = −2(1−w)

Page 61: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

HowtoChooseStepSize?

61w

L

η =1 ∂wL(w) = −2(1−w)

Page 62: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

HowtoChooseStepSize?

62w

L

η =1 ∂wL(w) = −2(1−w)

Page 63: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

HowtoChooseStepSize?

63w

L

η =1 ∂wL(w) = −2(1−w)

OscillateInfinitely!

Page 64: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

HowtoChooseStepSize?

64w

L

η = 0.0001 ∂wL(w) = −2(1−w)

Page 65: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

HowtoChooseStepSize?

65w

L

η = 0.0001 ∂wL(w) = −2(1−w)

Page 66: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

HowtoChooseStepSize?

66w

L

η = 0.0001 ∂wL(w) = −2(1−w)

Page 67: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

HowtoChooseStepSize?

67w

L

η = 0.0001 ∂wL(w) = −2(1−w)

TakesReallyLongTime!

Page 68: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

HowtoChooseStepSize?

68

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0.0010.010.10.5

Iterations

Loss

NotethattheabsolutescaleisnotmeaningfulFocusontherelativemagnitude differences

AsLargeAsPossible!(WithoutDiverging)

Page 69: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

BeingScaleInvariant

• Considerthefollowingtwogradientupdates:

• Suppose:– Howarethetwostepsizesrelated?

69

wt+1 = wt −η t+1∂wL(wt,bt )

wt+1 = wt −η̂ t+1∂wL̂(wt,bt )

L̂ =1000L

η̂ t+1 =η /1000

Page 70: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

PracticalRulesofThumb

• DivideLossFunctionbyNumberofExamples:

• Startwithlargestepsize– Iflossplateaus,dividestepsizeby2– (Canalsouse advancedoptimizationmethods)– (Stepsizemustdecreaseovertimetoguaranteeconvergencetoglobaloptimum)

70

wt+1 = wt −η t+1

N"

#$

%

&'∂wL(w

t,bt )

Page 71: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Aside: Convexity

71

1/15/2015 ConvexFunction.svg

file:///Users/yyue/Downloads/ConvexFunction.svg 1/2

ImageSource:http://en.wikipedia.org/wiki/Convex_function

Easytofindglobaloptima!

Strictconvexifdiffalways>0

NotConvex

Page 72: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Aside: Convexity

72

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

L(x2 ) ≥ L(x1)+∇L(x1)T (x2 − x1)

Functionisalwaysabovethelocallylinearextrapolation

Page 73: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Aside:Convexity

• Alllocaloptimaareglobaloptima:

• Strictlyconvex:uniqueglobaloptimum:

• Almostallstandardobjectivesare(strictly)convex:– SquaredLoss,SVMs,LR,Ridge,Lasso– Wewillseenon-convexobjectiveslater(e.g.,deeplearning)

73

GradientDescentwillfindoptimum

Assumingstepsizechosensafely

Page 74: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Convergence

• AssumeLisconvex• Howmanyiterationstoachieve:

• If:– ThenO(1/ε2)iterations

• If:– ThenO(1/ε)iterations

• If:– ThenO(log(1/ε))iterations

74MoreDetails:Bubeck TextbookChapter3

L(a)− L(b) ≤ ρ a− b Lis“ρ-Lipschitz”

L(w)− L(w*) ≤ ε

∇L(a)−∇L(b) ≤ ρ a− bLis“ρ-smooth”

L(a) ≥ L(b)+∇L(b)T (a− b)+ ρ2a− b 2

Lis“ρ-stronglyconvex”

Page 75: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Convergence

• Ingeneral,takesinfinitetimetoreachglobaloptimum.• Butingeneral,wedon’tcare!

– Aslongaswe’recloseenoughtotheglobaloptimum

75

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0.0010.010.10.5

Iterations

Loss

Howdoweknowifwe’rehere?

Andnothere?

Page 76: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

WhentoStop?

• Convergenceanalyses=worst-caseupperbounds– Whattodoinpractice?

• Stopwhenprogressissufficientlysmall– E.g.,relativereductionlessthan0.001

• Stopafterpre-specified#iterations– E.g.,100000

• Stopwhenvalidationerrorstopsgoingdown

76

Page 77: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

LimitationofGradientDescent

• Requiresfullpassovertrainingsetperiteration

• Veryexpensiveiftrainingsetishuge

• Doweneedtodoafullpassoverthedata?

77

∂wL(w,b | S) = ∂w L yi, f (xi |w,b)( )i=1

N

Page 78: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

StochasticGradientDescent

• SupposeLossFunctionDecomposesAdditively

• Gradient=expectedgradientofsub-functions

78

L(w,b) = 1N

Li (w,b)i=1

N

∑ = Ei Li (w,b)[ ]

EachLi correspondstoasingledatapoint

∂wL(w,b) = ∂w Ei Li (w,b)[ ] = Ei ∂wLi (w,b)[ ]Li (w,b) ≡ yi − f (xi |w,b( )2

Page 79: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

StochasticGradientDescent

• Sufficestotakerandomgradientupdate– Solongasitmatchesthetruegradientinexpectation

• Eachiterationt:– Choosei atrandom

• SGDisanonlinelearningalgorithm!

79

wt+1 = wt −η t+1∂wLi (w,b)

bt+1 = bt −η t+1∂bLi (w,b)

ExpectedValueis: ∂wL(w,b)

Page 80: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Mini-BatchSGD

• EachLi isasmallbatchoftrainingexamples– E.g,.500-1000examples– Canleveragevectoroperations– Decreasevolatilityofgradientupdates

• Industrystate-of-the-art– Everyoneusesmini-batchSGD– Oftenparallelized• (e.g.,differentcoresworkondifferentmini-batches)

80

Page 81: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

CheckingforConvergence

• Howtocheckforconvergence?– Evaluatinglossonentiretrainingsetseemsexpensive…

81

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0.0010.010.10.5

Iterations

Loss

Page 82: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

CheckingforConvergence

• Howtocheckforconvergence?– Evaluatinglossonentiretrainingsetseemsexpensive…

• Don’tcheckaftereveryiteration– E.g.,checkevery1000iterations

• Evaluatelossonasubsetoftrainingdata– E.g.,theprevious5000examples.

82

Page 83: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Recap:StochasticGradientDescent

• Conceptually:– DecomposeLossFunctionAdditively– ChooseaComponentRandomly– GradientUpdate

• Benefits:– Avoiditeratingentiredatasetforeveryupdate– Gradientupdateisconsistent(inexpectation)

• IndustryStandard

83

Page 84: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

PerceptronRevisited(WhatistheObjectiveFunction?)

• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt,bt)=y• [wt+1, bt+1]=[wt, bt]

– Else• wt+1=wt +yx• bt+1 =bt - y

84

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:

Gothroughtrainingsetinarbitraryorder(e.g.,randomly)

f (x |w) = sign(wT x − b)

Page 85: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Perceptron(Implicit)Objective

85

Li (w,b) =max 0,−yi f (xi |w,b){ }

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

yf(x)

Loss

Page 86: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

Recap:CompletePipeline

S = (xi, yi ){ }i=1N

TrainingData

f (x |w,b) = wT x − b

ModelClass(es)

L(a,b) = (a− b)2

LossFunction

argminw,b

L yi, f (xi |w,b)( )i=1

N

CrossValidation&ModelSelection Profit!

86

UseSGD!

Page 87: Machine Learning & Data Mining - Yisong Yue · Recap: Bias-Variance Trade -off 0 20 40 60 80 100 −1 −0.5 0 0.5 1 1.5 0 20 40 60 80 100 0 0.5 1 1.5 0 20 40 60 80 100 −1 −0.5

NextWeek

• DifferentLossFunctions– HingeLoss(SVM)– LogLoss(LogisticRegression)

• Non-linearmodelclasses– NeuralNets

• Regularization

• NextThursdayRecitation:– LinearAlgebra&Calculus

87