Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Linear models: Logistic regression
Chapter 3.3
1
Predicting probabilities
Objective: learn to predict a probability P(y | x) for a binary classification problem using a linear classifier The target function: For positive examples P(y = +1 | x) = 1 whereas P(y = +1 | x) = 0 for negative examples.
2
The Target Function is Inherently Noisy
f(x) = P[y = +1 | x].
The data is generated from a noisy target function:
P (y | x) =
⎧
⎪⎨
⎪⎩
f(x) for y = +1;
1− f(x) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 6 /23 When is h good? −→
Predicting probabilities
Objective: learn to predict a probability P(y | x) for a binary classification problem using a linear classifier The target function: For positive examples P(y = +1 | x) = 1 whereas P(y = +1 | x) = 0 for negative examples. Can we assume that P(y = +1 | x) is linear?
3
The Target Function is Inherently Noisy
f(x) = P[y = +1 | x].
The data is generated from a noisy target function:
P (y | x) =
⎧
⎪⎨
⎪⎩
f(x) for y = +1;
1− f(x) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 6 /23 When is h good? −→
Logistic regression
4
s =d!
i=0
wixi
h(x) = (s) h(x) = s h(x) = θ(s)
sx
x
x
x0
1
2
d
h x( )
sx
x
x
x0
1
2
d
h x( )
sx
x
x
x0
1
2
d
h x( )
⃝ AML
s = w
|x
θ
θ(s) =es
1 + es
θ(s)1
0 s
⃝ AML
θ
θ(s) =es
1 + es
θ(s)1
0 s
⃝ AML
The logistic function (aka squashing function):
The signal is the basis for several linear models:
Properties of the logistic function
5
Predicting a Probability
Will someone have a heart attack over the next year?
age 62 yearsgender maleblood sugar 120 mg/dL40,000HDL 50LDL 120Mass 190 lbsHeight 5′ 10′′
. . . . . .
Classification: Yes/No
Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0, 1]
h(x) = θ
!d"
i=0
wixi
#
= θ(wtx)θ(s)
1
0 s
θ(s) =es
1 + es=
1
1 + e−s.
θ(−s) =e−s
1 + e−s=
1
1 + es= 1− θ(s).
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 4 /23 Data is binary ±1 −→
Predicting a Probability
Will someone have a heart attack over the next year?
age 62 yearsgender maleblood sugar 120 mg/dL40,000HDL 50LDL 120Mass 190 lbsHeight 5′ 10′′
. . . . . .
Classification: Yes/No
Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0, 1]
h(x) = θ
!d"
i=0
wixi
#
= θ(wtx)θ(s)
1
0 s
θ(s) =es
1 + es=
1
1 + e−s.
θ(−s) =e−s
1 + e−s=
1
1 + es= 1− θ(s).
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 4 /23 Data is binary ±1 −→
Predicting probabilities
Fitting the data means finding a good hypothesis h
6
What Makes an h Good?
‘fiting’ the data means finding a good h
h is good if:
⎧
⎨
⎩
h(xn) ≈ 1 whenever yn = +1;
h(xn) ≈ 0 whenever yn = −1.
A simple error measure that captures this:
Ein(h) =1
N
N∑
n=1
(
h(xn)−12(1 + yn)
)2.
Not very convenient (hard to minimize).
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 7 /23 Cross entropy error −→
The Probabilistic Interpretation
Suppose that h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
1− θ(wtx) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1− θ(s) = θ(−s) −→
The Probabilistic Interpretation
Suppose that h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
1− θ(wtx) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1− θ(s) = θ(−s) −→
Predicting probabilities
Fitting the data means finding a good hypothesis h
7
What Makes an h Good?
‘fiting’ the data means finding a good h
h is good if:
⎧
⎨
⎩
h(xn) ≈ 1 whenever yn = +1;
h(xn) ≈ 0 whenever yn = −1.
A simple error measure that captures this:
Ein(h) =1
N
N∑
n=1
(
h(xn)−12(1 + yn)
)2.
Not very convenient (hard to minimize).
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 7 /23 Cross entropy error −→
The Probabilistic Interpretation
Suppose that h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
1− θ(wtx) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1− θ(s) = θ(−s) −→
The Probabilistic Interpretation
So, if h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
θ(−wtx) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 10 /23 Simplify to one equation −→
The Probabilistic Interpretation
So, if h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
θ(−wtx) for y = −1.
. . . or, more compactly,P (y | x) = θ(y ·wtx)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 11 /23 The likelihood −→
More compactly:
Is logistic regression really linear?
To figure out how the decision boundary looks like set solving for x we get: i.e.
8
P (y = +1|x) = exp(w
|x)
exp(w
|x) + 1
P (y = �1|x) = 1� P (y = +1|x) = 1
exp(w
|x) + 1
P (y = +1|x) = P (y = �1|x)
exp(w
|x) = 1
w
|x = 0
Maximum likelihood
We will find w using the principle of maximum likelihood.
9
The Likelihood
P (y | x) = θ(y ·wtx)
Recall: (x1, y1), . . . , (xN, yN) are independently generated
Likelihood:The probability of getting the y1, . . . , yN in D from the corresponding x1, . . . ,xN :
P (y1, . . . , yN | x1, . . . ,xn) =N!
n=1
P (yn | xn).
The likelihood measures the probability that the data were generated if f were h.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 12 /23 Maximize the likelihood −→
The Likelihood
P (y | x) = θ(y ·wtx)
Recall: (x1, y1), . . . , (xN, yN) are independently generated
Likelihood:The probability of getting the y1, . . . , yN in D from the corresponding x1, . . . ,xN :
P (y1, . . . , yN | x1, . . . ,xn) =N!
n=1
P (yn | xn).
The likelihood measures the probability that the data were generated if f were h.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 12 /23 Maximize the likelihood −→
Valid since
Maximizing the likelihood
10
Maximizing The Likelihood (why?)
max!N
n=1P (yn | xn)
⇔ max ln"!N
n=1P (yn | xn)#
≡ max$N
n=1 lnP (yn | xn)
⇔ min − 1N
$Nn=1 lnP (yn | xn)
≡ min 1N
$Nn=1 ln
1P (yn|xn)
≡ min 1N
$Nn=1 ln
1θ(yn·wtxn)
← we specialize to our “model” here
≡ min 1N
$Nn=1 ln(1 + e−yn·w
txn)
Ein(w) =1
N
N%
n=1
ln(1 + e−yn·wtxn)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 13 /23 How to minimize Ein(w) −→
Maximizing the likelihood
Summary: maximizing the likelihood is equivalent to
11
−1
Nln
!N"
n=1
θ(yn w xn)
#
=1
N
N$
n=1
ln
%1
θ(yn w xn)
& '
θ(s) =1
1 + e−s
(
E (w) =1
N
N$
n=1
ln)
1 + e−ynw xn
*
+ ,- .
(h(xn),yn)⃝ AML
minimize
Cross entropy error
Maximizing the likelihood
Summary: maximizing the likelihood is equivalent to
12
−1
Nln
!N"
n=1
θ(yn w xn)
#
=1
N
N$
n=1
ln
%1
θ(yn w xn)
& '
θ(s) =1
1 + e−s
(
E (w) =1
N
N$
n=1
ln)
1 + e−ynw xn
*
+ ,- .
(h(xn),yn)⃝ AML
minimize
Cross entropy error
Ein(w) =1
N
NX
n=1
I(yn = +1) ln1
h(xn)+ I(yn = �1) ln
1
1� h(xn)
Exercise: check that this is equivalent to:
Digression: information theory
I am thinking of an integer between 0 and 1,023. You want to guess it using the fewest number of questions. Most of us would ask “is it between 0 and 512?” This is a good strategy because it provides the most information about the unknown number. It provides the first binary digit of the number. Initially you need to obtain log2(1024) = 10 bits of information. After the first question you only need log2(512) = 9 bits.
Information and Entropy
By halving the search space we obtained one bit. In general, the information associated with a probabilistic outcome: Why the logarithm? Assume we have two independent events x, and y. We would like the information they carry to be additive. Let’s check: Entropy:
14
I(p) = � log p
I(x, y) = � logP (x, y) = � logP (x)P (y)
= � logP (x)� logP (y) = I(x) + I(y)
H(P ) = �X
x
P (x) logP (x)
Entropy
For a Bernoulli random variable: Maximal when p = ½.
H(p) = �p log p� (1� p) log(1� p)
KL divergence
The KL divergence between distributions P and Q: Properties: ² Non-negative, equal to 0 iff P = Q ² It is not symmetric
16
D
KL
(P ||Q) = �X
x
P (x) log
Q(x)
P (x)
KL divergence
The KL divergence between distributions P and Q:
17
D
KL
(P ||Q) = �X
x
P (x) log
Q(x)
P (x)
D
KL
(P ||Q) = �X
x
P (x) logQ(x) +
X
x
P (x) logP (x)
cross entropy - entropy
Cross entropy and logistic regression
The logistic regression cost function: It is the average cross entropy between the learned P(y | x) and the observed probabilities Cross entropy And for binary variables:
18
Ein(w) =1
N
NX
n=1
I(yn = +1) ln1
h(xn)+ I(yn = �1) ln
1
1� h(xn)
H(y, y) = �y log y � (1� y) log(1� y)
H(P,Q) = �X
x
P (x) logQ(x)
In-sample error
The in-sample error for logistic regression
19
−1
Nln
!N"
n=1
θ(yn w xn)
#
=1
N
N$
n=1
ln
%1
θ(yn w xn)
& '
θ(s) =1
1 + e−s
(
E (w) =1
N
N$
n=1
ln)
1 + e−ynw xn
*
+ ,- .
(h(xn),yn)⃝ AML
minimize
Cross entropy error t = 0 w(0)
t = 0, 1, 2, . . .
∇E = −1
N
N!
n=1
ynxn
1 + eynw (t)xn
w(t + 1) = w(t)− η∇E
w
⃝ AML
t = 0 w(0)t = 0, 1, 2, . . .
∇E = −1
N
N!
n=1
ynxn
1 + eynw (t)xn
w(t + 1) = w(t)− η∇E
w
⃝ AML
t = 0 w(0)t = 0, 1, 2, . . .
∇E = −1
N
N!
n=1
ynxn
1 + eynw (t)xn
w(t + 1) = w(t)− η∇E
w
⃝ AML
Digression: gradient ascent/descent
Topographical maps can give us intuition on how to optimize a cost function
20
http://www.csus.edu/indiv/s/slaymaker/archives/geol10l/shield1.jpg http://www.sir-ray.com/touro/IMG_0001_NEW.jpg
Digression: gradient descent
21 Images from http://en.wikipedia.org/wiki/Gradient_descent
Given a function E(w), the gradient is the direction of steepest ascent Therefore to minimize E(w), take a step in the direction of the negative of the gradient
Notice that the gradient is perpendicular to contours of equal E(w)
Gradient descent
Gradient descent is an iterative process How to pick ?
22
Our Ein Has Only One Valley
Weights, w
In-sam
pleError,E
in
. . . because Ein(w) is a convex function of w. (So, who care’s if it looks ugly!)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 16 /23 How to roll down? −→
How to “Roll Down”?
Assume you are at weights w(t) and you take a step of size η in the direction v.
w(t + 1) = w(t) + ηv
We get to pick v ← what’s the best direction to take the step?
Pick v to make Ein(w(t + 1)) as small as possible.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 17 /23 The gradient −→
w(t+ 1) = w(t) + ⌘v
Gradient descent
The gradient is the best direction to take to optimize Ein(w):
23
The Gradient is the Fastest Way to Roll Down
Approximating the change in Ein
∆Ein = Ein(w(t+ 1))− Ein(w(t))
= Ein(w(t) + ηv)− Ein(w(t))
= η∇Ein(w(t))tv
! "# $
minimized at v = − ∇Ein(w(t))||∇Ein(w(t)) ||
+O(η2) (Taylor’s Approximation)
>≈ −η||∇Ein(w(t)) || ←attained at v = − ∇Ein(w(t))
||∇Ein(w(t)) ||
The best (steepest) direction to move is the negative gradient:
v = −
∇Ein(w(t))
||∇Ein(w(t)) ||
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 18 /23 Iterate the gradient −→
The Gradient is the Fastest Way to Roll Down
Approximating the change in Ein
∆Ein = Ein(w(t+ 1))− Ein(w(t))
= Ein(w(t) + ηv)− Ein(w(t))
= η∇Ein(w(t))tv
! "# $
minimized at v = − ∇Ein(w(t))||∇Ein(w(t)) ||
+O(η2) (Taylor’s Approximation)
>≈ −η||∇Ein(w(t)) || ←attained at v = − ∇Ein(w(t))
||∇Ein(w(t)) ||
The best (steepest) direction to move is the negative gradient:
v = −
∇Ein(w(t))
||∇Ein(w(t)) ||
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 18 /23 Iterate the gradient −→
Choosing the step size
The choice of the step size affects the rate of convergence:
24
The ‘Goldilocks’ Step Size
η too small η too large variable ηt – just right
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
large η
small η
η = 0.1; 75 steps η = 2; 10 steps variable ηt; 10 steps
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent −→
Let’s use a variable learning rate:
w(t+ 1) = w(t) + ⌘tv
⌘t = ⌘ · ||rEin(w(t))||
||rEin(w(t))|| ! 0When approaching the minimum:
Choosing the step size
The choice of the step size affects the rate of convergence:
25
The ‘Goldilocks’ Step Size
η too small η too large variable ηt – just right
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
large η
small η
η = 0.1; 75 steps η = 2; 10 steps variable ηt; 10 steps
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent −→
Let’s use a variable learning rate:
w(t+ 1) = w(t) + ⌘tv
⌘t = ⌘ · ||rEin(w(t))||
⌘tv = �⌘ · ||rEin(w(t))|| · rEin(w(t))
||rEin(w(t))|| = �⌘rEin(w(t))
The final form of gradient descent
The choice of the step size affects the rate of convergence:
26
The ‘Goldilocks’ Step Size
η too small η too large variable ηt – just right
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
large η
small η
η = 0.1; 75 steps η = 2; 10 steps variable ηt; 10 steps
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent −→
w(t+ 1) = w(t)� ⌘rEin(w(t))
Logistic regression using gradient descent
We will use gradient descent to minimize our error function. Fortunately, the logistic regression error function has a single global minimum:
27
Our Ein Has Only One Valley
Weights, w
In-sam
pleError,E
in
. . . because Ein(w) is a convex function of w. (So, who care’s if it looks ugly!)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 16 /23 How to roll down? −→
Finding The Best Weights - Hill Descent
Ball on a complicated hilly terrain
— rolls down to a local valley↑
this is called a local minimum
Questions:
How to get to the bottom of the deepest valey?
How to do this when we don’t have gravity?
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 15 /23 Our Ein is convex −→
So we don’t need to worry about getting stuck in local minima
Logistic regression using gradient descent Putting it all together: This is called batch gradient descent
28
Fixed Learning Rate Gradient Descent
ηt = η · ||∇Ein(w(t)) ||
||∇Ein(w(t)) ||→ 0 when closer to the minimum.
v = −ηt ·∇Ein(w(t))
||∇Ein(w(t)) ||
= −η · ||∇Ein(w(t)) || ·∇Ein(w(t))
||∇Ein(w(t)) ||
v = −η ·∇Ein(w(t))
1: Initialize at step t = 0 to w(0).2: for t = 0, 1, 2, . . . do3: Compute the gradient
gt = ∇Ein(w(t)). ←− (Ex. 3.7 in LFD)
4: Move in the direction vt = −gt.5: Update the weights:
w(t+ 1) = w(t) + ηvt.
6: Iterate ‘until it is time to stop’.7: end for8: Return the final weights.
Gradient descent can minimize any smooth function, for example
Ein(w) =1
N
N!
n=1
ln(1 + e−yn·wtx) ← logistic regression
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 21 /23 Stochastic gradient descent −→
t = 0 w(0)t = 0, 1, 2, . . .
∇E = −1
N
N!
n=1
ynxn
1 + eynw (t)xn
w(t + 1) = w(t)− η∇E
w
⃝ AML
−1
Nln
!N"
n=1
θ(yn w xn)
#
=1
N
N$
n=1
ln
%1
θ(yn w xn)
& '
θ(s) =1
1 + e−s
(
E (w) =1
N
N$
n=1
ln)
1 + e−ynw xn
*
+ ,- .
(h(xn),yn)⃝ AML
Logistic regression
Comments: v In practice logistic regression is solved by faster
methods than gradient descent v There is an extension to multi-class classification
29
Stochastic gradient descent
Variation on gradient descent that considers the error for a single training example: Tends to converge faster than the batch version.
30
Stochastic Gradient Descent (SGD)
A variation of GD that considers only the error on one data point.
Ein(w) =1
N
N!
n=1
ln(1 + e−yn·wtx) =
1
N
N!
n=1
e(w,xn, yn)
• Pick a random data point (x∗, y∗)
• Run an iteration of GD on e(w,x∗, y∗)
w(t + 1)← w(t)− η∇we(w,x∗, y∗)
1. The ‘average’ move is the same as GD;
2. Computation: fraction 1N
cheaper per step;
3. Stochastic: helps escape local minima;
4. Simple;
5. Similar to PLA.
Logistic Regression:
w(t + 1)← w(t) + y∗x∗
"η
1 + ey∗wtx∗
#
(Recall PLA: w(t+ 1)← w(t) + y∗x∗)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture −→
Stochastic Gradient Descent (SGD)
A variation of GD that considers only the error on one data point.
Ein(w) =1
N
N!
n=1
ln(1 + e−yn·wtx) =
1
N
N!
n=1
e(w,xn, yn)
• Pick a random data point (x∗, y∗)
• Run an iteration of GD on e(w,x∗, y∗)
w(t + 1)← w(t)− η∇we(w,x∗, y∗)
1. The ‘average’ move is the same as GD;
2. Computation: fraction 1N
cheaper per step;
3. Stochastic: helps escape local minima;
4. Simple;
5. Similar to PLA.
Logistic Regression:
w(t + 1)← w(t) + y∗x∗
"η
1 + ey∗wtx∗
#
(Recall PLA: w(t+ 1)← w(t) + y∗x∗)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture −→
Stochastic Gradient Descent (SGD)
A variation of GD that considers only the error on one data point.
Ein(w) =1
N
N!
n=1
ln(1 + e−yn·wtx) =
1
N
N!
n=1
e(w,xn, yn)
• Pick a random data point (x∗, y∗)
• Run an iteration of GD on e(w,x∗, y∗)
w(t + 1)← w(t)− η∇we(w,x∗, y∗)
1. The ‘average’ move is the same as GD;
2. Computation: fraction 1N
cheaper per step;
3. Stochastic: helps escape local minima;
4. Simple;
5. Similar to PLA.
Logistic Regression:
w(t + 1)← w(t) + y∗x∗
"η
1 + ey∗wtx∗
#
(Recall PLA: w(t+ 1)← w(t) + y∗x∗)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture −→
Summary of linear models
Linear methods for classification and regression:
31
The Linear Signal
s = wtx−→
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
→ sign(wtx) {−1,+1}
→ wtx R
→ θ(wtx) [0, 1]
y = θ(s)
c⃝ AML Creator: Malik Magdon-Ismail Linear Classification and Regression: 6 /21 Classification and PLA −→
More to come!