22
c Stanley Chan 2020. All Rights Reserved. ECE595 / STAT598: Machine Learning I Lecture 01: Linear Regression Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 22

ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 01: Linear Regression

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 22

Page 2: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Outline

2 / 22

Page 3: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Outline

Mathematical Background

Lecture 1: Linear regression: A basic data analytic tool

Lecture 2: Regularization: Constraining the solution

Lecture 3: Kernel Method: Enabling nonlinearity

Lecture 1: Linear Regression

Linear Regression

NotationLoss FunctionSolving the Regression Problem

Geometry

ProjectionMinimum-Norm SolutionPseudo-Inverse

3 / 22

Page 4: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Basic Notation

Scalar: a, b, c ∈ RVector: a,b, c ∈ Rd

Matrix: A,B,C ∈ RN×d ; Entries are aij or [A]ij .

Rows and Columns

A =

| | |a1 a2 . . . ad| | |

, and A =

— (x1)T —— (x2)T —

...— (xN)T —

.{aj}: The j-th feature. {xn}: The n-th sample.

Identity matrix I

All-one vector 1 and all-zero vector 0

Standard basis e i .

4 / 22

Page 5: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Line Fitting

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

5 / 22

Page 6: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Linear Regression

The problem of regression can be summarized as:

Given measurements: yn, (where n = 1, . . . ,N)Given inputs: xn

Given a model: gθ(xn) parameterized by θDetermine θ such that yn ≈ gθ(xn).

Linear regression is one type of regression:

Restrict gθ(·) to a line:gθ(x) = xTθ

The inputs x and the parameters θ are

x = [x1, . . . , xd ]T and θ = [θ1, . . . , θd ]T

This is equivalent to

gθ(x) = xTθ =d∑

j=1

xjθj .

6 / 22

Page 7: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Solving the Regression Problem

The (general) regression can be solved via the following logic:

Define a (Squared-Error) Loss Function

J(θ) =N∑

n=1

L(gθ(xn), yn)

=N∑

n=1

(gθ(xn)− yn)2, e.g., L(♣,♠)def= (♣−♠)2

Other loss functions can be used, e.g., L(♣,♠) = |♣ − ♠|.The goal is to solve an optimization

θ = argminθ

J(θ).

The prediction of a new input xnew is ynew = gθ

(xnew) = θTxnew.

7 / 22

Page 8: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Linear Regression Solution

The linear regression problem is a special case which we can solveanalytically.

Restrict gθ(·) to a line:

gθ(xn) = θTxn

Then the loss function becomes

J(θ) =N∑

n=1

(θTxn − yn)2 = ‖Aθ − y‖2.

The matrix and vectors are defined as

A =

— (x1)T —— (x2)T —

...— (xN)T —

, θ =

θ1...θd

, and y =

y1y2...yN

‖ · ‖2 stands for the `2-norm square. See Tutorial on Linear Algebra.

8 / 22

Page 9: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Linear Regression Solution

Theorem

For a linear regression problem

θ = argminθ

J(θ)def= ‖Aθ − y‖2,

the minimizer isθ = (ATA)−1ATy .

Take derivative and setting to zero: (See Tutorial on “LinearAlgebra”.)

∇θJ(θ) = ∇θ

{‖Aθ − y‖2

}= 2AT (Aθ − y) = 0.

So solution is θ = (ATA)−1ATy , assuming ATA is invertible.

9 / 22

Page 10: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Examples

Example 1: Second-order polynomial fitting

yn = ax2n + bxn + c

A =

x21 x1 1...

......

x2N xN 1

, y =

y1...yN

, θ =

abc

Example 2: Auto-regression

yn = ayn−1 + byn−2

A =

y2 y1y3 y2...

...yN−1 yN−2

, y =

y3y4...yN

, θ =

[ab

]

10 / 22

Page 11: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Generalized Linear Regression

0 1 2 3 4 5 6 7 8 9 10

-1

0

1

0 1 2 3 4 5 6 7 8 9 10

-1

0

1

0 1 2 3 4 5 6 7 8 9 10

-1

0

1

0 1 2 3 4 5 6 7 8 9 10

-1

0

1

0 1 2 3 4 5 6 7 8 9 10

-2

-1

0

1

Eg 1: Fourier series

xn =

xn1xn2...xnd

=

sin(ω0tn)

sin(2ω0tn)...

sin(Kω0tn)

yn = θTxn =

d∑k=1

θk sin(kω0tn)

θk : k-th Fourier coefficient

sin(kω0tn): k-th Fourier basis attime tn

11 / 22

Page 12: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Outline

Mathematical Background

Lecture 1: Linear regression: A basic data analytic tool

Lecture 2: Regularization: Constraining the solution

Lecture 3: Kernel Method: Enabling nonlinearity

Lecture 1: Linear Regression

Linear Regression

NotationLoss FunctionSolving the Regression Problem

Geometry

ProjectionMinimum-Norm SolutionPseudo-Inverse

12 / 22

Page 13: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Linear Span

Given a set of vectors {a1, . . . , ad}, the span is the set of all possiblelinear combinations of these vectors.

span

{a1, . . . , ad

}=

{z | z =

d∑j=1

αjaj

}(1)

Which of the following sets of vectors can span R3?

13 / 22

Page 14: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Geometry of Linear Regression

Given θ, the product Aθ can be viewed as

Aθ =

| | |a1 a2 . . . ad| | |

θ1...θd

=d∑

j=1

θjaj .

So the set of all possible Aθ’s is equivalent to span{a1, . . . , ad}. Definethe range of A as R(A) = {y | y = Aθ}. Note that y 6∈ R(A).

14 / 22

Page 15: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Orthogonality Principle

Consider the error e = y − Aθ.

For the error to minimize, it must be orthogonal to R(A), which isthe span of the columns.

This orthogonality principle means that aTj e = 0 for all

j = 1, . . . , d , which implies ATe = 0.

15 / 22

Page 16: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Normal Equation

The orthogonality principle, which states that ATe = 0, implies thatAT (y − Aθ) = 0 by substituting e = y − Aθ.

This is called the normal equation:

ATAθ = ATy . (2)

The predicted value is

y = Aθ = A(ATA)−1ATy

The matrix Pdef= A(ATA)−1AT is a projection onto the span of

{a1, . . . , ad}, i.e., the range of A.

P is called the projection matrix. It holds that PP = P.

The error e = y − y is

e = y − A(ATA)−1ATy

= (I − P)y .

16 / 22

Page 17: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Over-determined and Under-determined Systems

Assume A has full column rank.Over-determined A: Tall and skinny. θ = (ATA)−1ATy .Under-determined A: Fat and short. θ = AT (AAT )−1y .If A is under-determined, then there exists a non-trivial null spaceN (A) = {θ | Aθ = 0}.This implies that if θ is a solution, then θ + θ0 is also a solution aslong as θ0 ∈ N (A). (Why?)

17 / 22

Page 18: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Minimum-Norm Solution

Assume A is fat and has full row rank.

Since A is fat, there exists infinitely many θ such that Aθ = y .

So we need to pick one in order to be unique.

It turns out that θ = AT (AAT )−1y is the solution to

θ = argminθ

‖θ‖2 subject to Aθ = y . (3)

(You can solve this problem using Lagrange multiplier. See Appendix.)

This is called the minimum-norm solution.

18 / 22

Page 19: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

What if Rank-Deficient?

If A is rank-deficient, then ATA is not invertibleApproach 1: Regularization. See Lecture 2.Approach 2: Pseudo-inverse. Decompose A = USV T .U ∈ RN×N , with UTU = I . V ∈ Rd×d , with V TV = I .The diagonal block of S ∈ RN×d is diag{s1, . . . , sr , 0, . . . , 0}.The solution is called the pseudo-inverse:

θ = VS+UTy , (4)

where S+ = diag{1/s1, . . . , 1/sr , 0, . . . , 0}.

19 / 22

Page 20: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Linear Algebra

Gilbert Strang, Linear Algebra and Its Applications, 5th Edition.

Carl Meyer, Matrix Analysis and Applied Linear Algebra, SIAM, 2000.

Univ. Waterloo Matrix Cookbook. https:

//www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Linear Regression

Stanford CS 229 (Note on Linear Algebra)http://cs229.stanford.edu/section/cs229-linalg.pdf

Elements of Statistical Learning (Chapter 3.2)https://web.stanford.edu/~hastie/ElemStatLearn/

Learning from Data (Chapter 3.2)https://work.caltech.edu/telecourse

20 / 22

Page 21: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Appendix

21 / 22

Page 22: ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

c©Stanley Chan 2020. All Rights Reserved.

Solving the Minimum Norm problem

Consider this problem

θ = argminθ

‖θ‖2 subject to Aθ = y . (5)

The Lagrangian is

L(θ,λ) = ‖θ‖2 + λT (Aθ − y).

Take derivative with respect to θ and λ yields

∇θL = 2θ + ATλ = 0, ∇λL = Aθ − y = 0

First equation gives us θ = −ATλ/2.

Substitute into second equation yields λ = −2(AAT )−1y .

Therefore, θ = AT (AAT )−1y .

22 / 22