42
A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz ( [email protected] ) 7/3/2008

A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz ([email protected])[email protected]

  • View
    231

  • Download
    4

Embed Size (px)

Citation preview

Page 1: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 1

Basis Expansions and Regularization ContinuedFrom Elements of Statistical Learning

(CH. 5 Part 2)

Speaker: Brian Quanz ([email protected])

7/3/2008

Page 2: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 2

Overview

• Nonparametric Logistic Regression• Multidimensional Splines• Regularization and Reproducing

Kernel Hilbert Spaces• Wavelet Smoothing

Page 3: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 3

Review: Logistic Regression

• Goal: Fit logistic curve to the data using iterative procedure to calculate maximum likelihood parameters

• Uses logistic (a.k.a. sigmoid) function:

• Can be used to associate probabilities with a discriminative classifier (i.e. P(Class=1|X=x)).

• For example, sigmoid fit is used with Support Vector Machines (SVM), where x is distance from separating hyperplane, to assign probabilities to a classification.

• H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt's probabilistic outputs for support vector machines. Technical report, Department of Computer Science, National Taiwan University, 2003. URL http://www.csie.ntu.edu.tw/~cjlin/papers/plattprob.ps.

f(x)

x

P(Class = 1|X=x)

P(Class = 2|X=x)

*Original sigmoid image taken from wikipedia.org

Page 4: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 4

Review: Logistic Regression - Fitting• Maximum Likelihood (2 class case):

• Sample data consists of data samples x1, x2,…,xn with labels y1, y2,…,yn, where xi has dimension p and yi are 0 or 1.

• Maximize Prob(Parameters and Data) = P(B;X;Y) = P(Y|B;X)P(B;X)

• L(B) = P(Y|B;X) is called the likelihood function• Then, assuming IID and taking log to simplify gives log-

likelihood function:

• Goal: find B that maximizes ℓ(B), take derivative to obtain score functions:

– Text uses Newton-Raphson algorithm to find zeros

Page 5: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 5

Logistic Regression – Newton-Rapshon Method

• To find zeros of arbitrary function• Approximate function at starting point with tangent; find x-

intercept to attain new starting point; repeat

• Likely to converge since log-likelihood is concave• Update rule for logistic regression:

*Image taken from wikipedia.org

(f(xn)-0)/(xn-xn+1) = f’(xn)

=> xn+1 = xn – f(xn)/f’(xn)

Page 6: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 6

Nonparametric Logistic Regression

• No longer fix the log odds as linear, allow a smoother fit:

• Fit f(x) smoothly to allow smoother conditional probability function

• As with smoothing spline, penalize curvature:

• As with smoothing splines, optimal f is finite dimensional natural spline with knots at unique x. We can define:

Page 7: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 7

Nonparametric Logistic Regression

• Which implies:

• Where p is N-vector with elements pi, and as defined previously:

• And W is diagonal matrix with entries P(Y = 1|X = xi)(1-P(Y = 1|X = xi))

• Using Newton-Raphson update as before:

Page 8: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 8

Multidimensional Splines• Many options for how to do multi-dimensional splines• The most basic, not even defined by text is additive

• Simply add together the spline bases for different dimensions

• The tensor product basis combines the bases from different dimensions through all possible multiplications with one basis from each. Example:

• Tensor product basis:

Page 9: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 9

Multidimensional Splines – Tensor Product• Simply expressed

as new basis, so same fitting applies as before, e.g. least-squares

• With increasing dimension, the resulting basis dimension grows exponentially

• Selecting only important basis functions to solve problem, discussed in ch. 9

*Image taken from the book

Page 10: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 10

Example Comparison: Additive and Tensor Natural Splines

*Image taken from the book

left

right

Page 11: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 11

Smoothing Splines of Higher Dimension

• Same problem as before, except x now has d dimensions:

• J is an appropriate penalty. Text gives example of a two-dimensional penalty extending the one-dimensional penalty previously presented:

• This optimization results in a thin-plate spline, which shares many properties with previously presented smoothing splines

• Thin-plate splines can be generalized to higher dimensions by using the appropriate penalty J

Page 12: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 12

Thin-Plate Splines• Properties similar to 1-D smoothing spline:

• λ -> 0 solution approaches interpolating function• λ -> ∞ solution approaches least-squares linear• For intermediate λ solution expressed as linear

expansion of basis functions, coefficients from generalized ridge regression

• hj are in fact radial basis functions, as discussed in previous presentation

• Computational complexity O(N3) can be reduced by choosing subset of knots K < N resulting in O(NK2 + K3)

Page 13: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 13

Thin-Plate Spline Example:

*Image taken from the book

Page 14: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 14

Additional Multidimensional Splines

• In general, there are many possibilities for multi-dimensional splines; we can use any suitably large basis expansion of different basis types and use a suitable regularizer• E.g. Tensor products of B-splines

• Additive splines are just one class that come from additive penalty (f are univariate splines):

• This can be extended to bases of functions with higer-order interactions:

• Many choices: maximum order, which terms to include, basis type, etc. Automatic selection may be preferred (ch. 9 and 10).

Page 15: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 15

Overview

• Nonparametric Logistic Regression• Multidimensional Splines• Regularization and Reproducing

Kernel Hilbert Spaces• Wavelet Smoothing

Page 16: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 16

Regularization and Reproducing Kernel Hilbert Spaces “This section is quite technical and can be skipped by the disinterested or intimidated

reader”

• Idea is to generalize the fitting/regression problem as much as possible

• Motivation: a truly general penalty• Start by considering abstract vector spaces,

where vectors can represent any number of objects, points in Euclidean space, functions, graphs, etc., as long as certain conditions are met, the same rules apply to all

Space of functions on which J(f) is defined

Loss function Penalty function

Page 17: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 17

A General Penalty• One set of general penalties proposed is of the form:

• denotes the fourier transform of f, and is positive and approaches 0 for large s, so that high frequency components are more heavily penalized. Under additional assumptions this has solution:

• span the null space of J (the null space of J is the set of all functions it maps to the same value)

Page 18: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 18

Hilbert Spaces: Introduction• Example to introduce Hilbert spaces which is also closely

related to wavelets• Recall Fourier series: all continuous functions f(x) defined

on an interval of length L, 0 < x < L, (let D denote this set of functions) can be expanded as a sum of a sine series:

*Ideas of introduction taken from course notes by Professor Edwin Langmann, “Notes on Hilbert space theory” 2006. http://courses.theophys.kth.se/5A1305/hil1.pdf

• This defines a vector space since, given f,g in D and real constants a,b, h=af + bg can be defined as:

which defines a continuous function (an element in D), fulfilling axioms of vector space.

• This Fourier series representation could also be expressed as:

Page 19: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 19

Hilbert Spaces: Introduction• Thus un represent a set of special functions in D and from

Fourier series theory, every element in D can be written as linear combination of these special functions

• Immediate analogy with RN:

• Also we can compute components with a scalar product:

• A component has formula:

• This can be easily shown:

Page 20: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 20

Hilbert Spaces: Introduction

• This is in fact the same as the expression for Fourier series components

and can be shown in the same way. We can define scalar product of two functions in D as:

• And un are orthogonal in the same sense:

Page 21: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 21

Hilbert Spaces: Introduction• Question of completeness: can every function in D be represented as a combination of un? In

general orthogonality of a system of functions vn shows best approximation off :

• We can define any system is complete if error

goes to 0 as M goes to infinity

• Thus the un form an orthonormal basis of D that is of infinite dimension.

• This is an infinite dimensional vector space. Both this space and the Euclidean space RN are special cases of the theory of Hilbert spaces.

• The Hilbert space allows a generalized way of considering many different types of vector spaces

Page 22: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 22

Hilbert Spaces: Definition

• A Hilbert space is an inner product space that is complete under the norm defined by the inner product < ∙, ∙ > by

• “Complete” means that if a sequence of vectors approaches a limit in the space, than that limit is in the space as well.

• For example real numbers are complete, rational numbers are not, since some sequences approach irrational numbers like sqrt(2) • An inner product space is a vector space of arbitrary

dimension, with an inner product, which associates a scalar quantity with each pair of vectors

• A vector space is a collection of objects having operations of vector addition and scalar multiplication and satisfying 8 axioms, such as operations being associative, commutative, distributive, containing an identity element, etc.

Page 23: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 23

Reproducing Kernel Hilbert Space

*Taken from slides by Dr. Christian Igel: http://www.neuroinformatik.ruhr-uni-bochum.de/PEOPLE/igel/LT/LT2.pdf

Page 24: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 24

Reproducing Kernel Hilbert Space

*Taken from slides by Dr. Christian Igel: http://www.neuroinformatik.ruhr-uni-bochum.de/PEOPLE/igel/LT/LT2.pdf

Page 25: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 25

Reproducing Kernel Hilbert Space

• Moore-Aroszajn Theorem: For every positive definite function K(∙, ∙) on X x X, there exists a unique RKHS and vice versa.

• This allows us to apply the ideas from Euclidean geometry to non-geometric problems, so long as we can define a suitable Kernel K(∙, ∙)

*Taken from slides by Dr. Christian Igel: http://www.neuroinformatik.ruhr-uni-bochum.de/PEOPLE/igel/LT/LT2.pdf

Page 26: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 26

Results Presented in the Text about RKHS• Text considers an important subclass of problems of:

for which H is the space of functions defined by a positive definite kernel K(x,y), HK, a RKHS

• Suppose K has eigen-expansion:

• Elements of HK expanded as:

Page 27: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 27

Results Presented in the Text about RKHS• We can define the penalty as:

which can be interpreted as generalized ridge penalty, where functions with large eigenvalues are penalized less

• It can be shown solution has finite-dimensional form:

• Consists of basis function known as representer of evaluation at xi

Page 28: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 28

Results Presented in the Text about RKHS• Then by reproducing property:

• And the objective function reduces to the following finite-dimensional problem, known as the kernel property in support vector machine literature:

• We can have the penalty apply to only a subspace of the functions in H by penalizing the projection of functions onto the subspace

• Solution then has form:(First term represents

expansion of H0)

Page 29: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 29

RKHS Examples

• Squared-error loss:

• Generalized ridge regression, solution obtained as:

Page 30: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 30

RKHS Examples

• Penalized Polynomial Regression

• Example:

• Objective Function:

• By substitution can be expressed as the squared-error loss problem

Page 31: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 31

• Gaussian Radial Basis Functions

RKHS Examples

• This is an expansion in radial basis functions. As we saw earlier, a thin-plate spline was also an expansion in radial basis functions

Page 32: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 32

RKHS Examples

• Support Vector Classifiers (ch. 12)

Page 33: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 33

Wavelet Smoothing

• Similar idea to the Fourier series representation, except wavelets are localized in both time and frequency

• We have a complete dictionary of orthonormal basis functions to represent functions

• Sparse representation is obtained by shrinking and selecting the coefficients of the basis functions, as we’ve seen before

Page 34: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 34

Wavelet example

• Fits basis coefficients by least-squares, thresholds smaller coefficients, like the Lasso

Page 35: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 35

Wavelet Derivation• We define father and mother wavelets. The rest are then

created from them, by increasing the frequency, as with Fourier series (translations and dilations):

Page 36: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 36

Wavelet Derivation• For example, for Haar wavelet

• Father wavelet

• Build orthogonal mother wavelet:

• All these basis functions are orthonormal• The father wavelets form the basis for the rough

components of a function, and the orthogonal mother wavelets build the detail:

• Haar wavelet is often too course; many other wavelets have been invented that are smoother, such as the symlet

Page 37: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 37

Wavelet Smoothing Example

FIGURE 5.14. The top panel shows a NMR signal, with the wavelet-shrunkversion superimposed in green. The lower left panel represents the wavelet transformof the original signal, down to V4, using the symmlet-8 basis. Each coefficientis represented by the height (positive or negative) of the vertical bar. Thelower right panel represents the wavelet coefficients after being shrunken usingthe waveshrink function in S-PLUS, which implements the SureShrink methodof wavelet adaptation of Donoho and Johnstone.

Page 38: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 38

Adaptive Wavelet Filtering• Lattice of N points, y is response vector and W is NxN

orthonormal wavelet basis evaluated at the N uniformly spaced observations. The following is called wavelet transform of y:

• Popular method for adaptive wavelet fitting is known as SURE shrinkage (Stein Unbiased Risk Estimation)

• Same as previously seen Lasso criterion• Because W is orthogonal, simple solution:

• Fitted function obtained from inverse wavelet transform:

Page 39: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 39

Wavelets: Key Idea

• In general, any basis could be used, such as the smoothing splines we’ve seen before.

• The key difference is that wavelets allow localization in time as well as frequency (roughness), and along with the L1 penalty allow sparse solutions

• Smoothing splines compress by imposing smoothness; Wavelets compress by imposing sparsity

Page 40: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 40

Wavelet Compared to Smoothing Spline

Page 41: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 41

Wavelet Compared to Smoothing Spline

Page 42: A KTEC Center of Excellence 1 Basis Expansions and Regularization Continued From Elements of Statistical Learning (CH. 5 Part 2) Speaker: Brian Quanz (bquanz@ittc.ku.edu)bquanz@ittc.ku.edu

A KTEC Center of Excellence 42

The End

• Questions?