Analyzing Task Driven Learning Algorithmsrvbalan/TEACHING/AMSC663Fall2011/... · 2011-12-05 ·...

Preview:

Citation preview

Analyzing Task Driven Learning AlgorithmsMid Year Status

Mike Pekala

December 6, 2011

Advisor: Prof. Doron Levy (dlevy at math.umd.edu)

UMD Dept. of Mathematics & Center for Scientific Computation and

Mathematical Modeling (CSCAMM)

Mike Pekala (UMD) AMSC663 December 6, 2011 1 / 28

Overview and Status Overview

Today

1 Overview and StatusOverviewSchedule

2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation

3 Dictionary LearningPreliminary Results

4 Summary & Next Steps

Mike Pekala (UMD) AMSC663 December 6, 2011 2 / 28

Overview and Status Overview

Overview

The underlying notion of sparse coding is that, in many domains, data vectorscan be concisely represented as a sparse linear combination of basis elementsor dictionary atoms. Recent results suggest that, for many tasks, performanceimprovements can be obtained by explicitly learning dictionaries directly fromthe data (vs. using predefined dictionaries, such as wavelets). Further resultssuggest that additional gains are possible by jointly optimizing the dictionaryfor both the data and the task (e.g. classification, denoising).

Consider the Task Driven Learning algorithm [Mairal et al., 2010]:

Outer loop: Stochastic gradient descent to learn dictionary atoms,classification weight vector. We talked about this at the kickoff.

Inner loop: Sparse approximation via penalized least squares. Authorsuse the Least Angle Regression (LARS) algorithm for this purpose.Vague at kickoff - we’ll talk about this a bit more today.

Mike Pekala (UMD) AMSC663 December 6, 2011 3 / 28

Overview and Status Schedule

Schedule and Milestones

Schedule and milestones from the kickoff:

Phase I: Algorithm development (Sept 23 - Jan 15)Phase Ia: Implement LARS (Sept 23 ∼ Oct 24)

X Milestone: LARS code available

Phase Ib: Validate LARS (Oct 24 ∼ Nov 14)

X Milestone: results on diabetes data and hand-crafted problems

Phase Ic: Implement SGD framework (Nov 14 ∼ Dec 15)

90% Milestone: Initial SGD code available

Phase Id: Validate SGD framework (Dec 15 ∼ Jan 15)

25% Milestone: TDDL results on MNIST/USPS

Phase II: Analysis on new data sets (Jan 15 - May 1)

Milestone: Preliminary results on selected dataset (∼ Mar 1)Milestone: Final report and presentation (∼ May 1)

Mike Pekala (UMD) AMSC663 December 6, 2011 4 / 28

Least Angle Regression

Today

1 Overview and StatusOverviewSchedule

2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation

3 Dictionary LearningPreliminary Results

4 Summary & Next Steps

Mike Pekala (UMD) AMSC663 December 6, 2011 5 / 28

Least Angle Regression Geometric View

Problem: Constrained Least Squares

Recall the Lasso: given X = [x1, . . . , xm] ∈ Rn×m, t ∈ R+, solve:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

which has an equivalent unconstrained formulation:

minβ||y −Xβ||22 + λ||β||1

for some scalar λ ≥ 0. The L1 penalty improves upon OLS by introducingparsimony (feature selection) and regularization (improved generality).

There are multiple ways to solve this problem:

1 Directly, via convex optimization (can be expensive)

2 Iterative techniques

Forward selection (“matching pursuit”), forward stagewise, others.Least Angle Regression (LARS) [Efron et al., 2004]

Mike Pekala (UMD) AMSC663 December 6, 2011 6 / 28

Least Angle Regression Geometric View

Visualizing the algorithm

x1

x2

y

Column space of X = [x1 x2]

Mike Pekala (UMD) AMSC663 December 6, 2011 7 / 28

Geometry when m = 2

Least Angle Regression Geometric View

Ordinary Least Squares (OLS)

x1

x2

y

y2 = Xβ = Py

Mike Pekala (UMD) AMSC663 December 6, 2011 8 / 28

β = arg minβ||y −Xβ||22

Least Angle Regression Geometric View

Least Angle Regression (LARS)

x1

x2

y

Assume ||βOLS ||1 > t

Mike Pekala (UMD) AMSC663 December 6, 2011 9 / 28

β = arg minβ||y −Xβ||22

s.t. ||β||1 ≤ t

Active set A = {}β0 = 0

Least Angle Regression Geometric View

Least Angle Regression (LARS)

x1

x2

y

x1/||x1||2 = u1

Choose initial direction u1

(covariate most correlated with y)

Mike Pekala (UMD) AMSC663 December 6, 2011 10 / 28

β = arg minβ||y −Xβ||22

s.t. ||β||1 ≤ t

A = {x1}

Least Angle Regression Geometric View

Least Angle Regression (LARS)

x1

x2

y

γ1u1 = µ1

Move along u1

until x2 is equally correlated

Mike Pekala (UMD) AMSC663 December 6, 2011 11 / 28

β = arg minβ||y −Xβ||22

s.t. ||β||1 ≤ t

A = {x1}

Least Angle Regression Geometric View

Least Angle Regression (LARS)

x1

x2

y

µ1 x2u2

Identify equiangular vector u2

Mike Pekala (UMD) AMSC663 December 6, 2011 12 / 28

β = arg minβ||y −Xβ||22

s.t. ||β||1 ≤ t

A = {x1, x2}

Least Angle Regression Geometric View

Least Angle Regression (LARS)

x1

x2

y

µ1 x2

µ2 = µ1 + γ2u2

Move along equiangular direction u2

Mike Pekala (UMD) AMSC663 December 6, 2011 13 / 28

β = arg minβ||y −Xβ||22

s.t. ||β||1 ≤ t

A = {x1, x2}

Least Angle Regression Geometric View

Relationship to OLS

x1

x2

y

x2

µ2 y2

µ2 approaches OLS solution y2

Mike Pekala (UMD) AMSC663 December 6, 2011 14 / 28

LARS solutions at step k related toOLS solution of ||y −Xkβ||22

Least Angle Regression Rank 1 Updating

Some Algorithm PropertiesFull details in [Efron et al., 2004]

(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).

(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj

(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.

Mike Pekala (UMD) AMSC663 December 6, 2011 15 / 28

Least Angle Regression Rank 1 Updating

Some Algorithm PropertiesFull details in [Efron et al., 2004]

(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.

(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj

(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.

Mike Pekala (UMD) AMSC663 December 6, 2011 15 / 28

Least Angle Regression Rank 1 Updating

Some Algorithm PropertiesFull details in [Efron et al., 2004]

(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj

(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.

Mike Pekala (UMD) AMSC663 December 6, 2011 15 / 28

Least Angle Regression Rank 1 Updating

Some Algorithm PropertiesFull details in [Efron et al., 2004]

(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj

(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.

Mike Pekala (UMD) AMSC663 December 6, 2011 15 / 28

Least Angle Regression Rank 1 Updating

LARS & Cholesky Decomposition

At iteration k, to determine the equiangular vector uk, one must invertthe k × k matrix Gk := XT

k Xk

Well, don’t really invert. Generate Cholesky decomposition Gk = RTRand solve triangular linear systems.

(Recall: Gk symm. pos. semi-definite and s.p.d. if Xk is full rank):

∀z ∈ Rk, z 6= 0, zTGkz = zTXTk Xkz = (Xkz)

T (Xkz) = ||Xkz||22 ≥ 0

Could call chol(Gk) each iteration, but there’s a more efficient way...

Mike Pekala (UMD) AMSC663 December 6, 2011 16 / 28

Least Angle Regression Rank 1 Updating

QR Rank 1 Updates(Golub and Van Loan [1996, sec. 12.5.2])

Recall: A = QR, then ATA = (RTQT )(QR) = RTR

Suppose we have Q,R, z and want the QR decomposition of:

A = [a1, . . . , ak, z, ak+1 . . . , an]

Let w := QT z. Then,

QT A =[QTa1, . . . , Q

Tak,w, QTak+1, . . . Q

Tan]

e.g.

× × × × × ×0 × × × × ×0 0 × × × ×0 0 0 × × ×0 0 0 × 0 ×0 0 0 × 0 00 0 0 × 0 0

Mike Pekala (UMD) AMSC663 December 6, 2011 17 / 28

Least Angle Regression Rank 1 Updating

QR Rank 1 Updates(Golub and Van Loan [1996, sec. 12.5.2])

Use Given’s rotations to remove the “spike” introduced by w:

× × × × × ×0 × × × × ×0 0 × × × ×0 0 0 × × ×0 0 0 × 0 ×0 0 0 × 0 00 0 0 × 0 0

× × × × × ×0 × × × × ×0 0 × × × ×0 0 0 × × ×0 0 0 0 × ×0 0 0 0 0 ×0 0 0 0 0 0

Operation requires mn flops.Analogous approach to downdate R when removing a column.Matlab functions qrinsert(), qrdelete() (Octave hascholupdate()...).Ran into trouble with non-uniqueness of QR decomp; used:

QR = QIR = QITk IkR

to swap sign of Rk,k (where Ik is ident. mat. w/ Ik,k = −1)Mike Pekala (UMD) AMSC663 December 6, 2011 18 / 28

Least Angle Regression Validation

Validation: Diabetes Data Set

0 500 1000 1500 2000 2500 3000 3500−800

−600

−400

−200

0

200

400

600

800

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

||β ||1

β

Diabetes Validation Test: Coefficients

m = 10, n = 442; Compares well with Figure 1 in [Efron et al., 2004]Also validated by comparing orthogonal designs with theoretical result.

Mike Pekala (UMD) AMSC663 December 6, 2011 19 / 28

Dictionary Learning

Today

1 Overview and StatusOverviewSchedule

2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation

3 Dictionary LearningPreliminary Results

4 Summary & Next Steps

Mike Pekala (UMD) AMSC663 December 6, 2011 20 / 28

Dictionary Learning Preliminary Results

Dictionary Learning: Progress

Warning:

The following is preliminary - work in progress.

No intelligent parameter selection, only looking at dictionarylearning at the moment.

Mike Pekala (UMD) AMSC663 December 6, 2011 21 / 28

Dictionary Learning Preliminary Results

Atoms: LARS+LASSOtime=6230.27 (sec), m=30, nIters=1000, λ=0.01

Mike Pekala (UMD) AMSC663 December 6, 2011 22 / 28

Dictionary Learning Preliminary Results

Experiment: LARS+LASSO, USPS 5023Num. atoms > 1e-4: 30

Mike Pekala (UMD) AMSC663 December 6, 2011 23 / 28

true digit reconstruction

a28 = 2.632

(0.12 %)

a16 = 1.662

(0.07 %)

a29 = 1.571

(0.07 %)

a11 = 1.438

(0.06 %)

a5 = −1.276

(0.06 %)

Dictionary Learning Preliminary Results

Atoms: LARS+LASSO-NNtime=2435.80 (sec), m=30, nIters=1000, λ=0.01

Mike Pekala (UMD) AMSC663 December 6, 2011 24 / 28

Dictionary Learning Preliminary Results

Experiment: LARS+LASSO-NN, USPS 5023Num. atoms > 1e-4: 13

Mike Pekala (UMD) AMSC663 December 6, 2011 25 / 28

true digit reconstruction

a27 = 2.520

(0.20 %)

a21 = 1.994

(0.16 %)

a28 = 1.882

(0.15 %)

a9 = 1.822

(0.14 %)

a8 = 1.270

(0.10 %)

Summary & Next Steps

Today

1 Overview and StatusOverviewSchedule

2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation

3 Dictionary LearningPreliminary Results

4 Summary & Next Steps

Mike Pekala (UMD) AMSC663 December 6, 2011 26 / 28

Summary & Next Steps

Summary

Progress

Milestones met; currently on schedule.

Also completed a few extra tasks not in the original plan(non-negative LARS, incremental Cholesky).

Near Term

Finish Task Driven Learning Framework and validation

On to hyperspectral data!

Optional Steps

Parallel SGD (e.g. [Zinkevich et al., 2010])

Mike Pekala (UMD) AMSC663 December 6, 2011 27 / 28

Summary & Next Steps

Bibliography I

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani.Least angle regression. Annals of Statistics, 32:407–499, 2004.

Gene Golub and Charles Van Loan. Matrix Computations. JohnsHopkins University Press, 1996.

Julien Mairal, Francis Bach, and Jean Ponce. Task-Driven DictionaryLearning. Rapport de recherche RR-7400, INRIA, 2010.

Martin Zinkevich, Markus Weimer, Alex Smola, and Lihong Li.Parallelized stochastic gradient descent. In J. Lafferty, C. K. I.Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors,Advances in Neural Information Processing Systems 23, pages2595–2603, 2010.

Mike Pekala (UMD) AMSC663 December 6, 2011 28 / 28

Recommended