Analyzing Task Driven Learning Algorithmsrvbalan/TEACHING/AMSC663Fall2011/... · 2011-12-05 ·...

Analyzing Task Driven Learning AlgorithmsMid Year Status

Mike Pekala

December 6, 2011

Advisor: Prof. Doron Levy (dlevy at math.umd.edu)

UMD Dept. of Mathematics & Center for Scientific Computation and

Mathematical Modeling (CSCAMM)

Mike Pekala (UMD) AMSC663 December 6, 2011 1 / 28

Overview and Status Overview

1 Overview and StatusOverviewSchedule

2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation

3 Dictionary LearningPreliminary Results

4 Summary & Next Steps

Overview and Status Overview

Overview

The underlying notion of sparse coding is that, in many domains, data vectorscan be concisely represented as a sparse linear combination of basis elementsor dictionary atoms. Recent results suggest that, for many tasks, performanceimprovements can be obtained by explicitly learning dictionaries directly fromthe data (vs. using predefined dictionaries, such as wavelets). Further resultssuggest that additional gains are possible by jointly optimizing the dictionaryfor both the data and the task (e.g. classification, denoising).

Consider the Task Driven Learning algorithm [Mairal et al., 2010]:

Outer loop: Stochastic gradient descent to learn dictionary atoms,classification weight vector. We talked about this at the kickoff.

Inner loop: Sparse approximation via penalized least squares. Authorsuse the Least Angle Regression (LARS) algorithm for this purpose.Vague at kickoff - we’ll talk about this a bit more today.

Overview and Status Schedule

Schedule and Milestones

Schedule and milestones from the kickoff:

Phase I: Algorithm development (Sept 23 - Jan 15)Phase Ia: Implement LARS (Sept 23 ∼ Oct 24)

X Milestone: LARS code available

Phase Ib: Validate LARS (Oct 24 ∼ Nov 14)

X Milestone: results on diabetes data and hand-crafted problems

Phase Ic: Implement SGD framework (Nov 14 ∼ Dec 15)

90% Milestone: Initial SGD code available

Phase Id: Validate SGD framework (Dec 15 ∼ Jan 15)

25% Milestone: TDDL results on MNIST/USPS

Phase II: Analysis on new data sets (Jan 15 - May 1)

Milestone: Preliminary results on selected dataset (∼ Mar 1)Milestone: Final report and presentation (∼ May 1)

Least Angle Regression

Least Angle Regression Geometric View

Problem: Constrained Least Squares

Recall the Lasso: given X = [x1, . . . , xm] ∈ Rn×m, t ∈ R+, solve:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

which has an equivalent unconstrained formulation:

minβ||y −Xβ||22 + λ||β||1

for some scalar λ ≥ 0. The L1 penalty improves upon OLS by introducingparsimony (feature selection) and regularization (improved generality).

There are multiple ways to solve this problem:

1 Directly, via convex optimization (can be expensive)

2 Iterative techniques

Forward selection (“matching pursuit”), forward stagewise, others.Least Angle Regression (LARS) [Efron et al., 2004]

Visualizing the algorithm

Column space of X = [x1 x2]

Geometry when m = 2

Ordinary Least Squares (OLS)

y2 = Xβ = Py

β = arg minβ||y −Xβ||22

Least Angle Regression (LARS)

Assume ||βOLS ||1 > t

s.t. ||β||1 ≤ t

Active set A = {}β0 = 0

x1/||x1||2 = u1

Choose initial direction u1

(covariate most correlated with y)

s.t. ||β||1 ≤ t

A = {x1}

γ1u1 = µ1

Move along u1

until x2 is equally correlated

s.t. ||β||1 ≤ t

A = {x1}

µ1 x2u2

Identify equiangular vector u2

s.t. ||β||1 ≤ t

A = {x1, x2}

µ1 x2

µ2 = µ1 + γ2u2

Move along equiangular direction u2

s.t. ||β||1 ≤ t

A = {x1, x2}

Relationship to OLS

µ2 y2

µ2 approaches OLS solution y2

LARS solutions at step k related toOLS solution of ||y −Xkβ||22

Least Angle Regression Rank 1 Updating

Some Algorithm PropertiesFull details in [Efron et al., 2004]

(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).

(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj

(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.

(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.

(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj

(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj

(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj

LARS & Cholesky Decomposition

At iteration k, to determine the equiangular vector uk, one must invertthe k × k matrix Gk := XT

Well, don’t really invert. Generate Cholesky decomposition Gk = RTRand solve triangular linear systems.

(Recall: Gk symm. pos. semi-definite and s.p.d. if Xk is full rank):

∀z ∈ Rk, z 6= 0, zTGkz = zTXTk Xkz = (Xkz)

T (Xkz) = ||Xkz||22 ≥ 0

Could call chol(Gk) each iteration, but there’s a more efficient way...

QR Rank 1 Updates(Golub and Van Loan [1996, sec. 12.5.2])

Recall: A = QR, then ATA = (RTQT )(QR) = RTR

Suppose we have Q,R, z and want the QR decomposition of:

A = [a1, . . . , ak, z, ak+1 . . . , an]

Let w := QT z. Then,

QT A =[QTa1, . . . , Q

Tak,w, QTak+1, . . . Q

× × × × × ×0 × × × × ×0 0 × × × ×0 0 0 × × ×0 0 0 × 0 ×0 0 0 × 0 00 0 0 × 0 0

QR Rank 1 Updates(Golub and Van Loan [1996, sec. 12.5.2])

Use Given’s rotations to remove the “spike” introduced by w:

× × × × × ×0 × × × × ×0 0 × × × ×0 0 0 × × ×0 0 0 × 0 ×0 0 0 × 0 00 0 0 × 0 0

× × × × × ×0 × × × × ×0 0 × × × ×0 0 0 × × ×0 0 0 0 × ×0 0 0 0 0 ×0 0 0 0 0 0

Operation requires mn flops.Analogous approach to downdate R when removing a column.Matlab functions qrinsert(), qrdelete() (Octave hascholupdate()...).Ran into trouble with non-uniqueness of QR decomp; used:

QR = QIR = QITk IkR

to swap sign of Rk,k (where Ik is ident. mat. w/ Ik,k = −1)Mike Pekala (UMD) AMSC663 December 6, 2011 18 / 28

Least Angle Regression Validation

Validation: Diabetes Data Set

0 500 1000 1500 2000 2500 3000 3500−800

−600

−400

−200

||β ||1

Diabetes Validation Test: Coefficients

m = 10, n = 442; Compares well with Figure 1 in [Efron et al., 2004]Also validated by comparing orthogonal designs with theoretical result.

Dictionary Learning

Dictionary Learning Preliminary Results

Dictionary Learning: Progress

Warning:

The following is preliminary - work in progress.

No intelligent parameter selection, only looking at dictionarylearning at the moment.

Atoms: LARS+LASSOtime=6230.27 (sec), m=30, nIters=1000, λ=0.01

Experiment: LARS+LASSO, USPS 5023Num. atoms > 1e-4: 30

true digit reconstruction

a28 = 2.632

(0.12 %)

a16 = 1.662

(0.07 %)

a29 = 1.571

(0.07 %)

a11 = 1.438

(0.06 %)

a5 = −1.276

(0.06 %)

Atoms: LARS+LASSO-NNtime=2435.80 (sec), m=30, nIters=1000, λ=0.01

Experiment: LARS+LASSO-NN, USPS 5023Num. atoms > 1e-4: 13

true digit reconstruction

a27 = 2.520

(0.20 %)

a21 = 1.994

(0.16 %)

a28 = 1.882

(0.15 %)

a9 = 1.822

(0.14 %)

a8 = 1.270

(0.10 %)

Summary & Next Steps

Summary

Progress

Milestones met; currently on schedule.

Also completed a few extra tasks not in the original plan(non-negative LARS, incremental Cholesky).

Near Term

Finish Task Driven Learning Framework and validation

On to hyperspectral data!

Optional Steps

Parallel SGD (e.g. [Zinkevich et al., 2010])

Bibliography I

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani.Least angle regression. Annals of Statistics, 32:407–499, 2004.

Gene Golub and Charles Van Loan. Matrix Computations. JohnsHopkins University Press, 1996.

Julien Mairal, Francis Bach, and Jean Ponce. Task-Driven DictionaryLearning. Rapport de recherche RR-7400, INRIA, 2010.

Martin Zinkevich, Markus Weimer, Alex Smola, and Lihong Li.Parallelized stochastic gradient descent. In J. Lafferty, C. K. I.Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors,Advances in Neural Information Processing Systems 23, pages2595–2603, 2010.

Analyzing Task Driven Learning Algorithmsrvbalan/TEACHING/AMSC663Fall2011/... · 2011-12-05 ·...

Documents

Lagrangian Data Assimilation and Manifold Detection for a ...rvbalan/TEACHING/AMSC663Fall2011/... · 1 Background 1.1 Geophysical Flows and Lagrangian Transport ... (t )) + t k (2)

ADULT EDUCATIONadulted.dadeschools.net/age/Documents/NeedsAssessment/... · 2012-03-28 · 18. Data‐driven decision making (accessing and analyzing adult student data) 24 19. Student

Assessment Planning for Excellence Support...Performance Assessment is the systematic and ongoing process of collecting and analyzing data for implementing data‐driven decisions

Analyzing TV Commercials - APLING-SpotlightProjectapling-spotlightproject.wikispaces.umb.edu/file/view/Analyzing+TV... · Analyzing TV Commercials Sandra Gutierrez Background:

Analyzing and Improving Cartesian Stiffness Control Stability ......Analyzing and Improving Cartesian Stiffness Control Stability of Series Elastic Tendon-driven Robotic Hands Prashant

A Hypothesis-Driven Approach to Solution …...logistics solution, we began with macro environment analysis to identify trends and inflection points. After analyzing customer’s operations

Lagrangian Data Assimilation and Manifold Detection for a Point …rvbalan/TEACHING/AMSC663Fall2011/... · 2011-12-14 · Lagrangian Data Assimilation and Manifold Detection for a

NLP Driven Citation Analysis for Scientometricsrahuljha/pdf/aan_jnle_CR.pdf · 2 Rahul Jha and others 1 Introduction The eld of scientometrics focuses on analyzing \the quantitative

SPOCK My decisions are completely logical. Patient…Elliot analyzing analyzing analyzing analyzing analyzing analyzing

Robust statistical methods for di erential abundance ...rvbalan/TEACHING/AMSC663Fall2011/PROJE… · Robust statistical methods for di erential abundance analysis of metagenomics

Analyzing The E ects of Test Driven Development Insoftwareprocess.es/pubs/borle2017EMSE-TDD.pdf · 2017-11-24 · Analyzing The E ects of Test Driven Development In GitHub 3 itself,

A Data-Driven Statistical Approach to Analyzing Process Variation in 65nm SOI Technology

Analyzing Task Driven Learning Algorithmsrvbalan/TEACHING/AMSC663Fall2011/PROJECT… · Analyzing Task Driven Learning Algorithms ... UMD Dept. of Mathematics & Center for Scienti

Automated Parameter Selection Tool for Solution to Ill-Posed Problems …rvbalan/TEACHING/AMSC663Fall2011/... · 2011. 12. 19. · an ill-posed inverse problem. To ﬁnd suitable

Robust statistical methods for di erential abundance …rvbalan/TEACHING/AMSC663Fall2011/...Robust statistical methods for di erential abundance analysis of metagenomics data Joseph

AMSC 664 Final Report: Upgrade to the GSP Gyrokinetic Codervbalan/TEACHING/AMSC663Fall2011/PROJECT… · AMSC 664 Final Report: Upgrade to the GSP Gyrokinetic Code George Wilkie

Analyzing Trafﬁc Delay at Unmanaged Intersections · is event-driven, whose dynamics encodes equilibria resulting from microscopic vehicle interactions. It absorbs the advan-tages

Activity-Driven Needs Analysis - COnnecting REpositories · 2014. 1. 8. · Irmeli Luukkonen Activity-Driven Needs Analysis and Modeling in ... by analyzing and modeling the work

Robust statistical methods for di erential abundance analysis of metagenomics datarvbalan/TEACHING/AMSC663Fall2011/... · 2012-05-18 · Robust statistical methods for di erential

A Method for Analyzing Commonalities in Clinical Trial ...zh2132/Papers/He_AMIA_2014.pdfcommonalities in clinical trial target populations from COMPACT. A use-case driven approach