Download pdf - Using Deep Learning to Accelerate Sparse Recoverywotaoyin/papers/pdf/ALISTA_slides_TAMU.pdf · Using Deep Learning to Accelerate Sparse Recovery Wotao Yin† Joint work with Xiaohan

Using Deep Learning to Accelerate Sparse Recovery

Wotao Yin†

Joint work with Xiaohan Chen�, Jialin Liu†, Zhangyang Wang�

†UCLA Math �Texas A&M CSE

Texas A&M U — February 20, 2019

1 / 32

This talk is based on the following papers:

• X. Chen, J. Liu, Z. Wang, and W. Yin, Theoretical linear convergenceof unfolded ISTA and its practical weights and thresholds, Advances inNeural Information Processing Systems (NeurIPS), 2018.

• J. Liu, X. Chen, Z. Wang, W. Yin, ALISTA: analytic weights are asgood as learned weights in LISTA, International Conference on LearningRepresentations (ICLR), 2019.

X. Chen and J. Liu are equal first authors in both papers.2 / 32

Overview

Recover a sparse x∗

b := Ax∗ + white noise

where A ∈ Rm×n and b ∈ Rm are given.

Known as compressed sensing, feature selection or LASSO.A fundamental problem with numerous applications in signal processing,inverse problems, and statistical/machine learning.

3 / 32

Application: Examples

MRI Reconstruction

Radar Sensing

4 / 32

Our methods improve upon classical analytical sparse recovery algorithms by

• recovering a signal closer to x∗ (higher quality)

• reducing the total number of iterations to just 15–20 (fast recovery)

Our methods improve upon existing deep learning-based recovery algorithms„e.g., LISTA (Gregor & LeCun’10), by

• learning (much) fewer parameters (faster training)

• adding support detection (faster recovery)

• proving linear convergence and robustness (theoretical guarantee!)

5 / 32

This talk is based on two recent papers:

• Xiaohan Chen*, Jialin Liu*, Zhangyang Wang, Wotao Yin. “Theoreticallinear convergence of unfolded ISTA and its practical weights andthresholds.” NIPS’18.

• Jialin Liu*, Xiaohan Chen*, Zhangyang Wang, Wotao Yin. “ALISTA:Analytic weights are as good as learned weights in LISTA.” to appear inICLR’19.

* denotes equal-contribution first authors.

6 / 32

Outline

• Review LASSO model and ISTA method

• LISTA: classic, a series of parameter elimination

• Theoretical results

• How to make it robust

7 / 32

LASSO and ISTA

LASSO model:

xlasso ← minimizex

12‖b−Ax‖

22 + λ‖x‖1

where λ is a model parameter, tuned by hand or cross validation.

Forward-backward splitting gives ISTA:

x(k+1) = η λL

(x(k) + 1

LAT (b−Ax(k))

)Sublinearly converges to xlasso with an eventual linear speed, not x∗.

FPC (fixed-point continuation): faster by using large λ and scheduling itsreduction. Proves finite support detection and eventual linear convergence.

8 / 32

Relax ISTA

Rewrite ISTA asx(k+1) = ηθ(W1b+W2x

(k)),

where W1 = 1LAT ,W2 = In − 1

LATA and θ = λ

L.

9 / 32

Gregor & LeCun’10: Learned ISTA (LISTA)

Unfold K iteration of ISTA

Free W k1 ,W

k2 and θk, k = 0, . . . ,K − 1 as parameters.

Learn them from training set D = {(bi, x∗i )}

minimize{Wk

1 ,Wk2 ,θ

k}

∑(b,x∗)∈D

∥∥xK(b)− x∗∥∥2

2.

10 / 32

Just generate synthetic sparse signals, train it like a neural network.

Training is very slow. But, K = 16 is enough. Better denoising quality.

0 100 200 300 400 500 600 700 800

-40

-30

-20

-10

0

NM

SE

(d

B)

ISTA ( = 0.1)

ISTA ( = 0.05)

ISTA ( = 0.025)

LISTA

11 / 32

However, it does not scale

To iterate K iterations, the total number of parameters is

O(n2K +mnK).

Too many parameters and too many hours to learn!

12 / 32

Coupling W1, W2If we need xK → x? uniformly for all sparse signals and no measurement noise,then we must have:

• W k2 +W k

1 A→ I,• θk → 0.

True under learning:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0

0.2

0.4

0.6

0.8

1

13 / 32

Therefore, we enforce the following coupling in all layers:

W k2 = In −W k

1 A,

yielding the iteration:

x(k+1) = ηθk (x(k) +W k1 (b−Ax(k))).

Parameter reduction

O(n2K +mnK)→ O(mnK),

significant especially if m� n. Also, helps to stabilize training.

14 / 32

Support selection

Inspired by FPC (Hale, Yin, Zhang’08) and Linearized Bregman (Osher etal’10).

Idea: at each iteration, let the largest components bypass soft-thresholding.

Selection of the largest by fraction, hand tuned.

We obtained both empirical and theoretical improvements.

15 / 32

Empirical results

• We compare results using normalized MSE (NMSE) in dB.

NMSE(x, x∗) = 20 log10 (‖x− x∗‖2/‖x∗‖2)

• Notation:• Original LISTA: LISTA;• LISTA with weight coupling: LISTA-CP;• LISTA with support selection: LISTA-SS;• LISTA with both structures: LISTA-CPSS;

• Setting:• m = 250, n = 500, sparsity s ≈ 50.• Aij ∼ N (0, 1/

√m), iid. A is column-normalized.

• Magnitudes are sampled from standard Gaussian.• Measurement noise levels are measured by signal-to-noise ratio.

16 / 32

Weight coupling

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

-50

-40

-30

-20

-10

0

ISTA

FISTA

AMP

LISTA

LISTA-CP

Weight coupling stabilizes intermediate results.No change in final recovery quality.

17 / 32

(Adding) support selection

Noiseless case (SNR=∞)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

-70

-60

-50

-40

-30

-20

-10

0

ISTA

FISTA

AMP

LISTA

LAMP

LISTA-CP

LISTA-SS

LISTA-CPSS

← LISTA and LISTA-CP

← LISTA-SS← LISTA-CPSS

18 / 32

(Adding) support selection

Noisy case (SNR=30)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

-40

-30

-20

-10

0

ISTA

FISTA

AMP

LISTA

LAMP

LISTA-CP

LISTA-SS

LISTA-CPSS

19 / 32

Natural image compressive sensing reconstruction

(a) Ground truth (b) 20% sample rate (c) 30% sample rate

(d) 40% sample rate (e) 50% sample rate (f) 60% sample rate

20 / 32

Theory: convergence analysis

Theorem (Convergence of LISTA-CP)Suppose K =∞ and let {x(k)}∞k=1 be generated by LISTA-CP. There exists asequence of parameters Θ(k) = {W i

1 , θi}k−1i=0 such that

‖x(k)(Θ(k), b, x0)− x∗‖2 ≤ C1 exp(−ck) + C2σ, ∀k = 1, 2, · · · ,

holds for all (x∗, ε) that are sparse and bounded, where c, C1, C2 > 0 areconstants that depend only on A and the distribution of x∗, and σ is the noiselevel.

The error bound consists of two parts:

• the error that linearly converges to zero;• the irreducible error caused by the measurement noise.

21 / 32

Theory: convergence analysis

Theorem (Convergence of LISTA-CPSS)Suppose K =∞ and let {x(k)}∞k=1 be generated by LISTA-CPSS. There existsa sequence of parameters Θ(k) = {W i

1 , θi}k−1i=0 such that

‖x(k)(Θ(k), b, x0)− x∗‖2 ≤ C1 exp(−k−1∑t=0

ctss

)+ Cssσ, ∀k = 1, 2, · · ·

holds for all (x∗, ε) satisfying some assumptions, where ckss ≥ c for all k, ckss > c

for large enough k, and Css < C2.

The convergence rate is better: ckss > c for large enough k. The acceleration ismore significant in deeper layers.The recovery error is better: Css < C2.

22 / 32

Tie W1 across the iterations

In the proofs, we chose one W k1 independent on layer k.

So, we use just one W for all iterations:

O(mnK)→ O(mn),

yielding tied LISTA (TiLISTA):

x(k+1) = ηθk (x(k) + γkWT (b−Ax(k))).

We learn step sizes {γk}k, thresholds {θk}k and just one matrix W .Tied LISTA works as well as LISTA.

23 / 32

Analytic LISTA (ALISTA)

Proofs also reveal that W needs to have small mutual coherence to A. So, wetried to solve for W , independent of training data.

Two steps:

1. Pre-compute W :

W ∈ argminW∈Rm×n

∥∥WTA∥∥2F, s.t. (W:,j)TA:,j = 1, ∀j = 1, 2, · · · , n,

which is a standard convex quadratic program and easy to solve.2. With W fixed, learn {γk, θk}k from data (back-propagation)

24 / 32

Analytic LISTA (ALISTA)

For the resulting ALISTA network:

1. The layer-wise weights W depends on model A, but not on training data.2. Step sizes γk and thresholds θk are learned from data, but they are only a

small number of scalars.

25 / 32

Numerical evaluation

Noiseless case(SNR=∞)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

-75

-65

-55

-45

-35

-25

-15

-5

NM

SE

(dB

)

ISTA

FISTA

LISTA

LISTA-CPSS

TiLISTA

ALISTA

Noisy case(SNR=30dB)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

-50

-45

-40

-35

-30

-25

-20

-15

-10

-5

0

NM

SE

(dB

)

ISTA

FISTA

LISTA

LISTA-CPSS

TiLISTA

ALISTA

26 / 32

Numbers of parameters to train

K: number of layers. A has M rows and N columns.

Original LISTA O(KM2 +K +MN)LISTA-CPSS O(KNM +K)

TiLISTA O(NM +K)ALISTA O(K)

A 16-layer ALISTA network only takes around 0.1 hours (6 minutes) oftraining, to achieve the comparable performance as LISTA-CPSS, which takesaround 1.5 hours to train.

27 / 32

Extension to convolutional A

Our main results can be directly extended to very large convolutions (circulantmatrices). They can handle large images.

Problem: forming a full matrix W is impossible, even for 100× 100 imagingproblems.

Approach: use a convolution W , find a nearly optimal one, minimize coherenceby FFTs.

Theoretical guarantee: we can get accurate approximation when the image (weconsider 2D conv) is large enough.

28 / 32

An end-to-end robust modelMutual coherence minimization can also be solved by unrolling an algorithm!

The coherence minimization can be relaxed as

arg minW∈Rm×n

∥∥Q� (ATW − In)∥∥2F,

where � is the Hadamard product and Q is a weight matrix that put morepenalty on errors on diagonals. It can be solved by gradient descent:

W (k+1) = W (k) − γ(k)A(Q2 � (ATW (k) − In)).

Figure: One Layer of the Encoder.

29 / 32

Robust ALISTA: An end-to-end robust model

We feed encoder perturbed models A = A+ εA so that W is robust to modelperturbations to some extent.The encoder takes A and returns W . It is obtained by unrolling the gradientdescent in the last slide.The decoder takes W , A, b and returns x. It is the ALISTA model.

Figure: Robust ALISTA: cascaded Encoder-Decoder Structure.

30 / 32

Numerical results

We perturb A0 element-wise with Gaussian σ ≤ 0.03. Perturbed A is thencolumn normalized. Testing model is A, perturbed from A0.

W matrices in non-robust LISTA methods are obtained using A0.

-0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

-70

-50

-30

-10

10

30

50

NM

SE

(d

B)

31 / 32

Summary

There is huge room of speed improvement by adapting an algorithm to asubset of optimization problems.

We can integrate data-driven (slow, adaptive) and analytic (fast, universal)approaches to obtain fast and adaptive algorithms.

While optimization helps deep learning, deep learning (ideas) can also helpoptimization.It is a part of the bigger picture called “differential programming”, a hot risingfield in deep learning theory.

Thank you!

32 / 32