Using Deep Learning to Accelerate Sparse Recovery
Wotao Yin†
Joint work with Xiaohan Chen�, Jialin Liu†, Zhangyang Wang�
†UCLA Math �Texas A&M CSE
Texas A&M U — February 20, 2019
1 / 32
This talk is based on the following papers:
• X. Chen, J. Liu, Z. Wang, and W. Yin, Theoretical linear convergenceof unfolded ISTA and its practical weights and thresholds, Advances inNeural Information Processing Systems (NeurIPS), 2018.
• J. Liu, X. Chen, Z. Wang, W. Yin, ALISTA: analytic weights are asgood as learned weights in LISTA, International Conference on LearningRepresentations (ICLR), 2019.
X. Chen and J. Liu are equal first authors in both papers.2 / 32
Overview
Recover a sparse x∗
b := Ax∗ + white noise
where A ∈ Rm×n and b ∈ Rm are given.
Known as compressed sensing, feature selection or LASSO.A fundamental problem with numerous applications in signal processing,inverse problems, and statistical/machine learning.
3 / 32
Application: Examples
MRI Reconstruction
Radar Sensing
4 / 32
Our methods improve upon classical analytical sparse recovery algorithms by
• recovering a signal closer to x∗ (higher quality)
• reducing the total number of iterations to just 15–20 (fast recovery)
Our methods improve upon existing deep learning-based recovery algorithms„e.g., LISTA (Gregor & LeCun’10), by
• learning (much) fewer parameters (faster training)
• adding support detection (faster recovery)
• proving linear convergence and robustness (theoretical guarantee!)
5 / 32
This talk is based on two recent papers:
• Xiaohan Chen*, Jialin Liu*, Zhangyang Wang, Wotao Yin. “Theoreticallinear convergence of unfolded ISTA and its practical weights andthresholds.” NIPS’18.
• Jialin Liu*, Xiaohan Chen*, Zhangyang Wang, Wotao Yin. “ALISTA:Analytic weights are as good as learned weights in LISTA.” to appear inICLR’19.
* denotes equal-contribution first authors.
6 / 32
Outline
• Review LASSO model and ISTA method
• LISTA: classic, a series of parameter elimination
• Theoretical results
• How to make it robust
7 / 32
LASSO and ISTA
LASSO model:
xlasso ← minimizex
12‖b−Ax‖
22 + λ‖x‖1
where λ is a model parameter, tuned by hand or cross validation.
Forward-backward splitting gives ISTA:
x(k+1) = η λL
(x(k) + 1
LAT (b−Ax(k))
)Sublinearly converges to xlasso with an eventual linear speed, not x∗.
FPC (fixed-point continuation): faster by using large λ and scheduling itsreduction. Proves finite support detection and eventual linear convergence.
8 / 32
Relax ISTA
Rewrite ISTA asx(k+1) = ηθ(W1b+W2x
(k)),
where W1 = 1LAT ,W2 = In − 1
LATA and θ = λ
L.
9 / 32
Gregor & LeCun’10: Learned ISTA (LISTA)
Unfold K iteration of ISTA
Free W k1 ,W
k2 and θk, k = 0, . . . ,K − 1 as parameters.
Learn them from training set D = {(bi, x∗i )}
minimize{Wk
1 ,Wk2 ,θ
k}
∑(b,x∗)∈D
∥∥xK(b)− x∗∥∥2
2.
10 / 32
Just generate synthetic sparse signals, train it like a neural network.
Training is very slow. But, K = 16 is enough. Better denoising quality.
0 100 200 300 400 500 600 700 800
-40
-30
-20
-10
0
NM
SE
(d
B)
ISTA ( = 0.1)
ISTA ( = 0.05)
ISTA ( = 0.025)
LISTA
11 / 32
However, it does not scale
To iterate K iterations, the total number of parameters is
O(n2K +mnK).
Too many parameters and too many hours to learn!
12 / 32
Coupling W1, W2If we need xK → x? uniformly for all sparse signals and no measurement noise,then we must have:
• W k2 +W k
1 A→ I,• θk → 0.
True under learning:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
0.5
1
1.5
2
2.5
3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
0.2
0.4
0.6
0.8
1
13 / 32
Therefore, we enforce the following coupling in all layers:
W k2 = In −W k
1 A,
yielding the iteration:
x(k+1) = ηθk (x(k) +W k1 (b−Ax(k))).
Parameter reduction
O(n2K +mnK)→ O(mnK),
significant especially if m� n. Also, helps to stabilize training.
14 / 32
Support selection
Inspired by FPC (Hale, Yin, Zhang’08) and Linearized Bregman (Osher etal’10).
Idea: at each iteration, let the largest components bypass soft-thresholding.
Selection of the largest by fraction, hand tuned.
We obtained both empirical and theoretical improvements.
15 / 32
Empirical results
• We compare results using normalized MSE (NMSE) in dB.
NMSE(x, x∗) = 20 log10 (‖x− x∗‖2/‖x∗‖2)
• Notation:• Original LISTA: LISTA;• LISTA with weight coupling: LISTA-CP;• LISTA with support selection: LISTA-SS;• LISTA with both structures: LISTA-CPSS;
• Setting:• m = 250, n = 500, sparsity s ≈ 50.• Aij ∼ N (0, 1/
√m), iid. A is column-normalized.
• Magnitudes are sampled from standard Gaussian.• Measurement noise levels are measured by signal-to-noise ratio.
16 / 32
Weight coupling
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-50
-40
-30
-20
-10
0
ISTA
FISTA
AMP
LISTA
LISTA-CP
Weight coupling stabilizes intermediate results.No change in final recovery quality.
17 / 32
(Adding) support selection
Noiseless case (SNR=∞)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-70
-60
-50
-40
-30
-20
-10
0
ISTA
FISTA
AMP
LISTA
LAMP
LISTA-CP
LISTA-SS
LISTA-CPSS
← LISTA and LISTA-CP
← LISTA-SS← LISTA-CPSS
18 / 32
(Adding) support selection
Noisy case (SNR=30)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-40
-30
-20
-10
0
ISTA
FISTA
AMP
LISTA
LAMP
LISTA-CP
LISTA-SS
LISTA-CPSS
19 / 32
Natural image compressive sensing reconstruction
(a) Ground truth (b) 20% sample rate (c) 30% sample rate
(d) 40% sample rate (e) 50% sample rate (f) 60% sample rate
20 / 32
Theory: convergence analysis
Theorem (Convergence of LISTA-CP)Suppose K =∞ and let {x(k)}∞k=1 be generated by LISTA-CP. There exists asequence of parameters Θ(k) = {W i
1 , θi}k−1i=0 such that
‖x(k)(Θ(k), b, x0)− x∗‖2 ≤ C1 exp(−ck) + C2σ, ∀k = 1, 2, · · · ,
holds for all (x∗, ε) that are sparse and bounded, where c, C1, C2 > 0 areconstants that depend only on A and the distribution of x∗, and σ is the noiselevel.
The error bound consists of two parts:
• the error that linearly converges to zero;• the irreducible error caused by the measurement noise.
21 / 32
Theory: convergence analysis
Theorem (Convergence of LISTA-CPSS)Suppose K =∞ and let {x(k)}∞k=1 be generated by LISTA-CPSS. There existsa sequence of parameters Θ(k) = {W i
1 , θi}k−1i=0 such that
‖x(k)(Θ(k), b, x0)− x∗‖2 ≤ C1 exp(−k−1∑t=0
ctss
)+ Cssσ, ∀k = 1, 2, · · ·
holds for all (x∗, ε) satisfying some assumptions, where ckss ≥ c for all k, ckss > c
for large enough k, and Css < C2.
The convergence rate is better: ckss > c for large enough k. The acceleration ismore significant in deeper layers.The recovery error is better: Css < C2.
22 / 32
Tie W1 across the iterations
In the proofs, we chose one W k1 independent on layer k.
So, we use just one W for all iterations:
O(mnK)→ O(mn),
yielding tied LISTA (TiLISTA):
x(k+1) = ηθk (x(k) + γkWT (b−Ax(k))).
We learn step sizes {γk}k, thresholds {θk}k and just one matrix W .Tied LISTA works as well as LISTA.
23 / 32
Analytic LISTA (ALISTA)
Proofs also reveal that W needs to have small mutual coherence to A. So, wetried to solve for W , independent of training data.
Two steps:
1. Pre-compute W :
W ∈ argminW∈Rm×n
∥∥WTA∥∥2F, s.t. (W:,j)TA:,j = 1, ∀j = 1, 2, · · · , n,
which is a standard convex quadratic program and easy to solve.2. With W fixed, learn {γk, θk}k from data (back-propagation)
24 / 32
Analytic LISTA (ALISTA)
For the resulting ALISTA network:
1. The layer-wise weights W depends on model A, but not on training data.2. Step sizes γk and thresholds θk are learned from data, but they are only a
small number of scalars.
25 / 32
Numerical evaluation
Noiseless case(SNR=∞)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-75
-65
-55
-45
-35
-25
-15
-5
NM
SE
(dB
)
ISTA
FISTA
LISTA
LISTA-CPSS
TiLISTA
ALISTA
Noisy case(SNR=30dB)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-50
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
NM
SE
(dB
)
ISTA
FISTA
LISTA
LISTA-CPSS
TiLISTA
ALISTA
26 / 32
Numbers of parameters to train
K: number of layers. A has M rows and N columns.
Original LISTA O(KM2 +K +MN)LISTA-CPSS O(KNM +K)
TiLISTA O(NM +K)ALISTA O(K)
A 16-layer ALISTA network only takes around 0.1 hours (6 minutes) oftraining, to achieve the comparable performance as LISTA-CPSS, which takesaround 1.5 hours to train.
27 / 32
Extension to convolutional A
Our main results can be directly extended to very large convolutions (circulantmatrices). They can handle large images.
Problem: forming a full matrix W is impossible, even for 100× 100 imagingproblems.
Approach: use a convolution W , find a nearly optimal one, minimize coherenceby FFTs.
Theoretical guarantee: we can get accurate approximation when the image (weconsider 2D conv) is large enough.
28 / 32
An end-to-end robust modelMutual coherence minimization can also be solved by unrolling an algorithm!
The coherence minimization can be relaxed as
arg minW∈Rm×n
∥∥Q� (ATW − In)∥∥2F,
where � is the Hadamard product and Q is a weight matrix that put morepenalty on errors on diagonals. It can be solved by gradient descent:
W (k+1) = W (k) − γ(k)A(Q2 � (ATW (k) − In)).
Figure: One Layer of the Encoder.
29 / 32
Robust ALISTA: An end-to-end robust model
We feed encoder perturbed models A = A+ εA so that W is robust to modelperturbations to some extent.The encoder takes A and returns W . It is obtained by unrolling the gradientdescent in the last slide.The decoder takes W , A, b and returns x. It is the ALISTA model.
Figure: Robust ALISTA: cascaded Encoder-Decoder Structure.
30 / 32
Numerical results
We perturb A0 element-wise with Gaussian σ ≤ 0.03. Perturbed A is thencolumn normalized. Testing model is A, perturbed from A0.
W matrices in non-robust LISTA methods are obtained using A0.
-0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035
-70
-50
-30
-10
10
30
50
NM
SE
(d
B)
31 / 32
Summary
There is huge room of speed improvement by adapting an algorithm to asubset of optimization problems.
We can integrate data-driven (slow, adaptive) and analytic (fast, universal)approaches to obtain fast and adaptive algorithms.
While optimization helps deep learning, deep learning (ideas) can also helpoptimization.It is a part of the bigger picture called “differential programming”, a hot risingfield in deep learning theory.
Thank you!
32 / 32