An Overview of AdvancesIn Sparse Representation
Zhangyang (Atlas) Wang
Why Sparsity Is (Still) Important物以“稀”为贵
Definition: Sparsity
u A signal x ∈ 𝐶% is sparse, when most of its entries are zero.
u Formally, x ∈ 𝐶% is s–sparse, if it has at most s nonzeroentries. One can think of an s– sparse signal as havingonly s degrees of freedom.
u In many cases, x is only approximately sparse (e.g.,exponentially decayed magnitudes). Fortunately, most s–sparse conclusions could still be generalized here.
Extension: Low-Dimensionality
u Low Rank:u A matrix X ∈ 𝐶(×% has low rank if its rank 𝑟 is (substantially) less
than the ambient dimension min(m,n).
u One can think of a rank- 𝑟 matrix as having only 𝑟(𝑚 + 𝑛 − 𝑟)degrees of freedom, as this is the dimension of the tangent space to the manifold of rank- 𝑟 matrices.
u Higher-dimensional: Tensoru Check “Tensor Decomposition” and “Tensor Completion”
Example: Compressive Sensing
u An underdetermined system y = Ax: y ∈ 𝐶(,A ∈ 𝐶(×%, x ∈𝐶%, and 𝑚 ≪ 𝑛.
u It does not even take a Professor to teach you that a uniquesolution x cannot be obtained in general.
u But it indeed took many years before we realized that 𝐱could be recovered from its highly incompletemeasurements 𝒚, by tractable algorithms.
u … given that x is known to be sparse, and some condition(“mutual incoherence”) of A is satisfied. RIP condition
Example: Sparse Coding-basedClassification (SRC)u Deep Learning rules ImageNet since 2012…
u But Do you know who ruled before deep learning stepped in??
2010 Winner: NEC-UIUC team
Driven by sparisity(#parameter:less
than 1% of AlexNet)
Applications of sparsity are found everywhere in science and technology
u Image, Speech, and Video Processing
u Pattern Classification/Clustering
u Matrix Completion
u Robust PCA
u System Identification
u Sensor Network Fusion
u MRI Phase Retrieval
u Quantum-State Tomography
The list goes on and on, and keeps on growing…
The Insights of Sparsity
u Sparsity is important for both predictive accuracy, and model interpretation. u It makes an “organized” and “informative” representation,
and usually implies statistical robustness.
u Example: a simple probability distribution vector
u Example: signal & noise under Fourier Transformation
The Insights of Sparsity
u Consider what sparsity means (informally): ugiven a class of objects, if you can create a model such that
the objects can be represented much compactly, but still preserving high fidelity to the original
u… it means that you have created a great model, in the sensethat redundancies of the original objects are reduced withouthampering the information.
The Insights of Sparsity
u Many computational benefits to the algorithmspeed, memory and storage:u In SVM, many algorithmic approaches exploit sparsity
for fast optimization.
u In LASSO-style regression, sparsity enables the ability to compute full regularization paths.
u ……
Sparse Recovery Algorithms论茴香豆的“茴”字有几种写法
Problem Formulation
Primal-Dual Interior-Point Methods (PDIPM)
• The algorithm simultaneously optimizes the primal-dual pair of the linear programming problems (P) and (D).
• (P) can be converted to a family of logarithmic barrier problems.• The primal-dual interior-point algorithm seek the domain of the central
trajectory for the problems (P) and (D), defined by KKT.
Gradient Projection Methods (GPM)u GPM reformulates the l1 min as quadratic programming (QP)
u One can separate the positive coefficients x+ and the negative coefficients x-
u It could be easily rewritten in the standard QP form:
Gradient Projection Methods (GPM)u The gradient of Q(z) is defined as:
u It leads to a basic algorithm that searches from each iterate along the negative gradient, with the aid of the standard line-search process.
u A Variant: Truncated Newton interior-point method (TNIPM):
u A logarithmic barrier for the constraints is constructed. Using the primal barrier method, the optimal search direction is computed via Newton’s method.
u Approximating the Newton step: preconditioned conjugate gradients (PCG)
Homotopy Methods(拓扑同伦)
u Both PDIPA and GP require the solution sequence to be close to a “central path”, which is often difficult to satisfy and computationally expensive.
u The homotopy methods exploit the fact that the objective function of LASSO undergoes a homotopy from the l2 constraint to the l1objective, as the penalty coefficient decreases. (“solution path”)
u It is only necessary to identify those “breakpoints” that lead to changes of the nonzero support set.
Homotopy Methods(拓扑同伦)
u Pros: The homotopy algorithm provably solves l1-min (P1)[NOT approximately]. For a k-sparse signal, homotopymethods can find it in k iterations.
u Cons: It may lose its computational competitiveness when the sparsity of x grows proportionally with the observation dimension d
Iterative Shrinkage and ThresholdingAlgorithm (ISTA)
A closed-form solution w.r.t. each scalar coefficient
Related:• FISTA• Proximal Gradient
Methods (PGM)
Alternating Direction Methods ofMultipliers (ADMM)
• Using the Lagrangian method, it is converted to an unconstrained form with two additional variables
An alternating minimization precedure is then performed.
• ADMM can be also applied to the dual problems of l1 –min. An implementation called YALL1 iterates in both the primal and dual spaces to converge.
Connecting “Sparse” to “Deep”历史的车轮总要转回同一根辐条
A Starting Point
u Regularized Least Squares (RLS)
𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑟(𝑌)
𝑋: input data 𝑌: feature𝐷: the basis for feature representation𝑟(𝑌): the regularization term that incorporates problem-specific prior
u Why RLS?u It represents a large family of feature learning modelsu It is solvable by a similar class of algorithms
u It derives most popular building blocks in latest deep learning
21
Approximated Regression Machine (ARM)
General form of iterative algorithm:𝑌DEF= 𝑁(𝐿F 𝑋 + 𝐿C 𝑌D )
Idea: unfold & truncate iterative algorithm
22
𝐿F + 𝑁
𝐿C
𝑋 𝑌
Approximated Regression Machine (ARM)
n Example: unfold & truncate to 𝑘 = 2 (2-Iteration ARM)n Train & inference are done in the same architecturen Could be further tuned on training datan End-to-end training & Fast inference
23
A 𝑘-Iteration ARMis a (𝑘+1)-layerneural network!
𝑌𝑓
𝐿F
+𝑁 𝐿C 𝑁 𝐿C + 𝑁
𝑋
Approximated Regression Machine (ARM)
u 𝑙F-RLS:𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑐| 𝑌 |F
u Iterative Algorithm:
𝑌DEF = 𝑁(𝐿F 𝑋 + 𝐿C 𝑌D )
u 𝐿F 𝑋 = 𝐷N𝑋, 𝐿C 𝑌D = (𝐼 − 𝐷N𝐷)𝑌D
u 𝑁(. ):
24ReLU
c-c
Approximated Regression Machine (ARM)
u Non-negative𝑙F-RLS:𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑐| 𝑌 |F,𝑌 ≥ 0
u Iterative Algorithm:
𝑌DEF = 𝑁(𝐿F 𝑋 + 𝐿C 𝑌D )u 𝐿F 𝑋 = 𝐷N𝑋 − 𝒄, 𝐿C 𝑌D = (𝐼 − 𝐷N𝐷)𝑌D
25
𝑌
𝐿F
+ReLU 𝐿C ReLU 𝐿C + ReLU
𝑋
Non-negative 𝑙F-ARM (2 iterations)
More Examples: 𝑙T, 𝑙C, 𝑙U-ARMs
26
Nonlinearity transform 𝑁 as neuron
Nonlinearity transform 𝑁 as pooling
• 𝑙T-RLS (form 1): 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑐C| 𝑌 |T• 𝑙C-RLS: 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑐| 𝑌 |C
• 𝑙U-RLS: 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C, 𝑠. 𝑡. | 𝑌 |U ≤ 𝑐
• 𝑙T-RLS (form 2): 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C, 𝑠. 𝑡. | 𝑌 |T ≤ 𝑀
Compare Non-linearities in ARMs
(a) tanh (b) ReLU (c) L1
(d) L0 (e) L2 (f) L∞27
Compare Non-linearities in ARMs
𝑙T-RLS (form 2): 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C, 𝑠. 𝑡. | 𝑌 |T ≤ 𝑀
u 𝑁(. ): keep the top 𝑀 largest coefficients ➡ max-𝑀poolingu Generalization of well-known max pooling operator 𝑀 = 1u Explains its success in deep learning: sparse representation
28
Max-M Pooling(M = 2)
Convolutional RLS
𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 −[𝐹] ∗ 𝑍]]
|C +[𝑟(𝑍])]
𝑋: input data 𝑍]: feature maps
𝐹]: the convolutional filter bank
𝑟(𝑍]): the regularization term
u Two interested special cases
u 𝑟(𝑍]) = 0, ∀𝑖
-> PCANet, a recently proposed baseline
u 𝑟(𝑍]) = 𝜆||𝑍]||F,𝑍] ≥ 0,∀𝑖
-> Non-negative convolutional 𝒍𝟏-ARM 29
Towards General Feed-Forward Networks
u 𝑌DEF = 𝑁 𝐿F 𝑋 + 𝐿C 𝑌D
𝐿F 𝑋 = 𝐷N𝑋 − 𝑐, 𝐿C 𝑌D = (𝐼 − 𝐷N𝐷)𝑌D , 𝑁 = ReLU
u What if… let k = 0? -> A “reckless” 0-iteration ARM?
u 𝑌 = 𝑅𝑒𝐿𝑈 𝐷N𝑋 − 𝑐
u Equivalent to a fully-connected layer + ReLU!
30
+ReLU 𝐿C ReLU 𝐿C + ReLU
𝑋 Y
𝐿F
Stacked Approximated Regression Machine
31
Non-negative 𝑙F-RLS ➡ 0-iteration ARM
➡ Fully-connected layer + neuron (ReLU)
Non-negative convolutional 𝑙F- RLS ➡ 0-iteration ARM
➡ Convolutional layer + neuron (ReLU)
l Besides, 𝑙𝟎-RLS (form 2) ➡ max pooling
Stacking (and/or Jointly Tuning) ARMs
➡ Stacked Approximated Regression Machine (SARM)
Most current populardeep models are
special cases of SARM!
Interpret Deep Networks with SARM
32
Each trainable layer (plusneuron) is a 0-iteration ARM,of non-negative (linear orconv.) 𝑙F-RLS.
Each ARM comes with itsown set of parameters:dictionary or filter bank.
Each hidden layer outputs aone-step approximation ofthe solution to the originalRLS model.
All 0-iteration ARMs are stacked into a SARM, andtuned from end to end tolearn all parameters jointly.
Deep Model From ASARM Viewpoint
Interpret Deep Networks with SARM
33
• Infinitelydeep?
• Controlsparsity
• Enforce“hard”sparsity
• Enforce non-negativesparsity
ReLU Pooling
DepthBias
Resemblance to Residual Learning
x +z3
L2 +L1 L2z2z1N N N
34
x + aW2W1 W3N N N
(Expansion: L2 = I – L1L1T)
Solving End-To-End Optimization with SARM
u Example: Compressive Sensing (CS)
35
An End-to-EndLearning ofAll CSparameters:
Deeply Optimized Compressive Sensing
u A feed-forward network pipeline to solve the bi-level optimization
u More Examples Solved: SVD (special case), dual sparsity model, etc.
36
GX
DMA Y
ℓ1-ARMA X
Encoder Decoder
Representation Measurement Recovery Reconstruction
Related Work
u Existing (albeit limited) work to interpret deep learning
u [S. Mallat et.al. 2013] Scattering network
u [Y. Bengio et.al. 2012-2016] Explain dropout, initialization…
u [R. Baraniuk et.al. 2015] Generative probabilistic theory
u [U. Kamilov et.al. 2016] Learning optimal nonlinearities for ISTA
u [B. Xin et. al. 2016] Maximal Sparsity with Deep Networks?
u Correlate classical ML models to deep models
u [Y. LeCun et.al. 2010] Learned ISTA (LISTA)
u [P. Sprechmann et.al. 2015] Fixed-complexity basis pursuit
u [R. Vemulapalli et.al. 2015] Deep Gaussian Random Field
u [S. Zheng et. al. 2015] CRF as RNN37
Take-Home Points
u Nonparametric structured models, based on sparsity/low-dimensionality/etc., are powerful and flexible.
u While they may not always be the best models in any particular application, they are quite often surprisingly competitive.
u They bring in more interpretability than “data-drivenblack boxes”. Their algorithms are concise and beautiful.
u We may still embrace them in the age of deep learning(see my next talk).