Compressed Sensing and Neural Networksmeetings.nomad-coe.eu/nomad-summer-2017/uploads/Meeting/... · 2017-10-02 · Lasso & Compressed Sensing Neural Networks Least squares & Regularization

Lasso & Compressed SensingNeural Networks

Compressed Sensing and Neural Networks

Jan Vybıral

(Charles University & Czech Technical University

Prague, Czech Republic)

NOMAD Summer

Berlin, September 25-29, 2017

1 / 31


Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing

Outline

Lasso & Compressed Sensing

I Least squares & Regularization

I Convexity, P vs. NP

I Sparsity & `1-minimization

I Compressed Sensing

Neural Networks

I Introduction

I Notation

I Training the network

I Applications

2 / 31



Part I






Neural Networks

I Introduction

I Notation


I Applications

3 / 31



Least squares

Fitting a cloud of points by a linear hyperplane

Considered already by Gauss and Legendre in 18th century

In 2D:

4 / 31



Least squares

Objects (=points) described by Ω real numbers:

d1 = (d1,1, . . . , d1,Ω) ∈ RΩ

...

dN = (dN,1, . . . , dN,Ω) ∈ RΩ

N - number of objects; D - N × Ω matrix with rows d1, . . . ,dN

P = (P1, . . . ,PN) are properties of interest

We look for a linear dependence P = f (d) with a linear f , i.e.

Pi =Ω∑j=1

cjdi ,j or P = Dc

5 / 31



Least squares

Objects (=points) described by Ω real numbers:

d1 = (d1,1, . . . , d1,Ω) ∈ RΩ

...

dN = (dN,1, . . . , dN,Ω) ∈ RΩ

N - number of objects; D - N × Ω matrix with rows d1, . . . ,dN

P = (P1, . . . ,PN) are properties of interest

We look for a linear dependence P = f (d) with a linear f , i.e.

Pi =Ω∑j=1

cjdi ,j or P = Dc

5 / 31



Least squares

The solution is found by minimizing the least-square error:

c = arg minc∈RΩ

N∑i=1

(Pi −

Ω∑j=1

cjdi ,j

)2= arg min

c∈RΩ

‖P−Dc‖22

I Closed formula exists

I Convex objective function

I c with all coordinates occupied

I Absolute term incorporated by an additional column full ofones

6 / 31



Least squares

The solution is found by minimizing the least-square error:

c = arg minc∈RΩ

N∑i=1

(Pi −

Ω∑j=1

cjdi ,j

)2= arg min

c∈RΩ

‖P−Dc‖22

I Closed formula exists

I Convex objective function

I c with all coordinates occupied

I Absolute term incorporated by an additional column full ofones

6 / 31



Regularization

How to include preknowledge on c?

Say, we prefer linear fit with small coefficients. We just weight theerror of the fit against the size of the coefficient!

λ > 0 - regularization parameter

c = arg minc∈RΩ

‖P−Dc‖22 + λ‖c‖2

2

I λ→ 0: least squares

I λ→∞: c = 0

7 / 31



Tractability

Convexity

I The minimizer is unique

I Local minimum of a convex function is also a global one

I Many effective methods exist (convex optimization)

P vs. NP

I P-problems: solvable in polynomial time (in dependence onthe size of the input)

I NP-problems: solution verifiable in polynomial time; P⊂NP

I One million dollar problem: P=NP?

I Computational Complexity

8 / 31



Tractability

Convexity

I The minimizer is unique

I Local minimum of a convex function is also a global one

I Many effective methods exist (convex optimization)

P vs. NP

I P-problems: solvable in polynomial time (in dependence onthe size of the input)

I NP-problems: solution verifiable in polynomial time; P⊂NP

I One million dollar problem: P=NP?

I Computational Complexity

8 / 31



Sparsity

If Ω is large (especially Ω N), we are often interested in“selecting features”, i.e. in c with many coordinates equal to zero.

‖c‖0 := #i : ci 6= 0 - the number of non-zero coordinates of c

Looking for a linear fit using only two features:

c = arg minc∈RΩ,‖c‖0≤2

‖P−Dc‖22

Regularized version:

c = arg minc∈RΩ

‖P−Dc‖22 + λ‖c‖0

NP-hard!9 / 31



`1-minimization

Other ways to measure the size of c: the `p-norms

‖c‖p =( Ω∑j=1

|cj |p)1/p

I Unit balls in `p in R2

I p =∞: ‖c‖∞ = maxj=1,...,Ω

|cj |

I p ≥ 1 - convex problem

I p ≤ 1 - promotes sparsity

10 / 31



`1-minimization

p ≤ 1 - promotes sparsity

Solution of Sp = arg minz∈R2

‖z‖p s.t. Az = y for p = 1, p = 2

11 / 31



`1-minimization

Take p = 1 (Lasso - Tibshirani, 1996)

c = arg minc∈RΩ

‖P−Dc‖22 + λ‖c‖1

I Chen, Donoho, Saunders: Basis pursuit (1998)

I λ→ 0 : least squares

I λ→∞: c = 0

I In between: λ selects sparsity

12 / 31



`1-minimization

Effect of λ > 0 on the support of ω

13 / 31



Compressed Sensing (aka Compressive Sensing, Compressive Sampling)

Theorem: Let D ∈ RN×Ω with independent gaussian entries!Let 0 < ε < 1, s a natural number and

N ≥ C(

s log(Ω) + log(1/ε)), C a universal constant.

If c ∈ RΩ is s-sparse, P = Dc and c is the minimizer of

c = arg minu∈RΩ

‖u‖1, s.t. P = Du,

then c = c with prob. at least 1− ε.

14 / 31



Compressed Sensing (aka Compressive Sensing, Compressive Sampling)

I Candes, Romberg, Tao (2006); Donoho (2006)

I Extensive theory of recovery of sparse vectors from linearmeasurements

I Optimal conditions on the number of measurements (i.e. datapoints) N ≈ Cs log Ω

I Only true, if most of the features (i.e. the columns of D) areincoherent with the majority of the others (if two features arevery similar, it is difficult to distinguish between them)

I H. Boche, R. Calderbank, G. Kutyniok, J.V.,A Survey of Compressed Sensing,First chapter in Compressed Sensing and its Applications,Birkhauser, Springer, 2015

15 / 31



Dictionaries

Real-life signals are (almost) never sparse in the canonical basis ofRΩ, more often they are sparse in some orthonormal basis, i.e.

x = Bc,

where c ∈ RΩ is sparse and columns (and rows) of B ∈ RΩ×Ω areorthonormal vectors - wavelets, Fourier basis, etc.

Compressed Sensing applies then without any essential change!...just replace D with DB. . . i.e. you rotate the problem. . .

16 / 31



Dictionaries

Even more often, the signal is represented in an overcompletedictionary/lexicon:

x = Lc,

where c ∈ R` is sparse and columns of L ∈ RΩ×` is thedictionary/lexicon - its columns form an overcomplete system(` > Ω)

x is a sparse combination of non-orthogonal vectors - the columnsof L.

Examples: Unions of two or more orthonormal bases, eachcapturing different features

17 / 31



Dictionaries

I Compressed sensing can be adapted also to this situation

I Optimization:

x = arg minu∈RΩ

‖L∗u‖1, s.t. P = Du

I We do not recover the (non-unique!) sparse coefficients c, butthe (approximation of) the signal x.

I Error bound involves L∗x, is reasonably small for examplewhen L∗L is nearly diagonal . . . not too many features in thedictionary are too correlated. . .

18 / 31



`1-based optimalization

I `1-SVM: Support vector machines are a standard tool forclassification problems. `1-penalty term leads to sparseclassifiers.

I Nuclear norm: Minimizing nuclear norm (=sum of absolutevalues of eigenvalues) of a matrix leads to low-rank matrices.

I TV(=total variation)-norm: Minimizing∑

i ,j |ui ,j+1 − ui ,j |over images u gives images with edges and flat parts.

I L1: Minimizing the L1-norm (=integral of the absolute value)of a function leads to functions with small support

I TV-norm of f : Minimizing∫|∇f | leads to functions with

jumps along curves.

19 / 31


IntroductionNotationTraining the networkApplications

Part II






Neural Networks

I Introduction

I Notation


I Applications

20 / 31



Neural Networks

W. McCulloch, W. Pitts (1943)Motivated by biological research on human brain and neurons

Neural network is a graph of nodes, partially connected. Nodesrepresents neurons, oriented connections between the nodesrepresent the transfer of outputs of some neurons to inputs ofother neurons.

21 / 31



Neural Networks

I In 70’s and 80’s a number of obstacles appeared - insufficientcomputer power to train large neural networks, theoreticalproblems of processing exclusive-or, etc.

I Support vector machines (and other simple algorithms) tookover the field of machine learning

I 2010’s: Algorithmic advances and higher computational powerallowed to train large neural networks to human (andsuperhuman) performance in pattern recognition

I Large neural networks (a.k.a. deep learning) used successfullyin many tasks

22 / 31



Neural Networks: Artificial Neuron

Artificial Neuron:. . . gets activated if a linear combination of its inputs grows over acertain threshold. . .

I Inputs x = (x1, . . . , xn) ∈ Rn

I Weights w = (w1, . . . ,wn) ∈ Rn

I Comparing 〈w , x〉 with a threshold b ∈ RI Plugging the result into the “activation function” - jump (or

smoothed jump) function σ

Artificial neuron is a functionx → σ(〈x ,w〉 − b),

where σ : R→ R might be σ(x) = sgn(x) or σ(x) = ex/(1 + ex),etc.

23 / 31



Neural Networks: Layers

Artificial neural network is a directed, acyclic graph of artificialneuronsThe neurons are grouped by their distance to the input into layers

24 / 31



Neural Networks: Layers

I Input: x = (x1, . . . , xn) ∈ Rn

I First layer of neurons:y1 = σ(〈x ,w 1

1 〉 − b11), . . . , yn1 = σ(〈x ,w 1

n1〉 − b1

n1)

I The outputs y = (y1, . . . , yn1) become inputs for the nextlayer . . . ; last layer outputs y ∈ R

I Training the network: given inputs x1, . . . , xN and outputsy 1, . . . , yN and optimize over weights w ’s and b’s

25 / 31



Neural Networks: Training

I The parameters p of the network are initialized (for examplein a random way) =⇒ Np

I For a set of pairs input/output (x i , y i ) we calculate theoutput of the neural network with current parameters=⇒ z i = Np(x i ).

I In an optimal case, z i = y i for all inputs

I Update the parameters of the neural networks tominimize/decrease the loss function, i.e.∑

i

|y i − z i |2

I . . . and repeat . . .

26 / 31




I Non-convex minimization over a huge space!

I Huge number of local minimizers exist

I Initialization of the minimization algorithm is important

I Backpropagation algorithm: the error at the output isredistributed to the neurons of the last hidden layer, then tothe previous one, etc.

I The error is distributed back through the network and used toupdate the parameters of each neuron by a gradient descentmethod

27 / 31




I Discovered in 1960’s

I Applied to neural networks 1970’s

I Theoretical progress in 1980’s and 1990’s

I Profited from increased computational power in 2010’s, whichallowed applications to large data sets and neural networks oftens or hundreds of layers

I Achieved human and super-human powers in patternrecognition and later on in many other applications

28 / 31



Neural Networks: Deep learning

I Training of a layer with large number (∼ 100) layers

I Made possible by the use of GPU’s (Nvidia), whichaccelerated the speed of deep learning by ca. 100times

I Use of many parameters makes it sensitive to overfitting(=too exact adaptation to the training data, not observed inother data from the same area)

I Overfitting reduced by regularization methods: `2 (decay) or`1 (sparsity) of weights

I Further tricks used to accelerate the learning algorithm

29 / 31



Applications

I Pattern recognition

I Computer vision

I Speech recognition

I Social network filtering

I Recommendation systems

I Bioinformatics

I AlphaGo

I . . .

30 / 31



Thank you for your attention!

31 / 31

Documents

Compressed Sensing and Neural Networksmeetings.nomad-coe.eu/nomad-summer-2017/uploads/Meeting/... · 2017-10-02 · Lasso & Compressed Sensing Neural Networks Least squares & Regularization