Upload
ngotruc
View
227
Download
0
Embed Size (px)
Citation preview
TOTAL VARIATIONAND
CALCULUSLecture 3 - CMSC878O SP15
SPARSE RECOVERY PROBLEMS
minimize kxk0 subject to Ax = b
“Fat” matrix(underdetermined)
Densemeasurements
Sparse solution
COMPLEXITY ALERT: NP-Complete
(Reductions to subset cover and SAT)Nonetheless: can solve by greedy methods• Orthogonal Matching Pursuit (OMP)• Stagewise methods: StOMP
L0 IS NOT CONVEX
minimize kxk0 subject to Ax = b
minimize |x| subject to Ax = b
CONVEX RELAXATIONminimize |x| subject to Ax = b
WHY USE L1?
• L1 is the “tightest” convex relaxation of L0
SPARSE OPTIMIZATION PROBLEMS
minimize |x| subject to Ax = b
minimize �|x|+ 1
2kAx� bk2
Basis Pursuit
Basis PursuitDenoising
Lasso minimize
1
2
kAx� bk2 subject to |x| �
CO-SPARSITY
• Sometimes signal is sparse under a transform
• When transform is invertible, can use synthesis
minimize �|�x|+ 1
2kAx� bk2
minimize �|v|+ 1
2kA��1v � bk2
• Otherwise, use the analysis formulation
• The think in the L1 norm is sparse!
IMAGE GRADIENT
0
BB@
u1 � u0
u2 � u1
u3 � u2
u4 � u3
1
CCA
0
BBBB@
u1 � u0
u2 � u1
u3 � u2
u4 � u3
u1 � u4
1
CCCCA
Neumann
Circulant
TOTAL VARIATIONTV (x) = |rx| =
X
k
|xk+1 � xk|
How to compute?[�1 1]
• Convolve with stencil [1 , -1] • Take absolute value• Take sum
TV IN 2D(rx)ij = (xi+1,j � xi,j , xi,j+1 � xi,j)
|(rx)ij | = |xi+1,j � xi,j |+ |xi,j+1 � xi,j)|Anisotropic
Isotropic k(rx)ijk =q
(xi+1,j � xij)2 + (xi,j+1 � xij)2
TOTAL VARIATION: 2D(rx)ij = (xi+1,j � xij , xi,j+1 � xij)
I gx
= Dx
I gy = DyI
TV
iso
(x) = |rx| =X
ij
q(x
i+1,j � x
ij
)2 + (xi,j+1 � x
ij
)2
TVan(x) = |rx| =X
ij
|xi+1,j � xij |+ |xi,j+1 � xij |
IMAGE RESTORATIONOriginal Noisy
TOTAL VARIATION DENOISING
�|rx|+ 1
2ku� fk2
�krxk2 + 1
2ku� fk2
�|rx|+ 1
2ku� fk2�krxk2 + 1
2ku� fk2
slice
0 100 200 300 400 500 60080
90
100
110
120
130
140
150
160
170
0 100 200 300 400 500 60040
60
80
100
120
140
160
180
200
TV IMAGING PROBLEMS
minimize �|rx|+ 1
2ku� fk2
minimize �|rx|+ 1
2kKu� fk2
minimize �|rx|+ |u� f |
Denoising (ROF)
Deblurring
TVL1
COMPUTING TV ON IMAGES��1 1
�
0
@�1
1
1
A
• Two linear convolutions • x-stencil = (1 -1)• y-stencil = (1 -1)’
COMPUTING TV ON IMAGES
• Two linear convolutions • x-stencil = (1 -1)• y-stencil = (1 -1)’
Circulant Boundary
Fast Transforms: use FFT
FOURIER TRANSFORMx̂k =
X
n
xne�i2⇡kn/N
x̂k =< x,Fk >
10 20 30 40 50 60
10
20
30
40
50
60
0 10 20 30 40 50 60 70−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
mode 2mode 4mode 10
CONVOLUTIONf ⇤ g(x) =
X
s
f(x� s)g(s) =X
s
f(x+ s)g(�s)
don’t forget
WRONG!!
back
war
ds k
erne
l
PROPERTIES OF FFT• Classical FFT is ORTHOGONAL
• Computable in O(N logN) time (Cooley-Tukey)
• Convolution = multiplication in Fourier domain
FHF = I
Fx =
✓I B
I �B
◆✓Fx
e
Fx
o
◆
x ⇤ y = FT (Fx · Fy)
what does this mean?
A BETTER WAY TO THINK ABOUT ITx ⇤ y = FT (Fx · Fy)
•FFT DIAGONALIZES convolution matrices
Kx = FTDFx
Orthogonal
Diagonal
Trivia!•Is ? •Does ?
K1K2 = K2K1
@x
@y
u = @y
@x
u
HOW DO WE FIGURE OUT D?• The diagonal matrix is just the FFT of backward
stencil
• Example: derivative [�1 1]
•Define stencil: [1 -1]•Embed stencil in matrix:
•Compute the diagonal
•Perform convolution using EVD
Kx = FTDFxs = [�1 0 0 0 0 0 0 1]
D = Fs
(show in matlab)
CONDITION NUMBERKx = FT
DFx
s = [�1 0 0 0 · · · 0 0 0 1]
How poorly conditioned is a convolution?
How poorly conditioned is the FFT?
0 10 20 30 40 50 60 700
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Zero: why? What is the eigenvector
WHY IS THE CHIRP WELL-CONDITIONED?
Echolocation chirp: brown bat
Linear chirp
FFT IN THE WILDMatlabs DFT is NOT ORTHOGONAL!
WHY?
Convolution with a delta function = identity
FFT still diagonalizes convolutions!
F? =1pN
F FT? =
pNFT
FTDF =pNFTDFT 1p
N= FT
?DF?
Same D
CONTINUOUS GRADIENTS
WHY WE NEED GRADIENTFirst-order conditions
rf(x?) = 0
First-order methods
Gradient descent
THE CALCULUS YO MOMMA TAUGHT YOU
rf(x, y, z) = @
x
f i+ @
y
f j+ @
z
f k
f(x, y, z) : Rn ! R
What letter do we use for 4th variable?
What letter do we use for 1,000,000th variable?
DERIVATIVE
Freshet derivative is a linear operator withD
limkhk!0
f(x+ h)� f(x)�Dh
khk = 0
A linear operator that locally behaves like f
f : Rn ! Rm
In optimization: variables are usually column vectors
x
Df
DERIVATIVEIn optimization: variables are usually column vectors
x
Df
Df =�D
x1f Dx2f D
x3f�
=
0
@D
x1f1 Dx2f1 D
x3f1D
x1f2 Dx2f2 D
x3f2D
x1f3 Dx2f3 D
x3f3
1
A
In this case the derivative is the Jacobian
GRADIENTIn optimization: variables are usually column vectors
x
Df
For scalar-valued functions
rf =
0
@@x1f
@x2f
@x3f
1
ADf =�@x1f, @x2f, @x3f
�
rf
Definition: Gradient is the adjoint of the derivative
GRADIENTIn optimization: variables are usually column vectors
x
Df rf
Better Definition: The gradient of a scalar-values function is a matrix of derivatives that is the same shape as the unknowns
x =
0
@x11 x12 x13
x21 x22 x23
x31 x32 x33
1
A rf =
0
@@11f @12f @13f@21f @22f @23f@31f @32f @33f
1
A
example: matrix of unknowns
WHAT’S THE GRADIENT?!?!
f(x) = Ax
Function
f(x) = kxk2
gradient
rf(x) = x
rf(x) = A
T
rf(x) = Ax
f(x) =1
2x
TAx
CHAIN RULE
@f(g(x)) = f
0(g)g0(x)
single-variable chain rule
multi-variable chain rule
Swap
h(x) = f � g(x) = f(g(x))
Df(g(x)) = Df �Dg(x)
rf(g(x)) = rg(x)rf(g)
for gradients
EXAMPLE: LEAST SQUARES1
2kAx� bk2
Ax� b
1
2kxk2
f(g(x)) =
x AT
r r
A
T (Ax� b)
rg(x)rf(g)
f = = g
How would you minimize this?
IMPORTANT RULES
h(x) = f(Ax)
If
Then
If
Then
h(x) = f(xA)
Why?
rh(x) = A
Tf
0(Ax) rh(x) = f
0(xA)AT
EXAMPLE: RIDGE REGRESSION
f(x) =�
2kxk2 + 1
2kAx� bk2
Optimality Condition
rf(x) = �x+A
T (Ax� b) = 0
(ATA+ �I)x = A
Tb
x
? = (ATA+ �I)�1
A
Tb
REGULARIZED GRADIENTS
|x| ⇡ h(x) =
(12x
2, for |x| �
�(|a|� 12�), otherwise
|x| ⇡p
x
2 + ✏
2
−3 −2 −1 0 1 2 30
0.5
1
1.5
2
2.5
3
Huber regularization
Hyperbolic regularization
GRADIENT OF TV|rx|+ µ
2kx� fk2
h(rx) +µ
2kx� fk2
Replace L1 with huber
rTh
0(rx) + µ(x� f)
Differentiate
h
0(x) =xp
x
2 + ✏
2 divide component-wise
LOGISTIC REGRESSIONData vectors
{Di}Labels{li}
Model�(D
i
x) =e
Dix
1 + e
Dix
� =
Maximum Likelihood (MAP) Estimator
maximize p(x|D, l)
minimize
x
mX
k=1
log(1 + exp(�l
k
(D
k
x))) = f
lr
(LDx)
LOGISTIC REGRESSIONminimize
x
mX
k=1
log(1 + exp(�l
k
(D
k
x))) = f
lr
(LDx)
D̂ = LD
Break the problem down
Coordinate separable Linear operator
flr(z) =mX
k=1
log(1 + exp(�zk))
minimizex
f
lr
(D̂x)
r{flr(D̂x)} = D̂
Trflr(D̂x) rflr(z) =�exp(�z)
1 + exp(�z)
NON-NEGATIVE MATRIX FACTORIZATION
minimizeX�0,Y�0
E(X,Y ) =1
2kXY � Sk2
@Y E = XT (XY � S)
@XE = (XY � S)Y T
� =
NEURAL NETS: LOGISTIC REGRESSION ON STEROIDS
MLP = “Multilayer Perception”
�(�(�(diW1)W2)W3) = y3
GRADIENT OF NET
di yi
�(y1W2) = y2�(diW1) = y1 �(y2W3) = y3
�(�(�(diW1)W2)W3) = y3
GRADIENT OF NET
�(y1W2) = y2�(diW1) = y1 �(y2W3) = y3
�(�(�(diW1)W2)W3) = y3
@y3@di
=@y3@y2
� @y2@y1
� @y1@di
Derivative of input with respect to output
rdiy3 = rdiy1ry1y2ry2y3
or…
Where do we evaluate these gradients?
FORWARD PASS
compute intermediate values
di yi
�(y1W2) = y2�(diW1) = y1 �(y2W3) = y3
BABY STEPSStart with just a sigma:
D� = diag(�01,�
02, · · · ,�0
n) = diag(�0)
�(xW ) : Rn ! Rm
�(x) : Rm ! Rm
rx
� = diag(�0)
Now a single layer :
rx
�(xW ) = diag(�0)WT
Easy, right???
D�(xW ) = Wdiag(�0)
why on right?
KEEP GOING! MATH IS FUN!!!
�(y1W2) = y2�(diW1) = y1 �(y2W3) = y3
rx
�
i
(xWi
) = diag(�0i
)WT
i
@y3@di
=@y3@y2
� @y2@y1
� @y1@di
D�i(xWi) = Widiag(�0i)
Ddiy3 = W1diag(�01)W2diag(�
02)W3diag(�
03)
Huh? What happened? Why did we multiply backward??
rdiy3 = diag(�03)W
T3 diag(�0
2)WT2 diag(�0
1)WT1
ADD A LOSS FUNCTION
�(y1W2) = y2�(diW1) = y1 �(y2W3) = y3
rx
�
i
(xWi
) = diag(�0i
)WT
i
D�i(xWi) = Widiag(�0i)
`(W4y3) : Rn ! R
Ddi` = W1diag(�01)W2diag(�
02)W3diag(�
03)W4D`
rdi` = `0WT4 diag(�0
3)WT3 diag(�0
2)WT2 diag(�0
1)WT1
SANITY CHECK
What shape should the output be?
`0 = r` WT1
rdi` = `0WT4 diag(�0
3)WT3 diag(�0
2)WT2 diag(�0
1)WT1
INSANITY CHECK:WHAT ABOUT WEIGHTS?
�(y1W2) = y2�(diW1) = y1 �(y2W3) = y3
`(y3) : Rn ! R
This is BIG and not a matrix“shorthand” gradient: concatenate gradient of each column
rW3` = (rW3y3)sdiag(ry3`)
using this shorthand we get…
(rW3y3)s = yT2 �03(y2W3)
rW3` = rW3y3 � ry3`
COMPLETE BACKPROP�(y1W2) = y2�(diW1) = y1 �(y2W3) = y3
rW3` = yT2 �03(y2W3)diag(ry3`)
rW2` = yT1 �02(y1W2)diag(ry2`)
rW1` = dTi �01(diW1)diag(ry1`)
`(y3W4) : Rn ! Rlast layer
ry3` = `0WT4
ry2` = `0WT4 diag(�3)W
T3
ry1` = `0WT4 diag(�3)W
T3 diag(�2)W
T2
CONVOLUTIONAL NETS
Convolution-based template matching
CONVOLUTIONAL NETS
Regular-old MLP:
Fancy-schmance conv-net :
conv matrix stencil
y1 = �(dW1)
y1 = �(dK1) = �(d ⇤ k1)
CONVOLUTIONAL NETS
y = �(�(�(dK1)K2)K3)
y1 = �(dK1)
y2 = �(y1K2)
y3 = �(y2K3)
CONVOLUTIONAL NETS
Adjoint of convolution
y = �(�(�(dK1)K2)K3)
y1 = �(dK1)
y2 = �(y1K2)
y3 = �(y2K3)
rx
�
i
(xKi
) = diag(�0i
)KT
i
CONVOLUTIONAL NETS
K = FHDF
Fk
Adjoint
Stencil
???
Diagonal Adjoint of convolution
rx
�
i
(xKi
) = diag(�0i
)KT
i
CONVOLUTIONAL NETS
K = FHDF
Fk
AdjointK = FHD̄F
Stencil
???
Diagonal Adjoint of convolution
rx
�
i
(xKi
) = diag(�0i
)KT
i
CONVOLUTIONAL NETS
K = FHDF
Fk
AdjointK = FHD̄F
Fflip(k)
Stencil
Stencil
Diagonal Adjoint of convolution
rx
�
i
(xKi
) = diag(�0i
)KT
i