Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Scientific Computing forX-Ray Computed TomographyIntroduction to Optimization (part I)
Martin S. AndersenSection for Scientific Computing
DTU Compute
DTUJanuary 21, 2019
Unconstrained minimization
g : Rn → R differentiable
minimize g(x)
Gradient method: choose x(0) ∈ Rn and iterate
x(k+1) = x(k) − tk∇g(x(k)), k = 0, 1, 2, . . .
I constant steps: tk = t > 0
I diminishing steps: tk = t√k> 0
I exact line search:
tk = argmint≥0
{g(x(k−1) − t∇g(x(k−1))
)}I backtracking line search
1/19
Example
g(x1, x2) =1
2x21 +
1
4x42 −
1
2x22, ∇g(x1, x2) =
[x1
x2(x22 − 1)
]
−20
2−2 −1 0 1 2
0
2
4
x1x2
g(x
1,x
2)
gradient method does not converge to local minimum if x(0)2 = 0
2/19
Lipschitz continuity
g : R→ R is Lipschitz continuous if there exists a constant Lsuch that
|g(y)− g(x)| ≤ L|y − x| for all x, y
L is referred to as a Lipschitz constant
Interpretation: left and right derivatives belong to [−L,L]
3/19
Lipschitz continuity
Multivariate functions
F : Rn → Rm is Lipschitz continuous with constant L if
‖F (y)− F (x)‖ ≤ L‖y − x‖ for all x, y
Lipschitz continuous gradient
∇g : Rn → Rn is Lipschitz continuous with constant L if
‖∇g(y)−∇g(x)‖ ≤ L‖y − x‖ for all x, y
for g : Rn → R continuously differentiable
4/19
Lipschitz continuous gradient (I)
Suppose ∇g is Lipschitz continuous and let φ(τ) = g(x+ τ(y−x))
Newton–Leibniz rule
∫ 1
0φ′(τ) dτ = φ(1)− φ(0)∫ 1
0∇g(x+ τ(y − x))T (y − x) dτ = g(y)− g(x)
5/19
Lipschitz continuous gradient (II)
Let p = y − x
g(y) = g(x) +∇g(x)T p+
∫ 1
0(∇g(x+ τp)−∇g(x))T p dτ
≤ g(x) +∇g(x)T p+ ‖p‖2∫ 1
0‖∇g(x+ τp)−∇g(x)‖2 dτ
Lipschitz property: ‖∇g(x+ τp)−∇g(x)‖2 ≤ L‖x+ τp− x‖2
g(x+ p) ≤ f(x) +∇g(x)T p+ ‖p‖∫ 1
0τL‖p‖ dτ
yields quadratic upper bound
g(y) ≤ g(x) +∇g(x)T (y − x) +L
2‖y − x‖2
6/19
Majorization minimization
A function ψ(y;x) is said to majorize g at x if
i. ψ(y;x) ≥ g(y) for all y
ii. ψ(x;x) = g(x)
ψ minorizes g provided that −ψ majorizes −g
Majorization minimization
x(k+1) = argminx
ψ(x;x(k))
yields descent method
g(x(k)) = ψ(x(k);x(k)) ≥ ψ(x(k+1);x(k)) ≥ g(x(k+1))
7/19
Functions with Lipschitz continuous gradientLipschitz property (quadratic upper bound) yields majorization
g(y) ≤ g(x) +∇g(x)T (y − x) +L
2‖y − x‖2︸ ︷︷ ︸
ψ(y;x)
Minimizing majorization (minimizing rhs with respect to y) yields
x(k+1) = x(k) − 1
L∇g(x(k))
descent property
g(x(k+1)) ≤ g(x(k))− 1
2L‖∇g(x(k))‖22
I possible to show that ‖∇g(x(k))‖2 → 0 as k →∞I x(k) may not converge to local minimum!
8/19
Convex sets
C ⊆ Rn is convex if for all x, y ∈ C and θ ∈ [0, 1]
θx+ (1− θ)y ∈ C
convex set
x
y
nonconvex set
x
y
9/19
Convex sets
C ⊆ Rn is convex if for all x, y ∈ C and θ ∈ [0, 1]
θx+ (1− θ)y ∈ C
convex set
x
y
nonconvex set
9/19
Convex functions
g : C → R is convex if for all x, y ∈ C and θ ∈ [0, 1]
g(θx+ (1− θ)y) ≤ θg(x) + (1− θ)g(y)
x y
g(x)
g(y)
I domain C ⊆ Rn must be a convex set
I g is concave if −g is convex
10/19
Strict and strong convexity
g is strictly convex if for all x, y ∈ dom g, x 6= y, and θ ∈ (0, 1)
g(θx+ (1− θ)y) < θg(x) + (1− θ)g(y)
g is strongly convex if for all x, y ∈ dom g and θ ∈ [0, 1]
g(θx+ (1− θ)y) ≤ θg(x) + (1− θ)g(y)− θ(1− θ)µ2
‖x− y‖22
µ > 0 is called the modulus of strong convexity
strongly convex ⊂ strictly convex ⊂ convex
11/19
First-order conditions for convexity
Differentiable g is convex if and only if dom g is convex and
g(y) ≥ g(x) +∇g(x)T (y − x)
for all x, y ∈ dom g
x
g(y)
g(x) +∇g(x)T (y − x)g(x)
I first-order Taylor approximation is a global underestimator
I ∇g(x) = 0 implies that x is a global minimizer of g
12/19
First-order conditions for strict and strong convexity
Suppose g is differentiable and dom g is convex
g is strictly convex if and only if for all x, y ∈ dom g, x 6= y
g(y) > g(x) +∇g(x)T (y − x)
∇g(x) = 0 implies that x is the unique global minimizer of g
g is strongly convex with modulus µ > 0 if for all x, y ∈ dom g
g(y) ≥ g(x) +∇g(x)T (y − x) +µ
2‖y − x‖22
provides global quadratic underestimator
13/19
Second-order conditions for convexity
Twice differentiable g is convex if and only if dom g is convex and
∇2g(x) � 0, ∀x ∈ dom g
Strict convexity (sufficient condition)
∇2g(x) � 0, ∀x ∈ dom g
Strong convexity (necessary and sufficient condition)
∇2g(x) � µI, ∀x ∈ dom g
I ∇2g(x) � µI means that ∇2g(x)− µI is positive semidefinite
I implies that g(x)− (µ/2)‖x‖22 is convex
14/19
Gradient descent — rate of convergence
g : Rn → R diffentiable with Lipschitz gradient
I convex g and constant step size t = 1/L
g(x(k))− g(x?) ≤ 2L‖x(0) − x?‖22k + 4
I strongly convex g and constant step size t = 2/(µ+ L)
g(x(k))− g(x?) ≤ L
2
(Qg − 1
Qg + 1
)2k
‖x(0) − x?‖22
‖x(k) − x?‖2 ≤(Qg − 1
Qg + 1
)k‖x(0) − x?‖2
where Qg = L/µ
15/19
Backtracking line search
Parameters α ∈ (0, 1/2) and β ∈ (0, 1)
Start with t = 1 and repeat t := βt until
g(x+ t∆x) < g(x) + αt∇g(x)T∆x
Sufficient decrease condition (t ≤ t0)
t0
g(x) + t∇g(x)T ∆x
g(x) + tα∇g(x)T ∆x
16/19
Regularized least-squares (I)
minimize g(x) =1
2‖Ax− b‖22 +
γ
2‖x‖22
∇g(x) = AT (Ax− b) + γx, ∇2g(x) = ATA+ γI
Gradient descent
x(k+1) = x(k) − t(AT (Ax(k) − b) + γx(k))
= (I − t(ATA+ γI))x(k) + tAT b
= (I − t∇2g(x(k)))x(k) + tAT b
contraction map if ‖(I − t∇2g(x(k)))‖2 < 1 or, equivalently,
|1− t(‖A‖22 + γ)| < 1
17/19
Regularized least-squares (II)
Lipschitz constantL = ‖A‖22 + γ
can be estimated iteratively via power iteration
let x 6= 0 be a random vectorfor k = 1, . . . ,M
z ← Ax/‖x‖2x← AT z
endL̂ = ‖x‖2 + γ ≤ ‖A‖22 + γ
I modulus of strong convexity: µ ≥ γ > 0
I linear rate of convergence: Qg = Lµ ≤
‖A‖22γ
18/19
References
[Pol87] B. T. Polyak. Introduction to Optimization. New York: OptimizationSoftware, Inc., 1987.
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge:Cambridge University Press, 2004.
[Nes04] Yu. Nesterov. Introductory Lectures on Convex Optimization.Dordrecht, The Netherlands: Kluwer Academic Publishers, 2004.
[NW06] J. Nocedal and S. J. Wright. Numerical Optimization. 2nd. Springer,2006.
[Ber09] D. P. Bertsekas. Convex Optimization Theory. Athena Scientific,2009.
[Ber15] D. P. Bertsekas. Convex Optimization Algorithms. Athena Scientific,2015.
19/19