83
Solving composite optimization problems, with applications to phase retrieval John Duchi (based on joint work with Feng Ruan)

Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Solving composite optimization problems,with applications to phase retrieval

John Duchi (based on joint work with Feng Ruan)

Page 2: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Outline

Composite optimization problems

Methods for composite optimization

Application: robust phase retrieval

Experimental evaluation

Large scale composite optimization?

Page 3: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

What I hope to accomplish today

I Investigate problem structures that are not quite convex but stillamenable to elegant solution approaches

I Show how we can leverage stochastic structure to turn hardnon-convex problems into “easy” ones[Keshavan, Montanari, Oh 10; Loh & Wainwright 12]

I Consider large scale versions of these problems

Page 4: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Composite optimization problems

The problem:minimize

xf(x) := h(c(x))

whereh : Rm → R is convex and c : Rn → Rm is smooth

Page 5: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Motivation: the exact penalty

minimizex

f(x) subject to x ∈ X

equivalent (for all large enough λ) to

minimizex

f(x) + λ dist(x,X)

dist(x,X)

Page 6: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Motivation: the exact penalty

minimizex

f(x) subject to x ∈ X

equivalent (for all large enough λ) to

minimizex

f(x) + λ dist(x,X)

dist(x,X)

Page 7: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Motivation: the exact penalty

minimizex

f(x) subject to x ∈ X

equivalent (for all large enough λ) to

minimizex

f(x) + λ dist(x,X)

dist(x,X)

Page 8: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Motivation: the exact penalty

minimizex

f(x) subject to c(x) = 0

equivalent to (for all large enough λ)

minimizex

f(x) + λ ‖c(x)‖

[Fletcher & Watson 80, 82; Burke 85]

Page 9: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Motivation: the exact penalty

minimizex

f(x) subject to c(x) = 0

equivalent to (for all large enough λ)

minimizex

f(x) + λ ‖c(x)‖︸ ︷︷ ︸=h(c(x))

whereh(z) = λ ‖z‖

[Fletcher & Watson 80, 82; Burke 85]

Page 10: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Motivation: nonlinear measurements and modeling

I Have true signal x? ∈ Rn and measurement vectors ai ∈ Rn

I Observe nonlinear measurements

bi = φ(〈ai, x?〉) + ξi, i = 1, . . . ,m

for φ(·) a nonlinear function but smooth function

An objective:

f(x) =1

m

m∑i=1

(φ(〈ai, x〉)− bi

)2Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &Soltanolkotabi 16]

Page 11: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Motivation: nonlinear measurements and modeling

I Have true signal x? ∈ Rn and measurement vectors ai ∈ Rn

I Observe nonlinear measurements

bi = φ(〈ai, x?〉) + ξi, i = 1, . . . ,m

for φ(·) a nonlinear function but smooth function

An objective:

f(x) =1

m

m∑i=1

(φ(〈ai, x〉)− bi

)2

Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &Soltanolkotabi 16]

Page 12: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Motivation: nonlinear measurements and modeling

I Have true signal x? ∈ Rn and measurement vectors ai ∈ Rn

I Observe nonlinear measurements

bi = φ(〈ai, x?〉) + ξi, i = 1, . . . ,m

for φ(·) a nonlinear function but smooth function

An objective:

f(x) =1

m

m∑i=1

(φ(〈ai, x〉)− bi

)2Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &Soltanolkotabi 16]

Page 13: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

(Robust) Phase retrieval

[Candes, Li, Soltanolkotabi 15]

Observations (usually)bi = 〈ai, x?〉2

yield objective

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|

Page 14: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

(Robust) Phase retrieval

[Candes, Li, Soltanolkotabi 15]

Observations (usually)bi = 〈ai, x?〉2

yield objective

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|

Page 15: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Optimization methods

How do we solve optimization problems?

1. Build a “good” but simple local model of f

2. Minimize the model (perhaps regularizing)

Page 16: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Optimization methods

How do we solve optimization problems?

1. Build a “good” but simple local model of f

2. Minimize the model (perhaps regularizing)

Gradient descent: Taylor (first-order) model

f(y) ≈ fx(y) := f(x) +∇f(x)T (y − x)

Page 17: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Optimization methods

How do we solve optimization problems?

1. Build a “good” but simple local model of f

2. Minimize the model (perhaps regularizing)

Newton’s method: Taylor (second-order) model

f(y) ≈ fx(y) := f(x) +∇f(x)T (y − x) + (1/2)(y − x)T∇2f(x)(y − x)

Page 18: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Modeling composite problems

Now we make a convex model

f(x) = h(c(x))

Page 19: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Modeling composite problems

Now we make a convex model

f(x) = h( c(x)︸︷︷︸linearize

)

Page 20: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Modeling composite problems

Now we make a convex model

f(y) ≈ h(c(x) +∇c(x)T (y − x))

Page 21: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Modeling composite problems

Now we make a convex model

f(y) ≈ h(c(x) +∇c(x)T (y − x)︸ ︷︷ ︸=c(y)+O(‖x−y‖2)

)

Page 22: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Modeling composite problems

Now we make a convex model

fx(y) := h(c(x) +∇c(x)T (y − x)

)

Page 23: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Modeling composite problems

Now we make a convex model

fx(y) := h(c(x) +∇c(x)T (y − x)

)[Burke 85; Drusvyatskiy, Ioffe, Lewis 16]

Page 24: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Modeling composite problems

Now we make a convex model

fx(y) := h(c(x) +∇c(x)T (y − x)

)Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1

Page 25: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Modeling composite problems

Now we make a convex model

fx(y) := h(c(x) +∇c(x)T (y − x)

)Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1

Page 26: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Modeling composite problems

Now we make a convex model

fx(y) := h(c(x) +∇c(x)T (y − x)

)Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1

Page 27: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it

xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X

{h(c(xk) +∇c(xk)T (x− xk)

)+

1

2α‖x− xk‖22

}

Page 28: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it

xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X

{h(c(xk) +∇c(xk)T (x− xk)

)+

1

2α‖x− xk‖22

}

Page 29: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it

xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X

{h(c(xk) +∇c(xk)T (x− xk)

)+

1

2α‖x− xk‖22

}

|xk − x?| = .3

Page 30: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it

xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X

{h(c(xk) +∇c(xk)T (x− xk)

)+

1

2α‖x− xk‖22

}

|xk − x?| = .024

Page 31: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it

xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X

{h(c(xk) +∇c(xk)T (x− xk)

)+

1

2α‖x− xk‖22

}

|xk − x?| = 3 · 10−4

Page 32: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it

xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X

{h(c(xk) +∇c(xk)T (x− xk)

)+

1

2α‖x− xk‖22

}

|xk − x?| = 4 · 10−8

Page 33: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Robust phase retrieval problems

A nice application for these composite methods

Page 34: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Robust phase retrieval problems

Data model: true signal x? ∈ Rn, for pfail <12 observe

bi = 〈ai, x?〉2 + ξi where ξi =

{0 w.p. ≥ 1− pfailarbitrary otherwise

Goal: solve

minimizex

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|

Composite problem: f(x) = 1m ‖φ(Ax)− b‖1 = h(c(x)) where φ(·) is

elementwise square,

h(z) =1

m‖z‖1 , c(x) = φ(Ax)− b

Page 35: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Robust phase retrieval problems

Data model: true signal x? ∈ Rn, for pfail <12 observe

bi = 〈ai, x?〉2 + ξi where ξi =

{0 w.p. ≥ 1− pfailarbitrary otherwise

Goal: solve

minimizex

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|

Composite problem: f(x) = 1m ‖φ(Ax)− b‖1 = h(c(x)) where φ(·) is

elementwise square,

h(z) =1

m‖z‖1 , c(x) = φ(Ax)− b

Page 36: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Robust phase retrieval problems

Data model: true signal x? ∈ Rn, for pfail <12 observe

bi = 〈ai, x?〉2 + ξi where ξi =

{0 w.p. ≥ 1− pfailarbitrary otherwise

Goal: solve

minimizex

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|

Composite problem: f(x) = 1m ‖φ(Ax)− b‖1 = h(c(x)) where φ(·) is

elementwise square,

h(z) =1

m‖z‖1 , c(x) = φ(Ax)− b

Page 37: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

A convergence theorem

Three key ingredients.

(1) Stability: f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2(2) Close models: |fx(y)− f(y)| ≤ 1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22

(3) A good initialization

Page 38: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

A convergence theorem

Three key ingredients.

(1) Stability: f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2(2) Close models: |fx(y)− f(y)| ≤ 1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22

(3) A good initialization

I Measurement matrix A = [a1 · · · am]T ∈ Rm×n and

1

mATA =

1

m

m∑i=1

aiaTi

I Convex model fx of f at x defined by

fx(y) = h(c(x) +∇c(x)T (y − x))

Page 39: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

A convergence theorem

Three key ingredients.

(1) Stability: f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2(2) Close models: |fx(y)− f(y)| ≤ 1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22

(3) A good initialization

I Measurement matrix A = [a1 · · · am]T ∈ Rm×n and

1

mATA =

1

m

m∑i=1

aiaTi

I Convex model fx of f at x defined by

fx(y) =1

m

m∑i=1

∣∣∣〈ai, x〉2 + 2 〈ai, x〉 〈ai, y − x〉∣∣∣

Page 40: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

A convergence theorem

Three key ingredients.

(1) Stability: f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2(2) Close models: |fx(y)− f(y)| ≤ 1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22

(3) A good initialization

Theorem (D. & Ruan 17)

Define dist(x, x?) = min{‖x− x?‖2 , ‖x+ x?‖2}. Let xk be generated bythe prox-linear method and L = 1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

. Then

dist(xk, x?) ≤

(2L

λdist(x0, x

?)

)2k

.

Page 41: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Unpacking the convergence theorem

Theorem (D. & Ruan 17)

Define dist(x, x?) = min{‖x− x?‖2 , ‖x+ x?‖2}. Let xk be generated bythe prox-linear method and L = 1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

. Then

dist(xk, x?) ≤

(2L

λdist(x0, x

?)

)2k

.

I Quadratic convergence: for all intents and purposes, 6 iterations

I Requires solving explicit convex optimization problems (quadraticprograms) with no tuning parameters

Page 42: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: stability

1. Stability: (cf. Eldar and Mendelson 14)

f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2

Page 43: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: stability

1. Stability: (cf. Eldar and Mendelson 14)

f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2

What is necessary?

Proposition (D. & Ruan 17)

Assume uniformity condition: for all u, v ∈ Rn and a ∼ P

P (|uTaaT v| ≥ ε0 ‖u‖2 ‖v‖2) ≥ c > 0.

Then f is 12ε0-stable with probability at least 1− e−cm.

(Gaussians satisfy this)

Page 44: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: stability

1. Stability: (cf. Eldar and Mendelson 14)

f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2

What is necessary?

Proposition (D. & Ruan 17)

Assume uniformity condition: for all u, v ∈ Rn and a ∼ P

P (|uTaaT v| ≥ ε0 ‖u‖2 ‖v‖2) ≥ c > 0.

Then f is 12ε0-stable with probability at least 1− e−cm.

(Gaussians satisfy this)

Page 45: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: stability

Growth condition (stability):

〈ai, x〉2 − 〈ai, x?〉2 = 〈ai, x− x?〉 〈ai, x+ x?〉

and under random ai with uniform enough support,

f(x) =1

m

m∑i=1

∣∣(x− x?)TaiaTi (x+ x?)∣∣ & ‖x− x?‖2 ‖x+ x?‖2

Page 46: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence

2. Approximation: need 1m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

= O(1)

What is necessary?

Proposition (Vershynin 11)

If the measurement vectors ai are sub-Gaussian, then

1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op≤ O(1) ·

√n

m+ t w.p. ≥ 1− e−mt2 .

Heavy-tailed data gets 1m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

= O(1) with reasonable probabilityfor m a bit larger

Page 47: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence

2. Approximation: need 1m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

= O(1)

What is necessary?

Proposition (Vershynin 11)

If the measurement vectors ai are sub-Gaussian, then

1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op≤ O(1) ·

√n

m+ t w.p. ≥ 1− e−mt2 .

Heavy-tailed data gets 1m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

= O(1) with reasonable probabilityfor m a bit larger

Page 48: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: spectral initialization

Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x?

X init :=∑

i:bi≤median(b)

aiaTi

satisfiesX init ≈ E[aiaTi ]− cd?d?

T where d? = x?/ ‖x?‖2

d?

Page 49: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: spectral initialization

Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x?

X init :=∑

i:bi≤median(b)

aiaTi

satisfiesX init ≈ E[aiaTi ]− cd?d?

T where d? = x?/ ‖x?‖2

d?

Page 50: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: spectral initialization

Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x?

X init :=∑

i:bi≤median(b)

aiaTi

satisfiesX init ≈ E[aiaTi ]− cd?d?

T where d? = x?/ ‖x?‖2

d?

Page 51: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: spectral initialization

3. Initialization: We need dist(x0, x?) . 1

2 ‖x?‖2

Estimate direction d ≈ x?/ ‖x?‖2 and radius r by

X init :=∑

i:bi≤median(b)

aiaTi and d = argmin

d∈Sn−1

{dTX initd

}r :=

(1

m

m∑i=1

b2i

) 12

≈ ‖x?‖2

Proposition (D. & Ruan 17)

Under appropriate orthogonality conditions, x0 = rd satisfies

dist(x0, x?) .

√n

m+ t

with probability at least 1− e−mt2

Page 52: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: spectral initialization

3. Initialization: We need dist(x0, x?) . 1

2 ‖x?‖2

Estimate direction d ≈ x?/ ‖x?‖2 and radius r by

X init :=∑

i:bi≤median(b)

aiaTi and d = argmin

d∈Sn−1

{dTX initd

}r :=

(1

m

m∑i=1

b2i

) 12

≈ ‖x?‖2

Proposition (D. & Ruan 17)

Under appropriate orthogonality conditions, x0 = rd satisfies

dist(x0, x?) .

√n

m+ t

with probability at least 1− e−mt2

Page 53: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Ingredients in convergence: spectral initialization

3. Initialization: We need dist(x0, x?) . 1

2 ‖x?‖2

Estimate direction d ≈ x?/ ‖x?‖2 and radius r by

X init :=∑

i:bi≤median(b)

aiaTi and d = argmin

d∈Sn−1

{dTX initd

}r :=

(1

m

m∑i=1

b2i

) 12

≈ ‖x?‖2

Proposition (D. & Ruan 17)

Under appropriate orthogonality conditions, x0 = rd satisfies

dist(x0, x?) .

√n

m+ t

with probability at least 1− e−mt2

Page 54: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Take-home result

I Stability: measurements ai are uniform enough in direction

I Closeness: ai are sub-Gaussian or normalized

I Sufficient conditions for initialization: for v ∈ Sn,

E[aiaTi | 〈ai, v〉2 ≤ ‖v‖22] = In − cvvT + E

where c > 0 and E is a small error

I Measurement failure probability pfail ≤ 14

Theorem (D. & Ruan 17)

If these conditions hold and m/n & 1, then the spectral initializationsucceeds and iterates xk of prox-linear algorithm satisfy

dist(xk, x0) ≤ (O(1) · dist(x0, x?))2k

Page 55: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiments

1. Random (Gaussian) measurements

2. Adversarially chosen outliers

3. Real images

Page 56: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 1: random Gaussian measurements

I Data generation: dimension n = 3000,

aiiid∼ N(0, In) and bi = 〈ai, x?〉2

I Compare to Wang, Giannakis, Eldar’s Truncated Amplitude Flow(best performing non-convex approach)

I Look at success probability against m/n (note that m ≥ 2n− 1 isnecessary for injectivity)

Page 57: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 1: random Gaussian measurements

1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20

m/n

0.0

0.2

0.4

0.6

0.8

1.0

P(s

ucc

ess

)

Prox

TAF

Page 58: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 1: random Gaussian measurements

1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20

m/n

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Fract

ion o

f one-s

ided s

ucc

ess

Prox

TAF

Page 59: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 2: corrupted measurements

I Data generation: dimension n = 200,

aiiid∼ N(0, In) and bi =

{0 w.p. pfail

〈ai, x?〉2 otherwise

(most confuses our initialization method)

I Compare to Zhang, Chi, Liang’s Median-Truncated Wirtinger Flow(designed specially for standard Gaussian measurements)

I Look at success probability against m/n (note that m ≥ 2n− 1 isnecessary for injectivity)

Page 60: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 2: corrupted measurements

0.0 0.25 0.5 0.75 1.0

0.0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3

1.82.02.53.04.06.08.0

0.0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3

1.82.02.53.04.06.08.0

pfail

Page 61: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 3: digit recovery

I Data generation: handwritten 16×16 grayscale digits, sensing matrix

A =

HnS1HnS2HnS3

∈ R3n×n

where n = 256, Sl are diagonal random sign matrices, Hn isHadamard transform matrix

I Observe

b = (Ax?)2 + ξ where ξi =

{0 w.p. 1− pfailCauchy otherwise

I Other non-convex approaches designed for Gaussian data; unclearhow to parameterize them

Page 62: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 3: digit recovery

Left: true image. Middle: spectral initialization. Right: solution.

Page 63: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 3: digit recovery

0.00 0.05 0.10 0.15 0.200.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Succ

ess

pro

babili

ty

2000

4000

6000

8000

10000

12000

14000

16000

18000

Matr

ix m

ult

iply

count

pfail

Performance of composite optimization scheme versus failure probability

Page 64: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 4: real images

Signal size n = 222, measurements m = 3 · 224

Page 65: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment 4: real images

Signal size n = 222, measurements m = 3 · 224

Page 66: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Composite optimization at scale

Question: What if we have composite problems with a really big sample?

I Typical stochastic optimization setup,

f(x) = E[F (x;S)] where F (x;S) = h(c(x;S);S)

I Example: large scale (robust) nonlinear regression

f(x) =1

m

m∑i=1

|φ(〈ai, x〉)− bi|

Page 67: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Composite optimization at scale

Question: What if we have composite problems with a really big sample?

I Typical stochastic optimization setup,

f(x) = E[F (x;S)] where F (x;S) = h(c(x;S);S)

I Example: large scale (robust) nonlinear regression

f(x) =1

m

m∑i=1

|φ(〈ai, x〉)− bi|

Page 68: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Composite optimization at scale

Question: What if we have composite problems with a really big sample?

I Typical stochastic optimization setup,

f(x) = E[F (x;S)] where F (x;S) = h(c(x;S);S)

I Example: large scale (robust) nonlinear regression

f(x) =1

m

m∑i=1

|φ(〈ai, x〉)− bi|

Page 69: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

A stochastic composite method

I Define (random) convex approximation

Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x); s)

I Then iterate for k = 1, 2, . . .

Skiid∼ P

xk+1 = argminx∈X

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}

Page 70: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

A stochastic composite method

I Define (random) convex approximation

Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)︸ ︷︷ ︸≈c(y;s)

; s)

I Then iterate for k = 1, 2, . . .

Skiid∼ P

xk+1 = argminx∈X

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}

Page 71: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

A stochastic composite method

I Define (random) convex approximation

Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)︸ ︷︷ ︸≈c(y;s)

; s)

I Then iterate for k = 1, 2, . . .

Skiid∼ P

xk+1 = argminx∈X

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}

Page 72: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Understanding convergence behavior

Ordinary differential equations (gradient flow):

x = −∇f(x) i.e.d

dtx(t) = −∇f(x(t))

Page 73: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Understanding convergence behavior

Ordinary differential inclusions (subgradient flow):

x ∈ −∂f(x) i.e.d

dtx(t) ∈ −∂f(x(t))

Page 74: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The differential inclusion

For stochastic function

f(x) := E[F (x;S)] = E[h(c(x;S);S)] =∫h(c(x; s); s)dP (s)

the generalized subgradient (for non-convex, non-smooth) is [D. & Ruan 17]

∂f(x) =

∫∇c(x; s)∂h(c(x; s); s)dP (s)

Theorem (D. & Ruan 17)

For stochastic composite problem, the subdifferential inclusion x ∈ −∂f(x) has aunique trajectory for all time and

f(x(t))− f(x(0)) ≤ −∫ t

0

‖∂f(x(τ))‖2 dτ.

It also has limit points and they are stationary.

Page 75: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The limiting differential inclusion

Recall our iteration

xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.

Optimality conditions: using Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)),

Page 76: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The limiting differential inclusion

Recall our iteration

xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.

Optimality conditions: using Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)),

0 ∈ ∇c(xk; s)∂h(c(xk; s) +∇c(xk; s)T (xk+1 − xk)) +1

αk[xk+1 − xk]

Page 77: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The limiting differential inclusion

Recall our iteration

xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.

Optimality conditions: using Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)),

0 ∈ ∇c(xk; s)∂h(c(xk; s) +∇c(xk; s)T (xk+1 − xk)︸ ︷︷ ︸=c(xk;s)±O(‖xk−xk+1‖2)

) +1

αk[xk+1 − xk]

Page 78: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

The limiting differential inclusion

Recall our iteration

xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.

Optimality conditions: using Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)),

0 ∈ ∇c(xk; s)∂h(c(xk; s) +∇c(xk; s)T (xk+1 − xk)︸ ︷︷ ︸=c(xk;s)±O(‖xk−xk+1‖2)

) +1

αk[xk+1 − xk]

i.e.

1

αk[xk+1 − xk] ∈ −∇c(xk; s)∂h(c(xk; s); s) + subgradient mess + Noise

= −∂f(xk) + subgradient mess + Noise

Page 79: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Graphical example

Iterate xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}

Page 80: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

A convergence guarantee

Consider the stochatsic composite optimization problem

minimizex∈X

f(x) := E[F (x;S)] where F (x; s) = h(c(x; s); s).

Use the iteration

xk+1 = argminx∈X

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.

Theorem (D. & Ruan 17)

Assume X is compact and∑∞

k=1 αk =∞,∑∞

k=1 α2k <∞. Then the

sequence {xk} satisfies

(1) f(xk) converges

(2) All cluster points of xk are stationary

Page 81: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Experiment: noiseless phase retrieval

0 50 100 150 20010-8

10-7

10-6

10-5

10-4

10-3

10-2

proxsproxsgd

Page 82: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Conclusions

1. Broadly interesting structures for non-convex problems that are stillapproximable

2. Statistical modeling allows solution of non-trivial, non-smooth,non-convex problems

3. Large scale efficient methods still important

References

I Solving (most) of a set of quadratic equalities: Compositeoptimization for robust phase retrieval arXiv:1705.02356

I Stochastic Methods for Composite Optimization ProblemsarXiv:1703.08570

Page 83: Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Conclusions

1. Broadly interesting structures for non-convex problems that are stillapproximable

2. Statistical modeling allows solution of non-trivial, non-smooth,non-convex problems

3. Large scale efficient methods still important

References

I Solving (most) of a set of quadratic equalities: Compositeoptimization for robust phase retrieval arXiv:1705.02356

I Stochastic Methods for Composite Optimization ProblemsarXiv:1703.08570