Avinash N. Madavan Subhonmesh Bose April 16, 2020 … · 2020. 4. 16. · Conditional Value at Risk Sensitive Optimization via Subgradient Methods Avinash N. Madavan Subhonmesh Bose

Conditional Value at Risk Sensitive Optimization via

Subgradient Methods

Avinash N. Madavan Subhonmesh Bose∗

April 16, 2020

Abstract

We study a first-order primal-dual subgradient method to optimize risk-constrained risk-penalized optimization problems, where risk is modeled via the popular conditional value atrisk (CVaR) measure. The algorithm processes independent and identically distributed samplesfrom the underlying uncertainty in an online fashion, and produces an η/

√K-approximately

feasible and η/√K-approximately optimal point within K iterations with constant step-size,

where η increases with tunable risk-parameters of CVaR. We find optimized step sizes usingour bounds and precisely characterize the computational cost of risk aversion as revealed bythe growth in η. Our proposed algorithm makes a simple modification to a typical primal-dualstochastic subgradient algorithm. With this mild change, our analysis surprisingly obviates theneed for a priori bounds or complex adaptive bounding schemes for dual variables assumed inmany prior works. We also draw interesting parallels in sample complexity with that for chance-constrained programs derived in the literature with a very different solution architecture.

1 Introduction

We study iterative primal-dual stochastic subgradient algorithms to solve risk-sensitive optimizationproblems of the form

PCVaR : minimizex∈X

F (x) := CVaRα[fω(x)],

subject to Gi(x) := CVaRβi [giω(x)] ≤ 0, i = 1, . . . ,m,(1)

where ω ∈ Ω is random and α,βββ := (β1, . . . , βm) in [0, 1) define risk-aversion parameters. Thecollection of functions fω, g1

ω, . . . , gmω are assumed convex, but not necessarily differentiable, over

the convex set X ⊆ Rn, where R and R+ stand for the set of real and nonnegative numbers,respectively. Denote by G and gω, the collection of Gi’s and giω’s respectively, for i = 1, . . . ,m.CVaR stands for conditional value at risk. For any δ ∈ [0, 1), CVaRδ[yω] of a scalar randomvariable yω with continuous distribution equals its expectation computed over the 1− δ tail of the

∗A. N. Madavan and S. Bose are with the Department of Electrical and Computer Engineering, University of Illinoisat Urbana-Champaign, Urbana, IL 61801. Emails: madavan2,[email protected]. This work was partially supportedby the International Institute of Carbon-Neutral Energy Research (I2CNER) and the Power System EngineeringResearch Center (PSERC).

1

arX

iv:1

908.

0108

6v2

[m

ath.

OC

] 1

5 A

pr 2

020

distribution of yω. For yω with general distributions, CVaR is defined via the following variationalcharacterization

CVaRδ[yω] = minu∈R

u+

1

1− δE[yω − u]+, (2)

following [31]. PCVaR offers a modeler the flexibility to indicate her risk preference in α,βββ. With αclose to zero, she indicates risk-neutrality towards the uncertain cost associated with the decision.With α closer to one, she expresses her risk aversion towards the same and seeks a decision thatlimits the possibility of large random costs associated with the decision. Similarly, β’s express therisk tolerance in constraint violation. Choosing β’s close to zero indicates that constraints should besatisfied on average over Ω rather than on each sample. Driving β’s to unity amounts to requiringthe constraints to be met almost surely. Said succinctly, PCVaR offers the modeler the ability tocustomize risk preference between the risk-neutral choice of expected evaluations of functions tothe conservative choice of robust evaluations.

There is a growing interest in solving risk-sensitive optimization problems with data. See [18, 3]for recent examples that tackle problems with generalized mean semi-deviation risk that equalsE[yω] + cE[|yω − E[yω]|p]1/p for p > 1 for a random variable yω. There is a long literature on riskmeasures; see [31, 29, 32, 11, 22, 1] for an excellent survey. We choose CVaR for three particularreasons. First, it is a coherent risk measure, meaning that it is normalized, sub-additive, positivelyhomogenous and translation invariant, i.e.,

CVaRδ[0] = 0, CVaRδ[y1ω + y2

ω] ≤ CVaRδ[y1ω] + CVaRδ[y

2ω],

CVaRδ[tyω] = tCVaRδ[yω], CVaRδ[yω + t′] = CVaRδ[yω] + t′

for random variables yω, y1ω, y

2ω, t > 0 and t′ ∈ R. An important consequence of coherence is that F

and G in PCVaR inherit the convexity of fω and gω. Convexity together with the variational char-acterization in (2) allow us to design sampling based primal-dual methods for PCVaR for which weare able to provide finite sample analysis of approximate optimality and feasibility. The popularityof the CVaR measure is our second reason to study PCVaR. Following Rockafellar and Uryasev’sseminal work in [31], CVaR has found applications in various engineering domains, e.g., see [23, 20],and therefore anticipate wide applications of our result. Our third and final reason to study PCVaR

is its close relation to other optimization paradigms in the literature as we describe next.PCVaR without constraints and α = 0 reduces to the minimization of E[fω(x)], the canonical

stochastic optimization problem. With α ↑ 1, the problem description of PCVaR approaches that ofa robust optimization problem (see [4]) of the form minx∈X ess supω∈Ω fω(x), where ess sup denotesthe essential supremum. Driving β’s to unity, PCVaR demands the constraints to be enforcedalmost surely. Such robust constraint enforcement is common in multi-stage stochastic optimizationproblems with recourse and discrete-time optimal control problems, e.g., in [34, 35, 15]. CVaR-based constraints are closely related to chance-constraints introduced by Charnes and Cooper in[11], that enforce Prgω(x) ≤ 0 > 1− ε where Pr refers to the probability measure on Ω. Even ifgω is convex, chance-constraints typically describe a nonconvex feasible set. It is well-known thatCVaR-based constraints provide a convex inner approximation of chance-constraints. Restrictingthe probability of constraint violation does not limit the extent of any possible violation, whileCVaR-based enforcement does so in expectation. CVaR is also intimately related to the bufferedprobability of exceedence (bPOE) introduced and studied more recently in [22, 40]. In fact, bPOE is

2

the inverse function of CVaR and hence, problems with bPOE-constraints can often be reformulatedas instances of PCVaR.

It can be challenging to compute CVaR of fω(x) or gω(x) for a given decision variable x withrespect to a general distribution on Ω for two reasons. First, if samples from Ω are obtained from asimulation tool, an explicit representation of the probability distribution on Ω may not be available.Second, even if such a distribution is available, computation of CVaR (or even the expectation) canbe difficult. For example, with fω as the positive part of an affine function and ω being uniformlydistributed over a unit hypercube, computation of E[fω] via a multivariate integral is #P-hardaccording to [16, Corollary 1]. Therefore, we do not assume knowledge of F and G but ratherstudy a sampling-based algorithm to solve PCVaR.

Solution architecture for PCVaR via sampling comes in two flavors. The first approach is sampleaverage approximation (SAA) that replaces the expectation in (2) by an empirical average over Nsamples. One can then solve the sampled problem as a deterministic convex program.1 We takethe second and alternate approach of stochastic approximation and process independent and iden-tically distributed (i.i.d.) samples from Ω in an online fashion. Iterative stochastic approximationalgorithms for the unconstrained problem have been studied since the early works by Robins andMonro in [30] and by Kiefer and Wolfowitz in [19]. See [21] for a more recent survey. Zinkevichin [41] proposed a projected stochastic subgradient method to tackle constraints in such problems.Without directly knowing G, we cannot project the iterates on the feasible set x ∈ X | G(x) ≤ 0.We circumvent the challenge by associating Lagrange multipliers z ∈ Rm+ to the constraints and iter-atively updating x, z by using fω, gω and their subgradients via a first-order stochastic primal-dualalgorithm for PCVaR along the lines of [26, 37, 39].

In Section 2, we first design and analyze Algorithm 1 for PCVaR with α = 0,βββ = 0, i.e., theoptimization problem

PE : minimizex∈X

F (x) := E[fω(x)],

subject to Gi(x) := E[giω(x)] ≤ 0, i = 1, . . . ,m.(3)

First-order stochastic primal-dual algorithms have a long history, dating back almost forty years,including that in [28, 26, 14, 21, 27]. The analyses of these algorithms often require a bound on thepossible growth of the dual variables. Borkar and Meyn in [7] stress the importance of compactnessassumptions in their analysis of stochastic approximation algorithms. A priori bounds used in [27]are difficult to know in practice and techniques for iterative construction of such bounds as in [26]require extra computational effort. Motivated to circumvent this limitation, we propose a simplemodification to the classical primal-dual stochastic subgradient algorithm. We were surprised tofind that its analysis allows us to completely bypass the need for explicit or implicit bounds onthe dual variable in our analysis. For Algorithm 1, we bound the expected optimality gap andconstraint violations at a suitably weighted average of the iterates by η/

√K for a constant η with

a constant step-size algorithm. Using these bounds, we then carefully optimize the step-size thatallows us to reach within a given threshold of suboptimality and constraint violation with theminimum number of iterations. We also provide stability analysis of our algorithm with decayingstep-sizes.

In Section 3, we solve PCVaR with general risk aversion parameters α,βββ using Algorithm 1 onan instance of PE obtained through a standard reformulation via the variational formula for CVaR

1For the unconstrained problem, variance-reduced stochastic gradient descent methods can efficiently minimizethe resulting finite sum as in [33, 17].

3

from [31] in (2). We then bound the expected suboptimality and constraint violation at a weightedaverage of the iterates for PCVaR by η(α,β)/

√K. Upon utilizing the optimized step-sizes from the

analysis of PE, we are then able to study the precise growth in the required iteration (and sample)complexity of PCVaR as a function of α,β. Not surprisingly, the more risk-averse a problem oneaims to solve, this complexity increases. A modeler chooses risk aversion parameters primarilydriven by attitudes towards risk in specific applications. Our precise characterization of the growthin sample complexity with risk aversion will permit the modeler to balance between desired risklevels and computational challenges in handling that risk. Using concentration inequalities, wealso report an interesting connection of our results to that in [9, 10] on scenario approximationsto chance-constrained programs. The resemblance in sample complexity is surprising, given thatthe approach in [9, 10] solves a deterministic convex program with sampled constraints, while weprocess samples in an online fashion.

There is an inherent problem in optimizing step-sizes based on upper bounds on suboptimalityand constraint violation. While the bounds are order-optimal, our numerical experiments revealthat a solution with desired risk tolerance can be usually found in far less iterations than obtainedfrom the upper bound. This limitation is unfortunately common for subgradient algorithms andlikely cannot be overcome in optimizing general nonsmooth functions that we study. We end thepaper in Section 4 with discussions on possible extensions of our analysis.

Very recently, it was brought to our attention that the work in [6] done concurrently presentsa related approach to tackle optimization of composite nonconvex functions under related butdifferent assumptions. In fact, their work appeared at the same time as our early version andclaims a similar result that does not require bounds on the dual variables. Our analysis does notrequire or analyze the case with strongly convex functions within our setup and therefore Nesterov-style acceleration remains untenable. As a result, our algorithm is different. Our focus on CVaRpermits us to further analyze the growth in optimized sample complexity with risk aversion and itsconnection to chance-constrained optimization, that is quite different.

2 Algorithm for PE and its analysis

We present the primal-dual stochastic subgradient method to solve PE in Algorithm 1.

Algorithm 1: Primal-dual stochastic subgradient method for PE.

Initialization: Choose x1 ∈ X, z1 = 0, and a positive sequence γ.1 for k ≥ 1 do2 Sample ωk ∈ Ω. Update x as

xk+1 ← argminx∈X

⟨∇fωk

(xk) +m∑i=1

zik∇giωk(xk),x− xk

⟩+

1

2γk‖x− xk‖2 . (4)

3 Sample ωk+1/2 ∈ Ω. Update z as

zk+1 ← argmaxz∈Rm

+

⟨gωk+1/2

(xk+1), z − zk⟩− 1

2γk‖z − zk‖2. (5)

4

The notation 〈·, ·〉 stands for the usual inner product in Euclidean space and ‖ · ‖ denotesthe induced `2-norm. The subscripts on ω are suppressed in the sequel for notational convenience.Convergence analysis of Algorithm 1 requires the functions in PE to satisfy the following properties.

Assumption 1. We make the following assumptions:

• Subgradients of F and G are bounded, i.e., ‖∇F (x)‖ ≤ CF , ‖∇Gi(x)‖ ≤ CiG for eachi = 1, . . . ,m and all x ∈ X.

• ∇fω and ∇giω for i = 1, . . . ,m have bounded variance, i.e., E‖∇fω(x)− E[∇fω(x)]‖2 ≤ σ2F

and E‖∇giω(x)− E[∇giω(x)]‖2 ≤ [σiG]2 for all x ∈ X.

• gω(x) has a bounded second moment, i.e., E‖giω(x)‖2 ≤ [DiG]2 for all x ∈ X.

• Problem (1) admits a finite primal-dual optimizer (x?, z?) ∈ X× Rm+ .

The subgradient of F and the variance of its noisy estimate are assumed bounded. Suchan assumption is standard in the convergence analysis of unconstrained stochastic subgradientmethods. The assumptions regarding G are similar, but we additionally require the second momentof the noisy estimate of G to be bounded over X. Boundedness of G in primal-dual subgradientmethods has appeared in prior literature, e.g., in [37, 39]. The second moment remains bounded ifgiω is uniformly bounded over X and Ω for each i. It is also satisfied if G remains bounded over Xand its noisy estimate has a bounded variance. Convergence analysis of unconstrained optimizationproblems typically assume the existence of a bounded primal solution. We extend that requirementto include the same for a dual solution—one that is satisfied under constraint qualifications suchas Slater’s condition. Notice that we do not assume any a priori bounds on the dual variable nordo we use a projection over adaptively computed compact sets.

Our first result provides a bound on the distance to optimality and constraint violation at aweighted average of the iterates generated by the algorithm. Denote by CG, DG, and σG thecollections of CiG, Di

G, and σiG respectively. We make use of the following notation.

P1 := 2‖x1 − x?‖2 + 4‖1+ z?‖2,P2 := 8(4C2

F + σ2F ) + 2‖DG‖2, P3 := 8m(4‖CG‖2 + ‖σG‖2).

Theorem 1 (Convergence result for PE). Suppose Assumption 1 holds. For a positive sequenceγkKk=1, if P3

∑Kk=1 γ

2k < 1, then the iterates generated by Algorithm 1 satisfy

E[F (xK+1)]− F (x?) ≤1

4∑K

k=1 γk

(P1 + P2

∑Kk=1 γ

2k

1− P3∑K

k=1 γ2k

), (6)

E[Gi(xK+1)] ≤ 1

4∑K

k=1 γk

(P1 + P2

∑Kk=1 γ

2k

1− P3∑K

k=1 γ2k

)(7)

for each i = 1, . . . ,m, where xK+1 :=∑K

k=1 γkxk+1∑Kk=1 γk

. Moreover, if γk = γ/√K for k = 1, . . . ,K with

0 < γ < P−1/23 , then

E[F (xK+1)]− F (x?) ≤η√K, E[Gi(xK+1)] ≤ η√

K(8)

for i = 1, . . . ,m, where η := P1+P2γ2

4γ(1−P3γ2).

5

A constant step-size of η/√K over a fixed number of K iterations yields the O(1/

√K) decay

rate in the expected distance to optimality and constraint violation of Algorithm 1. This is indeedorder optimal, as implied by Nesterov’s celebrated result in [28, Theorem 3.2.1].

Remark 1. While we present the proof for an i.i.d. sequence of samples, we believe that the resultcan be extended to the case where ω’s follow a Markov chain with geometric mixing rate followingthe technique in [36]. For such settings, the expectations in the definition of F,G are computed withrespect to the stationary distribution of the chain. The results will then apply to Markov decisionprocesses with applications to stochastic control.

Given that the literature on primal-dual subgradient methods is extensive, it is important forus to relate and distinguish Algorithm 1 and Theorem 1 with prior work. We begin by defining theLagrangian function for PE as

L(x, z) := F (x) + zᵀG(x) = E[Lω(x, z)],

for x ∈ X, z ∈ Rm+ , where Lω(x, z) := fω(x) + zᵀgω(x). Then, Algorithm 1 can be written as

xk+1 := projX[xk − γk∇xLω(xk, zk)],

zk+1 := projRm+

[zk + γk∇zLω(xk+1, zk)],(9)

where projA projects its argument on set A. The vectors ∇xLω and ∇zLω are stochastic subgra-dients of the Lagrangian function with respect to x and z, respectively. Therefore, Algorithm 1is a projected stochastic subgradient algorithm that seeks to solve the saddle-point reformulationof PE as minx∈X maxz∈Rm

+L(x, z). Implicit in our algorithm is the assumption that projection on

X is computationally easy. Any functional constraints describing X that makes such projectionchallenging should be included in G.

Closest in spirit to our work on PE are the papers by Baes et al. in [2], Yu et al. in [39],Xu in [37], and Nedic and Ozdaglar in [26]. Stochastic mirror-prox algorithm in [2] and projectedsubgradient method in [26] are similar in their updates to ours except in two ways. First, thesealgorithms in the context of PE update the dual variable zk based on G or its noisy estimateevaluated at xk, while we update it based on the estimate at xk+1. Second, both project the dualvariable on a compact subset of Rm+ that contains the optimal set of dual multipliers. While authorsin [2] assume an a priori set to project on, authors in [26] compute such a set from a point thatsatisfies the Slater’s constraint qualification G(x) < 0. Our algorithm does not require such a stepas the proof provides an explicit bound on the growth of the dual variable sequence, much in linewith Xu’s analysis in [37]. Much to our surprise, such a minor modification of using a Gauss-Siedelstyle dual update as opposed to the popular Jacobi style dual update obviates the need for crucialassumptions in the literature for the proofs to work. Specifically, the Jacobi style update requiressmoothness of the objective and constraints, as well as an explicit a-priori bound of the optimaldual multiplier. The latter is an inherent difficulty with existing algorithms and the removal of thisassumption was a primary motivator for this work. Unfortunately, this Gauss-Seidel style updatingscheme comes at an additional cost of an extra sample required per iteration of the primal-dualalgorithm. However, requiring different samples to update the primal and dual variables, points to apossible extension of our result to an asynchronous setting, often useful in engineering applications.

While sharing some parallels, our work has an important difference with that in [37]. Xu consid-ers a collection of deterministic constraint functions, i.e., gω is identical for all ω ∈ Ω, and considers

6

a modified augmented Lagrangian function of the form L(x, z) := F (x) + 1m

∑mi=1 ϕδ(x, z

i), where

ϕδ(x, zi) :=

zigi(x) + δ

2 [gi(x)]2, if δgi(x) + zi ≥ 0,

− (zi)2

2δ , otherwise

for i = 1, . . . ,m with a suitable time-varying sequence of δ’s. His algorithm is similar to Algorithm1 but performs a randomized coordinate update for the dual variable instead of (5). To the best ofour knowledge, Xu’s analysis in [37] with such a Lagrangian function does not directly apply to oursetting with stochastic constraints that is crucial for the subsequent analysis of the risk-sensitiveproblem PCVaR.

Finally, Yu et al.’s work in [39] provides an analysis of the algorithm that updates its dualvariables using

zk+1 := argmaxz∈Rm

+

〈vk, z − zk〉 −1

2γk‖z − zk‖2, (10)

where vi := giωk(xk) + 〈∇giωk

(xk),xk+1 − xk〉 for i = 1, . . . ,m. In contrast, our z-update in (5)samples ωk+1/2 and sets vk := gωk+1/2

(xk+1) at the already computed point xk+1. We are able to

recover the O(1/√K) decay rate of suboptimality and constraint violation with a proof technique

much closer to the classical analysis of subgradient methods in [8, 26]. Unlike [39], we provide aclean characterization of the constant η in (8) that is crucial to study the growth in sample (anditeration) complexity of Algorithm 1 applied to a reformulation of PCVaR.

2.1 Proof of Theorem 1

The proof proceeds in four steps.

(a) We establish the following dissipation inequality that consecutive iterates of the algorithmsatisfy.

γkE[L(xk+1, z)− L(x, zk)] +1

2E‖xk+1 − x‖2 +

1

2E‖zk+1 − z‖2

≤ 1

2E‖xk − x‖2 +

1

2E‖zk − z‖2 +

1

4P2γ

2k +

1

4P3γ

2kE‖zk‖2

(11)

for any x ∈ X and z ∈ Rm+ .

(b) We utilize the definition of the Lagrangian function to show that

L(x, z?) ≥ L(x?, z?) ≥ L(x?, z) (12)

for each x ∈ X and z ∈ Rm+ , where (x?, z?) defines a finite primal-dual optimal solution pair.2

(c) Next, we bound E‖zk‖2 generated by our algorithm from above using steps (a) and (b) as

E‖zk‖2 ≤P1 + P2AK1− P3AK

(13)

for k = 1, . . . ,K, where AK :=∑K

k=1 γ2k .

2An optimal primal-dual pair is a saddle point for the Lagrangian function and satisfies (12) for all x ∈ X, z ≥ 0.We explicitly derive (12) from the Karush-Kuhn-Tucker optimality conditions for completeness.

7

(d) We combine the results in steps (a), (b) and (c) to complete the proof.

Define the filtration W1 ⊂ W1+1/2 ⊂ W2 ⊂ . . ., where Wk is the σ-algebra generated by thesamples ω1, . . . ωk−1/2 for k being multiples of 1/2, starting from unity. Then, x1, z1, . . . ,xk, zkbecomes Wk-measurable, while x1, z1, . . . ,xk, zk,xk+1 is Wk+1/2-measurable.• Step (a) – Proof of (11): We first utilize the x-update in (4) to prove

E[F (xk+1)− F (x) + zᵀkG(xk+1)− zᵀkG(x)|Wk] +

1

2γkE[‖xk+1 − x‖2|Wk]

≤ 1

2γk‖xk − x‖2 + 2γk(4C

2F + σ2

F ) + 2γkm(4‖CG‖2 + ‖σG‖2)‖zk‖2.(14)

for all x ∈ X. Then, we utilize the z-update in (5) to prove

E[zᵀG(xk+1)− zᵀkG(xk+1)|Wk] +

1

2γkE[‖zk+1 − z‖2|Wk]

≤ 1

2γk‖zk − z‖2 +

γk2‖DG‖2

(15)

for all z ∈ Rm+ . The law of total probability is then applied to the sum of (14) and (15) followedby a multiplication by γk yielding the desired result in (11).

Proof of (14): The x-update in (4) yields⟨xk+1 − x,∇fω(xk) +

m∑i=1

zik∇giω(xk) +1

γk(xk+1 − xk)

⟩≤ 0. (16)

We simplify the inner product by considering the three terms in ∆x separately. The inner productwith ∇fω(xk) can be expressed as

〈xk+1 − x,∇fω(xk)〉 = 〈xk+1 − xk,∇F (xk+1)〉︸︷︷︸≥F (xk+1)−F (xk)

+〈xk − x,∇fω(xk)〉

− 〈xk+1 − xk,∇F (xk+1)−∇fω(xk)〉,(17)

where ∇F (xk+1) denotes a subgradient of F at xk+1. The inequality for the first term followsfrom the convexity of F . Since E[∇fω(xk)|Wk] ∈ ∂F (xk) from [5], the expectation of the thirdsummand on the right hand side (RHS) of (17) satisfies

E[〈xk − x,∇fω(xk)〉|Wk] = 〈xk − x,∇F (xk)〉 ≥ F (xk)− F (x?). (18)

Taking expectations in (17), the above relation implies

E[〈xk+1 − x,∇fω(xk)〉|Wk]

≥ E[F (xk+1)− F (x)− 〈xk+1 − xk,∇F (xk+1)−∇fω(xk)〉|Wk].(19)

8

Next, we bound the inner product with the second term on the RHS of (17). To that end, utilizethe convexity of member functions in gω and G along the above lines to infer

m∑i=1

E[〈xk+1 − x, zik∇giω(xk)〉|Wk]

≥m∑i=1

E[zikGi(xk+1)− zikGi(x)|Wk]

− E[〈xk+1 − xk, zik∇Gi(xk+1)− zik∇giω(xk)〉|Wk]

= E[zᵀkG(xk+1)− zᵀkG(x)|Wk]

−m∑i=1

E[〈xk+1 − xk, zik∇Gi(xk+1)− zik∇giω(xk)〉|Wk].

(20)

To tackle the inner product with the third term in the RHS of (17), we use the identity⟨xk+1 − x,

1

γk(xk+1 − xk)

⟩=

1

2γk[‖xk+1 − x‖2 − ‖xk − x‖2 + ‖xk+1 − xk‖2].

(21)

The inequalities in (19), (20), and the equality in (21) together gives

E[F (xk+1)− F (x) + zᵀkG(xk+1)− zᵀkG(x)|Wk]

− E[〈xk+1 − xk,∇F (xk+1)−∇fω(xk)〉|Wk]

−m∑i=1

E[〈xk+1 − xk, zik∇Gi(xk+1)− zik∇giω(xk)〉|Wk]

+1

2γkE[‖xk+1 − x‖2 + ‖xk+1 − xk‖2|Wk]

≤ 1

2γk‖xk − x‖2.

(22)

To simplify the above relation, apply Young’s inequality to obtain

E[〈xk+1 − xk,∇F (xk+1)−∇fω(xk)〉|Wk]

≤ 1

4γkE[‖xk+1 − xk‖2|Wk] + γkE[‖∇F (xk+1)−∇fω(xk)‖2|Wk]

≤ 1

4γkE[‖xk+1 − xk‖2|Wk] + 2γkE[‖∇F (xk+1)

− E∇fω(xk)‖2 + ‖E∇fω(xk)−∇fω(xk)‖2|Wk].

Recall that E[∇fω(xk)|Wk] ∈ ∂F (xk), subgradients of F are bounded and ∇fω has boundedvariance. Therefore, we infer from the above inequality that

E[〈xk+1 − xk,∇F (xk+1)−∇fω(xk)〉|Wk]

≤ 1

4γkE[‖xk+1 − xk‖2|Wk] + 2γk(4C

2F + σ2

F ).(23)

9

Appealing to Young’s inequality m times and a similar line of argument as above gives

m∑i=1

E[〈xk+1 − xk, zik∇Gi(xk+1)− zik∇giω(xk)〉|Wk]

≤ 1

4γkE[‖xk+1 − xk‖2|Wk] + 2γkm

m∑i=1

(4[CiG]2 + [σiG]2)‖zk‖2.(24)

Leveraging the relations in (23) and (24) in (22), we get

E[F (xk+1)− F (x) + zᵀkG(xk+1)− zᵀkG(x)|Wk]

+1

2γkE[‖xk+1 − x‖2 + ‖xk+1 − xk‖2|Wk]

≤ 1

2γk‖xk − x‖2 +

1

4γkE[‖xk+1 − xk‖2|Wk] + 2γk(4C

2F + σ2

F )

+1

4γkE[‖xk+1 − xk‖2|Wk] + 2γkm

m∑i=1

(4[CiG]2 + [σiG]2)‖zk‖2,

that upon simplification gives (14).

Proof of (15): From the z-update in (5), we obtain⟨zk+1 − z,−gω(xk+1) +

1

γk(zk+1 − zk)

⟩≤ 0 (25)

for all z ≥ 0. Again, we deal with the two summands in the second factor of the inner product of(25) separately. The expectation of the inner product with the first term yields

E[〈zk+1 − z,−gω(xk+1)〉|Wk+1/2]

= E[〈zk+1 − zk,−gω(xk+1)〉|Wk+1/2] + E[〈zk − z,−gω(xk+1)〉|Wk+1/2]

≥ − 1

2γkE[‖zk+1 − zk‖2|Wk+1/2]− γk

2E[‖gω(xk+1)‖2|Wk+1/2]

+ E[〈zk − z,−gω(xk+1)〉|Wk+1/2]

≥ − 1

2γkE[‖zk+1 − zk‖2|Wk+1/2]− γk

2‖DG‖2 + 〈zk − z,−G(xk+1)〉3

= − 1

2γkE[‖zk+1 − zk‖2|Wk+1/2]− γk

2‖DG‖2 + z

ᵀG(xk+1)− zᵀkG(xk+1).

(26)

In the above derivation, we have utilized Young’s inequality and the boundedness of the secondmoment of gω. Since Wk ⊂ Wk+1/2, the law of total probability can be used to condition (26) onWk rather than on Wk+1/2. To simplify the inner product with the second term in (25), we use theidentity ⟨

zk+1 − z,1

γk(zk+1 − zk)

⟩=

1

2γk[‖zk+1 − z‖2 − ‖zk − z‖2 + ‖zk+1 − zk‖2]. (27)

3E[gω(xk+1)|Wk+1/2] = G(xk+1) requires that we sample ω once more for the z-update.

10

Utilizing (26) and (27) in (25) gives (15). Adding (14) and (15) followed by a multiplication by γkyields

γkE[L(xk+1, z)− L(x, zk)|Wk] +1

2E[‖xk+1 − x‖2 + ‖zk+1 − z‖2|Wk

]≤ 1

2‖xk − x‖2 +

1

2‖zk − z‖2 +

1

4P2γ

2k +

1

4P3γ

2k‖zk‖2.

(28)

Taking the expectation and applying the law of total probability completes the proof of (11).• Step (b) – Proof of (12): Denote by NX(x?), the normal cone to X at x?. From the Karush-Kuhn-Tucker optimality conditions for (1), we infer that there exists subgradients ∇F (x?) ∈ ∂F (x?),∇Gi(x?) ∈ ∂Gi(x?), i = 1, . . . ,m and n ∈ NX(x?) for which

∇F (x?) +m∑i=1

zi?∇Gi(x?) + n = 0. (29)

For any x ∈ X, then we have

〈∇F (x?),x− x?〉︸︷︷︸≤F (x)−F (x?)

+

m∑i=1

zi?〈∇Gi(x?),x− x?〉︸︷︷︸≤zi?[Gi(x)−Gi(x?)]

+ 〈n,x− x?〉︸︷︷︸≤0

= 0.

The inequalities in the above relation follow from the convexity of F and Gi’s, nonnegativity of z?,and the definition of the normal cone. From the above inequalities, we conclude L(x, z?) ≥ L(x?, z?)for all x ∈ X. Furthermore, for any z ≥ 0, we have

L(x?, z?)− L(x?, z) = zᵀ?G(x?)− zᵀG(x?) ≥ 0,

where the last step follows from the fact that z ∈ Rm+ , G(x?) ≤ 0, and the complementary slacknesscondition which implies zᵀ?G(x?) = 0.• Step (c) – Proof of (13): Plugging (x, z) = (x?, z?) in the inequality for the one-step update in(11) and summing it over k = 1, . . . , κ for κ ≤ K gives

κ∑k=1

γk E[L(xk+1, z?)− L(x?, zk)]︸︷︷︸≥0 from (12)

+1

2

κ∑k=1

[E‖xk+1 − x?‖2 + E‖zk+1 − z?‖2

]

≤ 1

2

κ∑k=1

[E‖xk − x?‖2 + E‖zk − z?‖2

]+

1

4P2

κ∑k=1

γ2k +

1

4P3

κ∑k=1

γ2kE‖zk‖2

(30)

for κ = 1, . . . ,K. The above then yields

E‖xκ+1 − x?‖2︸︷︷︸≥0

+E‖zκ+1 − z?‖2

≤ ‖x1 − x?‖2 + ‖z1 − z?‖2 +1

2P2

κ∑k=1

γ2k +

1

2P3

κ∑k=1

γ2kE‖zk‖2.

(31)

11

Notice that 2E‖zκ+1 − z?‖2 + 2‖z?‖2 ≥ E‖zκ+1‖2. This inequality and z1 = 0 in (31) gives

E‖zκ+1‖2 ≤ 2‖x1 − x?‖2 + 4‖z?‖2 + P2

κ∑k=1

γ2k + P3

κ∑k=1

γ2kE‖zk‖2

≤ P1 + P2

κ∑k=1

γ2k + P3

κ∑k=1

γ2kE‖zk‖2.

We argue the bound on E‖zk‖2 for k = 1, . . . ,K inductively. Since z1 = 0, the base case triviallyholds. Assume that the bound holds for k = 1, . . . , κ for κ < K. With the notation AK =

∑Kk=1 γ

2k ,

the relation in (32) implies

E‖zκ+1‖2 ≤ P1 + P2

κ∑k=1

γ2k + P3

κ∑k=1

γ2k

P1 + P2AK1− P3AK

≤ P1 + P2AK + P3P1 + P2AK1− P3AK

AK

=P1 + P2AK1− P3AK

, (32)

completing the proof of step (c).• Step (d) – Combining steps (a), (b), (c) to prove Theorem 1: For any z ≥ 0, the inequality in(11) with x = x? from step (a) summed over k = 1, . . . ,K gives

K∑k=1

γkE[L(xk+1, z)− L(x?, zk)] +1

2

K∑k=1

[E‖xk+1 − x?‖2 + E‖zk+1 − z‖2

]≤ 1

2

K∑k=1

[E‖xk − x?‖2 + E‖zk − z‖2

]+

1

4P2

K∑k=1

γ2k +

1

4P3

K∑k=1

γ2kE‖zk‖2.

(33)

Using z1 = 0 and an appeal to (12) from step (b) yields

K∑k=1

γkE[L(xk+1, z)− L(x?, z?)] +1

2E‖xK+1 − x?‖2 +

1

2E‖zK+1 − z‖2

≤ 1

2E‖x1 − x?‖2 +

1

4P2

K∑k=1

γ2k +

1

4P3

K∑k=1

γ2kE‖zk‖2 +

1

2‖z‖2

≤ 1

4P1 − ‖1+ z?‖2 +

1

4P2AK +

1

4P3

K∑k=1

γ2k

P1 + P2AK1− P3AK

+1

2‖z‖2

=1

4P1 +

1

4P2AK +

1

4P3AK

P1 + P2AK1− P3AK

+1

2‖z‖2 − ‖1+ z?‖2

=1

4

(P1 + P2AK1− P3AK

)+

1

2‖z‖2 − ‖1+ z?‖2. (34)

In deriving the above inequality, we have utilized the bound on E‖zk‖2 from step (c) and thedefinition of P1 and AK . To further simplify the above inequality, notice that L(x, z) is convex in

12

x and L(x?, z?) = F (x?). Therefore, Jensen’s inequality implies

K∑k=1

γkE[L(xk+1, z)− L(x?, z?)] ≥(

K∑k=1

γk

)E[L(xK+1, z)− F (x?)], (35)

where recall that xK+1 is the γ-weighted average of the iterates. Utilizing (35) in (34), we get(K∑k=1

γk

)E[L(xK+1, z)− F (x?)] ≤

1

4

(P1 + P2AK1− P3AK

)+

1

2‖z‖2 − ‖1+ z?‖2. (36)

The above relation defines a bound on E[L(xK+1, z)] for every z ≥ 0. Choosing z = 0 and noting‖1+z?‖2 ≥ 0, we get the bound on expected suboptimality in (6). To derive the bound on expectedconstraint violation in (7), notice that (12) implies

E[L(xK+1,1i + z?)− F (x?)]

= E[L(xK+1, z?)− L(x?, z?)] + E[[1i]

ᵀG(xK+1)

]≥ E[Gi(xK+1)],

where 1i ∈ Rm is a vector of all zeros except the i-th entry that is unity. Choosing z = 1i + z? in(36) together with the above observation gives

E[Gi(xK+1)] ≤ 1

4∑K

k=1 γk

(P1 + P2AK1− P3AK

+ 2‖1i + z?‖2 − 4‖1+ z?‖2)

≤ 1

4∑K

k=1 γk

(P1 + P2AK1− P3AK

)for each i = 1, . . . ,m. This completes the proof of (7). The bounds in (8) are immediate from thatin (6)–(7). This completes the proof of Theorem 1.

2.2 Optimal step size selection

We exploit the bounds in Theorem 1 to select a step size that minimizes the iteration count toreach an ε-approximately feasible and optimal solution to PE and solve4

minimizeK, γ>0

K,

subject toη√K

=P1 + P2γ

2

4γ√K(1− P3γ2)

≤ ε, P3γ2 < 1.

(37)

The following characterization of optimal step sizes and the resulting iteration count from Propo-sition 1 will prove useful in studying the growth in iteration complexity in solving PCVaR with therisk-aversion parameters α,β in the following section.

Proposition 1. For any ε > 0, the optimal solution of (37) satisfies

γ2? =

2P−13

2 + y +√y2 + 8y

, K? =(P1 + P2γ

2?)2

16γ2?(1− P3γ2

?)2ε2, (38)

where y = 1 + P2P1P3

.4The integrality of K is ignored for notational convenience.

13

Proof. It is evident from (38) that γ2? < P−1

3 . Then, it suffices to show that γ? from (38) minimizes

√K =

P1 + P2γ2

4γ(1− P3γ2)ε

over γ > 0. To that end, notice that

d

dγ

(P1 + P2γ

2

γ(1− P3γ2)

)=P2P3γ

4 + (P2 + 3P1P3)γ2 − P1

γ2(1− P3γ2)2.

The above derivative is negative at γ = 0+ and vanishes only at γ? over positive values of γ,certifying it as the global minimizer.

Parameter P1 is generally not known a priori. However, it is often possible to bound it fromabove. One can calculate γ? and K? using (38), replacing P1 with its overestimate. Notice that

dK?

dP1:=

∂K?

∂P1+∂K?

∂γ?

dγ?dy

dy

dP1.

It is straightforward to verify that ∂K?∂P1

> 0, dydP1≤ 0, and ∂γ?

∂y ≤ 0, and hence, overestimating P1

results in a larger γ?. Finally, ∂K?∂γ > 0 for γ > γ?, implying that K? calculated with an overestimate

of P1 is larger than the optimal iteration count–the computational burden we must bear for notknowing P1. Our algorithm does require knowledge of P3 to implement the algorithm, that in turndepend only on the nature of the functions defining the constraints and not an optimizer.

2.3 Asymptotic almost sure convergence with decaying step-sizes

Subgradient methods are often studied with decaying non-summable square-summable step sizes,for which they converge to an optimizer in the unconstrained setting. The result holds even fordistributed variants and for mirror descent methods (see [12]). Establishing convergence of Algo-rithm 1 to a primal-dual optimizer of PE is much more challenging without assumptions of strongconvexity in the objective. With such step-sizes, we provide the following result to guarantee thestability of our algorithm that is reminiscent of [24, Theorem 4].

Proposition 2. Suppose Assumption 1 holds and γk∞k=1 is a non-summable square-summablenonnegative sequence, i.e.,

∑∞k=1 γk =∞,∑∞k=1 γ

2k <∞. Then, (xk, zk) generated by Algorithm 1

remains bounded and limk→∞ L(xk, z?)− L(x?, zk) = 0 almost surely.

This ‘gap’ function L(x, z?)−L(x?, z) looks notoriously similar to the duality gap at (x, z), butis not the same. We are unaware of any results on asymptotic almost sure convergence of primal-dual first-order algorithms to an optimizer for constrained convex programs with convex, but notnecessarily strongly convex, objectives. A recent result in [38] establishes such a convergence inprimal-dual dynamics in continuous time; our attempts at leveraging discretizations of the samehave yet proven unsuccessful.

The proof of Proposition 2 relies on the well-studied almost supermartingale convergence resultby Robbins and Siegmund in [30, Theorem 1].

14

Theorem (Convergence of almost supermartingales). Let mk, nk, rk, sk be Fk-measurable finitenonnegative random variables, where F1 ⊆ F2 ⊆ . . . describes a filtration. If

∑∞k=1 sk < ∞,∑∞

k=1 rk <∞, and

E[mk+1|Fk] ≤ mk(1 + sk) + rk − nk, (39)

then limk→∞mk exists and is finite and∑∞

k=1 nk <∞ almost surely.

Proof of Proposition 2. Using notation from the proof of Theorem 1, (14) and (15) together yield

γkE[L(xk+1, z?)− L(x, zk)|Wk] +1

2E[‖xk+1 − x?‖2 + ‖zk+1 − z?‖2|Wk

]≤ 1

2

[‖xk − x?‖2 + ‖zk − z?‖2

]+

1

4P2γ

2k +

1

4P3γ

2k‖zk‖2.

(40)

We utilize the above to derive a similar inequality replacing E[L(xk+1, z?)|Wk] with L(xk, z?) bybounding the difference between them. Then, we apply the almost supermartingale convergencetheorem to the result to conclude the proof. To bound said difference, the convexity of L in x andYoung’s inequality together imply

L(xk, z?)− E[L(xk+1, z?)|Wk]

≤ 〈∇L(xk+1, z?),xk − xk+1〉

≤ γk2E[‖∇xL(xk+1, z?)‖2|Wk] +

1

2γkE[‖xk − xk+1‖2|Wk].

(41)

where ∇xL denotes a subgradient of L w.r.t. x. To further bound the RHS of (41), Assumption 1allows us to deduce

‖∇L(x, z?)‖2 ≤ 2‖∇F (x)‖2 + 2m

m∑i=1

‖zi?∇Gi(x)‖2

≤ 2C2F + 2m‖z?‖2‖CG‖2

:= 2Q1.

(42)

for any x ∈ X. Furthermore, the x-update in (9) and the non-expansive nature of the projectionoperator yield

1

γ2k

E[‖xk+1 − xk‖2|Wk]

≤ E

∥∥∥∥∥∇fω(xk) +m∑i=1

zik∇giω(xk)

∥∥∥∥∥2

|Wk

≤ 2E[‖∇fω(xk)‖2|Wk] + 2m

m∑i=1

E[(zik)

2‖∇giω(xk)‖2|Wk

].

(43)

From Assumption 1, we get

E[‖∇fω(xk)‖2|Wk] ≤ 2E[‖∇fω(xk)− E∇fω(xk)‖2 + ‖E∇fω(xk)‖2|Wk]

≤ 2σ2F + 2C2

F ,

15

and along similar lines

m∑i=1

E[(zik)

2‖∇giω(xk)‖2|Wk

]≤ 2(‖σG‖2 + ‖CG‖2)‖zk‖2,

that together in (43) yield

1

γ2k

E[‖xk+1 − xk‖2|Wk] ≤ 4(σ2F + C2

F )︸︷︷︸:=2Q2

+ 4m(‖σG‖2 + ‖CG‖2)︸︷︷︸:=2Q3

‖zk‖2.

Combining the above with (42) in (41) gives

γk (L(xk, z?)− E[L(xk+1, z?)|Wk]) ≤ γ2k(Q1 +Q2 +Q3‖zk‖2). (44)

Adding (44) to (40) and simplifying, we obtain

1

2E[‖xk+1 − x?‖2 + ‖zk+1 − z?‖2|Wk

]≤ 1

2

[‖xk − x?‖2 + ‖zk − z?‖2

]− γk[L(xk, z?)− L(x?, zk)]

+ γ2k

(1

4P2 +Q1 +Q2

)+ γ2

k

(1

4P3 +Q3

)‖zk‖2.

(45)

The above inequality with

‖zk‖2 ≤ 2‖xk − x?‖2 + 2‖zk − z?‖2 + 2‖z?‖2

becomes (39), where

mk =1

2E‖xk − x?‖2 +

1

2E‖zk − z?‖2, nk = γk[L(xk, z?)− L(x?, zk)],

rk = γ2k

[1

4P2 +Q1 +Q2 +

(1

2P3 + 2Q3

)‖z?‖2

], sk = γ2

k

(1

2P3 + 2Q3

).

Each term is nonnegative, owing to (12), and γ defines a square summable sequence. Applying[30, Theorem 1], mk converges to a constant and

∑∞k=1 nk < ∞. The latter combined with the

non-summability of γ implies the result.

3 Algorithm for PCVaR and its analysis

We now devote our attention to solving PCVaR via a primal-dual algorithm. To do so, we reformulateit as an instance of PE and utilize Algorithm 1 to solve that reformulation under a stronger set ofassumptions given below.

Assumption 2. We make the following assumptions:

• Subgradients of F and G are bounded, i.e., ‖∇fω(x)‖ ≤ CF and ‖∇giω(x)‖ ≤ CiG almostsurely for all x ∈ X.

16

• gω(x) is bounded, i.e., ‖giω(x)‖ ≤ DiG for all x ∈ X, almost surely.

• Problem (47) admits a finite primal-dual optimizer (x?, z?) ∈ X× Rm+ .

Using the variational characterization (2) of CVaR, rewrite PCVaR as

minimizex∈X

minu0∈R

E[ψfω(x, u0;α)],

subject to minui∈R

E[ψgi

ω (x, ui;βi)] ≤ 0, i = 1, . . . ,m,(46)

where ψhω(x, u; δ) := u+ 11−δ [hω(x)− u]+ for any collection of convex functions hω : Rn → R,

ω ∈ Ω. Notice that CVaR for any random variable can only vary between the mean and themaximum of that variable. Coupled with Assumption 2, we are then able to bound |ui| ≤ Di

G foreach i = 1, . . . ,m, that allows us to rewrite PCVaR as

PE′ : minimize

x∈X,u0∈R,|u|≤DG

E[ψfω(x, u0;α)],

subject to E[ψgi

ω (x, ui;βi)] ≤ 0, for each i = 1, . . . ,m,

(47)

where | · | denotes the element-wise absolute value. Call the optimal value of PCVaR (and PE′) as

p?CVaR in the sequel.

Theorem 2 (Convergence result for PCVaR). Suppose Assumption 2 holds. The iterates generatedby Algorithm 1 on PE

′ for PCVaR with parameters α,β satisfy

E[CVaRα(fω(xK+1))]− p?CVaR ≤η(α,β)√

K, (48)

E[CVaRβi(giω(xK+1))] ≤ η(α,β)√K

(49)

for i = 1, . . . ,m with step sizes γk = γ/√K for k = 1, . . . ,K with 0 < γ < P

−1/23 (α,β), where

η(α,β) := P1+γ2P2(α,β)4γ(1−γ2P3(α,β))

and

P2(α,β) :=16(C2

F + 1)

(1− α)2+ 2

∥∥diag(1+ β) diag(1− β)−1DG

∥∥2,

P3(α,β) := 16m

∥∥∥∥∥(

diag(1− β)−1CGdiag(1− β)−11

)∥∥∥∥∥2

.

(50)

Proof. We prove the result in the following steps.

(a) Under Assumption 2, we revise P2 and P3 in Theorem 1 for PE.

(b) We show that if fω, gω satisfy Assumption 2, then ψfω and ψgi

ω , i = 1, . . . ,m satisfy Assumption2, but with different bounds on the gradients and function values. Leveraging these bounds,we obtain P2(α,β) and P3(α,β) for PE

′ using step (a).

17

(c) We apply Theorem 1 with P2(α,β) and P3(α,β) from step (b) on PE′ to derive the bounds

in (48) and (49).

• Step (a) – Revising Theorem 1 with Assumption 2: Recall that in the derivation of (23) in theproof of Theorem 1, Assumption 1 yields

‖∇F (xk+1)−∇fω(xk)‖2 ≤ 2(4C2F + σ2

F ).

Assumption 2 allows us to bound the same by 4C2F , yielding P2 = 16C2

F +2‖DG‖2. Along the samelines, we get P3 = 16m‖CG‖2.

• Step (b) – Deriving properties of ψω: Consider the stochastic subgradient of ψfω(x, t;α) given by

∇ψfω(x, u;α) =

(1

1−α∇fω(x) Ifω(x)≥u1− 1

1−α Ifω(x)≥u

), (51)

where I· denotes the indicator function over the set given in the subscript. Recall that ‖∇fω(x)‖ ≤CF for all x ∈ X almost surely. Therefore, we have

‖∇ψfω(x, u;α)‖2 =

∥∥∥∥ 1

1− α∇fω(x) Ifω(x)≥u

∥∥∥∥2

+

∥∥∥∥1− 1

1− α Ifω(x)≥u

∥∥∥∥2

≤ C2F + 1

(1− α)2.

(52)

Proceeding similarly, we obtain ∥∥∥∇ψgiω (x, ui;βi)∥∥∥2≤ [CiG]2 + 1

(1− βi)2. (53)

We also have

‖ψgiω (x, ui;βi)‖ =

∥∥∥∥max

giω(x)− βiui

1− βi , ui∥∥∥∥ ≤ 1 + βi

1− βiDiG. (54)

Then, (50) follows from step (a) using (52), (53), and (54).• Step (c) – Proof of (48) and (49): Applying Theorem 1 with revised P2 and P3 from step (b) toPE′ for which x0, . . . ,xK+1 and u0

0, . . . , u0K+1 are WK+1/2-measurable, we obtain

E[CVaRα(fω(xK+1))] = E

[minu0∈R

E[ψfω(xK+1, u0;α)|WK+1/2]

]≤ E

[E[ψfω(xK+1, u

0K+1;α)|WK+1/2]

]= E

[ψfω(xK+1, u

0K+1;α)

]≤ p?CVaR +

η(α,β)√K

. (55)

Following a similar argument for i = 1, . . . ,m, we get

E[CVaRβi(giω(xK+1))

]= E

[minui∈R

E[ψgi

ω (xK+1, ui;βi)|WK+1/2]

]≤ η(α,β)√

K, (56)

completing the proof.

18

Our proof architecture generalizes to problems with other risk measures as long as that measurepreserves convexity of fω, gω, admits a variational characterization as in (2), and a subgradient forthis modified objective can be easily computed and remains bounded over X. We restrict ourattention to CVaR to keep the exposition concrete.

CVaR of a random variable depends on the tail of its distribution. The higher the risk aversion,the further into the tail one needs to look, generally requiring more samples. Opposed to sampleaverage approximation (SAA) algorithms, we never quite compute or estimate CVaR[fω(x)] orCVaR[gω(x)] for any given decision x. Yet, our analysis provides guarantees on the same at xK+1.It is natural to expect our sample complexity to grow with risk aversion, that the following resultconfirms. This is not surprising, given that few samples simply cannot capture tail behavior.

Proposition 3. Suppose Assumption 2 holds. For an ε-approximately feasible and optimal solutionof PCVaR with risk aversion parameters α,β using Algorithm 1 on PE

′, the optimal step size γ?(α,β)and the optimal number of iterations K?(α,β) from Proposition 1 increases with α and β.

Proof. We borrow the notation from Proposition 1 and tackle the variation with α and β separately.

• Variation with α: P2 increases with α, implying γ? decreases with α because dγ2?dy ≤ 0 and dy

dP2≥ 0.

Furthermore, using ∂K?∂γ?

< 0 for γ < γ? and ∂K?∂P2≥ 0 in

dK?

dP2=∂K?

∂P2+∂K?

∂γ?

dγ?dP2

we infer that K? increases with α.• Variation with βi: Both P2 and P3 increase with βi and

dγ2?

dβi=∂γ2

?

∂P2

dP2

dβi+∂γ2

?

∂P3

dP3

dβi. (57)

Following an argument similar to that for the variation with α, the first term on the RHS of theabove equation can be shown to be nonpositive. Next, we show that the second term is nonpositiveto conclude that γ? decreases with βi, where we use dP3

dβi ≥ 0. Utilizing P2P1P3

= y − 1, we infer

∂γ2?

∂P3= − 2

P 23 (2 + y +

√y2 + 8y)

+∂γ2

?

∂y

∂y

∂P3

= − 2

P 23 (2 + y +

√y2 + 8y)

+ 24 + y +

√y2 + 8y

P3

√y2 + 8y(2 + y +

√y2 + 8y)2

P2

P1P 23

= −25y + 4 + 3

√y2 + 8y

P 23

√y2 + 8y(2 + y +

√y2 + 8y)2

≤ 0.

To characterize the variation of K?, notice that

dK?

dβi=∂K?

∂P2

∂P2

∂βi+∂K?

∂P3

∂P3

∂βi.

Again, the first term on the RHS of the above relation is nonnegative, owing to an argument similarto that used for the variation of K? with α. We show ∂K?

∂P3≤ 0 to conclude the proof. Treating K?

19

as a function of P3 and γ?, we obtain

dK?

dP3=∂K?

∂P3+∂K?

∂γ?

∂γ?∂P3

.

It is straightforward to verify that the first summand is nonnegative. We have already argued thatγ? decreases with P3, and ∂K?

∂γ < 0 for γ < γ?, implying that the second summand is nonnegativeas well, completing the proof.

It is straightforward to compute K?(α,β) and γ?(α,β) from Proposition 1. Instead of providingthe formula, we illustrate the same in Figure 1. To derive an additional insight, fix β and drive αtowards unity. Then, we have

P2(α,β) ∼ (1− α)−2, γ?(α,β) ∼ (1− α), K?(α,β) ∼ 1

ε2(1− α)2.

With α approaching unity, PCVaR is attempting to solve a robust optimization problem via sam-pling. Not surprisingly, sample complexity exhibits unbounded growth with robustness require-ments, since we do not assume Ω to be finite. Also, this growth matches that of solving the SAAproblem within ε-tolerance on the unconstrained problem to minimize F (x) := 1

K

∑Kj=1 ψ

fωj (x, u;α).

To see this, apply Theorem 1 on F (x) with optimized step size from Proposition 1, where P2 ∼‖∇F (x)‖2 ∼ (1− α)−2 and P3 = 1.

(a) γ?(α, β) (b) K?(α, β)

Figure 1: Plot of (a) optimal step size γ?(α, β), and (b) the number of iterations K?(α, β) requiredto achieve a tolerance of ε = 10−3 with m = 1 constraint and CF = CG = DG = P1 = 1.

Parallelization can lead to stronger bounds. More precisely, run stochastic approximation inparallel on N machines, each with K samples and compute 〈x〉K+1 := 1

N

∑Nj=1 xK+1[j] using

xK+1[1], . . . , xK+1[N ] obtained from the N separate runs. Then, we have

PrGi (〈x〉K+1) ≥ (1 + τ)η(α,β)/

√K

≤ Pr

1

N

N∑j=1

CVaRβi

[giω (xK+1[j])

]≥ (1 + τ)

η(α,β)√K

≤ exp

(−Nτ

2η2(α,β)

K[DiG]2

) (58)

20

for i = 1, . . . ,m and τ > 0. The steps combine coherence of CVaR, convexity and uniformboundedness of giω, Hoeffding’s inequality and Theorem 2. A similar bound can be derived forsuboptimality. Thus, parallelized stochastic approximation produces a result whose O(1/

√K)-

violation occurs with a probability that decays exponentially with the degree of parallelizationN .

The bound in (58) reveals an interesting connection with results for chance constrained pro-grams. To describe the link, notice that CVaRδ[yω] ≤ 0 implies Pryω ≤ 0 ≥ 1− δ for any randomvariable yω and δ ∈ [0, 1). Therefore, (58) implies

PrPrgiω (xK+1) ≤ C/

√K≥ 1− βi is violated

≤ exp

(−C ′/K

)≤ ν,

for constants C,C ′. Said differently, our stochastic approximation algorithm requires O(log(1/ν))samples to produce a solution that satisfies an O(1/

√log(1/

√ν))-approximate chance-constraint

with a violation probability bounded by ν. This result bears a striking similarity to that derivedin [10], where the authors deterministically enforce O(log(1/ν)) sampled constraints to producea solution that satisfies the exact chance-constraint Pr

giω (x) ≤ 0

≥ 1 − βi with a violation

probability bounded by ν. This resemblance in order-wise sample complexity is intriguing, giventhe significant differences between the algorithms.

Example 3.1. Consider the problem

minimize−1≤x≤1

CVaRα

[1

2(x− ω2 + 0.5)2

], subject to CVaRβ [x+ ω] ≤ 0. (59)

Let ω ∼ 2ξ − 1 ∈ [−1, 1], where ξ ∼ beta(2, 2), for which we have

CF = 2.5, CG = 1, DG = 2, P2 =116

(1− α)2+ 8

1 + β

1− β , P3 =32

(1− β)2.

While we cannot determine z? a priori, our analysis indicates that we can assume |z?| ≤ 2 which,taking x0 = 0, further implies P1 = 10.

(a) K?(α, β)

0 100000 200000 300000 400000 500000 600000Iteration

10−3

10−2

10−1

100

Err

or

|F (xk)− F (x∗)||G(xk)−G(x∗)||xk − x∗|

0 2500 5000 7500 10000

10−2

10−1

(b) Convergence

Figure 2: Plot of (a) iteration count as a function of α and β, and (b) convergence of the iterates,function value, and constraint violation to their respective optima for α = 0.6, β = 0.4.

21

As Figure 2a demonstrates, iteration complexity grows more rapidly with an increase in αrather than β. This is due to the coefficient of (1− α)−2 in P2 being 3.625 times larger than thatof (1 − β)−2 in P3. P3 grows linearly in the number of constraints. As a result, risk-sensitiveconstraint satisfaction typically bears a comparatively higher computational burden than objectiveminimization.

The optimal iteration count is quite high even for this simple example. This is the downsideof optimizing upper bounds to compute that count for subgradient methods that are inherentlyslow. The iterate in Figure 2b seems to stabilize much earlier however, even though theoreticalguarantees at low iteration count remain elusive. Careful design of a termination criterion will beuseful for practical applciations.

4 Conclusions and future work

In this paper, we study a stochastic approximation algorithm for CVaR-sensitive optimizationproblems. Such problems are remarkably rich in their modeling power and encompass a plethoraof stochastic programming problems with broad applications. We study a primal-dual algorithmto solve that problem that processes samples in an online fashion, i.e., obtains samples and up-dates decision variables in each iteration. Such algorithms are useful when sampling is easy andintermediate approximate solutions, albeit inexact, are useful. The convergence analysis allowsus to optimize the number of iterations required to reach a solution within a prescribed toleranceon expected suboptimality and constraint violation. The sample and iteration complexity pre-dictably grows with risk-aversion. Our work affirms that a modeler must not only consider theattitude towards risk but also consider the computational burdens of risk in deciding the problemformulation.

Two possible extensions are of immediate interest. First, primal-dual algorithms find applica-tions in multi-agent distributed optimization problems over a (possibly time-varying) communica-tion network. We plan to extend our results to solve distributed risk-sensitive convex optimizationproblems over networks, borrowing techniques from [25, 13]. Second, the relationship to samplecomplexity for chance-constrained programs in [10] encourages us to pursue a possible explorationof stochastic approximation for such optimization problems.

5 Acknowledgements

We would like to acknowledge Eilyan Bitar and Rayadurgam Srikant for useful discussions.

References

[1] Amir Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. Journal of Opti-mization Theory and Applications, 155(3):1105–1123, 2012.

[2] Michel Baes, Michael Burgisser, and Arkadi Nemirovski. A randomized mirror-prox method forsolving structured large-scale matrix saddle-point problems. SIAM Journal on Optimization,23(2):934–962, 2013.

22

[3] Amrit Singh Bedi, Alec Koppel, and Ketan Rajawat. Nonparametric compositional stochasticoptimization, 2019.

[4] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization, volume 28.Princeton University Press, 2009.

[5] Dimitri P Bertsekas. Stochastic optimization problems with nondifferentiable cost functionals.Journal of Optimization Theory and Applications, 12(2):218–231, 1973.

[6] D Boob, Q Deng, and G Lan. Stochastic first-order methods for convex and nonconvexfunctional constrained optimization. arXiv preprint arXiv:1908.02734, 2019.

[7] Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximationand reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.

[8] Stephen Boyd and Almir Mutapcic. Subgradient methods. Lecture notes of EE364b, StanfordUniversity, Winter Quarter, 2007, 2006.

[9] Giuseppe Calafiore and Marco C Campi. Uncertain convex programs: randomized solutionsand confidence levels. Mathematical Programming, 102(1):25–46, 2005.

[10] Marco C Campi and Simone Garatti. The exact feasibility of randomized solutions of uncertainconvex programs. SIAM Journal on Optimization, 19(3):1211–1230, 2008.

[11] Abraham Charnes and William W Cooper. Chance-constrained programming. Managementscience, 6(1):73–79, 1959.

[12] Thinh T Doan, Subhonmesh Bose, D Hoa Nguyen, and Carolyn L Beck. Convergence of theiterates in mirror descent methods. IEEE control systems letters, 3(1):114–119, 2018.

[13] Alejandro D Dominguez-Garcia and Christoforos N Hadjicostis. Distributed matrix scalingand application to average consensus in directed graphs. IEEE Transactions on AutomaticControl, 58(3):667–681, 2013.

[14] Yu M Ermoliev. Methods of stochastic programming, 1976.

[15] Michael J Hadjiyiannis, Paul J Goulart, and Daniel Kuhn. An efficient method to estimatethe suboptimality of affine controllers. IEEE Transactions on Automatic Control, 56(12):2841–2853, 2011.

[16] Grani A Hanasusanto, Daniel Kuhn, and Wolfram Wiesemann. A comment on computationalcomplexity of stochastic programming problems. Mathematical Programming, 159(1-2):557–569, 2016.

[17] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-ance reduction. In Advances in neural information processing systems, pages 315–323, 2013.

[18] Dionysios S. Kalogerias and Warren B. Powell. Recursive optimization of convex risk measures:Mean-semideviation models, 2018.

[19] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regressionfunction. The Annals of Mathematical Statistics, 23(3):462–466, 1952.

23

[20] Jakob Kisiala. Conditional value-at-risk: Theory and applications. arXiv preprintarXiv:1511.00140, 2015.

[21] Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms andapplications, volume 35. Springer Science & Business Media, 2003.

[22] Alexander Mafusalov and Stan Uryasev. Buffered probability of exceedance: mathematicalproperties and optimization. SIAM Journal on Optimization, 28(2):1077–1103, 2018.

[23] Christopher W Miller and Insoon Yang. Optimal control of conditional value-at-risk in con-tinuous time. SIAM Journal on Control and Optimization, 55(2):856–884, 2017.

[24] Angelia Nedic and Soomin Lee. On stochastic subgradient mirror-descent algorithm withweighted averaging. SIAM Journal on Optimization, 24(1):84–107, 2014.

[25] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent opti-mization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.

[26] Angelia Nedic and Asuman Ozdaglar. Subgradient methods for saddle-point problems. Journalof optimization theory and applications, 142(1):205–228, 2009.

[27] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochas-tic approximation approach to stochastic programming. SIAM Journal on optimization,19(4):1574–1609, 2009.

[28] Yurii Nesterov. Introductory lectures on convex programming volume i: Basic course. Lecturenotes, 1998.

[29] W lodzimierz Ogryczak and Andrzej Ruszczynski. From stochastic dominance to mean-riskmodels: Semideviations as risk measures. European journal of operational research, 116(1):33–50, 1999.

[30] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost su-permartingales and some applications. In Optimizing methods in statistics, pages 233–257.Elsevier, 1971.

[31] R Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distri-butions. Journal of banking & finance, 26(7):1443–1471, 2002.

[32] Andrzej Ruszczynski and Alexander Shapiro. Optimization of convex risk functions. Mathe-matics of operations research, 31(3):433–452, 2006.

[33] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochasticaverage gradient. Mathematical Programming, 162(1-2):83–112, 2017.

[34] Alexander Shapiro and Andy Philpott. A tutorial on stochastic programming. Manuscript.Available at www2. isye. gatech. edu/ashapiro/publications. html, 17, 2007.

[35] Joelle Skaf and Stephen P Boyd. Design of affine controllers via convex optimization. IEEETransactions on Automatic Control, 55(11):2476–2487, 2010.

24

[36] Tao Sun, Yuejiao Sun, and Wotao Yin. On markov chain gradient descent. In Advances inNeural Information Processing Systems, pages 9896–9905, 2018.

[37] Yangyang Xu. Primal-dual stochastic gradient method for convex programs with many func-tional constraints. arXiv preprint arXiv:1802.02724, 2018.

[38] Shunya Yamashita, Takeshi Hatanaka, Junya Yamauchi, and Masayuki Fujita. Passivity-basedgeneralization of primal–dual dynamics for non-strictly convex cost functions. Automatica,112:108712, 2020.

[39] Hao Yu, Michael Neely, and Xiaohan Wei. Online convex optimization with stochastic con-straints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.

[40] Tong Zhang, Stan Uryasev, and Yongpei Guan. Derivatives and subderivatives of bufferedprobability of exceedance. Operations Research Letters, 47(2):130–132, 2019.

[41] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent.In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages928–936, 2003.

25

Documents

Avinash N. Madavan Subhonmesh Bose April 16, 2020 … · 2020. 4. 16. · Conditional Value at Risk Sensitive Optimization via Subgradient Methods Avinash N. Madavan Subhonmesh Bose