A Generic First-Order Algorithmic Framework for Bi-Level … · 2020-07-03 · ing bi-level FOMs. In particular, we investigate their iteration behaviors and reach the conclusion

A Generic First-Order Algorithmic Framework for Bi-Level ProgrammingBeyond Lower-Level Singleton

Risheng Liu 1 2 Pan Mu 1 2 Xiaoming Yuan 3 Shangzhi Zeng 3 Jin Zhang 4

AbstractIn recent years, a variety of gradient-based bi-level optimization methods have been developedfor learning tasks. However, theoretical guaran-tees of these existing approaches often heavilyrely on the simplification that for each fixed upper-level variable, the lower-level solution must be asingleton (a.k.a., Lower-Level Singleton, LLS).In this work, by formulating bi-level models fromthe optimistic viewpoint and aggregating hier-archical objective information, we establish Bi-level Descent Aggregation (BDA), a flexible andmodularized algorithmic framework for bi-levelprogramming. Theoretically, we derive a newmethodology to prove the convergence of BDAwithout the LLS condition. Furthermore, we im-prove the convergence properties of conventionalfirst-order bi-level schemes (under the LLS sim-plification) based on our proof recipe. Extensiveexperiments justify our theoretical results anddemonstrate the superiority of the proposed BDAfor different tasks, including hyper-parameter op-timization and meta learning.

1. IntroductionBi-Level Programs (BLPs) are mathematical programs withoptimization problems in their constraints and recently havebeen recognized as powerful theoretical tools for a variety ofmachine learning applications. Mathematically, BLPs canbe (re)formulated as the following optimization problem:

minx∈X ,y∈Rm

F (x,y), s.t. y ∈ S(x), (1)

1DUT-RU International School of Information Science andEngineering, Dalian University of Technology. 2Key Labora-tory for Ubiquitous Network and Service Software of LiaoningProvince. 3Department of Mathematics, The University of HongKong. 4SUSTech International Center for Mathematics and Depart-ment of Mathematics, Southern University of Science and Technol-ogy. Correspondence to: Jin Zhang <[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).

where the Upper-Level (UL) objective F is a jointly contin-uous function, the UL constraint X is a compact set, and theset-valued mapping S(x) indicates the parameterized solu-tion set of the Lower-Level (LL) subproblem. Without lossof generality, we consider the following LL subproblem:

S(x) = arg minyf(x,y), (2)

where f is another jointly continuous function. Indeed,the BLPs model formulated in Eqs. (1)-(2) is a hierar-chical optimization problem with two coupled variables(x,y) ∈ Rn × Rm. Specifically, given the UL variable xfrom the feasible set X (i.e., x ∈ X ), the LL variable yis an optimal solution of the LL subproblem governed byx (i.e., y ∈ S(x)). Due to the hierarchical structure andthe complicated dependency between UL and LL variables,solving the above BLPs problem is challenging in general,especially when the LL solution set S(x) in Eq. (2) is nota singleton (Jeroslow, 1985; Dempe, 2018). In this work,we always call the condition that S(x) is a singleton asLower-Level Singleton (or LLS for short).

1.1. Related Work

Although early works on BLPs can date back to the nineteenseventies (Dempe, 2018), it was not until the last decadethat a large amount of bi-level optimization models wereestablished to formulate specific machine learning problems,include meta learning (Franceschi et al., 2018; Rajeswaranet al., 2019; Zugner & Gunnemann, 2019), hyper-parameteroptimization (Franceschi et al., 2017; Okuno et al., 2018;MacKay et al., 2019), reinforcement learning (Yang et al.,2019), generative adversarial learning (Pfau & Vinyals,2016), and image processing (Kunisch & Pock, 2013; De losReyes et al., 2017), just to name a few.

A large number of optimization techniques have been devel-oped to solve BLPs in Eqs. (1)-(2). For example, the worksin (Kunapuli et al., 2008; Moore, 2010; Okuno et al., 2018)aimed to reformulate the original BLPs in Eqs. (1)-(2) as asingle-level optimization problem based on the first-orderoptimality conditions. However, these approaches involvetoo many auxiliary variables, thus are not applicable forcomplex machine learning tasks.

arX

iv:2

006.

0404

5v2

[cs

.LG

] 2

Jul

202

0

A Generic First-Order Algorithmic Framework for Bi-Level Programming Beyond Lower-Level Singleton

Table 1. Comparing the convergence results (together with properties required by the UL and LL subproblems) between BDA and theexisting bi-level FOMs in different scenarios (i.e., BLPs with and without LLS condition). Here s−→ and u−→ represent the subsequentialand uniform convergence, respectively. The superscript ∗ denotes that it is the true optimal variables/values.

Alg. LLS w/o LLS

Existingbi-level FOMs

UL F (x, ·) is Lipschitz continuous.LL {yK(x)} is uniformly bounded on X , yK(x)

u−→ y∗(x). Not AvailableMain results: xK

s−→ x∗, infx∈X ϕK(x)→ infx∈X ϕ(x).

UL F (x, ·) is Lipschitz continuous. F (x, ·) is Lipschitz continuous,LF -smooth, and σ-strongly convex.

BDA LL {yK(x)} is uniformly bounded on X , f(x,yK(x))u−→ f∗(x), f(x, ·) is Lf -smooth and convex,

f(x,y) is level-bounded in y locally uniformly in x ∈ X . S(x) is continuous.Main results: xK

s−→ x∗, infx∈X ϕK(x)→ infx∈X ϕ(x).

Recently, gradient-based First-Order Methods (FOMs) havealso been investigated to solve BLPs. The key idea under-lying these approaches is to hierarchically calculate gra-dients of UL and LL objectives. Specifically, the worksin (Maclaurin et al., 2015; Franceschi et al., 2017; 2018)first calculate gradient representations of the LL objectiveand then perform either reverse or forward gradient compu-tations (a.k.a., automatic differentiation, based on the LLgradients) for the UL subproblem. It is known that the re-verse mode is related to the back-propagation through timewhile the forward mode actually appears to the standardchain rule (Franceschi et al., 2017). In fact, similar ideashave also been used in (Jenni & Favaro, 2018; Zugner &Gunnemann, 2019; Rajeswaran et al., 2019), but with dif-ferent specific implementations. In (Shaban et al., 2019), atruncated back-propagation scheme is adopted to improvethe scale issue for the LL gradient updating. Furthermore,the works in (Lorraine & Duvenaud, 2018; MacKay et al.,2019) trained a so-called hyper-network to map LL gradi-ents for their hierarchical optimization.

Although widely used in different machine learning appli-cations, theoretical properties of these bi-level FOMs arestill not convincing (summarized in Table 1). Indeed, allof these methods require the LLS constraint in Eq. (2) tosimplify their optimization process and theoretical analysis.For example, to satisfy such restrictive condition, existingworks (Franceschi et al., 2018; Shaban et al., 2019) have toenforce a (local) strong convexity assumption to their LLsubproblem, which is actually too tough to be satisfied inreal-world complex tasks.

1.2. Our Contributions

This work proposes Bi-level Descent Aggregation (BDA),a generic bi-level first-order algorithmic framework that isflexible and modularized to handle BLPs in Eqs. (1)-(2).Unlike the above existing bi-level FOMs, which requirethe LLS assumption on Eq. (2) and separate the originalmodel into two single-level subproblems, our BDA inves-

tigates BLPs from the optimistic viewpoint and develop anew hierarchical optimization scheme, which consists of asingle-level optimization formulation for the UL variablex and a simple bi-level optimization formulation for theLL variable y. Theoretically, we establish a general proofrecipe to analyze the convergence behaviors of these bi-level FOMs. We prove that the convergence of BDA canbe strictly guaranteed in the absence of the restrictive LLScondition. Furthermore, we demonstrate that the strongconvexity of the LL objective (required in previous theo-retical analysis (Franceschi et al., 2018)) is actually non-essential for these existing LLS-based bi-level FOMs, suchas (Domke, 2012; Maclaurin et al., 2015; Franceschi et al.,2017; 2018; Shaban et al., 2019). Table 1 compares the con-vergence results of BDA and the existing approaches. It canbe seen that in LLS scenario, BDA and the existing meth-ods share the same requirements for the UL subproblem.However, for the LL subproblem, assumptions required inprevious approaches are essentially more restrictive thanthat in BDA. More importantly, when solving BLPs with-out LLS, no theoretical results can be obtained for theseclassical methods. Fortunately, BDA can still obtain thesame convergence properties as that in LLS scenario. Thecontributions can be summarized as:

• A counter-example (i.e., Example 1) explicitly indi-cates the importance of the LLS condition for the exist-ing bi-level FOMs. In particular, we investigate theiriteration behaviors and reach the conclusion that usingthese approaches in the absence of the LLS conditionmay lead to incorrect solutions.

• By formulating BLPs in Eqs. (1)-(2) from the view-point of optimistic bi-level, BDA provides a genericbi-level algorithmic framework. Embedded with aspecific gradient-aggregation-based iterative module,BDA is applicable to a variety of learning tasks.

• A general proof recipe is established to analyze theconvergence behaviors of bi-level FOMs. We strictly


prove the convergence of BDA without the LLS as-sumption. Furthermore, we revisit and improve theconvergence properties of the existing bi-level FOMsin the LLS scenario.

2. First-Order Methods for BLPs2.1. Solution Strategies with Lower-Level Singleton

As aforementioned, a number of FOMs have been proposedto solve BLPs in Eqs. (1)-(2). However, these existing meth-ods all rely on the uniqueness of S(x) (i.e., LLS assump-tion). That is, rather than considering the original BLPs inEqs. (1)-(2), they actually solve the following simplification:

minx∈X

F (x,y), s.t. y = arg minyf(x,y), (3)

where the LL subproblem only has one single solution for agiven x. By considering y as a function of x, the idea be-hind these approaches is to take a gradient-based first-orderscheme (e.g, gradient descent, stochastic gradient descent,or their variations) on the LL subproblem. Therefore, withthe initialization point y0, a sequence {yk}Kk=0 parameter-ized by x can be generated, e.g.,

yk+1 = yk − sl∇yf(x,yk), k = 0, · · · ,K − 1, (4)

where sl > 0 is an appropriately chosen step size. Thenby considering yK(x) (i.e., the output of Eq. (4) for agiven x) as an approximated optimal solution to the LLsubproblem, we can incorporate yK(x) into the UL ob-jective and obtain a single-level approximation model, i.e.,minx∈X F (x,yK(x)). Finally, by unrolling the iterativeupdate scheme in Eq. (4), we can calculate the derivativeof F (x,yK(x)) (w.r.t. x) to optimize Eq. (3) by automaticdifferentiation techniques (Franceschi et al., 2017; Baydinet al., 2017).

2.2. Fundamental Issues and Counter-Example

It can be observed that the LLS condition fairly mattersfor the validation of the existing bi-level FOMs. However,such singleton assumption on the solution set of the LL sub-problem is actually too restrictive to be satisfied, especiallyin real-world applications. In this subsection, we designan interesting counter-example (Example 1 below) to illus-trate such invalidation of these conventional gradient-basedbi-level schemes in the absence of the LLS condition.Example 1. (Counter-Example) With x ∈ [−100, 100] andy ∈ R2, we consider the following BLPs problem:

minx∈[−100,100]

12 (x− [y]2)2 + 1

2 ([y]1 − 1)2,

s.t. y ∈ arg miny∈R2

12 [y]21 − x[y]1,

(5)

where [·]i denotes the i-th element of the vector. By simplecalculation, we know that the optimal solution of Eq. (5)

is x∗ = 1,y∗ = (1, 1). However, if adopting the exist-ing gradient-based scheme in Eq. (4) with initializationy0 = (0, 0) and varying step size skl ∈ (0, 1), we have that[yK ]1 = (1−

∏K−1k=0 (1− skl ))x and [yK ]2 = 0. Then the

approximated problem of Eq. (5) amounts to

minx∈[−100,100]

F (x,yK) =1

2x2+

1

2((1−

K−1∏k=0

(1−skl ))x−1)2.

By defining ϕK(x) = F (x,yK), we have

x∗K = arg minx∈[−100,100]

φK(x) =(1−

∏K−1k=0 (1− skl ))

1 + (1−∏K−1k=0 (1− skl ))2

.

It is easy to check that

0 ≤ lim infK→∞

K−1∏k=0

(1− skl ) ≤ lim supK→∞

K−1∏k=0

(1− skl ) ≤ 1,

then we have lim supK→∞(1−

∏K−1k=0 (1−skl ))

1+(1−∏K−1k=0 (1−skl ))2

≤ 12 . So

x∗K cannot converge to the true solution (i.e., x∗ = 1).Remark 1. The UL objective F is indeed a function of boththe UL variable x and the LL variable y. Conventionalbi-level FOMs only use the gradient information of the LLsubproblem to update y. Thanks to the LLS assumption,for fixed UL variable x, the LL solution y can be uniquelydetermined. Thus the sequence {yk}Kk=0 could converge tothe true optimal solution, that minimizes both the LL andUL objectives. However, when LLS is absent, {yk}Kk=0 mayeasily fail to converge to the true solution. Therefore, x∗Kmay tend to be incorrect limiting points. Fortunately, wewill demonstrate in Sections 3 and 5 that the example inEq. (5) can be efficiently solved by our proposed BDA.

3. Bi-level Descent AggregationIn contrast to previous works, which only consider simpli-fied BLPs with the LLS assumption in Eq. (3), we propose anew algorithmic framework, named Bi-level Descent Aggre-gation (BDA), to handle more generic BLPs in Eqs. (1)-(2).

3.1. Optimistic Bi-level Algorithmic Framework

In fact, the situation becomes intricate if the LL subproblemis not uniquely solvable for each x ∈ X . In this work,we consider BLPs from the optimistic bi-level viewpoint1,thus for any given x, we expect to choose the LL solutiony ∈ S(x) that can also lead to the best objective functionvalue for the UL objective (i.e., F (x, ·)). Inspired by thisobservation, we can reformulate Eqs. (1)-(2) as

minx∈X

ϕ(x), with ϕ(x) = infy∈S(x)

F (x,y). (6)

1For more theoretical details of optimistic BLPs, we referto (Dempe, 2018) and the references therein.


Such reformulation reduces BLPs to a single-level problemminx∈X ϕ(x) w.r.t. the UL variable x. While for any givenx, ϕ actually turns out to be the value function of a simplebi-level problem w.r.t. the LL variable y, i.e.,

minyF (x,y), s.t. y ∈ S(x), (with fixed x). (7)

Based on the above analysis, we actually could update y by

yk+1(x) = Tk+1(x,yk(x)), k = 0, · · · ,K − 1, (8)

where Tk(x, ·) stands for a schematic iterative module orig-inated from a certain simple bi-level solution strategy onEq. (7) with a fixed UL variable x.2 Let y0 = T0(x) bethe initialization of the above scheme and denote yK(x) asthe output of Eq. (8) after K iterations (including the initialcalculation T0). Then we can replace ϕ(x) by F (x,yK(x))and obtain the following approximation of Eq. (6):

minx∈X

ϕK(x) = F (x,yK(x)). (9)

With the above procedure, the BLPs in Eqs. (1)-(2) is ap-proximated by a sequence of standard single-level opti-mization problems. For each approximation subproblemin Eq. (9), its descent direction is actually implicitly rep-resentable in terms of a certain simple bi-level solutionstrategy (i.e., Eq. (8)). Therefore, these existing automaticdifferentiation techniques all can be involved to achieve op-timal solutions to Eq. (9) (Franceschi et al., 2017; Baydinet al., 2017).

3.2. Aggregated Iteration Modules

Now optimizing BLPs in Eqs. (1)-(2) reduces to the problemof designing proper Tk for Eq. (8). As discussed above, Tkis related to both the UL and LL objectives. So it is naturalto aggregate the descent information of these two subprob-lems to design Tk. Specifically, for a given x, the descentdirections of the UL and LL objectives can be defined as

dFk (x) = su∇yF (x,yk),

dfk(x) = sl∇yf(x,yk),

where su, sl > 0 are their step size parameters. Then weformulate Tk as the following first-order descent scheme:

Tk+1 (x,yk(x)) = yk −(αkd

Fk (x) + (1− αk)dfk(x)

),

(10)where αk ∈ (0, 1) denotes the aggregation parameter.

Remark 2. In this part, we just introduce a gradient aggre-gation based Tk to handle the simple bi-level subproblemin Eq. (7). Indeed, our theoretical analysis in Section 4 will

2It can be seen that Tk actually should integrate the informationfrom both the UL and LL subproblems in Eqs. (1)-(2). We willdiscuss specific choices of Tk in the following subsection.

demonstrate that BDA algorithmic framework is flexibleenough to incorporate a variety of numerical schemes. Forexample, in Supplemental Material, we also design an ap-propriate Tk to handle BLPs with nonsmooth LL objectivewhile its convergence is still strictly guaranteed within ourframework.

4. Theoretical InvestigationsIn this section, we first derive a general convergence proofrecipe together with two elementary properties to system-atically investigate the convergence behaviors of bi-levelFOMs (Section 4.1). Following this roadmap, the conver-gence of our BDA can successfully get rid of dependingupon the LLS condition (Section 4.2). We also improve theconvergence results for the existing bi-level FOMs in theLLS scenario (Section 4.3). To avoid triviality, hereafterwe always assume that S(x) is nonempty for any x ∈ X .Please notice that all the proofs of our theoretical results arestated in the Supplemental Material.

4.1. A General Proof Recipe

We first state some definitions, which are necessary for ouranalysis.3 A series of continuity properties for set-valuedmappings and functions can be defined as follows.

Definition 1. A set-valued mapping S(x) : Rn ⇒ Rm isOuter Semi-Continuous (OSC) at x if lim supx→x S(x) ⊆S(x) and Inner Semi-Continuous (ISC) at x iflim infx→x S(x) ⊇ S(x). S(x) is called continuousat x if it is both OSC and ISC at x, as expressed bylimx→x S(x) = S(x). Here lim supx→x S(x) andlim infx→x S(x) are defined as

lim supx→x

S(x) = {y| ∃xν → x,∃yν → y,yν ∈ S(xν)} ,

lim infx→x

S(x) = {y| ∀xν → x,∃yν → y,yν ∈ S(xν)} ,

where ν ∈ N.

Definition 2. A function ϕ(x) : Rn → R is Upper Semi-Continuous (USC) at x if lim supx→x ϕ(x) ≤ ϕ(x), orequivalently lim supx→x ϕ(x) = ϕ(x), and USC on Rnif this holds for every x ∈ Rn. Similarly, ϕ(x) is LowerSemi-Continuous (LSC) at x if lim infx→x ϕ(x) ≥ ϕ(x),or equivalently lim infx→x ϕ(x) = ϕ(x), and LSC on Rnif this holds for every x ∈ Rn. Here lim supx→x ϕ(x) andlim infx→x ϕ(x) are respectively defined as

lim supx→x

ϕ(x) = limδ→0

[supx∈Bδ(x) ϕ(x)

]and

lim infx→x

ϕ(x) = limδ→0

[infx∈Bδ(x) ϕ(x)

]3Please also refer to (Rockafellar & Wets, 2009) for more

details on these variational analysis properties.


where Bδ(x) = {x∣∣‖x− x‖ ≤ δ}.

Then for a given function f(x,y), we state the property thatit is level-bounded in x locally uniform in y in the followingdefinition.

Definition 3. Given a function f(x,y) : Rn×Rm → R, iffor a point x ∈ X ⊆ Rn and c ∈ R, there exist δ > 0 alongwith a bounded set B ∈ Rm, such that

{y ∈ Rm | f(x,y) ≤ c} ⊆ B, ∀x ∈ Bδ(x) ∩ X ,

then we call f(x,y) is level-bounded in y locally uniformlyin x ∈ X . If the above property holds for each x ∈ X , wefurther call f(x,y) level-bounded in y locally uniformly inx ∈ X .

Now we are ready to establish the general proof recipe,which describes the main steps to achieve the converge guar-antees for our bi-level updating scheme (stated in Eqs. (8)-(9), with a schematic Tk). Basically, our proof methodologyconsists of two main steps:

(1) LL solution set property: For any ε > 0, there existsk(ε) > 0 such that whenever K > k(ε),

supx∈X

dist(yK(x),S(x)) ≤ ε.

(2) UL objective convergence property: ϕ(x) is LSC onX , and for each x ∈ X ,

limK→∞

ϕK(x)→ ϕ(x).

Equipped with the above two properties, we can establishour general convergence results in the following theoremfor the schematic bi-level scheme in Eqs. (8)-(9).

Theorem 1. Suppose both the above LL solution set andUL objective convergence properties hold. Then we have

(1) suppose xK ∈ arg minx∈X ϕK(x), then any limitpoint x of the sequence {xK} satisfies that x ∈arg minx∈X ϕ(x).

(2) infx∈X ϕK(x)→ infx∈X ϕ(x) as K →∞.

Remark 3. Indeed, if xK is a local minimum of ϕK(x)with uniform neighborhood modulus δ > 0, we can stillhave that any limit point x of the sequence {xK} is a localminimum of ϕ(x). Please see our Supplemental Materialfor more details on this issue.

4.2. Convergence Properties of BDA

The objective here is to demonstrate that our BDA meetsthese two elementary properties required by Theorem 1.Before proving the convergence results for BDA, we firsttake the following as our blanket assumption.

Assumption 1. For any x ∈ X , F (x, ·) : Rm → R is L0-Lipschitz continuous, LF -smooth, and σ-strongly convex,f(x, ·) : Rm → R is Lf -smooth and convex.

Please notice that Assumption 1 is quite standard for BLPsin machine learning areas (Franceschi et al., 2018; Shabanet al., 2019). As can be seen, it is satisfied for all the ap-plications considered in this work. We first present somenecessary variational analysis preliminaries. Denoting

S(x) = arg miny∈S(x)

F (x,y),

under Assumption 1, we can quickly obtain that S(x) isnonempty and unique for any x ∈ X . Moreover, we canderive the boundedness of S(x) in the following lemma.

Lemma 1. Suppose F (x,y) is level-bounded in y locallyuniformly in x ∈ X . If S(x) is ISC on X , then ∪x∈X S(x)is bounded.

Thanks to the continuity of f(x,y), we further have thefollowing result.

Lemma 2. Denote f∗(x) = miny f(x,y). If f(x,y) iscontinuous on X × Rm, then f∗(x) is USC on X .

Now we are ready to establish our fundamental LL solutionset and UL objective convergence properties required inTheorem 1. In the following proposition, we first derivethe convergence of {yK(x)} in the light of the general factstated in (Sabach & Shtern, 2017).

Proposition 1. Suppose Assumption 1 is satisfied and let{yK} be defined as in Eq. (10), sl ∈ (0, 1/Lf ], su ∈(0, 2/(LF + σ)],

αk = min {2γ/k(1− β), 1− ε} ,

with k ≥ 1, ε > 0, γ ∈ (0, 1], and

β =√

1− 2suσLF /(σ + LF ).

Denote yK(x) = yK(x)− sl∇yf(x,yK(x)), and

Cy∗(x) = max

{‖y0 − y∗(x)‖, su

1− β‖∇yF (x,y∗(x))‖

},

with y∗(x) ∈ S(x) and x ∈ X . Then we have

‖yK(x)− y∗(x)‖ ≤ Cy∗(x),

‖yK(x)− yK(x)‖ ≤2Cy∗(x)(J + 2)

K(1− β),

f(x, yK(x))− f∗(x) ≤2C2

y∗(x)(J + 2)

K(1− β)sl,

where J = b2/(1 − β)c. Furthermore, for any x ∈ X ,{yK(x)} converges to S(x) as K →∞.


Proposition 1, together with Lemma 1, shows that {yK(x)}is a bounded sequence and {f(x, yK(x))} uniformly con-verges. We next prove the uniform convergence of {yK(x)}towards the solution set S(x) through the uniform conver-gence of {f(x, yK(x))}.Proposition 2. Let Y ⊆ Rm be a bounded set and ε > 0.If S(x) is ISC on X , then there exists δ > 0 such that forany y ∈ Y ,

supx∈X

dist(y,S(x)) ≤ ε,

in case supx∈X {f(x,y)− f∗(x)} ≤ δ is satisfied.

Combining Lemmas 1 and 2, together with Proposition 2,the LL solution set property required in Theorem 1 can beeventually derived. Let us now prove the LSC property of ϕon X in the following proposition.

Proposition 3. Suppose F (x,y) is level-bounded in y lo-cally uniformly in x ∈ X . If S(x) is OSC at x ∈ X , thenϕ(x) is LSC at x ∈ X .

Then the UL objective convergence property required inTheorem 1 can be obtained subsequently based on Proposi-tion 3, In summary, we present the main convergence resultsof BDA in the following theorem.

Theorem 2. Suppose Assumption 1 is satisfied and let{yK} be defined as in Eq. (10), sl ∈ (0, 1/Lf ], su ∈(0, 2/(LF + σ)],

αk = min {2γ/k(1− β), 1− ε} ,

with k ≥ 1, ε > 0, γ ∈ (0, 1], and

β =√

1− 2suσLF /(σ + LF ).

Assume further that S(x) is continuous on X . Then we havethat both the LL solution set and UL objective convergenceproperties hold.

Remark 4. Our proposed theoretical results are indeedgeneral enough for BLPs in different application scenar-ios. For example, when the LL objective takes a nonsmoothform, e.g., h = f + g with smooth f and nonsmooth g, wecan adopt the proximal operation based iteration module(Beck, 2017) to construct Tk within our BDA framework.The convergence proofs are highly similar to that in Theo-rem 2. More details on such extension can be found in ourSupplemental Material.

4.3. Improving Existing LLS Theories

Although with the LLS simplification on BLPs in Eqs. (1)-(2), the theoretical properties of the existing bi-level FOMsare still not very convincing. Their convergence proofs inessence depend on the strong convexity of the LL objective,which may restrict the use of these approaches in com-plex machine learning applications. In this subsection, by

weakening the required assumptions, we improve the con-vergence results in (Franceschi et al., 2018; Shaban et al.,2019) for these conventional bi-level FOMs in the LLS sce-nario. Specifically, we first introduce an assumption on theLL objective f(x,y), which is needed for our analysis inthis subsection.Assumption 2. f(x,y) : Rn × Rm → R is level-boundedin y locally uniformly in x ∈ X .

In fact, Assumption 2 is mild and satisfied by a large numberof bi-level FOMs, when the LL subproblem is convex but notnecessarily strongly convex. In contrast, theoretical resultsin existing literature (Franceschi et al., 2018; Shaban et al.,2019) require the more restrictive (local) strong convexityproperty on the LL objective to meets the LLS condition.

Under Assumption 2, the following lemma verifies the con-tinuity of S(x) in the LLS scenario.Lemma 3. Suppose S(x) is single-valued on X and As-sumption 2 is satisfied. Then S(x) is continuous on X .

As can be seen from the proof of Theorem 3 in our Sup-plemental Material, Lemma 3 and the uniform convergenceof {f(x,yK(x))} actually imply the LL solution set andUL objective convergence properties. Hence Theorem 1 isapplicable, which inspires an improved version of the con-vergence results for the existing bi-level FOMs as follows.Theorem 3. Suppose S(x) is single-valued on X and As-sumption 2 is satisfied, {yK(x)} is uniformly bounded onX , and {f(x,yK(x))} converges uniformly to f∗(x) on Xas K →∞. Then we have that both the LL solution set andUL objective convergence properties hold.Remark 5. Theorem 3 actually improves the converge re-sults in (Franceschi et al., 2018). In fact, the uniform con-vergence assumption of {yK(x)} towards y∗(x) requiredin (Franceschi et al., 2018) is essentially based on the strongconvexity assumption (see Remark 3.3 of (Franceschi et al.,2018)). Instead of assuming such strong convexity, we onlyneed to assume a weaker condition that {f(x,yK(x))}converges uniformly to f∗(x) on X as K →∞.

It is natural for us to illustrate our improvement in termsof concrete applications. Specifically, we take the gradient-based bi-level scheme summarized in Section 2.1 (whichhas been used in (Franceschi et al., 2018; Jenni & Favaro,2018; Shaban et al., 2019; Zugner & Gunnemann, 2019;Rajeswaran et al., 2019)). In the following two propositions,we assume that f(x, ·) : Rm → R isLf -smooth and convex,and sl ≤ 1/Lf .

Inspired by Theorems 10.21 and 10.23 in (Beck, 2017), wefirst derive the following proposition.Proposition 4. Let {yK} be defined as in Eq. (4). Then itholds that

‖yK(x)− y∗(x)‖ ≤ ‖y0 − y∗(x)‖,


and

f(yK(x))− f∗(x) ≤ ‖y0 − y∗(x)‖2

2slK,

with y∗(x) ∈ S(x) and x ∈ X .

Then in the following proposition we can immediately verifyour required assumption on {f(x,yK(x))} in the absenceof the strong convexity property on the LL objective.Proposition 5. Suppose that S(x) is single-valued on Xand Assumption 2 is satisfied. Let {yK} be defined asin Eq. (4). Then {yK(x)} is uniformly bounded on Xand {f(x,yK(x))} converges uniformly to f∗(x) on Xas K →∞.Remark 6. When the LL subproblem is convex, but notnecessarily strongly convex, a large number of gradient-based methods, including accelerated gradient methodssuch as FISTA (Beck & Teboulle, 2009) and block coordi-nate descent method (Tseng, 2001), automatically meet ourassumption, i.e., the uniform convergence of optimal values{f(x,yK(x))} towards f∗(x) on X .

5. Experimental ResultsIn this section, we first verify the theoretical findings andthen evaluate the performance of our proposed method ondifferent problems, such as hyper-parameter optimizationand meta learning. We conducted these experiments on aPC with Intel Core i7-7700 CPU (3.6 GHz), 32GB RAMand an NVIDIA GeForce RTX 2060 6GB GPU.

5.1. Synthetic BLPs

Our theoretical findings are investigated based on the syn-thetic BLPs described in Section 2.2. As stated above, thisdeterministic bi-level formulation satisfies all the assump-tions required in Section 4, but it cannot meet the LLScondition considered in (Finn et al., 2017; Franceschi et al.,2017; 2018; Shaban et al., 2019). Here, we fix the learningrate parameters su = 0.7 and sl = 0.2 in this experiment.

In Figure 1, we plotted numerical results of BDA and one ofthe most representative bi-level FOMs (i.e., Reverse Hyper-Gradient (RHG) (Franceschi et al., 2017; 2018)) with differ-ent initialization points. We considered different numericalmetrics, such as |F −F ∗|, |f −f∗|, ‖x− x∗‖2/‖x∗‖2, and‖y − y∗‖2/‖y∗‖2, for evaluations. It can be observed thatRHG is always hard to obtain correct solution, even startfrom different initialization points. This is mainly becausethat the solution set of the LL subproblem in Eq. (5) is not asingleton, which does not satisfy the fundamental assump-tion of RHG. In contrast, our BDA aggregated the UL andLL information to perform the LL updating, thus we are ableto obtain true optimal solution in all these scenarios. The ini-tialization actually only slightly affected on the convergencespeed of our iterative sequences.

Figure 2 further plotted the convergence behaviors of BDAand RHG with different LL iterations (i.e.,K). We observedthat the results of RHG cannot be improved by increas-ing K. But for BDA, the three iterative sequences (withK = 8, 16, 64) are always converged and the numerical per-formance can be improved by performing relatively moreLL iterations. In the above two figures, we set αk = 0.5/k,k = 1, · · · ,K.

Figure 3 evaluated the convergence behaviors of BDA withdifferent choices of αk. By setting αk = 0, we was unableto use the UL information to guide the LL updating, thusit is hard to obtain proper feasible solutions for the ULsubproblem. When choosing a fixed αk in (0, 1) (e.g., αk =0.5), the numerical performance can be improved but theconvergence speed was still slow. Fortunately, we followedour theoretical findings and introduced an adaptive strategyto incorporate UL information into LL iterations, leading tonice convergence behaviors for both UL and LL variables.

5.2. Hyper-parameter Optimization

Hyper-parameter optimization aims choosing a set ofoptimal hyper-parameters for a given machine learningtask. In this experiment, we consider a specific hyper-parameter optimization example, known as data hyper-cleaning (Franceschi et al., 2017; Shaban et al., 2019), toevaluate our proposed bi-level algorithm. In this task, weneed to train a linear classifier on a given image set, but partof the training labels are corrupted. Following (Franceschiet al., 2017; Shaban et al., 2019), here we consider softmaxregression (with parameters y) as our classifier and intro-duce hyper-parameters x to weight samples for training.

Specifically, let `(y;ui,vi) be the cross-entropy functionwith the classification parameter y and data pairs (ui,vi)and denote Dtr and Dval as the training and validation sets,respectively. Then we can define the LL objective as thefollowing weighted training loss:

f(x,y) =∑

(ui,vi)∈Dtr

[σ(x)]i`(y;ui,vi),

where x is the hyper-parameter vector to penalize the objec-tive for different training samples. Here σ(x) denotes theelement-wise sigmoid function on x and is used to constrainthe weights in the range [0, 1]. For the UL subproblem,we define the objective as the cross-entropy loss with `2regularization on the validation set, i.e.,

F (x,y) =∑

(ui,vi)∈Dval

`(y(x);ui,vi) + λ‖y(x)‖2,

where λ > 0 is the trade-off parameter and fixed as 10−4.

We applied our BDA together with the baselines, RHGand Truncated RHG (T-RHG) (Shaban et al., 2019), to


Figure 1. Illustrating the numerical performance of first-order BLPs algorithms with different initialization points. Top row: fix x0 = 0and vary y0 = (0, 0), (2, 2). Bottom row: fix y0 = (2, 2) and vary x0 = 0, 2. We fix K = 16 for UL iterations. The dashed and solidcurves denote the results of RHG and BDA, respectively. The legend is only plotted in the first subfigure.

Figure 2. Illustrating the numerical performance of first-order BLPs algorithms with different LL iterations (i.e., K = 8, 16, 64). We fixinitialization as x0 = 0 and y0 = (2, 2). The dashed and solid curves denotes the results of RHG and BDA, respectively. The legend isonly plotted in the first subfigure.

Figure 3. Illustrating the numerical performance of BDA with fixedαk (e.g., αk = 0, 0.5) and adaptive αk (e.g., {αk = 0.9/k},denoted as “Adap. α”). The initialization and LL iterations arefixed as x0 = 0, y0 = (2, 2), and K = 16, respectively.

solve the above BLPs problem on MNIST database (Le-Cun et al., 1998). Both the training and the validation setsconsist of 7000 class-balanced samples and the remain-ing 56000 samples are used as the test set. We adoptedthe architectures used in RHG as the feature extractor forall the compared methods. For T-RHG, we chose 25-steptruncated back-propagation to guarantee its convergence.Table 2 reported the averaged accuracy for all these com-

Table 2. Data hyper-cleaning accuracy of the compared meth-ods with different number of LL iterations (i.e., K =50, 100, 200, 400, 800) on MNIST.

Method No. of LL Iterations (K)50 100 200 400 800

RHG 88.96 89.73 90.13 90.19 90.15T-RHG 87.90 88.28 88.50 88.52 89.99BDA 89.12 90.12 90.57 90.81 90.86

pared methods with different number of LL iterations (i.e.,K = 50, 100, 200, 400, 800). We observed that RHG out-performed T-RHG. While BAD consistently achieved thehighest accuracy. Our theoretical results suggested that mostof the improvements in BDA should come from the aggre-gations of the UL and LL information. The results alsoshowed that more LL iterations are able to improve the finalperformances in most cases.


Table 3. The averaged few-shot classification accuracy on Om-niglot (N = 5, 20 and M = 1, 5).

Method 5-way 20-way1-shot 5-shot 1-shot 5-shot

MAML 98.70 99.91 95.80 98.90Meta-SGD 97.97 98.96 93.98 98.40

Reptile 97.68 99.48 89.43 97.12RHG 98.60 99.50 95.50 98.40

T-RHG 98.74 99.52 95.82 98.95BDA 99.04 99.62 96.50 99.10

5.3. Meta Learning

The aim of meta learning is to learn an algorithm that shouldwork well on novel tasks. In particular, we consider the few-shot learning problem (Vinyals et al., 2016; Qiao et al.,2018), where each task is a N -way classification and it isto learn the hyper-parameter x such that each task can besolved only with M training samples (i.e., N -way M -shot).

Following the experimental protocol used in recent works,we separate the network architecture into two parts: thecross-task intermediate representation layers (parameterizedby x) outputs the meta features and the multinomial logisticregression layer (parameterized by yj) as our ground classi-fier for the j-th task. We also collect a meta training data setD = {Dj}, where Dj = Djtr ∪ D

jval is linked to the j-th

task. Then for the j-th task, we consider the cross-entropyfunction `(x,yj ;Djtr) as the task-specific loss and thus theLL objective can be defined as

f(x, {yj}) =∑j

`(x,yj ;Djtr).

As for the UL objective, we also utilize cross-entropy func-tion but define it based on {Djval} as

F (x, {yj}) =∑j

`(x,yj ;Djval).

Our experiments are conducted on two widely used bench-marks, i.e., Ominglot (Lake et al., 2015), which contains1623 hand written characters from 50 alphabets and Mini-ImageNet (Vinyals et al., 2016), which is a subset of Ima-geNet (Deng et al., 2009) and includes 60000 downsampledimages from 100 different classes. We followed the exper-imental protocol used in MAML (Finn et al., 2017) andcompared our BDA to several state-of-the-art approaches,such as MAML (Finn et al., 2017), Meta-SGD (Li et al.,2018), Reptile (Nichol et al., 2018), RHG, and T-RHG.

It can be seen in Table 3 that BDA compared well to thesemethods and achieved the highest classification accuracyexcept in the 5-way 5-shot task. In this case, practical perfor-mance of BDA was slightly worse than MAML. We further

Table 4. The few-shot classification performances on MiniIma-geNet (N = 5 and M = 1). The second column reported theaveraged accuracy after converged. The rightmost two columnscompared the UL Iterations (denoted as “UL Iter.”), when achiev-ing almost the same accuracy (≈ 44%). Here “Ave. ± Var. (Acc.)”denotes the averaged accuracy and the corresponding variance.

Method Acc. Ave. ± Var. (Acc.) UL Iter.RHG 48.89 44.46 ± 0.78 3300

T-RHG 47.67 44.21 ± 0.78 3700BDA 49.08 44.24 ± 0.79 2500

Figure 4. Illustrating the validation loss (i.e., UL objectivesF (x,y)) for three BLPs based methods on few-shot classifica-tion task. The curves in left and right subfigures are based on5-way 1-shot results in Tables 3 and 4, respectively.

conducted experiments on the more challenging MiniIma-geNet data set. In the second column of Table 4, we reportedthe averaged accuracy of three first-order BLPs based meth-ods (i.e., RHG, T-RHG and BDA). Again, the performanceof BDA is better than RHG and T-RHG. In the rightmosttwo columns, we also compared the number of averagedUL iterations when they achieved almost the same accuracy(≈ 44%). These results showed that BDA needed the fewestiterations to achieve such accuracy. The validation loss onOmnglot and MiniImageNet about 5-way 1-shot are plottedin Figure. 4

6. ConclusionsThe proposed BDA is a generic first-order algorithmicscheme to address BLPs. We first designed a counter-example to indicate that the existing bi-level FOMs in theabsence of the LLS condition may lead to incorrect solutions.Considering BLPs from the optimistic bi-level viewpoint,BDA could reformulate the original models in Eqs. (1)-(2) as the composition of a single-level subproblem (w.r.t.x) and a simple bi-level subproblem (w.r.t., y). We estab-lished a general proof recipe for bi-level FOMs and provedthe convergence of BDA without the LLS assumption. Asa nontrivial byproduct, we further improved convergenceresults for those existing schemes. Extensive evaluationsshowed the superiority of BDA for different applications.


AcknowledgementsThis work was supported by the National Natural ScienceFoundation of China (Nos. 61922019, 61672125, 61733002,61772105 and 11971220), LiaoNing Revitalization Tal-ents Program (XLYC1807088), the Fundamental ResearchFunds for the Central Universities and the Natural ScienceFoundation of Guangdong Province 2019A1515011152.This work was also supported by the General ResearchFund 12302318 from Hong Kong Research Grants Council.

ReferencesBaydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind,

J. M. Automatic differentiation in machine learning: asurvey. The Journal of Machine Learning Research, 18(1):5595–5637, 2017.

Beck, A. First-order methods in optimization. SIAM, 2017.

Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAMJournal on Imaging Sciences, 2(1):183–202, 2009.

Bonnans, J. F. and Shapiro, A. Perturbation analysis ofoptimization problems. Springer Science & BusinessMedia, 2013.

De los Reyes, J. C., Schonlieb, C.-B., and Valkonen, T.Bilevel parameter learning for higher-order total variationregularisation models. Journal of Mathematical Imagingand Vision, 57(1):1–25, 2017.

Dempe, S. Bilevel optimization: theory, algorithms and ap-plications. TU Bergakademie Freiberg Mining Academyand Technical University, 2018.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,L. Imagenet: A large-scale hierarchical image database.In CVPR, pp. 248–255, 2009.

Domke, J. Generic methods for optimization-based model-ing. In AISTATS, pp. 318–326, 2012.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML,pp. 1126–1135, 2017.

Franceschi, L., Donini, M., Frasconi, P., and Pontil, M.Forward and reverse gradient-based hyperparameter opti-mization. In ICML, pp. 1165–1173, 2017.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil,M. Bilevel programming for hyperparameter optimizationand meta-learning. In ICML, pp. 1563–1572, 2018.

Jenni, S. and Favaro, P. Deep bilevel learning. In ECCV, pp.618–633, 2018.

Jeroslow, R. G. The polynomial hierarchy and a simplemodel for competitive analysis. Mathematical Program-ming, 32(2):146–164, 1985.

Kunapuli, G., Bennett, K. P., Hu, J., and Pang, J.-S. Classi-fication model selection via bilevel programming. Opti-mization Methods & Software, 23(4):475–489, 2008.

Kunisch, K. and Pock, T. A bilevel optimization approachfor parameter learning in variational models. SIAM Jour-nal on Imaging Sciences, 6(2):938–983, 2013.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.Human-level concept learning through probabilistic pro-gram induction. Science, 350(6266):1332–1338, 2015.

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.

Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning tolearn quickly for few-shot learning. In ICML, 2018.

Lorraine, J. and Duvenaud, D. Stochastic hyperpa-rameter optimization through hypernetworks. CoRR,abs/1802.09419, 2018.

MacKay, M., Vicol, P., Lorraine, J., Duvenaud, D., andGrosse, R. Self-tuning networks: Bilevel optimization ofhyperparameters using structured best-response functions.ICLR, 2019.

Maclaurin, D., Duvenaud, D., and Adams, R. Gradient-based hyperparameter optimization through reversiblelearning. In ICML, pp. 2113–2122, 2015.

Moore, G. M. Bilevel programming algorithms for ma-chine learning model selection. Rensselaer PolytechnicInstitute, 2010.

Nichol, A., Achiam, J., and Schulman, J. On first-ordermeta-learning algorithms. CoRR, abs/1803.02999, 2018.

Okuno, T., Takeda, A., and Kawana, A. Hyperparame-ter learning via bilevel nonsmooth optimization. CoRR,abs/1806.01520, 2018.

Pfau, D. and Vinyals, O. Connecting generative adversarialnetworks and actor-critic methods. In NeurIPS Workshopon Adversarial Training, 2016.

Qiao, S., Liu, C., Shen, W., and Yuille, A. L. Few-shot im-age recognition by predicting parameters from activations.In CVPR, pp. 7229–7238, 2018.

Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S.Meta-learning with implicit gradients. In NeurIPS, pp.113–124, 2019.


Rockafellar, R. T. and Wets, R. J.-B. Variational analysis.Springer Science & Business Media, 2009.

Sabach, S. and Shtern, S. A first order method for solvingconvex bilevel optimization problems. SIAM Journal onOptimization, 27(2):640–660, 2017.

Shaban, A., Cheng, C.-A., Hatch, N., and Boots, B. Trun-cated back-propagation for bilevel optimization. In AIS-TATS, pp. 1723–1732, 2019.

Tseng, P. Convergence of a block coordinate descent methodfor nondifferentiable minimization. Journal of Optimiza-tion Theory and Applications, 109(3):475–494, 2001.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.Matching networks for one shot learning. In NeurIPS, pp.3630–3638, 2016.

Wang, S., Fidler, S., and Urtasun, R. Proximal deep struc-tured models. In NeurIPS, pp. 865–873, 2016.

Yang, Z., Chen, Y., Hong, M., and Wang, Z. Provably globalconvergence of actor-critic: A case for linear quadraticregulator with ergodic cost. In NeurIPS, pp. 8351–8363,2019.

Zugner, D. and Gunnemann, S. Adversarial attacks on graphneural networks via meta learning. ICLR, 2019.

AppendixThis supplemental material is organized as follows. In Sec-tion A, we present the detailed proof of subsection 4.1.Section B and Section C provide detailed proofs of subsec-tion 4.2 and subsection 4.3 respectively. Section D thenproves some extended theoretical results, including the localconvergence behaviors of BDA, the algorithmic scheme andconvergence properties of BDA for BLPs with nonsmoothLL objective.

A. Proof of Section 4.1A.1. Proof of Theorem 1

Proof. Since X is compact, we can assume without loss ofgenerality that xK → x ∈ X . For any ε > 0, there existsk(ε) > 0 such that whenever K > k(ε), we have

supx∈X

dist(yK(x),S(x)) ≤ ε

2L0.

Thus, for any x ∈ X , there exists y∗(x) ∈ S(x) such that

‖yK(x)− y∗(x)‖ ≤ ε

L0.

Therefore, for any x ∈ X , we have

ϕ(x) = infy∈S(x)

F (x,y)

≤ F (x,y∗(x))

≤ F (x,yK(x)) + L0‖yK(x)− y∗(x)‖≤ ϕK(x) + ε.

This implies that, for any ε > 0, there exists k(ε) > 0 suchthat whenever K > k(ε),it holds

ϕ(xK) ≤ ϕK(xK) + ε ≤ ϕK(x) + ε, ∀x ∈ X .

Taking K →∞ and by the LSC of ϕ, we have

ϕ(x) ≤ lim infK→∞

ϕ(xK)

≤ lim infK→∞

ϕK(xK) + ε

≤ limK→∞

ϕK(x) + ε = ϕ(x) + ε, ∀x ∈ X .

By taking ε→ 0, we have

ϕ(x) ≤ ϕ(x), ∀x ∈ X ,

which implies x ∈ arg minx∈X ϕ(x).

We next show that infx∈X ϕK(x)→ infx∈X ϕ(x) as K →∞. If this is not true, then there exist δ > 0 and sequence{l} ⊆ N such that∣∣∣∣ inf

x∈Xϕl(x)− inf

x∈Xϕ(x)

∣∣∣∣ > δ, ∀l. (11)

For each l, there exists xl ∈ X such that

ϕl(xl) ≤ infx∈X

ϕl(x) + δ/2.

Since X is compact, we can assume without loss of general-ity that xl → x ∈ X . For any ε > 0, there exists k(ε) > 0such that whenever l > k(ε), the following holds

ϕ(xl) ≤ ϕl(xl) + ε

≤ infx∈X

ϕl(x) + δ/2 + ε

≤ ϕl(x) + δ/2 + ε, ∀x ∈ X .

By taking l→∞ and with the LSC of ϕ, we have

ϕ(x) ≤ lim infl→∞

ϕ(xl)

≤ lim infl→∞

(infx∈X

ϕl(x)

)+ δ/2 + ε

≤ lim supl→∞

(infx∈X

ϕl(x)

)+ δ/2 + ε

≤ ϕ(x) + δ/2 + ε, ∀x ∈ X .


Then, by taking ε→ 0, we have

infx∈X

ϕ(x) ≤ lim infl→∞

(infx∈X

ϕl(x)

)+ δ/2

≤ lim supl→∞

(infx∈X

ϕl(x)

)+ δ/2

≤ infx∈X

ϕ(x) + δ/2,

which implies a contradiction to Eq. (11). Thus we haveinfx∈X ϕK(x)→ infx∈X ϕ(x) as K →∞.

B. Proofs of Section 4.2B.1. Proof of Lemma 1

Proof. We prove this result by providing a contradiction,that is, we have {xt} ⊆ X and yt ∈ S(xt) such that‖yt‖ → +∞. As X is compact, we can assume withoutloss of generality that xt → x ∈ X . Since F (x,y) is level-bounded in y locally uniformly in x ∈ X , we must haveϕ(xt) = F (xt,yt) → +∞. On the other hand, for anyε > 0, let y ∈ S(x) satisfy F (x, y) ≤ ϕ(x) + ε. As F iscontinuous at (x, y), there exists δ0 > 0 such that

F (x,y) ≤ F (x, y) + ε, ∀(x,y) ∈ Bδ0(x, y).

As S(x) is ISC at x relative to X , then it follows that thereexists

√2

2 δ0 ≥ δ > 0 satisfying

S(x) ∩ B√22 δ0

(y) 6= ∅, ∀x ∈ Bδ(x) ∩ X .

Therefore, for any x ∈ Bδ(x)∩X , there exists y ∈ S(x) sat-isfying (x,y) ∈ Bδ0(x, y) and thus F (x,y) ≤ F (x, y)+ε.Consequently, for any x ∈ Bδ(x) ∩ X , we have

ϕ(x) = miny∈S(x)

F (x,y) ≤ F (x, y) + ε = ϕ(x) + 2ε,

which contradicts to ϕ(xt)→∞.

B.2. Proof of Lemma 2

Proof. For any sequence {xt} ⊆ X satisfying xt → x ∈X , given any ε > 0, let y ∈ Rm satisfy f(x, y) ≤ f∗(x)+ε.As f is continuous at (x, y), there exists T > 0 such that

f∗(xt) ≤ f(xt, y) ≤ f(x, y)+ε ≤ f∗(x)+2ε, ∀t > T,

and thuslim supt→∞

f∗(xt) ≤ f∗(x) + 2ε.

By taking ε → 0, we get lim supk→∞ f∗(xt) ≤ f∗(x).

We next prove the uniform convergence of {yK(x)} to-wards the solution set S(x) through the uniform conver-gence of {f(x, yK(x))}.

B.3. Proof of Proposition 2

Proof. We are going to prove this statement by a contra-diction. We assume that there exist bounded set Y ⊆ Rm,ε > 0, sequences {(xt,yt)} ⊆ X × Y and {δk} withδk → 0 satisfying

f(xt,yt)− f∗(xt) ≤ δk and dist(yt,S(xt)) > ε.

Without loss of generality, we can assume that xt → x ∈ Xand yt → y ∈ Rm as t→∞. According to the continuityof f and the USC of f∗ from Lemma 2, we have

0 ≤ f(x, y)− f∗(x) ≤ lim inft→∞

f(xt,yt)− f∗(xt) ≤ 0,

which implies y ∈ S(x). However, as dist(yt,S(xt)) >ε, following from the ISC of S(x) at x and Proposition 5.11of (Rockafellar & Wets, 2009), we have

dist(y,S(x)) ≥ lim supt→∞

dist(y,S(xt))

= lim supt→∞

(dist(yt,S(xt)) + ‖yt − y‖

)≥ lim inf

t→∞dist(yt,S(xt)) ≥ ε,

which contradicts to y ∈ S(x).

B.4. Proof of Proposition 3

Proof. We assume that there exists x ∈ X satisfying xt →x as t→∞, then the following

lim infx→x

ϕ(x) < ϕ(x),

holds. Next, there exist ε > 0 and sequences xt → x ∈ Xand yt ∈ S(xt) satisfying

F (xt,yt) ≤ ϕ(xt) + ε < ϕ(x)− ε.

Furthermore, since F (x,y) is level-bounded in y locallyuniformly in x ∈ X , we have that {yt} is bounded. Take asubsequence {yν} of {yt} such that yν → y and it followsfrom the OSC of S that y ∈ S(x). Then we have

ϕ(x) ≤ F (x, y) ≤ lim supt→∞

F (xt,yt) = lim supt→∞

ϕ(xt)

≤ ϕ(x)− ε,

which implies a contradiction. Thus

ϕ(x) ≤ lim infx→x

ϕ(x)

and we get the conclusion.

B.5. Proof of Theorem 2

Proof. We first show that F (x,y) is level-bounded in ylocally uniformly in x ∈ X . For any x ∈ X , let {xt} ⊆ X


with xt → x and {yt} ∈ Rm with ‖yt‖ → +∞. Then,with Assumption 1 we have

F (xt,yt) ≥F (xt,y1) + 〈∇yF (xt,y1),yt − y1〉

+σ

2‖yt − y1‖2.

As F (x, ·) : Rm → R is Lipschitz continuous with uniformconstant L0 for any x ∈ X , we have ‖∇yF (xt,y1)‖ ≤ L0.Then, by the continuity of F , with xt → x ∈ X , and‖yt‖ → +∞, we have F (xt,yt)→ +∞. Thus F (x,y) islevel-bounded in y locally uniformly in x ∈ X . Then withProposition 3 and assumptions in Theorem 2, we get theLSC property of ϕ on X . And according to Lemma 1, thereexists M > 0 such that Cy∗(x) ≤M for any y∗(x) ∈ S(x)and x ∈ X . Following Proposition 1, there exists C > 0such that for any x ∈ X we have

‖yK(x)‖ ≤ C, ∀K ≥ 0,

‖yK(x)− yK(x)‖ ≤ C

K,

and

f(yK(x))− f∗(x) ≤ C

K, ∀K ≥ 0.

Next, according to Proposition 2, for any ε > 0, there existsk(ε) > 0 such that whenever K > max{2C/ε, k(ε)} wehave

supx∈X

dist(yK(x),S(x))

≤ ‖yK(x)− yK(x)‖+ supx∈X


Then it follows from Proposition 1 that ϕK(x) → ϕ(x)when K →∞ for any x ∈ X .

C. Proofs of Section 4.3C.1. Proof of Lemma 3

Proof. First, according to Proposition 4.4 of (Bonnans &Shapiro, 2013), we know that if f(x,y) : Rn × Rm →R is continuous on X × Rm, level-bounded in y locallyuniformly in x ∈ X , then f∗(x) is continuous on X , S(x)is OSC onX and locally bounded at x. Thus, for any x ∈ X ,f∗(x) is locally bounded at x. As S(x) is a single-valuedmapping on X and S(x) is OSC at x ∈ X and locallybounded at x, Upon Proposition 5.20 of (Rockafellar &Wets, 2009), we conclude that S(x) is ISC at x, and thuscontinuous at x. This completes the proof.

C.2. Proof of Theorem 3

Proof. First, we get the continuity of S(x) on X fromLemma 3. Then, by Proposition 3, we obtain the LSCof ϕ(x) on X . From Proposition 2 and Lemma 3, we have

that for any ε > 0, there exists k(ε) > 0 such that wheneverK > k(ε),

supx∈X


As S(x) is a single-valued mapping on X , we haveϕK(x)→ ϕ(x) for any x ∈ X as K →∞.

In the following two propositions, we assume that f(x, ·) :Rm → R is Lf -smooth and convex, sl ≤ 1/Lf .

C.3. Proof of Proposition 4

Proof. This proposition can be directly obtained from The-orem 10.21 and Theorem 10.23 of (Beck, 2017).

Then in the following proposition we can immediately verifyour required assumption on {f(x,yK(x))} in the absenceof the strong convexity property on the LL objective.

C.4. Proof of Proposition 5

Proof. By the same arguments given in proof of Lemma 3,we can show that S(x) is locally bounded at each point onX under Assumption 2. As X is compact, thus ∪x∈XS(x)is bounded. Then the conclusion follows from Proposition 4directly.

D. Extended Theoretical ResultsD.1. Local Convergence Results

In this part, we analyze the local convergence behaviors ofBDA. In fact, even if xK is a local minimum of ϕK(x) withuniform neighborhood modulus δ > 0, we can still obtainsimilar convergence results as that in Theorem 1. Suchproperties are summarized in the following theorem.Theorem 4. Suppose both the LL solution set and UL ob-jective convergence properties (stated in Section ??) holdand let xK be a local minimum of ϕK(x) with uniformneighborhood modulus δ > 0. Then we have that any limitpoint x of the sequence {xK} is a local minimum of ϕ, i.e.,there exists δ > 0 such that

ϕ(x) ≤ ϕ(x), ∀x ∈ Bδ(x) ∩ X .

Proof. Since X is compact, we can assume without lossof generality that xK → x ∈ X and xK ∈ Bδ/2(x) byconsidering a subsequence of {xK}. For any ε > 0, thereexists k(ε) > 0 such that whenever K > k(ε), we have

supx∈X

dist(yK(x),S(x)) ≤ ε

2L0.

Thus, for any x ∈ X , there exists y∗(x) ∈ S(x) such that

‖yK(x)− y∗(x)‖ ≤ ε

L0.


Therefore, for any x ∈ X , we have

ϕ(x) = infy∈S(x)

F (x,y)

≤ F (x,y∗(x))

≤ F (x,yK(x)) + L0‖yK(x)− y∗(x)‖≤ ϕK(x) + ε.

This implies that, for any ε > 0, there exists k(ε) > 0 suchthat whenever K > k(ε), we have

ϕ(xK) ≤ ϕK(xK) + ε ≤ ϕK(x) + ε, ∀x ∈ X .

Next, as xK is a local minimum of ϕK(x) with uniformneighborhood modulus δ, it follows

ϕK(xK) ≤ ϕK(x), ∀x ∈ Bδ(xK) ∩ X .

Since Bδ/2(x) ⊆ Bδ/2+‖xk−x‖(xK) ⊆ Bδ(xK), we havethat for any ε > 0, ∀x ∈ Bδ/2(x)∩X , there exists k(ε) > 0such that whenever K > k(ε),

ϕ(xK) ≤ ϕK(xK) + ε ≤ ϕK(x) + ε.

Taking K →∞ and by the LSC of ϕ, ∀x ∈ Bδ/2(x) ∩ X ,we have

ϕ(x) ≤ lim infK→∞

ϕ(xK)

≤ lim infK→∞

ϕK(xK) + ε

≤ limK→∞

ϕK(x) + ε = ϕ(x) + ε.

By taking ε→ 0, we have

ϕ(x) ≤ ϕ(x), ∀x ∈ Bδ/2(x) ∩ X ,

which implies x ∈ arg minx∈Bδ/2(x)∩X ϕ(x), i.e, x is alocal minimum of ϕ.

D.2. Nonsmooth LL Objective

It is well-known that a variety of nonsmooth regularizationtechniques (e.g., `1-norm regularization) have been utilizedin learning and vision areas. So in this section, we brieflydiscuss a potential extension of BDA for BLPs with thenonsmooth LL objective, e.g.,

S(x) = arg minyh(x,y) = f(x,y) + g(x,y). (12)

Here we consider f as a function with the same propertiesas that in our above analysis, while g is convex but not nec-essarily smooth, w.r.t. y and continuous w.r.t. (x,y). Sinceg is not necessarily differentiable w.r.t. y, these existinggradient-based first-order BLP methods are not available forthis problem. Fortunately, we demonstrate that by slightlymodifying our inner updating rule, BDA can be directly

extended to address BLPs with the nonsmooth LL objectivein Eq. (12). Specifically, we first write the descent directionof the LL subproblem as

dhk(x) = yk − proxslg(x,·)(yk − sl∇yf(x,yk)),

where proxslg(x,·) denotes the proximal operator w.r.t. thenonsmooth function g(x, ·) and step size sl. Then by ag-gregating dFk (x) and dhk(x), we derive a new Tk to handleBLPs with the nonsmooth composite LL objective h, i.e.,

Tk+1(x,yk(x)) = yk −(αkd

hk(x) + (1− αk)dFk (x)

),

(13)where αk ∈ (0, 1]. In fact, since explicitly estimating thesubgradient information of some proximal operators maybe computationally infeasible in practice, one may applyautomatic differentiation through the dynamical system withapproximation techniques (Wang et al., 2016; Rajeswaranet al., 2019) to obtain dϕK

dx , where ϕK(x) = F (x,yK(x)).

We are now in the position to extend the converge proper-ties of BDA for BLPs in Eq. (13) from smooth LL case tononsmooth LL case. Similar to the discussion in the smoothcase, our analysis could follow the following roadmap:

Step 1: Denoting S(x) = arg miny∈S(x) F (x,y) and fur-ther h∗(x) = miny h(x,y), as extensions to Lemma 1 andLemma 2, we shall derive the boundedness of S(x) and theUSC of h∗(x) for the nonsmooth LL case, respectively. Theproofs are indeed straightforward and purely technical, thusomitted here.

Step 2: As an extension to Proposition 1 which focuses onthe smooth case, we may derive the following convergenceresults regarding {yK(x)} in the light of the general factstated in (Sabach & Shtern, 2017).

Proposition 6. Suppose Assumption 1 is satisfied, g is con-tinuous and convex w.r.t. y, and let {yK} be defined as inEq. (13), sl ∈ (0, 1/Lf ], su ∈ (0, 2/(LF + σ)],

αk = min {2γ/k(1− β), 1− ε} ,

with k ≥ 1, ε > 0, γ ∈ (0, 1] and

β =√

1− 2suσLF /(σ + LF ).

Denoting

yK(x) = proxslg(x,·)(yK(x)− sl∇yf(x,yK(x))),

and

Cy∗(x) = max

{‖y0 − y∗(x)‖, su

1− β‖∇yF (x,y∗(x))‖

},


with y∗(x) ∈ S(x) and x ∈ X . Then it holds that

‖yK(x)− y∗(x)‖ ≤ Cy∗(x),

‖yK(x)− yK(x)‖ ≤2Cy∗(x)(J + 2)

K(1− β),

h(x, yK(x))− h∗(x) ≤2C2

y∗(x)(J + 2)

K(1− β)sl,

where J = b2/(1−β)c. Further, yK(x) converges to S(x)as K →∞ for any x ∈ X .

Step 3: Taking a closer look at the proofs for Proposition 2and Proposition 3, we observe that the techniques we usedbarely rely on the smoothness of the LL objective. There-fore, straightforward extensions of Proposition 2 and Propo-sition 3 to the nonsmooth case can yield the desired uniformconvergence of yK(x) and the UL objective convergence,respectively.

Step 4: Similar to the arguments in the proof of Theorem 2,by combining Step 1 and Step 2, we eventually meet the LLsolution set and UL objective convergence properties, andhence the analysis framework in Theorem 1 has been acti-vated. Therefore, the same convergence results concerning{xK}K∈N and {ϕK(x)} can be achieved as following.

Theorem 5. Suppose Assumption 1 is satisfied, g is con-tinuous and convex w.r.t. y, and let {yK} be defined as inEq. (13), sl ∈ (0, 1/Lf ], su ∈ (0, 2/(LF + σ)],

αk = min {2γ/k(1− β), 1− ε} ,

with k ≥ 1, ε > 0, γ ∈ (0, 1] and

β =√

1− 2suσLF /(σ + LF ).

Assume further that S(x) is nonempty for any x ∈ X andS(x) is continuous on X . Then

(1) if xK ∈ arg minx∈X ϕK(x), we have the same resultsas in Theorem 1;

(2) if xK is a local minimum of ϕK(x) with uniform neigh-borhood modulus δ > 0, we have that any limit pointx of the sequence {xK} is a local minimum of ϕ.

Documents

A Generic First-Order Algorithmic Framework for Bi-Level … · 2020-07-03 · ing bi-level FOMs. In particular, we investigate their iteration behaviors and reach the conclusion