1
Non-monotone Continuous DR-submodular Maximization: Structure and Algorithms An Bian, Kfir Y. Levy, Andreas Krause and Joachim M. Buhmann DR-submodular (Diminishing Returns) Maximization & Its Applications Softmax extension for determinantal point processes (DPPs) [Gillenwater et al β€˜12] Mean-field inference for log-submodular models [Djolonga et al β€˜14] DR-submodular quadratic programming (Generalized submodularity over conic lattices) e.g., logistic regression with a non-convex separable regularizer [Antoniadis et al β€˜11] Etc… (more see paper) Based on Local-Global Relation, can use any solver for finding an approximately stationary point as the subroutine, e.g., the Non- convex Frank-Wolfe solver in [Lascote-Julien β€˜16] TWO-PHASE ALGORITHM Input: stopping tolerances " , $ , #iterations " , $ ← Non-convex Frank-Wolfe(,, " , " ) // Phase I on β†βˆ© ≀ 0 βˆ’ } ← Non-convex Frank-Wolfe(,, $ , $ ) // Phase II on Output: argmax , () Underlying Properties of DR-submodular Maximization Concavity Along Non-negative Directions: Experimental Results (more see paper) DR-submodular (DR property) [Bian et al β€˜17]: βˆ€ ≀ ∈ , βˆ€ , βˆ€ ∈ ℝ C , it holds, E + βˆ’ β‰₯ E + βˆ’ (). - If differentiable, () is an antitone mapping (βˆ€β‰€, it holds β‰₯ ) - If twice differentiable, EI $ ≀ 0, βˆ€ max ∈ () : β†’ ℝ is continuous DR-submodular. is a hypercube. Wlog, let = , 0 .βŠ† is convex and down-closed: ∈ & ≀≀ implies ∈ . Applications References Feldman, Naor, and Schwartz. A unified continuous greedy algorithm for submodular maximization. FOCS 2011 Gillenwater, Kulesza, and Taskar. Near-optimal map inference for determinantal point processes. NIPS 2012. Bach. Submodular functions: from discrete to continous domains. arXiv:1511.00394, 2015. Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv:1607.00345, 2016. Bian, Mirzasoleiman, Buhmann, and Krause. Guaranteed non-convex optimization: Submodular maximization over continuous domains. AISTATS 2017. Quadratic Lower Bound. With a -Lipschitz gradient, for all and ∈ ±ℝ C T , it holds, + β‰₯ + ⟨ , ⟩ βˆ’ W $ X Strongly DR-submodular & Quadratic Upper Bound. is -strongly DR-submodular if for all and ∈ ±ℝ C T , it holds, + ≀ + ⟨ , ⟩ βˆ’ Z $ X Two Guaranteed Algorithms Guarantee of TWO-PHASE ALGORITHM. max , β‰₯ [ \ ( βˆ’ βˆ— $ + βˆ’ βˆ— $ )+ ^ _ β€˜ βˆ— abcd e ^ f ^ g^ ,i ^ abcd e X f X g^ ,i X , where βˆ— β‰”βˆ¨ βˆ— βˆ’ NON-MONOTONE FRANK-WOLFE V ARIANT Input: step size ∈ (0,1] (o) ← 0, ← 0, (o) ←0 // : cumulative step size While (q) <1 do: (q) ← argmax ∈,s 0a (t) , ( (q) ) // shrunken LMO q ← min ,1βˆ’ (q) (qC") ← (q) + q (q) , (qC") ← (q) + q , ++ Output: (w) Guarantee of NON-MONOTONE FRANK-WOLFE V ARIANT . w β‰₯ a" βˆ— βˆ’ " w X βˆ— βˆ’ z X W $w Baselines: - QUADPROGIP: global solver for non-convex quadratic programming (possibly in exponential time) - Projected Gradient Ascent (PROJGRAD) with diminishing step sizes ( " qC" ⁄ ) DR-submodular Quadratic Programming. Synthetic problem instances = ^ X | + + , = { βˆˆβ„ C T | ≀ , ≀ 0,βˆˆβ„ CC Γ—T ,βˆˆβ„ C } has linear constraints. Randomly generated in two manners: 1) Uniform distribution (see Figs below); 2) Exponential distribution Maximizing Softmax Extensions for MAP inference of DPPs. = log det diag βˆ’ + ,∈ 0,1 T : kernel/similarity matrix. is a matching polytope for matched summarization. Synthetic problem instances: - Softmax objectives: generate with random eigenvalues - Generate polytope constraints similarly as that for quadratic programming Real-world results on matched summarization: Select a set of document pairs out of a corpus of documents, such that the two documents within a pair are similar, and the overall set of pairs is as diverse as possible. Setting similar to [Gillenwater et al β€˜12], experimented on the 2012 US Republican detates data. 0.2 0.4 0.6 0.8 1 Match quality controller 2 4 6 8 10 Function value 0 20 40 60 80 100 Iteration 0 0.5 1 1.5 2 2.5 Function value Submodular Concave Convex DR-submodular Approximately Stationary Points & Global Optimum: (Local-Global Relation). Let ∈ with non-stationarity . Define β‰”βˆ© ≀ 0 βˆ’ }. Let ∈ with non-stationarity . Then, max , β‰₯ ^ _ [ βˆ— βˆ’ βˆ’ ]+ [ \ ( βˆ’ βˆ— $ + βˆ’ βˆ— $ ), where βˆ— β‰”βˆ¨ βˆ— βˆ’. - Proof using the essential DR property on carefully constructed auxiliary points - Good empirical performance for the Two-Phase algorithm: if is away from βˆ— , βˆ’ βˆ— $ will augment the bound; if is close to βˆ— , by the smoothness of , should be near optimal. DR-submodularity captures a subclass of non-convex/non-concave functions that enables exact minimization and approximate maximization in poly. time. Investigate geometric properties that underlie such objectives, e.g., a strong relation between stationary points & global optimum is proved. Devise two guaranteed algorithms: i) A β€œtwo-phase” algorithm with ΒΌ approximation guarantee. ii) A non-monotone Frank-Wolfe variant with " ⁄ approximation guarantee Extend to a much broader class of submodular functions on β€œconic” lattices. Abstract 8 10 12 14 16 Dimensionality 0.85 0.9 0.95 1 8 10 12 14 16 Dimensionality 0.85 0.9 0.95 1 Approx. ratio 8 10 12 14 16 Dimensionality 0.9 0.95 1 Approx. ratio = 0.5 = = 1.5 8 10 12 14 16 Dimensionality -0.05 0 0.05 0.1 0.15 0.2 8 10 12 14 16 Dimensionality -0.05 0 0.05 0.1 0.15 0.2 Function value 8 10 12 14 16 Dimensionality -0.05 0 0.05 0.1 0.15 0.2 Function value = 0.5 = 1.5 = key difference from the monotone Frank-Wolfe variant [Bian et al β€˜17] Lemma. For any , , ⟨ βˆ’ , ⟩ β‰₯ ∨ + ∧ βˆ’2 + Z $ a X If =0, then 2 β‰₯ ∨ + ∧ + [ X a X Γ  implicit relation between & . (finding an exact stationary point is difficult ) Non-stationarity Measure [Lacoste-Julien β€˜16]. For any βŠ† , the non-stationarity of ∈ is, ≔ max ∈ ⟨ βˆ’ , ⟩ coordinate-wise max. coordinate-wise min. : diameter of : smooth gradient Softmax (red) & multilinear (blue) extensions, & concave cross-sections Fig. from [Gillenwater et al β€˜12]

AnBian, KfirY. Levy, AndreasKrauseandJoachimM.Buhmannyataobian.com/docs/poster-nips17.pdfNon-monotoneContinuousDR-submodularMaximization: StructureandAlgorithms AnBian, KfirY. Levy,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Non-monotone Continuous DR-submodular Maximization:Structure and Algorithms

    An Bian, Kfir Y. Levy, Andreas Krause and Joachim M. Buhmann

    DR-submodular (Diminishing Returns) Maximization & Its Applications

    πŸ‘‰ Softmax extension for determinantal point processes (DPPs) [Gillenwater et al β€˜12]

    πŸ‘‰ Mean-field inference for log-submodular models [Djolonga et al β€˜14]

    πŸ‘‰ DR-submodular quadratic programming

    πŸ‘‰ (Generalized submodularity over conic lattices) e.g., logistic regression with a non-convex separable regularizer [Antoniadis et al β€˜11]

    πŸ‘‰ Etc… (more see paper)

    Based on Local-Global Relation, can use any solver for finding an approximately stationary point as the subroutine, e.g., the Non-convex Frank-Wolfe solver in [Lascote-Julien β€˜16]

    TWO-PHASE ALGORITHMInput: stopping tolerances πœ–", πœ–$, #iterations 𝐾", 𝐾$𝒙 ← Non-convex Frank-Wolfe(𝑓, 𝒫, 𝐾", πœ–") // Phase I on 𝒫𝒬 ← 𝒫 ∩ π’š π’š ≀ 𝒖0 βˆ’ 𝒙}𝒛 ← Non-convex Frank-Wolfe(𝑓, 𝒬, 𝐾$, πœ–$) // Phase II on 𝒬Output: argmax 𝑓 𝒙 , 𝑓(𝒛)

    Underlying Properties of DR-submodular Maximization

    πŸ‘‰ Concavity Along Non-negative Directions:

    Experimental Results (more see paper)

    DR-submodular (DR property) [Bian et al β€˜17]: βˆ€π’‚ ≀ 𝒃 ∈ 𝒳, βˆ€π‘–, βˆ€π‘˜ ∈ ℝC, it holds,

    𝑓 π‘˜π’†E + 𝒂 βˆ’ 𝑓 𝒂 β‰₯ 𝑓 π‘˜π’†E + 𝒃 βˆ’ 𝑓(𝒃).

    - If 𝑓 differentiable, 𝛻𝑓() is an antitone mapping (βˆ€π’‚ ≀ 𝒃, it holds 𝛻𝑓 𝒂 β‰₯ 𝛻𝑓 𝒃 )

    - If 𝑓 twice differentiable, 𝛻EI$𝑓 𝒙 ≀ 0, βˆ€π’™

    maxπ’™βˆˆπ’«

    𝑓(𝒙)𝑓: 𝒳 β†’ ℝ is continuous DR-submodular. 𝒳 is a hypercube. Wlog, let 𝒳 = 𝟎, 𝒖0 . 𝒫 βŠ† 𝒳 is convex and down-closed: 𝒙 ∈ 𝒫 & 𝟎 ≀ π’š ≀ 𝒙 implies π’š ∈ 𝒫.

    App

    licat

    ions

    Ref

    eren

    ces

    Feldman, Naor, and Schwartz. A unified continuous greedy algorithm for submodular maximization. FOCS 2011

    Gillenwater, Kulesza, and Taskar. Near-optimal map inference for determinantal point processes. NIPS 2012.

    Bach. Submodular functions: from discrete to continous domains. arXiv:1511.00394, 2015.

    Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv:1607.00345, 2016.

    Bian, Mirzasoleiman, Buhmann, and Krause. Guaranteed non-convex optimization: Submodular maximization over continuous domains. AISTATS 2017.

    Quadratic Lower Bound. With a 𝐿-Lipschitz gradient, for all 𝒙 and 𝒗 ∈ ±ℝCT , it holds,𝑓 𝒙 + 𝒗 β‰₯ 𝑓 𝒙 + βŸ¨π›»π‘“ 𝒙 , π’—βŸ© βˆ’ W$ 𝒗 X

    Strongly DR-submodular & Quadratic Upper Bound. 𝑓 is πœ‡-strongly DR-submodular if for all 𝒙 and 𝒗 ∈ ±ℝCT , it holds,

    𝑓 𝒙 + 𝒗 ≀ 𝑓 𝒙 + βŸ¨π›»π‘“ 𝒙 , π’—βŸ© βˆ’ Z$ 𝒗 X

    Two Guaranteed Algorithms

    Guarantee of TWO-PHASE ALGORITHM.

    max 𝑓 𝒙 , 𝑓 𝒛 β‰₯[\( 𝒙 βˆ’ π’™βˆ— $ + 𝒛 βˆ’ π’›βˆ— $)+^_ ` 𝒙

    βˆ— abcd e^f^g^

    οΏ½ ,i^ abcdeXfXg^

    οΏ½ ,iX ,

    where π’›βˆ— ≔ 𝒙 ∨ π’™βˆ— βˆ’ 𝒙

    NON-MONOTONE FRANK-WOLFE VARIANTInput: step size 𝛾 ∈ (0,1]𝒙(o) ← 0, π‘˜ ← 0, 𝑑(o) ← 0 // 𝑑: cumulative step sizeWhile 𝑑(q) < 1 do:

    𝒗(q) ← argmaxπ’—βˆˆπ’«,𝒗s𝒖0a𝒙(t) 𝒗, 𝛻𝑓(𝒙(q)) // shrunken LMO

    𝛾q ← min 𝛾, 1 βˆ’ 𝑑(q)

    𝒙(qC") ← 𝒙(q) + 𝛾q𝒗(q), 𝑑(qC") ← 𝑑(q) + 𝛾q, π‘˜ + +Output: 𝒙(w)

    Guarantee of NON-MONOTONE FRANK-WOLFE VARIANT.

    𝑓 𝒙 w β‰₯ 𝑒a"𝑓 π’™βˆ— βˆ’ 𝑂 "wX 𝑓 π’™βˆ— βˆ’ z

    XW$w

    Baselines: - QUADPROGIP: global solver for non-convex quadratic programming (possibly in exponential time)- Projected Gradient Ascent (PROJGRAD) with diminishing step sizes (" qC"⁄ )

    DR-submodular Quadratic Programming. Synthetic problem instances 𝑓 𝒙 = ^X𝒙|𝐇𝒙 + 𝒉

    𝒙 + 𝑐, 𝒫 = {𝒙 ∈ ℝCT|𝐀𝒙 ≀ 𝒃, 𝒙 ≀𝒖0, 𝐀 ∈ ℝCCΓ—T, 𝒃 ∈ ℝC} has π‘š linear constraints.

    Randomly generated in two manners:1) Uniform distribution (see Figs below); 2) Exponential distribution

    Maximizing Softmax Extensions for MAP inference of DPPs.𝑓 𝒙 = log det diag 𝒙 𝐋 βˆ’ 𝐈 + 𝐈 , 𝒙 ∈ 0,1 T

    𝐋:kernel/similarity matrix. 𝒫 is a matching polytope for matched summarization.

    Synthetic problem instances: - Softmax objectives: generate 𝐋 with 𝑛 random eigenvalues - Generate polytope constraints similarly as that for quadratic programming

    Real-world results on matched summarization:Select a set of document pairs out of a corpus of documents, such that the two documents within a pair are similar, and the overall set of pairs is as diverse as possible. Setting similar to [Gillenwater et al β€˜12], experimented on the 2012 US Republican detates data.

    0.2 0.4 0.6 0.8 1Match quality controller

    2

    4

    6

    8

    10

    Func

    tion

    valu

    e

    0 20 40 60 80 100Iteration

    0

    0.5

    1

    1.5

    2

    2.5

    Func

    tion

    valu

    e

    Submodular

    Concave Convex

    DR-submodular

    πŸ‘‰ Approximately Stationary Points & Global Optimum:

    (Local-Global Relation). Let 𝒙 ∈ 𝒫 with non-stationarity 𝑔𝒫 𝒙 . Define 𝒬 ≔ 𝒫 ∩ π’š π’š ≀ 𝒖0 βˆ’ 𝒙}. Let 𝒛 ∈ 𝒬 with non-stationarity 𝑔𝒬 𝒛 . Then,

    max 𝑓 𝒙 , 𝑓 𝒛 β‰₯ ^_[𝑓 π’™βˆ— βˆ’ 𝑔𝒫 𝒙 βˆ’ 𝑔𝒬 𝒛 ] + [\( 𝒙 βˆ’ 𝒙

    βˆ— $ + 𝒛 βˆ’ π’›βˆ— $),

    where π’›βˆ— ≔ 𝒙 ∨ π’™βˆ— βˆ’ 𝒙.

    - Proof using the essential DR property on carefully constructed auxiliary points

    - Good empirical performance for the Two-Phase algorithm: if 𝒙 is away from π’™βˆ—, 𝒙 βˆ’ π’™βˆ— $ will augment the bound; if 𝒙 is close to π’™βˆ—, by the smoothness of 𝑓, should be near optimal.

    DR-submodularity captures a subclass of non-convex/non-concave functions that enables exact minimization and approximate maximization in poly. time.

    πŸ‘‰ Investigate geometric properties that underlie such objectives, e.g., a strong relation between stationary points & global optimum is proved.

    πŸ‘‰ Devise two guaranteed algorithms: i) A β€œtwo-phase” algorithm with ΒΌ approximation guarantee. ii) A non-monotone Frank-Wolfe variant with " ⁄ approximation guarantee

    πŸ‘‰ Extend to a much broader class of submodular functions on β€œconic” lattices.

    Abst

    ract

    8 10 12 14 16Dimensionality

    0.85

    0.9

    0.95

    1

    Appr

    ox. r

    atio

    8 10 12 14 16Dimensionality

    0.85

    0.9

    0.95

    1

    Appr

    ox. r

    atio

    8 10 12 14 16Dimensionality

    0.9

    0.95

    1

    Appr

    ox. r

    atio

    π‘š = 0.5𝑛 π‘š = 𝑛 π‘š = 1.5𝑛

    8 10 12 14 16Dimensionality

    -0.05

    0

    0.05

    0.1

    0.15

    0.2

    Func

    tion

    valu

    e

    8 10 12 14 16Dimensionality

    -0.05

    0

    0.05

    0.1

    0.15

    0.2

    Func

    tion

    valu

    e

    8 10 12 14 16Dimensionality

    -0.05

    0

    0.05

    0.1

    0.15

    0.2

    Func

    tion

    valu

    e

    π‘š = 0.5𝑛 π‘š = 1.5π‘›π‘š = 𝑛

    key difference from the monotone Frank-Wolfevariant [Bian et al β€˜17]

    Lemma. For any 𝒙, π’š, βŸ¨π’š βˆ’ 𝒙, 𝛻𝑓 𝒙 ⟩ β‰₯ 𝑓 𝒙 ∨ π’š + 𝑓 𝒙 ∧ π’š βˆ’ 2𝑓 𝒙 + Z$ 𝒙aπ’š X

    If 𝛻𝑓 𝒙 = 0, then 2𝑓 𝒙 β‰₯ 𝑓 𝒙 ∨ π’š + 𝑓 𝒙 ∧ π’š + [X 𝒙aπ’š XΓ  implicit relation between 𝒙 & π’š. (finding an exact stationary point is difficult 😟)

    Non-stationarity Measure [Lacoste-Julien β€˜16]. For any 𝒬 βŠ† 𝒳, the non-stationarity of 𝒙 ∈ 𝒬 is,𝑔𝒬 𝒙 ≔ maxπ’—βˆˆπ’¬ βŸ¨π’— βˆ’ 𝒙, 𝛻𝑓 𝒙 ⟩

    coordinate-wise max. coordinate-wise min.

    𝐷: diameter of 𝒫𝐿: smooth gradient

    Softmax (red) & multilinear (blue)extensions, & concave cross-sectionsFig. from [Gillenwater et al β€˜12]