Convexity, Duality, and Lagrange Multipliersceimat/antiguo/sea/apuntes/Chapter1.pdf · alongsome directions and stay withinX forat least anontrivial inter-val. In fact a stronger

LECTURE NOTES

Convexity, Duality, and

Lagrange Multipliers

Dimitri P. Bertsekas

with assistance from

Angelia Geary-Nedic and Asuman Koksal

Massachusetts Institute of Technology

Spring 2001

These notes were developed for the needs of the 6.291 class atM.I.T. (Spring 2001). They are copyright-protected, but theymay be reproduced freely for noncommercial purposes.

Contents

1. Convex Analysis and Optimization

1.1. Linear Algebra and Analysis . . . . . . . . . . . . . . . . .1.1.1. Vectors and Matrices . . . . . . . . . . . . . . . . . .1.1.2. Topological Properties . . . . . . . . . . . . . . . . . .1.1.3. Square Matrices . . . . . . . . . . . . . . . . . . . .1.1.4. Derivatives . . . . . . . . . . . . . . . . . . . . . . .

1.2. Convex Sets and Functions . . . . . . . . . . . . . . . . . .1.2.1. Basic Properties . . . . . . . . . . . . . . . . . . . .1.2.2. Convex and Affine Hulls . . . . . . . . . . . . . . . . .1.2.3. Closure, Relative Interior, and Continuity . . . . . . . . .1.2.4. Recession Cones . . . . . . . . . . . . . . . . . . . .

1.3. Convexity and Optimization . . . . . . . . . . . . . . . . .1.3.1. Local and Global Minima . . . . . . . . . . . . . . . .1.3.2. The Projection Theorem . . . . . . . . . . . . . . . . .1.3.3. Directions of Recession

and Existence of Optimal Solutions . . . . . . . . . . . .1.3.4. Existence of Saddle Points . . . . . . . . . . . . . . . .

1.4. Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . .1.5. Conical Approximations and Constrained Optimization . . . . .1.6. Polyhedral Convexity . . . . . . . . . . . . . . . . . . . .

1.6.1. Polyhedral Cones . . . . . . . . . . . . . . . . . . . .1.6.2. Polyhedral Sets . . . . . . . . . . . . . . . . . . . . .1.6.3. Extreme Points . . . . . . . . . . . . . . . . . . . . .1.6.4. Extreme Points and Linear Programming . . . . . . . . .

1.7. Subgradients . . . . . . . . . . . . . . . . . . . . . . . .1.7.1. Directional Derivatives . . . . . . . . . . . . . . . . . .1.7.2. Subgradients and Subdifferentials . . . . . . . . . . . . .1.7.3. ε-Subgradients . . . . . . . . . . . . . . . . . . . . .1.7.4. Subgradients of Extended Real-Valued Functions . . . . . .1.7.5. Directional Derivative of the Max Function . . . . . . . . .

1.8. Optimality Conditions . . . . . . . . . . . . . . . . . . . .1.9. Notes and Sources . . . . . . . . . . . . . . . . . . . . .

iii

iv Contents

2. Lagrange Multipliers

2.1. Introduction to Lagrange Multipliers . . . . . . . . . . . . .2.2. Enhanced Fritz John Optimality Conditions . . . . . . . . . .2.3. Informative Lagrange Multipliers . . . . . . . . . . . . . . .2.4. Pseudonormality and Constraint Qualifications . . . . . . . . .2.5. Exact Penalty Functions . . . . . . . . . . . . . . . . . . .2.6. Using the Extended Representation . . . . . . . . . . . . . .2.7. Extensions to the Nondifferentiable Case . . . . . . . . . . . .2.8. Notes and Sources . . . . . . . . . . . . . . . . . . . . .

3. Lagrangian Duality

3.1. Geometric Multipliers . . . . . . . . . . . . . . . . . . . .3.2. Duality Theory . . . . . . . . . . . . . . . . . . . . . . .3.3. Linear and Quadratic Programming Duality . . . . . . . . . .3.4. Strong Duality Theorems . . . . . . . . . . . . . . . . . .

3.4.1. Convex Cost – Linear Constraints . . . . . . . . . . . . .3.4.2. Convex Cost – Convex Constraints . . . . . . . . . . . .

3.5. Notes and Sources . . . . . . . . . . . . . . . . . . . . .

4. Conjugate Duality and Applications

4.1. Conjugate Functions . . . . . . . . . . . . . . . . . . . .4.2. The Fenchel Duality Theorem . . . . . . . . . . . . . . . .4.3. The Primal Function and Sensitivity Analysis . . . . . . . . .4.4. Exact Penalty Functions . . . . . . . . . . . . . . . . . . .4.5. Notes and Sources . . . . . . . . . . . . . . . . . . . . .

5. Dual Computational Methods

5.1. Dual Subgradients . . . . . . . . . . . . . . . . . . . . .5.2. Subgradient Methods . . . . . . . . . . . . . . . . . . . .

5.2.1. Analysis of Subgradient Methods . . . . . . . . . . . . .5.2.2. Subgradient Methods with Randomization . . . . . . . . .

5.3. Cutting Plane Methods . . . . . . . . . . . . . . . . . . .5.4. Ascent Methods . . . . . . . . . . . . . . . . . . . . . .5.5. Notes and Sources . . . . . . . . . . . . . . . . . . . . .

Preface

These lecture notes were developed for the needs of a graduate course atthe Electrical Engineering and Computer Science Department at M.I.T.They focus selectively on a number of fundamental analytical and com-putational topics in (deterministic) optimization that span a broad rangefrom continuous to discrete optimization, but are connected through therecurring theme of convexity, Lagrange multipliers, and duality. These top-ics include Lagrange multiplier theory, Lagrangian and conjugate duality,and nondifferentiable optimization. The notes contain substantial portionsthat are adapted from my textbook “Nonlinear Programming: 2nd Edi-tion,” Athena Scientific, 1999. However, the notes are more advanced,more mathematical, and more research-oriented.

As part of the course I have also decided to develop in detail thoseaspects of the theory of convex sets and functions that are essential for anin-depth coverage of Lagrange multipliers and duality. I have long thoughtthat convexity, aside from being an eminently useful subject in engineeringand operations research, is also an excellent vehicle for assimilating someof the basic concepts of analysis within an intuitive geometrical setting.Unfortunately, the subject’s coverage in mathematics and engineering cur-ricula is scant and incidental. I believe that at least part of the reasonis that while there are a number of excellent books on convexity, as wellas a true classic (Rockafellar’s 1970 book), none of them is well suited forteaching nonmathematicians who form the largest part of the potentialaudience.

I have therefore tried in these notes to make convex analysis accessibleby limiting somewhat its scope and by emphasizing its geometrical charac-ter, while at the same time maintaining mathematical rigor. The coverageof the theory is significantly extended in the exercises, whose detailed so-lutions are posted on the internet. I have included as many insightfulillustrations as I could come up with, and I have tried to use geometricvisualization as a principal tool for maintaining the students’ interest inmathematical proofs. To highlight a contrast in style, Rockafellar’s mar-velous book contains no figures at all!

v

vi Preface

An important part of my approach has been to maintain a close linkbetween the theoretical treatment of convexity concepts with their appli-cation to optimization. For example, in Chapter 1, soon after the devel-opment of some of the basic facts about convexity, I discuss some of theirapplications to optimization and saddle point theory; soon after the dis-cussion of hyperplanes and cones, I discuss conical approximations andnecessary conditions for optimality; soon after the discussion of polyhedralconvexity, I discuss its application in linear programming; and soon afterthe discussion of subgradients, I discuss their use in optimality conditions.I follow consistently this style in the remaining chapters, although havingdeveloped in Chapter 1 most of the needed convexity theory, the discussionin the subsequent chapters is more heavily weighted towards optimization.

In addition to their educational purpose, these notes aim to developtwo topics that I have recently researched with two of my students, and tointegrate them into the overall landscape of convexity, duality, and opti-mization. These topics are:

(a) A new approach to Lagrange multiplier theory, based on a set of en-hanced Fritz-John conditions and the notion of constraint pseudonor-mality. This work, joint with my Ph.D. student Asuman Koksal, aimsto generalize, unify, and streamline the theory of constraint qualifica-tions. It allows for an abstract set constraint (in addition to equalitiesand inequalities), it highlights the essential structure of constraintsusing the new notion of pseudonormality, and it develops the connec-tion between Lagrange multipliers and exact penalty functions.

(b) A new approach to the computational solution of (nondifferentiable)dual problems via incremental subgradient methods. These methods,developed jointly with my Ph.D. student Angelia Geary-Nedic, in-clude some interesting randomized variants, which according to bothanalysis and experiment, perform substantially better than the stan-dard subgradient methods for large scale problems that typically arisein the context of duality.

The lecture notes may be freely reproduced and distributed for non-commercial purposes. They represent work-in-progress, and your feedbackand suggestions for improvements in content and style will be most wel-come.

Dimitri P. [email protected] 2001

1

Convex Analysis and

Optimization

Date: June 10, 2001

Contents

1.1. Linear Algebra and Analysis . . . . . . . . . . . . . p. 51.1.1. Vectors and Matrices . . . . . . . . . . . . . . p. 51.1.2. Topological Properties . . . . . . . . . . . . . p. 81.1.3. Square Matrices . . . . . . . . . . . . . . . . p. 161.1.4. Derivatives . . . . . . . . . . . . . . . . . . . p. 20

1.2. Convex Sets and Functions . . . . . . . . . . . . . . p. 241.2.1. Basic Properties . . . . . . . . . . . . . . . . p. 241.2.2. Convex and Affine Hulls . . . . . . . . . . . . . p. 351.2.3. Closure, Relative Interior, and Continuity . . . . . p. 381.2.4. Recession Cones . . . . . . . . . . . . . . . . p. 45

1.3. Convexity and Optimization . . . . . . . . . . . . . p. 581.3.1. Local and Global Minima . . . . . . . . . . . . p. 581.3.2. The Projection Theorem . . . . . . . . . . . . . p. 611.3.3. Directions of Recession

and Existence of Optimal Solutions . . . . . . . . p. 631.3.4. Existence of Saddle Points . . . . . . . . . . . . p. 71

1.4. Hyperplanes . . . . . . . . . . . . . . . . . . . . p. 81

1

2 Convex Analysis and Optimization Chap. 1

1.5. Conical Approximations and Constrained Optimization . p. 871.6. Polyhedral Convexity . . . . . . . . . . . . . . . . p. 98

1.6.1. Polyhedral Cones . . . . . . . . . . . . . . . . p. 981.6.2. Polyhedral Sets . . . . . . . . . . . . . . . . p. 1031.6.3. Extreme Points . . . . . . . . . . . . . . . . p. 1041.6.4. Extreme Points and Linear Programming . . . . p. 107

1.7. Subgradients . . . . . . . . . . . . . . . . . . . p. 1111.7.1. Directional Derivatives . . . . . . . . . . . . . p. 1111.7.2. Subgradients and Subdifferentials . . . . . . . . p. 1151.7.3. ε-Subgradients . . . . . . . . . . . . . . . . p. 1231.7.4. Subgradients of Extended Real-Valued Functions . p. 1281.7.5. Directional Derivative of the Max Function . . . p. 129

1.8. Optimality Conditions . . . . . . . . . . . . . . . p. 1351.9. Notes and Sources . . . . . . . . . . . . . . . . p. 137

Sec. 1.0 Preface 3

In this chapter we provide the mathematical background for this book. InSection 1.1, we list some basic definitions, notational conventions, and re-sults from linear algebra and analysis. We assume that the reader is familiarwith this material, so no proofs are given. In the remainder of the chapter,we focus on convex analysis with an emphasis on optimization-related top-ics. We assume no prior knowledge of the subject, and we provide proofsand a fairly detailed development.

For related and additional material, we recommend the books byHoffman and Kunze [HoK71], Lancaster and Tismenetsky [LaT85], andStrang [Str76] (linear algebra), the books by Ash [Ash72], Ortega andRheinboldt [OrR70], and Rudin [Rud76] (analysis), and the books by Rock-afellar [Roc70], Ekeland and Temam [EkT76], Rockafellar [Roc84], Hiriart-Urruty and Lemarechal [HiL93], Rockafellar and Wets [RoW98], Bonnansand Shapiro [BoS00], and Borwein and Lewis [BoL00] (convex analysis).

The book by Rockafellar [Roc70], widely viewed as the classic con-vex analysis text, contains a deeper and more extensive development ofconvexity that the one given here, although it does not cross over intononconvex optimization. The book by Rockafellar and Wets [RoW98] isa deep and detailed treatment of “variational analysis,” a broad spectrumof topics that integrate classical analysis, convexity, and optimization ofboth convex and nonconvex (possibly nonsmooth) functions. These twobooks represent important milestones in the development of optimizationtheory, and contain a wealth of material, a good deal of which is original.However, they are written for the advanced reader, in a style that manynonmathematicians find challenging.

As we embark on the study of convexity, it is worth listing some ofthe properties of convex sets and functions that make them so special inoptimization.

(a) Convex functions have no local minima that are not global . Thus thedifficulties associated with multiple disconnected local minima, whoseglobal optimality is hard to verify in practice, are avoided.

(b) Convex sets are connected and have feasible directions at any point(assuming they consist of more than one point). By this we mean thatgiven any point x in a convex set X, it is possible to move from xalong some directions and stay within X for at least a nontrivial inter-val. In fact a stronger property holds: given any two distinct pointsx and x in X, the direction x − x is a feasible direction at x, andall feasible directions can be characterized this way. For optimizationpurposes, this is important because it allows a calculus-based com-parison of the cost of x with the cost of its close neighbors, and formsthe basis for some important algorithms. Furthermore, much of thedifficulty commonly associated with discrete constraint sets (arisingfor example in combinatorial optimization), is not encountered underconvexity.


(c) Convex sets have a nonempty relative interior . In other words, whenviewed within the smallest affine set containing it, a convex set has anonempty interior. Thus convex sets avoid the analytical and compu-tational optimization difficulties associated with “thin” and “curved”constraint surfaces.

(d) A nonconvex function can be “convexified” while maintaining the opti-mality of its global minima , by forming the convex hull of the epigraphof the function.

(e) The existence of a global minimum of a convex function over a convexset is conveniently characterized in terms of directions of recession(see Section 1.3).

(f) Polyhedral convex sets (those specified by linear equality and inequal-ity constraints) are characterized in terms of a finite set of extremepoints and extreme directions . This is the basis for finitely terminat-ing methods for linear programming, including the celebrated simplexmethod (see Section 1.6).

(g) Convex functions are continuous and have nice differentiability prop-erties. In particular, a real-valued convex function is directionallydifferentiable at any point. Furthermore, while a convex functionneed not be differentiable, it possesses subgradients, which are niceand geometrically intuitive substitutes for a gradient (see Section 1.7).Just like gradients, subgradients figure prominently in optimality con-ditions and computational algorithms.

(h) Convex functions are central in duality theory . Indeed, the dual prob-lem of a given optimization problem (discussed in Chapters 3 and 4)consists of minimization of a convex function over a convex set, evenif the original problem is not convex.

(i) Closed convex cones are self-dual with respect to orthogonality . Inwords, the set of vectors orthogonal to the set C⊥ (the set of vectorsthat form a nonpositive inner product with all vectors in a closed andconvex cone C) is equal to C. This simple and geometrically intuitiveproperty (discussed in Section 1.5) underlies important aspects ofLagrange multiplier theory.

(j) Convex, lower semicontinuous functions are self-dual with respect toconjugacy . It will be seen in Chapter 4 that a certain geometricallymotivated conjugacy operation on a given convex, lower semicontinu-ous function generates a convex, lower semicontinuous function, andwhen applied for a second time regenerates the original function. Theconjugacy operation is central in duality theory, and has a nice inter-pretation that can be used to visualize and understand some of themost profound aspects of optimization.

Sec. 1.1 Linear Algebra and Analysis 5

Our approach in this chapter is to maintain a close link betweenthe theoretical treatment of convexity concepts with their application tooptimization. For example, soon after the development for some of the basicfacts about convexity in Section 1.2, we discuss some of their applicationsto optimization in Section 1.3; and soon after the discussion of hyperplanesand cones in Sections 1.4 and 1.5, we discuss conditions for optimality. Wefollow consistently this style in the remaining chapters, although havingdeveloped in Chapter 1 most of the convexity theory that we will need,the discussion in the subsequent chapters is more heavily weighted towardsoptimization.

1.1 LINEAR ALGEBRA AND ANALYSIS

Notation

If X is a set and x is an element of X, we write x ∈ X . A set can bespecified in the form X = {x | x satisfies P}, as the set of all elementssatisfying property P . The union of two sets X1 and X2 is denoted byX1 ∪ X2 and their intersection by X1 ∩ X2. The symbols ∃ and ∀ havethe meanings “there exists” and “for all,” respectively. The empty set isdenoted by Ø.

The set of real numbers (also referred to as scalars) is denoted by <.The set < augmented with +∞ and −∞ is called the set of extended realnumbers. We denote by [a, b] the set of (possibly extended) real numbers xsatisfying a ≤ x ≤ b. A rounded, instead of square, bracket denotes strictinequality in the definition. Thus (a, b], [a, b), and (a, b) denote the set ofall x satisfying a < x ≤ b, a ≤ x < b, and a < x < b, respectively. Whenworking with extended real numbers, we use the natural extensions of therules of arithmetic: x · 0 = 0 for every extended real number x, x · ∞ = ∞if x > 0, x · ∞ = −∞ if x < 0, and x + ∞ = ∞ and x − ∞ = −∞ forevery scalar x. The expression ∞−∞ is meaningless and is never allowedto occur.

If f is a function, we use the notation f : X 7→ Y to indicate the factthat f is defined on a set X (its domain) and takes values in a set Y (itsrange). If f : X 7→ Y is a function, and U and V are subsets of X and Y ,respectively, the set

{f(x) | x ∈ U

}is called the image or forward image

of U , and the set{x ∈ <n | f(x) ∈ V

}is called the inverse image of V .

1.1.1 Vectors and Matrices

We denote by <n the set of n-dimensional real vectors. For any x ∈ <n,we use xi to indicate its ith coordinate , also called its ith component .


Vectors in <n will be viewed as column vectors, unless the contraryis explicitly stated. For any x ∈ <n, x′ denotes the transpose of x, which isan n-dimensional row vector. The inner product of two vectors x, y ∈ <n isdefined by x′y =

∑ni=1 xiyi. Any two vectors x, y ∈ <n satisfying x′y = 0

are called orthogonal .If x is a vector in <n, the notations x > 0 and x ≥ 0 indicate that all

coordinates of x are positive and nonnegative, respectively. For any twovectors x and y, the notation x > y means that x− y > 0. The notationsx ≥ y, x < y, etc., are to be interpreted accordingly.

If X is a set and λ is a scalar we denote by λX the set {λx | x ∈ X}.If X1 and X2 are two subsets of <n, we denote by X1 + X2 the vector sum

{x1 + x2 | x1 ∈ X1, x2 ∈ X2}.

We use a similar notation for the sum of any finite number of subsets. Inthe case where one of the subsets consists of a single vector x, we simplifythis notation as follows:

x + X = {x + x | x ∈ X}.

Given sets Xi ⊂ <ni, i = 1, . . . , m, the Cartesian product of the Xi,denoted by X1 × · · · ×Xm, is the subset

{(x1, . . . , xm) | xi ∈ Xi, i = 1, . . . , m

}

of <n1+···+nm .

Subspaces and Linear Independence

A subset S of <n is called a subspace if ax + by ∈ S for every x, y ∈ Xand every a, b ∈ <. An affine set in <n is a translated subspace, i.e., a setof the form x + S = {x + x | x ∈ S}, where x is a vector in <n and S isa subspace of <n. The span of a finite collection {x1, . . . , xm} of elementsof <n (also called the subspace generated by the collection) is the subspaceconsisting of all vectors y of the form y =

∑mk=1 akxk, where each ak is a

scalar.The vectors x1, . . . , xm ∈ <n are called linearly independent if there

exists no set of scalars a1, . . . , am such that∑m

k=1 akxk = 0, unless ak = 0for each k. An equivalent definition is that x1 6= 0, and for every k > 1,the vector xk does not belong to the span of x1, . . . , xk−1.

If S is a subspace of <n containing at least one nonzero vector, a basisfor S is a collection of vectors that are linearly independent and whosespan is equal to S. Every basis of a given subspace has the same numberof vectors. This number is called the dimension of S. By convention, thesubspace {0} is said to have dimension zero. The dimension of an affine set


x + S is the dimension of the corresponding subspace S. Every subspaceof nonzero dimension has an orthogonal basis, i.e., a basis consisting ofmutually orthogonal vectors.

Given any set X, the set of vectors that are orthogonal to all elementsof X is a subspace denoted by X⊥:

X⊥ = {y | y′x = 0, ∀ x ∈ X}.

If S is a subspace, S⊥ is called the orthogonal complement of S. It canbe shown that (S⊥)⊥ = S (see the Polar Cone Theorem in Section 1.5).Furthermore, any vector x can be uniquely decomposed as the sum of avector from S and a vector from S⊥ (see the Projection Theorem in Section1.3.2).

Matrices

For any matrix A, we use Aij , [A]ij , or aij to denote its ijth element. Thetranspose of A, denoted by A′, is defined by [A′]ij = aji. For any twomatrices A and B of compatible dimensions, the transpose of the productmatrix AB satisfies (AB)′ = B′A′.

If X is a subset of <n and A is an m × n matrix, then the image ofX under A is denoted by AX (or A ·X if this enhances notational clarity):

AX = {Ax | x ∈ X}.

If X is subspace, then AX is also a subspace.Let A be a square matrix. We say that A is symmetric if A′ = A. We

say that A is diagonal if [A]ij = 0 whenever i 6= j. We use I to denote theidentity matrix. The determinant of A is denoted by det(A).

Let A be an m× n matrix. The range space of A, denoted by R(A),is the set of all vectors y ∈ <m such that y = Ax for some x ∈ <n. Thenull space of A, denoted by N (A), is the set of all vectors x ∈ <n suchthat Ax = 0. It is seen that the range space and the null space of A aresubspaces. The rank of A is the dimension of the range space of A. Therank of A is equal to the maximal number of linearly independent columnsof A, and is also equal to the maximal number of linearly independent rowsof A. The matrix A and its transpose A′ have the same rank. We say thatA has full rank , if its rank is equal to min{m, n}. This is true if and onlyif either all the rows of A are linearly independent, or all the columns of Aare linearly independent.

The range of an m × n matrix A and the orthogonal complement ofthe nullspace of its transpose are equal, i.e.,

R(A) = N(A′)⊥.

Another way to state this result is that given vectors a1, . . . , an ∈ <m (thecolumns of A) and a vector x ∈ <m, we have x′y = 0 for all y such that


a′iy = 0 for all i if and only if x = λ1a1 + · · · + λnan for some scalarsλ1, . . . , λn. This is a special case of Farkas’ lemma, an important result forconstrained optimization, which will be discussed later in Section 1.6. Auseful application of this result is that if S1 and S2 are two subspaces of<n, then

S⊥1 + S⊥2 = (S1 ∩ S2)⊥.

This follows by introducing matrices B1 and B2 such that S1 = {x | B1x =0} = N(B1) and S2 = {x | B2x = 0} = N (B2), and writing

S⊥1 +S⊥2 = R([ B′

1 B′2 ]

)= N

([B1

B2

])⊥=

(N (B1)∩N (B2)

)⊥= (S1∩S2)⊥

A function f : <n 7→ < is said to be affine if it has the form f(x) =a′x + b for some a ∈ <n and b ∈ <. Similarly, a function f : <n 7→ <m issaid to be affine if it has the form f(x) = Ax + b for some m× n matrixA and some b ∈ <m. If b = 0, f is said to be a linear function or lineartransformation .

1.1.2 Topological Properties

Definition 1.1.1: A norm ‖ · ‖ on <n is a function that assigns ascalar ‖x‖ to every x ∈ <n and that has the following properties:

(a) ‖x‖ ≥ 0 for all x ∈ <n.

(b) ‖αx‖ = |α| · ‖x‖ for every scalar α and every x ∈ <n.

(c) ‖x‖ = 0 if and only if x = 0.

(d) ‖x + y‖ ≤ ‖x‖+ ‖y‖ for all x, y ∈ <n (this is referred to as thetriangle inequality).

The Euclidean norm of a vector x = (x1, . . . , xn) is defined by

‖x‖ = (x′x)1/2 =

(n∑

i=1

|xi|2)1/2

.

The space <n, equipped with this norm, is called a Euclidean space. Wewill use the Euclidean norm almost exclusively in this book. In particular,in the absence of a clear indication to the contrary, ‖ · ‖ will denote theEuclidean norm. Two important results for the Euclidean norm are:


Proposition 1.1.1: (Pythagorean Theorem) For any two vectorsx and y that are orthogonal, we have

‖x + y‖2 = ‖x‖2 + ‖y‖2.

Proposition 1.1.2: (Schwartz inequality) For any two vectors xand y, we have

|x′y| ≤ ‖x‖ · ‖y‖,

with equality holding if and only if x = αy for some scalar α.

Two other important norms are the maximum norm ‖·‖∞ (also calledsup-norm or `∞-norm), defined by

‖x‖∞ = maxi|xi|,

and the `1-norm ‖ · ‖1, defined by

‖x‖1 =

n∑

i=1

|xi|.

Sequences

We use both subscripts and superscripts in sequence notation. Generally,we prefer subscripts, but we use superscripts whenever we need to reservethe subscript notation for indexing coordinates or components of vectorsand functions. The meaning of the subscripts and superscripts should beclear from the context in which they are used.

A sequence {xk | k = 1, 2, . . .} (or {xk} for short) of scalars is saidto converge if there exists a scalar x such that for every ε > 0 we have|xk− x| < ε for every k greater than some integer K (depending on ε). Wecall the scalar x the limit of {xk}, and we also say that {xk} converges tox; symbolically, xk → x or limk→∞ xk = x. If for every scalar b there existssome K (depending on b) such that xk ≥ b for all k ≥ K, we write xk →∞and limk→∞ xk = ∞. Similarly, if for every scalar b there exists some Ksuch that xk ≤ b for all k ≥ K, we write xk →−∞ and limk→∞ xk = −∞.


A sequence {xk} is called a Cauchy sequence if for every ε > 0, thereexists some K (depending on ε) such that |xk − xm| < ε for all k ≥ K andm ≥ K.

A sequence {xk} is said to be bounded above (respectively, below) ifthere exists some scalar b such that xk ≤ b (respectively, xk ≥ b) for allk. It is said to be bounded if it is bounded above and bounded below.The sequence {xk} is said to be monotonically nonincreasing (respectively,nondecreasing) if xk+1 ≤ xk (respectively, xk+1 ≥ xk) for all k. If {xk} con-verges to x and is nonincreasing (nondecreasing), we also use the notationxk ↓ x (xk ↑ x, respectively).

Proposition 1.1.3: Every bounded and monotonically nonincreasingor nondecreasing scalar sequence converges.

Note that a monotonically nondecreasing sequence {xk} is eitherbounded, in which case it converges to some scalar x by the above propo-sition, or else it is unbounded, in which case xk → ∞. Similarly, a mono-tonically nonincreasing sequence {xk} is either bounded and converges, orit is unbounded, in which case xk → −∞.

The supremum of a nonempty set X of scalars, denoted by sup X, isdefined as the smallest scalar x such that x ≥ y for all y ∈ X. If no suchscalar exists, we say that the supremum of X is ∞. Similarly, the infimumof X, denoted by inf X, is defined as the largest scalar x such that x ≤ yfor all y ∈ X, and is equal to −∞ if no such scalar exists. For the emptyset, we use the convention

sup(Ø) = −∞, inf(Ø) = ∞.

(This is somewhat paradoxical, since we have that the sup of a set is lessthan its inf, but works well for our analysis.) If sup X is equal to a scalarx that belongs to the set X, we say that x is the maximum point of X andwe often write

x = supX = max X.

Similarly, if inf X is equal to a scalar x that belongs to the set X, we oftenwrite

x = inf X = min X.

Thus, when we write max X (or min X) in place of sup X (or inf X, re-spectively) we do so just for emphasis: we indicate that it is either evident,or it is known through earlier analysis, or it is about to be shown that themaximum (or minimum, respectively) of the set X is attained at one of itspoints.

Given a scalar sequence {xk}, the supremum of the sequence, denotedby supk xk, is defined as sup{xk | k = 1, 2, . . .}. The infimum of a sequence


is similarly defined. Given a sequence {xk}, let ym = sup{xk | k ≥ m},zm = inf{xk | k ≥ m}. The sequences {ym} and {zm} are nonincreasingand nondecreasing, respectively, and therefore have a limit whenever {xk}is bounded above or is bounded below, respectively (Prop. 1.1.3). Thelimit of ym is denoted by lim supk→∞ xk, and is referred to as the limitsuperior of {xk}. The limit of zm is denoted by lim infk→∞ xk, and isreferred to as the limit inferior of {xk}. If {xk} is unbounded above,we write lim supk→∞ xk = ∞, and if it is unbounded below, we writelim infk→∞ xk = −∞.

Proposition 1.1.4: Let {xk} and {yk} be scalar sequences.

(a) There holds

infk

xk ≤ lim infk→∞

xk ≤ lim supk→∞

xk ≤ supk

xk.

(b) {xk} converges if and only if lim infk→∞ xk = lim supk→∞ xk

and, in that case, both of these quantities are equal to the limitof xk.

(c) If xk ≤ yk for all k, then

lim infk→∞

xk ≤ lim infk→∞

yk, lim supk→∞

xk ≤ lim supk→∞

yk.

(d) We have

lim infk→∞

xk + lim infk→∞

yk ≤ lim infk→∞

(xk + yk),

lim supk→∞

xk + lim supk→∞

yk ≥ lim supk→∞

(xk + yk).

A sequence {xk} of vectors in <n is said to converge to some x ∈ <n ifthe ith coordinate of xk converges to the ith coordinate of x for every i. Weuse the notations xk → x and limk→∞ xk = x to indicate convergence forvector sequences as well. The sequence {xk} is called bounded (respectively,a Cauchy sequence) if each of its corresponding coordinate sequences isbounded (respectively, a Cauchy sequence). It can be seen that {xk} isbounded if and only if there exists a scalar c such that ‖xk‖ ≤ c for all k.


Definition 1.1.2: We say that a vector x ∈ <n is a limit point of a se-quence {xk} in <n if there exists a subsequence of {xk} that convergesto x.

Proposition 1.1.5: Let {xk} be a sequence in <n.

(a) If {xk} is bounded, it has at least one limit point.

(b) {xk} converges if and only if it is bounded and it has a uniquelimit point.

(c) {xk} converges if and only if it is a Cauchy sequence.

o(·) Notation

If p is a positive integer and h : <n 7→ <m, then we write

h(x) = o(‖x‖p

)

if

limk→∞

h(xk)

‖xk‖p= 0,

for all sequences {xk}, with xk 6= 0 for all k, that converge to 0.

Closed and Open Sets

We say that x is a closure point or limit point of a set X ⊂ <n if thereexists a sequence {xk}, consisting of elements of X, that converges to x.The closure of X, denoted cl(X), is the set of all limit points of X .

Definition 1.1.3: A set X ⊂ <n is called closed if it is equal to itsclosure. It is called open if its complement (the set {x | x /∈ X}) isclosed. It is called bounded if there exists a scalar c such that themagnitude of any coordinate of any element of X is less than c. It iscalled compact if it is closed and bounded.


Definition 1.1.4: A neighborhood of a vector x is an open set con-taining x. We say that x is an interior point of a set X ⊂ <n if thereexists a neighborhood of x that is contained in X. A vector x ∈ cl(X)which is not an interior point of X is said to be a boundary point ofX.

Let ‖ · ‖ be a given norm in <n. For any ε > 0 and x∗ ∈ <n, considerthe sets {

x | ‖x− x∗‖ < ε},

{x | ‖x− x∗‖ ≤ ε

}.

The first set is open and is called an open sphere centered at x∗, while thesecond set is closed and is called a closed sphere centered at x∗. Sometimesthe terms open ball and closed ball are used, respectively.

Proposition 1.1.6:

(a) The union of finitely many closed sets is closed.

(b) The intersection of closed sets is closed.

(c) The union of open sets is open.

(d) The intersection of finitely many open sets is open.

(e) A set is open if and only if all of its elements are interior points.

(f) Every subspace of <n is closed.

(g) A set X is compact if and only if every sequence of elements ofX has a subsequence that converges to an element of X.

(h) If {Xk} is a sequence of nonempty and compact sets such thatXk ⊃ Xk+1 for all k, then the intersection ∩∞k=0Xk is nonemptyand compact.

The topological properties of subsets of <n, such as being open,closed, or compact, do not depend on the norm being used. This is aconsequence of the following proposition, referred to as the norm equiva-lence property in <n, which shows that if a sequence converges with respectto one norm, it converges with respect to all other norms.

Proposition 1.1.7: For any two norms ‖ · ‖ and ‖ · ‖′ on <n, thereexists a scalar c such that ‖x‖ ≤ c‖x‖′ for all x ∈ <n.

Using the preceding proposition, we obtain the following.


Proposition 1.1.8: If a subset of <n is open (respectively, closed,bounded, or compact) with respect to some norm, it is open (respec-tively, closed, bounded, or compact) with respect to all other norms.

Sequences of Sets

Let {Xk} be a sequence of nonempty subsets of <n. The outer limit of{Xk}, denoted lim supk→∞Xk, is the set of all x ∈ <n such that everyneighborhood of x has a nonempty intersection with infinitely many of thesets Xk, k = 1,2, . . .. Equivalently, lim supk→∞Xk is the set of all limitsof subsequences {xk}K such that xk ∈ Xk, for all k ∈ K.

The inner limit of {Xk}, denoted lim infk→∞ Xk, is the set of allx ∈ <n such that every neighborhood of x has a nonempty intersectionwith all except finitely many of the sets Xk, k = 1, 2, . . .. Equivalently,lim infk→∞Xk is the set of all limits of sequences {xk} such that xk ∈ Xk,for all k = 1, 2, . . ..

The sequence {Xk} is said to converge to a set X if

X = lim infk→∞

Xk = lim supk→∞

Xk,

in which case X is said to be the limit of {Xk}. The inner and outer limitsare closed (possibly empty) sets. If each set Xk consists of a single pointxk, lim supk→∞Xk is the set of limit points of {xk}, while lim infk→∞Xk

is just the limit of {xk} if {xk} converges, and otherwise it is empty.

Continuity

Let X be a subset of <n and let f : X 7→ <m be some function. Let xbe a closure point of X. If there exists a vector y ∈ <m such that thesequence

{f(xk)

}converges to y for every sequence {xk} ⊂ X such that

limk→∞ xk = x, we write limz→x f(z) = y.If X is a subset of < and x is a closure point of X, we denote by

limz↑x f(z) [respectively, limz↓x f(z)] the limit of f(xk), where {xk} is anysequence of elements of X converging to x and satisfying xk ≤ x (respec-tively, xk ≥ x), assuming that at least one such sequence {xk} exists, andthe corresponding limit of f(xk) exists and is independent of the choice of{xk}.


Definition 1.1.5: Let X be a subset of <n.

(a) A function f : X 7→ <m is called continuous at a point x ∈ X iflimz→x f(z) = f(x).

(b) A function f : X 7→ <m is called right-continuous (respectively,left-continuous) at a point x ∈ X if limz↓x f(z) = f(x) [respec-tively, limz↑x f(z) = f(x)].

(c) A real-valued function f : X 7→ < is called upper semicontinuous(respectively, lower semicontinuous) at a vector x ∈ X if f(x) ≥lim supk→∞ f(xk) [respectively, f(x) ≤ lim infk→∞ f(xk)] for ev-ery sequence {xk} of elements of X converging to x.

If f : X 7→ <m is continuous at every point of a subset of its domainX, we say that f is continuous over that subset . If f : X 7→ <m is con-tinuous at every point of its domain X , we say that f is continuous . Weuse similar terminology for right-continuous, left-continuous, upper semi-continuous, and lower semicontinuous functions.

If f : X 7→ <m is continuous at every point of a subset of its domainX, we say that f is continuous over that subset . If f : X 7→ <m is con-tinuous at every point of its domain X , we say that f is continuous . Weuse similar terminology for right-continuous, left-continuous, upper semi-continuous, and lower semicontinuous functions.

Proposition 1.1.9:

(a) The composition of two continuous functions is continuous.

(b) Any vector norm on <n is a continuous function.

(c) Let f : <n 7→ <m be continuous, and let Y ⊂ <m be open(respectively, closed). Then the inverse image of Y ,

{x ∈ <n |

f(x) ∈ Y}, is open (respectively, closed).

(d) Let f : <n 7→ <m be continuous, and let X ⊂ <n be compact.Then the forward image of X,

{f(x) | x ∈ X

}, is compact.

Matrix Norms

A norm ‖ · ‖ on the set of n× n matrices is a real-valued function that hasthe same properties as vector norms do when the matrix is viewed as anelement of <n2

. The norm of an n× n matrix A is denoted by ‖A‖.We are mainly interested in induced norms , which are constructed as

follows. Given any vector norm ‖ · ‖, the corresponding induced matrix


norm, also denoted by ‖ · ‖, is defined by

‖A‖ = sup{x∈<n|‖x‖=1

} ‖Ax‖.

It is easily verified that for any vector norm, the above equation defines abona fide matrix norm having all the required properties.

Note that by the Schwartz inequality (Prop. 1.1.2), we have

‖A‖ = sup‖x‖=1

‖Ax‖ = sup‖y‖=‖x‖=1

|y′Ax|.

By reversing the roles of x and y in the above relation and by using theequality y′Ax = x′A′y, it follows that ‖A‖ = ‖A′‖.

1.1.3 Square Matrices

Definition 1.1.6: A square matrix A is called singular if its determi-nant is zero. Otherwise it is called nonsingular or invertible.

Proposition 1.1.10:

(a) Let A be an n× n matrix. The following are equivalent:

(i) The matrix A is nonsingular.

(ii) The matrix A′ is nonsingular.

(iii) For every nonzero x ∈ <n, we have Ax 6= 0.

(iv) For every y ∈ <n, there is a unique x ∈ <n such thatAx = y.

(v) There is an n× n matrix B such that AB = I = BA.(vi) The columns of A are linearly independent.(vii) The rows of A are linearly independent.

(b) Assuming that A is nonsingular, the matrix B of statement (v)(called the inverse of A and denoted by A−1) is unique.

(c) For any two square invertible matrices A and B of the samedimensions, we have (AB)−1 = B−1A−1.


Definition 1.1.7: The characteristic polynomial φ of an n×n matrixA is defined by φ(λ) = det(λI − A), where I is the identity matrixof the same size as A. The n (possibly repeated or complex) roots ofφ are called the eigenvalues of A. A nonzero vector x (with possiblycomplex coordinates) such that Ax = λx, where λ is an eigenvalue ofA, is called an eigenvector of A associated with λ.

Proposition 1.1.11: Let A be a square matrix.

(a) A complex number λ is an eigenvalue of A if and only if thereexists a nonzero eigenvector associated with λ.

(b) A is singular if and only if it has an eigenvalue that is equal tozero.

Note that the only use of complex numbers in this book is in relationto eigenvalues and eigenvectors. All other matrices or vectors are implicitlyassumed to have real components.

Proposition 1.1.12: Let A be an n× n matrix.

(a) If T is a nonsingular matrix and B = TAT−1, then the eigenval-ues of A and B coincide.

(b) For any scalar c, the eigenvalues of cI + A are equal to c +λ1, . . . , c + λn, where λ1, . . . , λn are the eigenvalues of A.

(c) The eigenvalues of Ak are equal to λk1, . . . , λ

kn, where λ1, . . . , λn

are the eigenvalues of A.

(d) If A is nonsingular, then the eigenvalues of A−1 are the recipro-cals of the eigenvalues of A.

(e) The eigenvalues of A and A′ coincide.

Symmetric and Positive Definite Matrices

Symmetric matrices have several special properties, particularly regardingtheir eigenvalues and eigenvectors. In what follows in this section, ‖ · ‖


denotes the Euclidean norm.

Proposition 1.1.13: Let A be a symmetric n× n matrix. Then:

(a) The eigenvalues of A are real.

(b) The matrix A has a set of n mutually orthogonal, real, andnonzero eigenvectors x1, . . . , xn.

(c) Suppose that the eigenvectors in part (b) have been normalizedso that ‖xi‖ = 1 for each i. Then

A =

n∑

i=1

λixix′i,

where λi is the eigenvalue corresponding to xi.

Proposition 1.1.14: Let A be a symmetric n × n matrix, and letλ1 ≤ · · · ≤ λn be its (real) eigenvalues. Then:

(a) ‖A‖ = max{|λ1|, |λn|

}, where ‖ · ‖ is the matrix norm induced

by the Euclidean norm.

(b) λ1‖y‖2 ≤ y′Ay ≤ λn‖y‖2 for all y ∈ <n.

(c) If A is nonsingular, then

‖A−1‖ =1

min{|λ1|, |λn|

} .

Proposition 1.1.15: Let A be a square matrix, and let ‖ · ‖ be thematrix norm induced by the Euclidean norm. Then:

(a) If A is symmetric, then ‖Ak‖ = ‖A‖k for any positive integer k.

(b) ‖A‖2 = ‖A′A‖ = ‖AA′‖.


Definition 1.1.8: A symmetric n×n matrix A is called positive defi-nite if x′Ax > 0 for all x ∈ <n, x 6= 0. It is called positive semidefiniteif x′Ax ≥ 0 for all x ∈ <n.

Throughout this book, the notion of positive definiteness applies ex-clusively to symmetric matrices. Thus whenever we say that a matrix ispositive (semi)definite, we implicitly assume that the matrix is symmetric.

Proposition 1.1.16:

(a) The sum of two positive semidefinite matrices is positive semidef-inite. If one of the two matrices is positive definite, the sum ispositive definite.

(b) If A is a positive semidefinite n × n matrix and T is an m ×n matrix, then the matrix TAT ′ is positive semidefinite. If Ais positive definite and T is invertible, then TAT ′ is positivedefinite.

Proposition 1.1.17:

(a) For any m×n matrix A, the matrix A′A is symmetric and positivesemidefinite. A′A is positive definite if and only if A has rank n.In particular, if m = n, A′A is positive definite if and only if Ais nonsingular.

(b) A square symmetric matrix is positive semidefinite (respectively,positive definite) if and only if all of its eigenvalues are nonneg-ative (respectively, positive).

(c) The inverse of a symmetric positive definite matrix is symmetricand positive definite.


Proposition 1.1.18: Let A be a symmetric positive semidefinite n×nmatrix of rank m. There exists an n × m matrix C of rank m suchthat

A = CC′.

Furthermore, for any such matrix C:

(a) A and C′ have the same null space: N(A) = N (C′).

(b) A and C have the same range space: R(A) = R(C).

Proposition 1.1.19: Let A be a square symmetric positive semidef-inite matrix.

(a) There exists a symmetric matrix Q with the property Q2 = A.Such a matrix is called a symmetric square root of A and is de-noted by A1/2.

(b) There is a unique symmetric square root if and only if A is pos-itive definite.

(c) A symmetric square root A1/2 is invertible if and only if A isinvertible. Its inverse is denoted by A−1/2.

(d) There holds A−1/2A−1/2 = A−1.

(e) There holds AA1/2 = A1/2A.

1.1.4 Derivatives

Let f : <n 7→ < be some function, fix some x ∈ <n, and consider theexpression

limα→0

f(x + αei)− f(x)

α,

where ei is the ith unit vector (all components are 0 except for the ithcomponent which is 1). If the above limit exists, it is called the ith partialderivative of f at the point x and it is denoted by (∂f/∂xi)(x) or ∂f(x)/∂xi

(xi in this section will denote the ith coordinate of the vector x). Assumingall of these partial derivatives exist, the gradient of f at x is defined as thecolumn vector

∇f(x) =

∂f(x)∂x1

...∂f(x)∂xn

.


For any y ∈ <n, we define the one-sided directional derivative of f inthe direction y, to be

f ′(x; y) = limα↓0

f(x + αy)− f(x)

α,

provided that the limit exists. We note from the definitions that

f ′(x; ei) = −f ′(x;−ei) ⇒ f ′(x; ei) = (∂f/∂xi)(x).

If the directional derivative of f at a vector x exists in all directionsy and f ′(x; y) is a linear function of y, we say that f is differentiable atx. This type of differentiability is also called Gateaux differentiability . Itis seen that f is differentiable at x if and only if the gradient ∇f(x) existsand satisfies ∇f(x)′y = f ′(x; y) for every y ∈ <n. The function f is calleddifferentiable over a given subset U of <n if it is differentiable at everyx ∈ U . The function f is called differentiable (without qualification) if itis differentiable at all x ∈ <n.

If f is differentiable over an open set U and the gradient ∇f(x) iscontinuous at all x ∈ U , f is said to be continuously differentiable over U .Such a function has the property

limy→0

f(x + y)− f(x)−∇f(x)′y

‖y‖ = 0, ∀ x ∈ U, (1.1)

where ‖ · ‖ is an arbitrary vector norm. If f is continuously differentiableover <n, then f is also called a smooth function.

The preceding equation can also be used as an alternative definitionof differentiability. In particular, f is called Frechet differentiable at xif there exists a vector g satisfying Eq. (1.1) with ∇f(x) replaced by g.If such a vector g exists, it can be seen that all the partial derivatives(∂f/∂xi)(x) exist and that g = ∇f(x). Frechet differentiability implies(Gateaux) differentiability but not conversely (see for example [OrR70] fora detailed discussion). In this book, when dealing with a differentiablefunction f , we will always assume that f is continuously differentiable(smooth) over a given open set [∇f(x) is a continuous function of x overthat set], in which case f is both Gateaux and Frechet differentiable, andthe distinctions made above are of no consequence.

The definitions of differentiability of f at a point x only involve thevalues of f in a neighborhood of x. Thus, these definitions can be usedfor functions f that are not defined on all of <n, but are defined insteadin a neighborhood of the point at which the derivative is computed. Inparticular, for functions f : X 7→ < that are defined over a strict subsetX of <n, we use the above definition of differentiability of f at a vectorx provided x is an interior point of the domain X. Similarly, we use theabove definition of differentiability or continuous differentiability of f over


a subset U , provided U is an open subset of the domain X. Thus anymention of differentiability of a function f over a subset implicitly assumesthat this subset is open.

If f : <n 7→ <m is a vector-valued function, it is called differentiable(or smooth) if each component fi of f is differentiable (or smooth, respec-tively). The gradient matrix of f , denoted ∇f(x), is the n × m matrixwhose ith column is the gradient ∇fi(x) of fi. Thus,

∇f(x) =[∇f1(x) · · · ∇fm(x)

].

The transpose of ∇f is called the Jacobian of f and is a matrix whose ijthentry is equal to the partial derivative ∂fi/∂xj .

Now suppose that each one of the partial derivatives of a functionf : <n 7→ < is a smooth function of x. We use the notation (∂2f/∂xi∂xj)(x)to indicate the ith partial derivative of ∂f/∂xj at a point x ∈ <n. TheHessian of f is the matrix whose ijth entry is equal to (∂2f/∂xi∂xj)(x),and is denoted by ∇2f(x). We have (∂2f/∂xi∂xj)(x) = (∂2f/∂xj∂xi)(x)for every x, which implies that ∇2f(x) is symmetric.

If f : <m+n 7→ < is a function of (x, y), where x = (x1, . . . , xm) ∈ <m

and y = (y1, . . . , yn) ∈ <n, we write

∇xf(x, y) =

∂f(x,y)∂x1

...∂f(x,y)

∂xm

, ∇yf(x, y) =

∂f(x,y)∂y1

...∂f(x,y)

∂yn

.

We denote by ∇2xxf(x, y), ∇2

xyf(x, y), and ∇2yyf(x, y) the matrices with

components

[∇2

xxf(x, y)]ij

=∂2f(x, y)

∂xi∂xj,

[∇2

xyf(x, y)]ij

=∂2f(x, y)

∂xi∂yj,

[∇2

yyf(x, y)]ij

=∂2f(x, y)

∂yi∂yj.

If f : <m+n 7→ <r, f = (f1, f2, . . . , fr), we write

∇xf(x, y) =[∇xf1(x, y) · · ·∇xfr(x, y)

],

∇yf(x, y) =[∇yf1(x, y) · · · ∇yfr(x, y)

].

Let f : <k 7→ <m and g : <m 7→ <n be smooth functions, and let hbe their composition, i.e.,

h(x) = g(f(x)

).


Then, the chain rule for differentiation states that

∇h(x) = ∇f(x)∇g(f(x)

), ∀ x ∈ <k.

Some examples of useful relations that follow from the chain rule are:

∇(f(Ax)

)= A′∇f(Ax), ∇2

(f(Ax)

)= A′∇2f(Ax)A,

where A is a matrix,

∇x

(f(h(x), y

))= ∇h(x)∇hf

(h(x), y

),

∇x

(f(h(x), g(x)

))= ∇h(x)∇hf

(h(x), g(x)

)+∇g(x)∇gf

(h(x), g(x)

).

We now state some theorems relating to differentiable functions thatwill be useful for our purposes.

Proposition 1.1.20: (Mean Value Theorem) Let f : <n 7→ < becontinuously differentiable over an open sphere S, and let x be a vectorin S. Then for all y such that x+ y ∈ S, there exists an α ∈ [0, 1] suchthat

f(x + y) = f(x) +∇f(x + αy)′y.

Proposition 1.1.21: (Second Order Expansions) Let f : <n 7→< be twice continuously differentiable over an open sphere S, and letx be a vector in S. Then for all y such that x + y ∈ S:

(a) There holds

f(x + y) = f(x) + y′∇f(x) + 12y′

(∫ 1

0

(∫ t

0 ∇2f(x + τy)dτ)

dt)

y.

(b) There exists an α ∈ [0, 1] such that

f(x + y) = f(x) + y′∇f(x) + 12y′∇2f(x + αy)y.

(c) There holds

f(x + y) = f(x) + y′∇f(x) + 12y′∇2f(x)y + o

(‖y‖2

).


Proposition 1.1.22: (Implicit Function Theorem) Consider afunction f : <n+m 7→ <m of x ∈ <n and y ∈ <m such that:

(1) f(x, y) = 0.

(2) f is continuous, and has a continuous and nonsingular gradientmatrix ∇yf(x, y) in an open set containing (x, y).

Then there exist open sets Sx ⊂ <n and Sy ⊂ <m containing x and y,respectively, and a continuous function φ : Sx 7→ Sy such that y = φ(x)and f

(x, φ(x)

)= 0 for all x ∈ Sx. The function φ is unique in the sense

that if x ∈ Sx, y ∈ Sy, and f(x, y) = 0, then y = φ(x). Furthermore,if for some integer p > 0, f is p times continuously differentiable thesame is true for φ, and we have

∇φ(x) = −∇xf(x, φ(x)

)(∇yf

(x, φ(x)

))−1

, ∀ x ∈ Sx.

As a final word of caution to the reader, let us mention that one caneasily get confused with gradient notation and its use in various formulas,such as for example the order of multiplication of various gradients in thechain rule and the Implicit Function Theorem. Perhaps the safest guidelineto minimize errors is to remember our conventions:

(a) A vector is viewed as a column vector (an n× 1 matrix).

(b) The gradient ∇f of a scalar function f : <n 7→ < is also viewed as acolumn vector.

(c) The gradient matrix ∇f of a vector function f : <n 7→ <m withcomponents f1, . . . , fm is the n × m matrix whose columns are the(column) vectors ∇f1, . . . ,∇fm.

With these rules in mind one can use “dimension matching” as an effectiveguide to writing correct formulas quickly.

1.2 CONVEX SETS AND FUNCTIONS

1.2.1 Basic Properties

The notion of a convex set is defined below and is illustrated in Fig. 1.2.1.

Sec. 1.2 Convex Sets and Functions 25

Convex Sets Nonconvex Sets

x

y

αx + (1 - α)y, 0 < α < 1

x

x

y

y

xy

Figure 1.2.1. Illustration of the definition of a convex set. For convexity, linearinterpolation between any two points of the set must yield points that lie withinthe set.

Definition 1.2.1: Let C be a subset of <n. We say that C is convexif

αx + (1− α)y ∈ C, ∀ x, y ∈ C, ∀ α ∈ [0, 1]. (1.2)

Note that the empty set is by convention considered to be convex.Generally, when referring to a convex set, it will usually be apparent fromthe context whether this set can be empty, but we will often be specific inorder to minimize ambiguities.

The following proposition lists some operations that preserve convex-ity of a set.


Proposition 1.2.1:

(a) The intersection ∩i∈ICi of any collection {Ci | i ∈ I} of convexsets is convex.

(b) The vector sum C1 + C2 of two convex sets C1 and C2 is convex.

(c) The set x + λC is convex for any convex set C, vector x, andscalar λ. Furthermore, if C is a convex set and λ1, λ2 are positivescalars, we have

(λ1 + λ2)C = λ1C + λ2C.

(d) The closure and the interior of a convex set are convex.

(e) The image and the inverse image of a convex set under an affinefunction are convex.

Proof: The proof is straightforward using the definition of convexity, cf.Eq. (1.2). For example, to prove part (a), we take two points x and yfrom ∩i∈ICi, and we use the convexity of Ci to argue that the line segmentconnecting x and y belongs to all the sets Ci, and hence, to their intersec-tion. The proofs of parts (b)-(e) are similar and are left as exercises for thereader. Q.E.D.

A set C is said to be a cone if for all x ∈ C and λ > 0, we haveλx ∈ C. A cone need not be convex and need not contain the origin(although the origin always lies in its closure). Several of the results of theabove proposition have analogs for cones (see the exercises).

Convex Functions

The notion of a convex function is defined below and is illustrated in Fig.1.2.2.


Definition 1.2.2: Let C be a convex subset of <n. A function f :C 7→ < is called convex if

f(αx + (1− α)y

)≤ αf(x) + (1− α)f(y), ∀ x, y ∈ C, ∀ α ∈ [0, 1].

(1.3)The function f is called concave if −f is convex. The function f iscalled strictly convex if the above inequality is strict for all x, y ∈ Cwith x 6= y, and all α ∈ (0, 1). For a function f : X 7→ <, we also saythat f is convex over the convex set C if the domain X of f containsC and Eq. (1.3) holds, i.e., when the domain of f is restricted to C , fbecomes convex.

αf(x) + (1 - α)f(y)

x y

C

z

f(z)

Figure 1.2.2. Illustration of the definition of a function that is convex over aconvex set C. The linear interpolation αf(x) + (1 − α)f(y) overestimates thefunction value f(αx + (1− α)y) for all α ∈ [0, 1].

If C is a convex set and f : C 7→ < is a convex function, the levelsets {x ∈ C | f(x) ≤ γ} and {x ∈ C | f(x) < γ} are convex for all scalarsγ. To see this, note that if x, y ∈ C are such that f(x) ≤ γ and f(y) ≤ γ,then for any α ∈ [0, 1], we have αx + (1− α)y ∈ C, by the convexity of C,and f

((αx + (1− α)y

)≤ αf(x) + (1− α)f(y) ≤ γ, by the convexity of f .

However, the converse is not true; for example, the function f(x) =√|x|

has convex level sets but is not convex.Unless otherwise indicated, we implicitly assume that a convex func-

tion is real-valued and is defined over the entire Euclidean space (ratherthan over just a convex subset). We occasionally deal with extended real-valued convex functions that can take the value of ∞ or can take the value


−∞ (but never with functions that can take both values −∞ and ∞).A function f mapping a convex set C ⊂ <n into (−∞,∞], is also calledconvex if the condition

f(αx + (1 − α)y

)≤ αf(x) + (1− α)f(y), ∀ x, y ∈ C, ∀ α ∈ [0, 1]

holds. It can again be seen that if f is convex, the level sets {x ∈ C |f(x) ≤ γ} and {x ∈ C | f(x) < γ} are convex for all scalars γ. Theeffective domain of such a convex function f is the convex set

dom(f) ={x ∈ C | f(x) < ∞

}.

By replacing the domain of an extended real-valued convex function withits effective domain, we can convert it to a real-valued function. In this way,we can use results stated in terms of real-valued functions, and we can alsoavoid calculations with ∞. Thus, the entire subject of convex functionscan be developed without resorting to extended real-valued functions. Thereverse is also true, namely that extended real-valued functions can beadopted as the norm; for example, the classical treatment of Rockafellar[Roc70] uses this approach. We will adopt a flexible approach, generallypreferring to avoid extended real-valued functions, unless there are strongnotational or other advantages for doing so.

An extended real-valued function f : X 7→ (−∞,∞] is called lowersemicontinuous at a vector x ∈ X if f(x) ≤ lim infk→∞ f(xk) for everysequence {xk} converging to x. This definition is consistent with the cor-responding definition for real-valued functions [cf. Def. 1.1.5(c)]. If f islower semicontinuous at every x in a subset U ⊂ X, we say that f is lowersemicontinuous over U .

The epigraph of a function f : X 7→ (−∞,∞], where X ⊂ <n, is thesubset of <n+1 given by

epi(f) ={(x,w) | x ∈ X, w ∈ <, f(x) ≤ w

};

(see Fig. 1.2.3). Note that if we restrict f to its effective domain{x ∈

X | f(x) < ∞}, so that it becomes real-valued, the epigraph remains

unaffected. Epigraphs are useful for our purposes because of the follow-ing proposition, which shows that questions about convexity and lowersemicontinuity of functions can be reduced to corresponding questions ofconvexity and closure of their epigraphs.


f(x)

x

Convex function

f(x)

x

Nonconvex function

Epigraph Epigraph

Figure 1.2.3. Illustration of the epigraph of a convex function and a nonconvexfunction f : X 7→ (−∞,∞].

Proposition 1.2.2: Let f : X 7→ (−∞,∞] be a function. Then:

(a) epi(f) is convex if and only if the set X is convex and f is convexover X.

(b) Assuming X = <n, the following are equivalent:

(i) epi(f) is closed.

(ii) f is lower semicontinuous over <n.

(iii) The level sets {x | f(x) ≤ γ} are closed for all scalars γ.

Proof: (a) Assume that X is convex and f is convex over X. If (x1, w1)and (x2, w2) belong to epi(f) and α ∈ [0, 1], we have

f(x1) ≤ w1, f(x2) ≤ w2,

and by multiplying these inequalities with α and (1 − α), respectively, byadding, and by using the convexity of f , we obtain

f(αx1 + (1− α)x2

)≤ αf(x1) + (1− α)f(x2) ≤ αw1 + (1 − α)w2.

Hence the vector(αx1 + (1 − α)x2, αw1 + (1 − α)w2

), which is equal to

α(x1, w1) + (1 − α)(x2, w2), belongs to epi(f), showing the convexity ofepi(f).

Conversely, assume that epi(f) is convex, and let x1, x2 ∈ X andα ∈ [0, 1]. The pairs

(x1, f(x1)

)and

(x2, f(x2)

)belong to epi(f), so by

convexity, we have

(αx1 + (1− α)x2, αf(x1) + (1− α)f(x2)

)∈ epi(f).


Therefore, by the definition of epi(f), it follows that αx1 + (1−α)x2 ∈ X,so X is convex, while

f(αx1 + (1− α)x2

)≤ αf(x1) + (1− α)f(x2),

so f is convex over X.

(b) We first show that (i) and (ii) are equivalent. Assume that f islower semicontinuous over <n, and let (x, w) be the limit of a sequence{(xk, wk)} ⊂ epi(f). We have f(xk) ≤ wk, and by taking limit as k →∞ and by using the lower semicontinuity of f at x, we obtain f(x) ≤lim infk→∞ f(xk) ≤ w. Hence (x,w) ∈ epi(f) and epi(f) is closed.

Conversely, assume that epi(f) is closed, choose any x ∈ <n, let {xk}be a sequence converging to x, and let w = lim infk→∞ f(xk). We willshow that f(x) ≤ w. Indeed, if w = ∞, we have f(x) ≤ w. If w < ∞, foreach positive integer n, let wn = w + 1/n, and let k(n) be an integer suchthat k(n) ≥ n and f

(xk(n)

)≤ wn. The sequence

{(xk(n), wn

)}belongs to

epi(f) and converges to (x, w), so by the closure of epi(f), we must havef(x) ≤ w. Thus, f is lower semicontinuous at x.

We next show that (i) implies (iii), and that (iii) implies (ii). Assumethat epi(f) is closed and let {xk} be a sequence that converges to somex and belongs to the level set

{z | f(z) ≤ γ

}, where γ is a scalar. Then

(xk, γ) ∈ epi(f) for all k, and by closure of epi(f), we have(f(x), γ

)∈

epi(f). Hence x belongs to the level set{x | f(x) ≤ γ

}, implying that this

set is closed. Therefore (i) implies (iii).Finally, assume that the level sets

{x | f(x) ≤ γ

}are closed, fix an

x, and let {xk} be a sequence converging to x. If lim infk→∞ f(xk) < ∞,then for each γ with lim infk→∞ f(xk) < γ and all sufficiently large k,we have f(xk) < γ. From the closure of the level sets

{x | f(x) ≤ γ

}, it

follows that x belongs to all the levels with lim infk→∞ f(xk) < γ, implyingthat f(x) ≤ lim infk→∞ f(xk), and that f is lower semicontinuous at x.Therefore, (iii) implies (ii). Q.E.D.

If the epigraph of a function f : X 7→ (−∞,∞] is a closed set, we saythat f is a closed function. Thus, if we extend the domain of f to rn andconsider the function f given by

f(x) =

{f(x) if x ∈ X,∞ if x /∈ X,

we see that according to the preceding proposition, f is closed if and onlyf is lower semicontinuous over <n.

Common examples of convex functions are affine functions and norms;this is straightforward to verify, using the definition of convexity. For ex-ample, for any x, y ∈ <n and any α ∈ [0, 1], we have by using the triangleinequality,

‖αx + (1− α)y‖ ≤ ‖αx‖+ ‖(1 − α)y‖ = α‖x‖+ (1− α)‖y‖,


so the norm function ‖ · ‖ is convex. The following proposition providessome means for recognizing convex functions, and gives some algebraicoperations that preserve convexity of a function.

Proposition 1.2.3:

(a) Let f1, . . . , fm : <n 7→ (−∞,∞] be given functions, let λ1, . . . , λm

be positive scalars, and consider the function g : <n 7→ (−∞,∞]given by

g(x) = λ1f1(x) + · · ·+ λmfm(x).

If f1, . . . , fm are convex, then g is also convex, while if f1, . . . , fm

are closed, then g is also closed.

(b) Let f : <n 7→ (−∞,∞] be a given function, let A be an m × nmatrix, and consider the function g : <n 7→ (−∞,∞] given by

g(x) = f(Ax).

If f is convex, then g is also convex, while if f is closed, then gis also closed.

(c) Let fi : <n 7→ (−∞,∞] be given functions for i ∈ I , where I isan index set, and consider the function g : <n 7→ (−∞,∞] givenby

g(x) = supi∈I

fi(x).

If fi, i ∈ I, are convex, then g is also convex, while if fi, i ∈ I ,are closed, then g is also closed.

Proof: (a) Let f1, . . . , fm be convex. We use the definition of convexityto write for any x, y ∈ <n and α ∈ [0,1],

f(αx + (1 − α)y

)=

m∑

i=1

λifi

(αx + (1− α)y

)

≤m∑

i=1

λi

(αfi(x) + (1− α)fi(y)

)

= α

m∑

i=1

λifi(x) + (1 − α)

m∑

i=1

λifi(y)

= αf(x) + (1− α)f(y).

Hence f is convex.Let f1, . . . , fm be closed. Then the fi are lower semicontinuous at

every x ∈ <n [cf. Prop. 1.2.2(b)], so for every sequence {xk} converging to


x, we have fi(x) ≤ lim infk→∞ fi(xk) for all i. Hence

g(x) ≤m∑

i=1

λi lim infk→∞

fi(xk) ≤ lim infk→∞

m∑

i=1

λifi(xk) = lim infk→∞

g(xk).

where we have used Prop. 1.1.4(d) (the sum of the limit inferiors of se-quences is less or equal to the limit inferior of the sum sequence). There-fore, g is lower semicontinuous at all x ∈ <n, so by Prop. 1.2.2(b), it isclosed.

(b) This is straightforward, along the lines of the proof of part (a).

(c) A pair (x,w) belongs to the epigraph

epi(g) ={(x, w) | g(x) ≤ w

}

if and only if fi(x) ≤ w for all i ∈ I , or (x,w) ∈ ∩i∈Iepi(fi). Therefore,

epi(g) = ∩i∈Iepi(fi).

If the fi are convex, the epigraphs epi(fi) are convex, so epi(g) is convex,and g is convex. If the fi are closed, then the epigraphs epi(fi) are closed,so epi(g) is closed, and g is closed. Q.E.D.

Characterizations of Differentiable Convex Functions

For differentiable functions, there is an alternative characterization of con-vexity, given in the following proposition and illustrated in Fig. 1.2.4.

Proposition 1.2.4: Let C ⊂ <n be a convex set and let f : <n 7→ <be differentiable over <n.

(a) f is convex over C if and only if

f(z) ≥ f(x) + (z − x)′∇f(x), ∀ x, z ∈ C. (1.4)

(b) f is strictly convex over C if and only if the above inequality isstrict whenever x 6= z.

Proof: We prove (a) and (b) simultaneously. Assume that the inequality(1.4) holds. Choose any x, y ∈ C and α ∈ [0,1], and let z = αx + (1− α)y.Using the inequality (1.4) twice, we obtain

f(x) ≥ f(z) + (x− z)′∇f(z),


f(z)f(x) + (z - x)'∇f(x)

x z

Figure 1.2.4. Characterization of convexity in terms of first derivatives. Thecondition f(z) ≥ f(x) + (z − x)′∇f(x) states that a linear approximation, basedon the first order Taylor series expansion, underestimates a convex function.

f(y) ≥ f(z) + (y − z)′∇f(z).

We multiply the first inequality by α, the second by (1−α), and add themto obtain

αf(x) + (1− α)f(y) ≥ f(z) +(αx + (1− α)y − z

)′∇f(z) = f(z),

which proves that f is convex. If the inequality (1.4) is strict as stated inpart (b), then if we take x 6= y and α ∈ (0, 1) above, the three precedinginequalities become strict, thus showing the strict convexity of f .

Conversely, assume that f is convex, let x and z be any vectors in Cwith x 6= z, and for α ∈ (0, 1), consider the function

g(α) =f(x + α(z − x)

)− f(x)

α, α ∈ (0, 1].

We will show that g(α) is monotonically decreasing with α, and is strictlymonotonically decreasing if f is strictly convex. This will imply that

(z − x)′∇f(x) = limα↓0

g(a) ≤ g(1) = f(z)− f(x),

with strict inequality if g is strictly monotonically decreasing, thereby show-ing that the desired inequality (1.4) holds (and holds strictly if f is strictlyconvex). Indeed, consider any α1, α2, with 0 < α1 < α2 < 1, and let

α =α1

α2, z = x + α2(z − x). (1.5)

We havef(x + α(z − x)

)≤ αf(z) + (1− α)f(x),


orf(x + α(z − x)

)− f(x)

α≤ f(z) − f(x), (1.6)

and the above inequalities are strict if f is strictly convex. Substituting thedefinitions (1.5) in Eq. (1.6), we obtain after a straightforward calculation

f(x + α1(z − x)

)− f(x)

α1≤ f

(x + α2(z − x)

)− f(x)

α2,

or

g(α1) ≤ g(α2),

with strict inequality if f is strictly convex. Hence g is monotonicallydecreasing with α, and strictly so if f is strictly convex. Q.E.D.

Note a simple consequence of Prop. 1.2.4(a): if f : <n 7→ < is a convexfunction and ∇f(x∗) = 0, then x∗ minimizes f over <n. This is a classicalsufficient condition for unconstrained optimality, originally formulated (inone dimension) by Fermat in 1637.

For twice differentiable convex functions, there is another characteri-zation of convexity as shown by the following proposition.

Proposition 1.2.5: Let C ⊂ <n be a convex set and let f : <n 7→ <be twice continuously differentiable over <n.

(a) If ∇2f(x) is positive semidefinite for all x ∈ C, then f is convexover C.

(b) If ∇2f(x) is positive definite for all x ∈ C, then f is strictlyconvex over C .

(c) If C = <n and f is convex, then ∇2f(x) is positive semidefinitefor all x.

Proof: (a) By Prop. 1.1.21(b), for all x, y ∈ C we have

f(y) = f(x) + (y − x)′∇f(x) + 12 (y − x)′∇2f

(x + α(y − x)

)(y − x)

for some α ∈ [0, 1]. Therefore, using the positive semidefiniteness of ∇2f ,we obtain

f(y) ≥ f(x) + (y − x)′∇f(x), ∀ x, y ∈ C.

From Prop. 1.2.4(a), we conclude that f is convex.

(b) Similar to the proof of part (a), we have f(y) > f(x) + (y − x)′∇f(x)for all x, y ∈ C with x 6= y, and the result follows from Prop. 1.2.4(b).


(c) Suppose that f : <n 7→ < is convex and suppose, to obtain a con-tradiction, that there exist some x ∈ <n and some z ∈ <n such thatz′∇2f(x)z < 0. Using the continuity of ∇2f , we see that we can choose thenorm of z to be small enough so that z′∇2f(x+αz)z < 0 for every α ∈ [0, 1].Then, using again Prop. 1.1.21(b), we obtain f(x + z) < f(x) + z′∇f(x),which, in view of Prop. 1.2.4(a), contradicts the convexity of f . Q.E.D.

If f is convex over a strict subset C ⊂ <n, it is not necessarily truethat ∇2f(x) is positive semidefinite at any point of C [take for examplen = 2, C = {(x1,0) | x1 ∈ <}, and f(x) = x2

1 − x22]. A strengthened

version of Prop. 1.2.5 is given in the exercises. It can be shown that theconclusion of Prop. 1.2.5(c) also holds if C is assumed to have nonemptyinterior instead of being equal to <n.

The following proposition considers a strengthened form of strict con-vexity characterized by the following equation:

(∇f(x)−∇f(y)

)′(x− y) ≥ α‖x− y‖2, ∀ x, y ∈ <n, (1.7)

where α is some positive number. Convex functions with this property arecalled strongly convex with coefficient α.

Proposition 1.2.6: (Strong Convexity) Let f : <n 7→ < besmooth. If f is strongly convex with coefficient α, then f is strictlyconvex. Furthermore, if f is twice continuously differentiable, thenstrong convexity of f with coefficient α is equivalent to the positivesemidefiniteness of ∇2f(x) − αI for every x ∈ <n, where I is theidentity matrix.

Proof: Fix some x, y ∈ <n such that x 6= y, and define the functionh : < 7→ < by h(t) = f

(x + t(y − x)

). Consider some t, t′ ∈ < such that

t < t′. Using the chain rule and Eq. (1.7), we have

(dh

dt(t′) − dh

dt(t)

)(t′ − t)

=(∇f

(x + t′(y − x)

)−∇f

(x + t(y − x)

))′(y − x)(t′ − t)

≥ α(t′ − t)2‖x− y‖2 > 0.

Thus, dh/dt is strictly increasing and for any t ∈ (0, 1), we have

h(t)− h(0)

t=

1

t

∫ t

0

dh

dτ(τ) dτ <

1

1− t

∫ 1

t

dh

dτ(τ) dτ =

h(1)− h(t)

1− t.

Equivalently, th(1) + (1− t)h(0) > h(t). The definition of h yields tf(y) +(1 − t)f(x) > f

(ty + (1 − t)x

). Since this inequality has been proved for

arbitrary t ∈ (0, 1) and x 6= y, we conclude that f is strictly convex.


Suppose now that f is twice continuously differentiable and Eq. (1.7)holds. Let c be a scalar. We use Prop. 1.1.21(b) twice to obtain

f(x + cy) = f(x) + cy′∇f(x) +c2

2y′∇2f(x + tcy)y,

and

f(x) = f(x + cy)− cy′∇f(x + cy) +c2

2y′∇2f(x + scy)y,

for some t and s belonging to [0,1]. Adding these two equations and usingEq. (1.7), we obtain

c2

2y′

(∇2f(x+scy)+∇2f(x+tcy)

)y =

(∇f(x+cy)−∇f(x)

)′(cy) ≥ αc2‖y‖2.

We divide both sides by c2 and then take the limit as c → 0 to concludethat y′∇2f(x)y ≥ α‖y‖2. Since this inequality is valid for every y ∈ <n, itfollows that ∇2f(x) − αI is positive semidefinite.

For the converse, assume that ∇2f(x) − αI is positive semidefinitefor all x ∈ <n. Consider the function g : < 7→ < defined by

g(t) = ∇f(tx + (1− t)y

)′(x− y).

Using the Mean Value Theorem (Prop. 1.1.20), we have(∇f(x)−∇f(y)

)′(x−

y) = g(1)−g(0) = (dg/dt)(t) for some t ∈ [0, 1]. The result follows because

dg

dt(t) = (x− y)′∇2f

(tx + (1− t)y

)(x− y) ≥ α‖x− y‖2,

where the last inequality is a consequence of the positive semidefinitenessof ∇2f

(tx + (1 − t)y

)− αI . Q.E.D.

As an example, consider the quadratic function

f(x) = x′Qx,

where Q is a symmetric matrix. By Prop. 1.2.5, the function f is convexif and only if Q is positive semidefinite. Furthermore, by Prop. 1.2.6, f isstrongly convex with coefficient α if and only if ∇2f(x)− αI = 2Q−αI ispositive semidefinite for some α > 0. Thus f is strongly convex with somepositive coefficient (as well as strictly convex) if and only if Q is positivedefinite.


1.2.2 Convex and Affine Hulls

Let X be a subset of <n. A convex combination of elements of X is a vectorof the form

∑mi=1 αixi, where m is a positive integer, x1, . . . , xm belong to

X, and α1, . . . , αm are scalars such that

αi ≥ 0, i = 1, . . . ,m,

m∑

i=1

αi = 1.

Note that if X is convex, then the convex combination∑m

i=1 αixi belongsto X (this is easily shown by induction; see the exercises), and for anyfunction f : <n 7→ < that is convex over X, we have

f

(m∑

i=1

αixi

)≤

m∑

i=1

αif(xi). (1.8)

This follows by using repeatedly the definition of convexity. The precedingrelation is a special case of Jensen’s inequality and can be used to prove anumber of interesting inequalities in applied mathematics and probabilitytheory.

The convex hull of a set X , denoted conv(X), is the intersection ofall convex sets containing X, and is a convex set by Prop. 1.2.1(a). It isstraightforward to verify that the set of all convex combinations of elementsof X is convex, and is equal to the convex hull conv(X) (see the exercises).In particular, if X consists of a finite number of vectors x1, . . . , xm, itsconvex hull is

conv({x1, . . . , xm}

)=

{m∑

i=1

αixi

∣∣∣ αi ≥ 0, i = 1, . . . , m,

m∑

i=1

αi = 1

}.

We recall that an affine set M is a set of the form x + S, where Sis a subspace, called the subspace parallel to M . If X is a subset of <n,the affine hull of X, denoted aff(X), is the intersection of all affine setscontaining X. Note that aff(X) is itself an affine set and that it containsconv(X). It can be seen that the affine hull of X , the affine hull of theconvex hull conv(X), and the affine hull of the closure cl(X) coincide (seethe exercises). For a convex set C, the dimension of C is defined to be thedimension of aff(C).

Given a subset X ⊂ <n, a nonnegative combination of elements of Xis a vector of the form

∑mi=1 αixi, where m is a positive integer, x1, . . . , xm

belong to X, and α1, . . . , αm are nonnegative scalars. If the scalars αi

are all positive, the combination∑m

i=1 αixi is said to be positive. The conegenerated by X , denoted by cone(X), is the set of nonnegative combinationsof elements of X. It is easily seen that cone(X) is a convex cone, although


it need not be closed [cone(X) can be shown to be closed in special cases,such as when X is a finite set – this is one of the central results of polyhedralconvexity and will be shown in Section 1.6].

The following is a fundamental characterization of convex hulls.

Proposition 1.2.7: (Caratheodory’s Theorem) Let X be a sub-set of <n.

(a) Every x in conv(X) can be represented as a convex combinationof vectors x1, . . . , xm ∈ X such that x2 − x1, . . . , xm − x1 arelinearly independent, where m is a positive integer with m ≤n + 1.

(b) Every x in cone(X) can be represented as a positive combinationof vectors x1, . . . , xm ∈ X that are linearly independent, wherem is a positive integer with m ≤ n.

Proof: (a) Let x be a vector in the convex hull of X, and let m be thesmallest integer such that x has the form

∑mi=1 αixi, where

∑mi=1 αi =

1, αi > 0, and xi ∈ X for all i = 1, . . . ,m. The m − 1 vectors x2 −x1, . . . , xm − x1 belong to the subspace parallel to aff(X). Assume, toarrive at a contradiction, that these vectors are linearly dependent. Then,there must exist scalars λ2, . . . , λm at least one of which is positive, suchthat

m∑

i=2

λi(xi − x1) = 0.

Letting µi = λi for i = 2, . . . ,m and µ1 = −∑m

i=2 λi, we see that

m∑

i=1

µixi = 0,m∑

i=1

µi = 0,

while at least one of the scalars µ2, . . . , µm is positive. Define

αi = αi − γµi, i = 1, . . . , m,

where γ > 0 is the largest γ such that αi − γµi ≥ 0 for all i. Then, since∑mi=1 µixi = 0, we see that x is also represented as

∑mi=1 αixi. Further-

more, in view of the choice of γ and the fact∑m

i=1 µi = 0, the coefficientsαi are nonnegative, sum to one, and at least one of them is zero. Thus,x can be represented as a convex combination of fewer than m vectorsof X, contradicting our earlier assumption. It follows that the vectorsx2 − x1, . . . , xm − x1 must be linearly independent, so that their numbermust be at most n. Hence m ≤ n + 1.


(b) Let x be a nonzero vector in the cone(X), and let m be the smallestinteger such that x has the form

∑mi=1 αixi, where αi > 0 and xi ∈ X for

all i = 1, . . . ,m. If the vectors xi were linearly dependent, there wouldexist scalars λ1, . . . , λm, with

∑mi=1 λixi = 0 and at least one of the λi is

positive. Consider the linear combination∑m

i=1(αi − γλi)xi, where γ isthe largest γ such that αi − γλi ≥ 0 for all i. This combination provides arepresentation of x as a positive combination of fewer than m vectors of X– a contradiction. Since any linearly independent set of vectors contains atmost n elements, we must have m ≤ n. Q.E.D.

It is not generally true that the convex hull of a closed set is closed[take for instance the convex hull of the set consisting of the origin and thesubset {(x1, x2) | x1x2 = 1, x1 ≥ 0, x2 ≥ 0} of <2]. We have, however, thefollowing.

Proposition 1.2.8: The convex hull of a compact set is compact.

Proof: Let X be a compact subset of <n. By Caratheodory’s Theorem,

a sequence in conv(X) can be expressed as{∑n+1

i=1 αki xk

i

}, where for all k

and i, αki ≥ 0, xk

i ∈ X, and∑n+1

i=1 αki = 1. Since the sequence

{(αk

1, . . . , αkn+1, x

k1, . . . , x

kn+1)

}

belongs to a compact set, it has a limit point{(α1, . . . , αn+1, x1, . . . , xn+1)

}

such that∑n+1

i=1 αi = 1, and for all i, αi ≥ 0, and xi ∈ X. Thus, the vector∑n+1i=1 αixi, which belongs to conv(X), is a limit point of the sequence{∑n+1

i=1 αki xk

i

}, showing that conv(X) is compact. Q.E.D.

1.2.3 Closure, Relative Interior, and Continuity

We now consider some generic topological properties of convex sets andfunctions. Let C be a nonempty convex subset of <n. The closure cl(C)of C is also a nonempty convex set (Prop. 1.2.1). While the interior of Cmay be empty, it turns out that convexity implies the existence of interiorpoints relative to the affine hull of C. This is an important property, whichwe now formalize.

We say that x is a relative interior point of C, if x ∈ C and thereexists a neighborhood N of x such that N ∩aff(C) ⊂ C, i.e., x is an interiorpoint of C relative to aff(C). The relative interior of C , denoted ri(C), isthe set of all relative interior points of C. For example, if C is a linesegment connecting two distinct points in the plane, then ri(C) consists ofall points of C except for the end points.


The following proposition gives some basic facts about relative interiorpoints.

Proposition 1.2.9: Let C be a nonempty convex set.

(a) (Line Segment Principle) If x ∈ ri(C) and x ∈ cl(C), then allpoints on the line segment connecting x and x, except possiblyx, belong to ri(C).

(b) (Nonemptiness of Relative Interior ) ri(C) is a nonempty and con-vex set, and has the same affine hull as C. In fact, if m is the di-mension of aff(C) and m > 0, there exist vectors x0, x1, . . . , xm ∈ri(C) such that x1 − x0, . . . , xm − x0 span the subspace parallelto aff(C).

(c) x ∈ ri(C) if and only if every line segment in C having x as oneendpoint can be prolonged beyond x without leaving C [i.e., forevery x ∈ C, there exists a γ > 1 such that x+(γ−1)(x−x) ∈ C].

Proof: (a) In the case where x ∈ C, see Fig. 1.2.5. In the case wherex /∈ C , to show that for any α ∈ (0,1] we have xα = αx+(1−α)x ∈ ri(C),consider a sequence {xk} ⊂ C that converges to x, and let xk,α = αx+(1−α)xk. Then as in Fig. 1.2.5, we see that {z | ‖z−xk,α‖ < αε}∩aff(C) ⊂ Cfor all k. Since for large enough k, we have

{z | ‖z − xα‖ < αε/2} ⊂ {z | ‖z − xk,α‖ < αε},

it follows that {z | ‖z − xα‖ < αε/2} ∩ aff(C) ⊂ C, which shows thatxα ∈ ri(C).

(b) Convexity of ri(C) follows from the line segment principle of part (a).By using a translation argument if necessary, we assume without loss ofgenerality that 0 ∈ C . Then, the affine hull of C is a subspace of dimensionm. If m = 0, then C and aff(C) consist of a single point, which is a uniquerelative interior point. If m > 0, we can find m linearly independent vectorsz1, . . . , zm in C that span aff(C); otherwise there would exist r < m linearlyindependent vectors in C whose span contains C, contradicting the fact thatthe dimension of aff(C) is m. Thus z1, . . . , zm form a basis for aff(C).

Consider the set

X =

{x

∣∣∣ x =

m∑

i=1

αizi,

m∑

i=1

αi < 1, αi > 0, i = 1, . . . ,m

}

(see Fig. 1.2.6). This set is open relative to aff(C); that is, for every x ∈ X,there exists an open set N such that x ∈ N and N ∩ aff(C) ⊂ X. [To see


S

Sα

x

ε

αεx

xα = αx + (1 - α)x

C

Figure 1.2.5. Proof of the line segment principle for the case where x ∈ C. Sincex ∈ ri(C), there exists a sphere S = {z | ‖z − x‖ < ε} such that S ∩ aff(C) ⊂ C.For all α ∈ (0, 1], let xα = αx + (1 − α)x and let Sα = {z | ‖z − xα‖ < αε}. Itcan be seen that each point of Sα ∩ aff(C) is a convex combination of x and somepoint of S ∩ aff(C). Therefore, Sα ∩ aff(C) ⊂ C, implying that xα ∈ ri(C).

this, note that X is the inverse image of the open set in <m

{(α1, . . . , αm)

∣∣∣m∑

i=1

αi < 1, αi > 0, i = 1, . . . , m

}

under the linear transformation from aff(C) to <m that maps∑m

i=1 αizi

into (α1, . . . , αm); openness of the above set follows by continuity of lineartransformation.] Therefore all points of X are relative interior points ofC, and ri(C) is nonempty. Since by construction, aff(X) = aff(C) andX ⊂ ri(C), it follows that ri(C) and C have the same affine hull.

To show the last assertion of part (b), consider vectors

x0 = α

m∑

i=1

zi, xi = x0 + αzi, i = 1, . . . , m,

where α is a positive scalar such that α(m+1) < 1. The vectors x0, . . . , xm

are in the set X and in the relative interior of C , since X ⊂ ri(C). Further-more, because xi − x0 = αzi for all i and vectors z1, . . . , zm span aff(C),the vectors x1 − x0, . . . , xm − x0 also span aff(C).

(c) If x ∈ ri(C) the condition given clearly holds. Conversely, let x satisfythe given condition. We will show that x ∈ ri(C). By part (b), there existsa vector x ∈ ri(C). We may assume that x 6= x, since otherwise we aredone. By the given condition, since x is in C , there is a γ > 1 such thaty = x + (γ − 1)(x − x) ∈ C. Then we have x = (1 − α)x + αy, whereα = 1/γ ∈ (0, 1), so by part (a), we obtain x ∈ ri(C). Q.E.D.


X

x1

0

C

x2

Figure 1.2.6. Construction of the relatively open set X in the proof of nonempti-ness of the relative interior of a convex set C that contains the origin. We choosem linearly independent vectors z1, . . . , zm ∈ C, where m is the dimension ofaff(C), and let

X =

{m∑

i=1

αizi

∣∣∣m∑

i=1

αi < 1, αi > 0, i = 1, . . . , m

}.

In view of Prop. 1.2.9(b), C and ri(C) all have the same dimension.It can also be shown that C and cl(C) have the same dimension (see theexercises). The next proposition gives several properties of closures andrelative interiors of convex sets.

Proposition 1.2.10: Let C be a nonempty convex set.

(a) cl(C) = cl(ri(C)

).

(b) ri(C) = ri(cl(C)

).

(c) Let C be another nonempty convex set. Then the following threeconditions are equivalent:

(i) C and C have the same relative interior.

(ii) C and C have the same closure.

(iii) ri(C) ⊂ C ⊂ cl(C).

(d) A · ri(C) = ri(A · C) for all m× n matrices A.

(e) If C is bounded, then A · cl(C) = cl(A ·C) for all m×n matricesA.

Proof: (a) Since ri(C) ⊂ C , we have cl(ri(C)

)⊂ cl(C). Conversely, let

x ∈ cl(C). We will show that x ∈ cl(ri(C)

). Let x be any point in ri(C)


[there exists such a point by Prop. 1.2.9(b)], and assume that that x 6= x(otherwise we are done). By the line segment principle [Prop. 1.2.9(a)], wehave αx + (1 − α)x ∈ ri(C) for all α ∈ (0,1]. Thus, x is the limit of thesequence

{(1/k)x + (1− 1/k)x | k ≥ 1

}that lies in ri(C), so x ∈ cl

(ri(C)

).

(b) Since C ⊂ cl(C), we must have ri(C) ⊂ ri(cl(C)

). To prove the reverse

inclusion, let z ∈ ri(cl(C)

). We will show that z ∈ ri(C). By Prop. 1.2.9(b),

there exists an x ∈ ri(C). We may assume that x 6= z (otherwise we aredone). We choose γ > 1, with γ sufficiently close to 1 so that the vectory = z + (γ − 1)(z − x) belongs to ri

(cl(C)

)[cf. Prop. 1.2.9(c)], and hence

also to cl(C). Then we have z = (1 − α)x + αy where α = 1/γ ∈ (0,1), soby the line segment principle [Prop. 1.2.9(a)], we obtain z ∈ ri(C).

(c) If ri(C) = ri(C), part (a) implies that cl(C) = cl(C). Similarly, ifcl(C) = cl(C), part (b) implies that ri(C) = ri(C). Furthermore, if theseconditions hold the relation ri(C) ⊂ C ⊂ cl(C) implies condition (iii).Finally, assume that condition (iii) holds. Then by taking closures, wehave cl

(ri(C)

)⊂ cl(C) ⊂ cl(C), and by using part (a), we obtain cl(C) ⊂

cl(C) ⊂ cl(C). Hence C and C have the same closure.

(d) For any set X, we have A · cl(X) ⊂ cl(A · X), since if a sequence{xk} ⊂ X converges to some x ∈ cl(X) then the sequence {Axk} ⊂ A · Xconverges to Ax, implying that Ax ∈ cl(A ·X). We use this fact and part(a) to write

A · ri(C) ⊂ A ·C ⊂ A · cl(C) = A · cl(ri(C)

)⊂ cl

(A · ri(C)

).

Thus A·C lies between the set A·ri(C) and the closure of that set, implyingthat the relative interiors of the sets A ·C and A · ri(C) are equal [part (c)].Hence ri(A·C) ⊂ A·ri(C). We will show the reverse inclusion by taking anyz ∈ A · ri(C) and showing that z ∈ ri(A ·C). Let x be any vector in A · C,and let z ∈ ri(C) and x ∈ C be such that Az = z and Ax = x. By partProp. 1.2.9(c), there exists γ > 1 such that the vector y = z+(γ−1)(z−x)belongs to C. Thus we have Ay ∈ A ·C and Ay = z + (γ − 1)(z −x), so byProp. 1.2.9(c) it follows that z ∈ ri(A · C).

(e) By the argument given in part (d), we have A · cl(C) ⊂ cl(A · C). Toshow the converse, choose any x ∈ cl(A ·C). Then, there exists a sequence{xk} ⊂ C such that Axk → x. Since C is bounded, {xk} has a subsequencethat converges to some x ∈ cl(C), and we must have Ax = x. It followsthat x ∈ A · cl(C). Q.E.D.

Note that if C is closed but unbounded, the set A · C need not beclosed [cf. part (e) of the above proposition]. For example take the closedset C =

{(x1, x2) | x1x2 ≥ 1, x1 ≥ 0, x2 ≥ 0

}and let A have the effect

of projecting the typical vector x on the horizontal axis, i.e., A(x1, x2) =(x1, 0). Then A · C is the (nonclosed) halfline

{(x1, x2) | x1 > 0, x2 = 0

}.


Proposition 1.2.11: Let C1 and C2 be nonempty convex sets:

(a) Assume that the sets ri(C1) and ri(C2) have a nonempty inter-section. Then

cl(C1 ∩ C2) = cl(C1) ∩ cl(C2), ri(C1 ∩C2) = ri(C1) ∩ ri(C2).

(b) ri(C1 + C2) = ri(C1) + ri(C2).

Proof: (a) Let y ∈ cl(C1) ∩ cl(C2). If x ∈ ri(C1) ∩ ri(C2), by the linesegment principle [Prop. 1.2.9(a)], the vector αx + (1 − α)y belongs tori(C1)∩ri(C2) for all α ∈ (0,1]. Hence y is the limit of a sequence αkx+(1−αk)y ⊂ ri(C1) ∩ ri(C2) with αk → 0, implying that y ∈ cl

(ri(C1) ∩ ri(C2)

).

Hence, we have

cl(C1) ∩ cl(C2) ⊂ cl(ri(C1) ∩ ri(C2)

)⊂ cl(C1 ∩ C2).

Also C1 ∩ C2 is contained in cl(C1) ∩ cl(C2), which is a closed set, so wehave

cl(C1 ∩ C2) ⊂ cl(C1) ∩ cl(C2).

Thus, equality holds throughout in the preceding two relations, so thatcl(C1 ∩ C2) = cl(C1) ∩ cl(C2). Furthermore, the sets ri(C1) ∩ ri(C2) andC1 ∩ C2 have the same closure. Therefore, by Prop. 1.2.10(c), they havethe same relative interior, implying that

ri(C1 ∩ C2) ⊂ ri(C1) ∩ ri(C2).

To show the converse, take any x ∈ ri(C1) ∩ ri(C2) and any y ∈C1 ∩ C2. By Prop. 1.2.9(c), the line segment connecting x and y can beprolonged beyond x by a small amount without leaving C1 ∩ C2. By thesame proposition, it follows that x ∈ ri(C1 ∩ C2).

(b) Consider the linear transformation A : <2n 7→ <n given by A(x1, x2) =x1 + x2 for all x1, x2 ∈ <n. The relative interior of the Cartesian productC1×C2 (viewed as a subset of <2n) is easily seen to be ri(C1)×ri(C2). SinceA(C1 × C2) = C1 + C2, the result follows from Prop. 1.2.10(d). Q.E.D.

The requirement that ri(C1) ∩ ri(C2) 6= Ø is essential in part (a) ofthe above proposition. As an example, consider the subsets of the realline C1 = {x | x > 0} and C2 = {x | x < 0}. Then we have cl(C1 ∩C2) = Ø 6= {0} = cl(C1) ∩ cl(C2). Also, consider C1 = {x | x ≥ 0} andC2 = {x | x ≤ 0}. Then we have ri(C1 ∩C2) = {0} 6= Ø = ri(C1) ∩ ri(C2).


Continuity of Convex Functions

We close this section with a basic result on the continuity properties ofconvex functions.

Proposition 1.2.12: If f : <n 7→ < is convex, then it is continuous.More generally, if C ⊂ <n is convex and f : C 7→ < is convex, then fis continuous over the relative interior of C .

Proof: Restricting attention to the affine hull of C and using a transforma-tion argument if necessary, we assume without loss of generality, that theorigin is an interior point of C and that the unit cube X = {x | ‖x‖∞ ≤ 1}is contained in C. It will suffice to show that f is continuous at 0, i.e, thatfor any sequence {xk} ⊂ <n that converges to 0, we have f(xk) → f(0).

Let ei, i = 1, . . . , 2n, be the corners of X, i.e., each ei is a vector whoseentries are either 1 or −1. It is not difficult to see that any x ∈ X can

be expressed in the form x =∑2n

i=1 αiei, where each αi is a nonnegative

scalar and∑2n

i=1 αi = 1. Let A = maxi f(ei). From Jensen’s inequality[Eq. (1.8)], it follows that f(x) ≤ A for every x ∈ X .

For the purpose of proving continuity at zero, we can assume thatxk ∈ X and xk 6= 0 for all k. Consider the sequences {yk} and {zk} givenby

yk =xk

‖xk‖∞, zk = − xk

‖xk‖∞;

(cf. Fig. 1.2.7). Using the definition of a convex function for the linesegment that connects yk, xk, and 0, we have

f(xk) ≤(1− ‖xk‖∞

)f(0) + ‖xk‖∞f(yk).

We have ‖xk‖∞ → 0 while f(yk) ≤ A for all k, so by taking limit as k →∞,we obtain

lim supk→∞

f(xk) ≤ f(0).

Using the definition of a convex function for the line segment that connectsxk, 0, and zk, we have

f(0) ≤ ‖xk‖∞‖xk‖∞ + 1

f(zk) +1

‖xk‖∞ + 1f(xk)

and letting k →∞, we obtain

f(0) ≤ lim infk→∞

f(xk).

Thus, limk→∞ f(xk) = f(0) and f is continuous at zero. Q.E.D.

A straightforward consequence of the continuity of a real-valued func-tion f that is convex over <n is that its epigraph as well as the level sets{x | f(x) ≤ γ

}are closed and convex (cf. Prop. 1.2.2).


e1

xk

xk+1

0

yke3 e2

e4 zk

Figure 1.2.7. Construction for provingcontinuity of a convex function (cf. Prop.1.2.12).

1.2.4 Recession Cones

Some of the preceding results [Props. 1.2.8, 1.2.10(e)] have illustrated howboundedness affects the topological properties of sets obtained throughvarious operations on convex sets. In this section we take a closer look atthis issue.

Given a convex set C, we say that a vector y is a direction of recessionof C if x + αy ∈ C for all x ∈ C and α ≥ 0. In words, y is a direction ofrecession of C if starting at any x in C and going indefinitely along y, wenever cross the boundary of C to points outside C . The set of all directionsof recession is a cone containing the origin. It is called the recession coneof C and it is denoted by RC (see Fig. 1.2.8). This definition implies thatthe recession cone of the intersection of any collection of sets Ci, i ∈ I , isequal to the corresponding intersection of the recession cones:

R∩i∈ICi = ∩i∈IRCi .

The following proposition gives some additional properties of recessioncones.

Proposition 1.2.13: (Recession Cone Theorem) Let C be anonempty closed convex set.

(a) The recession cone RC is a closed convex cone.

(b) A vector y belongs to RC if and only if there exists a vectorx ∈ C such that x + αy ∈ C for all α ≥ 0.

(c) RC contains a nonzero direction if and only if C is unbounded.


0

x + αy

x

Convex Set C

Recession Cone RC

y

Figure 1.2.8. Illustration of the recession cone RC of a convex set C. A directionof recession y has the property that x + αy ∈ C for all x ∈ C and α ≥ 0.

Proof: (a) If y1, y2 belong to RC and λ1, λ2 are positive scalars such thatλ1 + λ2 = 1, we have for any x ∈ C and α ≥ 0

x + α(λ1y1 + λ2y2) = λ1(x + αy1) + λ2(x + αy2) ∈ C,

where the last inclusion holds because x+ αy1 and x +αy2 belong to C bythe definition of RC . Hence λ1y1 +λ2y2 ∈ RC , implying that RC is convex.

Let y be in the closure of RC, and let {yk} ⊂ RC be a sequenceconverging to y. For any x ∈ C and α ≥ 0 we have x + αyk ∈ C for all k,and because C is closed, we have x + αy ∈ C. This implies that y ∈ RC

and that RC is closed.

(b) If y ∈ RC, every vector x ∈ C has the required property by the def-inition of RC . Conversely, let y be such that there exists a vector x ∈ Cwith x + αy ∈ C for all α ≥ 0. We fix x ∈ C and α > 0, and we showthat x + αy ∈ C. We may assume that y 6= 0 (otherwise we are done) andwithout loss of generality, we may assume that ‖y‖ = 1. Let

zk = x + kαy, k = 1,2, . . . .

If x = zk for some k, then x + αy = x + α(k + 1)y which belongs to C andwe are done. We thus assume that x 6= zk for all k, and we define

yk =zk − x

‖zk − x‖ , k = 1, 2, . . .

(see the construction of Fig. 1.2.9).We have

yk =‖zk − x‖‖zk − x‖ ·

zk − x

‖zk − x‖ +x− x

‖zk − x‖ =‖zk − x‖‖zk − x‖ · y +

x− x

‖zk − x‖ .

Because zk is an unbounded sequence,

‖zk − x‖‖zk − x‖ → 1,

x− x

‖zk − x‖ → 0,


x + αy

x

Convex Set C

x

zk-2zk-1

zk

x + y

x + yk

Unit Ball

x + αy

Figure 1.2.9. Construction for the proof of Prop. 1.2.13(b).

so by combining the preceding relations, we have yk → y. Thus x + αy isthe limit of {x + αyk}. The vector x + αyk lies between x and zk in theline segment connecting x and zk for all k such that ‖zk − x‖ ≥ α, so byconvexity of C, we have x + αyk ∈ C for all sufficiently large k. Using theclosure of C , it follows that x + αy must belong to C .

(c) Assuming that C is unbounded, we will show that RC contains a nonzerodirection (the reverse is clear). Choose any x ∈ C and any unboundedsequence {zk} ⊂ C. Consider the sequence {yk}, where

yk =zk − x

‖zk − x‖ ,

and let y be a limit point of {yk} (compare with the construction of Fig.1.2.9). For any fixed α ≥ 0, the vector x+αyk lies between x and zk in theline segment connecting x and zk for all k such that ‖zk − x‖ ≥ α. Henceby convexity of C , we have x + αyk ∈ C for all sufficiently large k. Sincex + αy is a limit point of {x + αyk}, and C is closed, we have x + αy ∈ C.Hence the nonzero vector y is a direction of recession. Q.E.D.

Note that part (c) of the above proposition yields a characterizationof compact and convex sets, namely that a closed convex set is boundedif and only if RC = {0}. A useful generalization is that for a compact setW ⊂ <m and an m× n matrix A, the set

V = {x ∈ C | Ax ∈ W}

is compact if and only if RC ∩ N (A) = {0}. To see this, note that therecession cone of the set

V = {x ∈ <n | Ax ∈ W}


is N(A) [clearly N (A) ⊂ RV ; if x /∈ N (A) but x ∈ RV we must haveαAx ∈ W for all α > 0, which contradicts the boundedness of W ]. Hence,the recession cone of V is RC ∩N(A), so by Prop. 1.2.13(c), V is compactif and only if RC ∩N (A) = {0}.

One possible use of recession cones is to obtain conditions guaran-teeing the closure of linear transformations and vector sums of convex setsin the absence of boundedness, as in the following two propositions (somerefinements are given in the exercises).

Proposition 1.2.14: Let C be a nonempty closed convex subset of<n and let A be an m × n matrix with nullspace denoted by N(A).Assume that RC ∩N(A) = {0}. Then AC is closed.

Proof: For any y ∈ cl(AC), the set

Cε ={x ∈ C | ‖y − Ax‖ ≤ ε

}

is nonempty for all ε > 0. Furthermore, by the discussion following theproof of Prop. 1.2.13, the assumption RC ∩N(A) = {0} implies that Cε iscompact. It follows that the set ∩ε>0Cε is nonempty and any x ∈ ∩ε>0Cε

satisfies Ax = y, so y ∈ AC . Q.E.D.

Proposition 1.2.15: Let C1, . . . , Cm be nonempty closed convex sub-sets of <n such that the equality y1 + · · · + ym = 0 for some vectorsyi ∈ RCi implies that yi = 0 for all i = 1, . . . , m. Then the vector sumC1 + · · ·+ Cm is a closed set.

Proof: Let C be the Cartesian product C1 × · · · × Cm viewed as a sub-set of <mn and let A be the linear transformation that maps a vector(x1, . . . , xm) ∈ <mn into x1 + · · ·+ xm. We have

RC = RC1 + · · ·+ RCm

(see the exercises) and

N (A) ={(y1, . . . , ym) | y1 + · · ·+ ym = 0, yi ∈ <n

},

so under the given condition, we obtain RC ∩ N (A) = {0}. Since AC =C1 + · · ·+ Cm, the result follows by applying Prop. 1.2.14. Q.E.D.

When specialized to just two sets C1 and C2, the above propositionsays that if there is no nonzero direction of recession of C1 that is the


opposite of a direction of recession of C2, then C1 + C2 is closed. This istrue in particular if RC1 = {0} which is equivalent to C1 being compact[cf. Prop. 1.2.13(c)]. We thus obtain the following proposition.

Proposition 1.2.16: Let C1 and C2 be closed, convex sets. If C1 isbounded, then C1 + C2 is a closed and convex set. If both C1 and C2

are bounded, then C1 + C2 is a convex and compact set.

Proof: Closedness of C1 + C2 follows from the preceding discussion. Ifboth C1 and C2 are bounded, then C1 +C2 is also bounded and hence alsocompact. Q.E.D.

Note that if C1 and C2 are both closed and unbounded, the vectorsum C1 +C2 need not be closed. For example consider the closed sets of <2

given by C1 ={(x1, x2) | x1x2 ≥ 1, x1 ≥ 0, x2 ≥ 0

}and C2 = {(x1, x2) |

x1 = 0}. Then C1 + C2 is the open halfspace {(x1, x2) | x1 > 0}.

E X E R C I S E S

1.2.1

(a) Show that a set is convex if and only if it contains all the convex combina-tions of its elements.

(b) Show that the convex hull of a set coincides with the set of all the convexcombinations of its elements.

1.2.2

Let C be a nonempty set in <n, and let λ1 and λ2 be positive scalars. Showby example that the sets (λ1 + λ2)C and λ1C + λ2C may differ when C is notconvex [cf. Prop. 1.2.1].

1.2.3 (Properties of Cones)

(a) For any collection {Ci | i ∈ I} of cones, the intersection ∩i∈ICi is a cone.

(b) The vector sum C1 + C2 of two cones C1 and C2 is a cone.


(c) The closure of a cone is a cone.

(d) The image and the inverse image of a cone under a linear transformationis a cone.

(e) For any collection of vectors {ai | i ∈ I}, the set

C = {x | a′ix ≤ 0, i ∈ I}

is a closed convex cone.

1.2.4 (Convex Cones)

Let C1 and C2 be convex cones containing the origin.

(a) Show thatC1 + C2 = conv(C1 ∪ C2).

(b) Consider the set C given by

C =⋃

α∈[0,1]

((1− α)C1 ∩ αC2

).

Show thatC = C1 ∩ C2.

1.2.5

Given sets Xi ⊂ <ni , i = 1, . . . , m, let X = X1 × · · · × Xm be their Cartesianproduct.

(a) Show that the convex hull (closure, affine hull) of X is equal to the Cartesianproduct of the convex hulls (closures, affine hulls, respectively) of the Xi’s.

(b) Assuming X1, . . . ,Xm are convex, show that the relative interior (reces-sion cone) of X is equal to the Cartesian product of the relative interiors(recession cones) of the Xi’s.

1.2.6

Let {Ci | i ∈ I} be an arbitrary collection of convex sets in <n, and let C be theconvex hull of the union of the collection. Show that

C =⋃

(∑

i∈I

αiCi

),

where the union is taken over all convex combinations such that only finitelymany coefficients αi are nonzero.


1.2.7

Let X be a nonempty set.

(a) Show that X, conv(X), and cl(X) have the same dimension.

(b) Show that cone(X) = cone(conv(X)

).

(c) Show that the dimension of conv(X) is at most as large as the dimensionof cone(X). Give an example where the dimension of conv(X) is smallerthan the dimension of cone(X).

(d) Assuming that the origin belongs to conv(X), show that conv(X) andcone(X) have the same dimension.

1.2.8

Let g be a convex, monotonically nondecreasing function of a single variable [i.e.,g(y) ≤ g(y) for y < y], and let f be a convex function defined on a convex setC ⊂ <n. Show that the function h defined by

h(x) = g(f(x)

)

is convex over C.

1.2.9 (Convex Functions)

Show that the following functions are convex:

(a) f1 : X → < is given by

f1(x1, . . . , xn) = −(x1x2 · · · xn)1n

where X = {(x1, . . . , xn) | x1 ≥, . . . , xn ≥ 0}.(b) f2(x) = ‖x‖p with p ≥ 1.

(c) f3(x) = 1g(x) with g a concave function over <n such that g(x) > 0 for all

x.

(d) f4(x) = αf(x) + β with f a convex function over <n, and α and β scalarssuch that α ≥ 0.

(e) f5(x) = max{f(x), 0} with f a convex function over <n.

(f) f6(x) = ||Ax − b|| with A an m × n matrix and b a vector in <m.

(g) f7(x) = x′Ax + b′x + β with A an n × n positive semidefinite symmetricmatrix, b a vector in <n, and β a scalar.

(h) f8(x) = eβx′Ax with A an n×n positive semidefinite symmetric matrix andβ a positive scalar.


1.2.10

Use the Line Segment Principle and the method of proof of Prop. 1.2.5(c) to showthat if C is a convex set with nonempty interior, and f : <n 7→ < is convex andtwice continuously differentiable over C, then ∇2f(x) is positive semidefinite forall x ∈ C.

1.2.11

Let C ⊂ <n be a convex set and let f : <n 7→ < be twice continuously differen-tiable over C. Let S be the subspace that is parallel to the affine hull of C. Showthat f is convex over C if and only if y′∇2f(x)y ≥ 0 for all x ∈ C and y ∈ S.

1.2.12

Let f : <n 7→ < be a differentiable function. Show that f is convex over a convexset C if and only if

(∇f(x) −∇f(y)

)′(x− y) ≥ 0, ∀ x, y ∈ C.

Hint : The condition above says that the function f , restricted to the line segmentconnecting x and y, has monotonically nondecreasing gradient; see also the proofof Prop. 1.2.6.

1.2.13 (Ascent/Descent Behavior of a Convex Function)

Let f : < 7→ < be a convex function of a single variable.

(a) (Monotropic Property) Use the definition of convexity to show that f is“turning upwards” in the sense that if x1, x2, x3 are three scalars such thatx1 < x2 < x3, then

f(x2) − f(x1)

x2 − x1≤ f(x3) − f(x2)

x3 − x2.

(b) Use part (a) to show that there are four possibilities as x increases to ∞:(1) f(x) decreases monotonically to −∞, (2) f(x) decreases monotonicallyto a finite value, (3) f(x) reaches and stays at some value, (4) f(x) increasesmonotonically to ∞ when x ≥ x for some x ∈ <.

1.2.14 (Arithmetic-Geometric Mean Inequality)

Show that if α1, . . . , αn are positive scalars with∑n

i=1αi = 1, then for every set

of positive scalars x1, . . . , xn, we have

xα11 x

α22 · · ·xαn

n ≤ α1x1 + α2x2 + · · · + αnxn,

with equality if and only if x1 = x2 = · · · = xn. Hint : Show that − lnx is astrictly convex function on (0,∞).


1.2.15

Use the result of Exercise 1.2.14 to verify Young’s inequality

xy ≤ xp

p+

yq

q,

where p > 0, q > 0, 1/p + 1/q = 1, x ≥ 0, and y ≥ 0. Then, use Young’sinequality to verify Holder’s inequality

n∑

i=1

|xiyi| ≤(

n∑

i=1

|xi|p)1/p (

n∑

i=1

|yi|q)1/q

.

1.2.16

Let f : <n+m 7→ < be a convex function. Consider the function h : <n 7→ <given by

h(x) = infu∈U

f(x, u),

where U is nonempty and convex subset of <m. Assuming that h(x) > −∞ forall x ∈ <n, show that h is convex. Hint : There cannot exist α ∈ [0, 1], x1, x2,u1 ∈ U , u2 ∈ U such that h

(αx1 + (1− α)x2

)> αf(x1, u1) + (1− α)f(x2, u2).

1.2.17

(a) Let C be a convex set in <n+1 and let

f(x) = inf{w | (x, w) ∈ C}.

Show that f is convex over <n.

(b) Let f1, . . . , fm be convex functions over <n and let

f(x) = inf

{m∑

i=1

fi(xi)

∣∣∣m∑

i=1

xi = x

}.

Assuming that f(x) > −∞ for all x, show that f is convex over <n.

(c) Let h : <m 7→ < be a convex function and let

f(x) = infBy=x

h(y),

where B is an n ×m matrix. Assuming that f(x) > −∞ for all x, showthat f is convex over the range space of B.

(d) In parts (b) and (c), show by example that if the assumption f(x) > −∞for all x is violated, then the set

{x | f(x) > −∞

}need not be convex.


1.2.18

Let {fi | i ∈ I} be an arbitrary collection of convex functions on <n. Definethe convex hull of these functions f : <n → < as the pointwise infimum of thecollection, i.e.,

f(x) = inf{w | (x, w) ∈ conv

(∪i∈Iepi(fi)

)}.

Show that f(x) is given by

f(x) = inf

{∑

i∈I

αifi(xi)

∣∣∣∑

i∈I

αixi = x

},

where the infimum is taken over all representations of x as a convex combinationof elements xi, such that only finitely many coefficients αi are nonzero.

1.2.19 (Convexification of Nonconvex Functions)

Let X be a nonempty subset of <n, let f : X 7→ < be a function that is boundedbelow over X . Define the function F : conv(X) 7→ < by

F (x) = inf{w | (x, w) ∈ conv

(epi(f)

)}.

(a) Show that F is convex over conv(X) and it is given by

F (x) = inf

{m∑

i=1

αif(xi)∣∣∣

m∑

i=1

αixi = x, xi ∈ X,

m∑

i=1

αi = 1, αi ≥ 0, m ≥ 1

}

(b) Show thatinf

x∈conv(X)F (x) = inf

x∈Xf(x).

(c) Show that the set of global minima of F over conv(X) includes all globalminima of f over X.

1.2.20 (Extension of Caratheodory’s Theorem)

Let X1 and X2 be subsets of <n, and let X = conv(X1) + cone(X2). Show thatevery vector x in X can be represented in the form

x =

k∑

i=1

αixi +

m∑

i=k+1

αixi,

where m is a positive integer with m ≤ n+1, the vectors x1, . . . , xk belong to X1,the vectors xk+1, . . . , xm belong to X2, and the scalars α1, . . . , αm are nonnegativewith α1+· · ·+αk = 1. Furthermore, the vectors x2−x1, . . . , xk−x1, xk+1, . . . , xm

are linearly independent.


1.2.21

Let x0, . . . , xm be vectors in <n such that x1 − x0, . . . , xm − x0 are linearlyindependent. The convex hull of x0, . . . , xm is called an m-dimensional simplex,and x0, . . . , xm are called the vertices of the simplex.

(a) Show that the dimension of a convex set C is the maximum of the dimen-sions of the simplices included in C.

(b) Use part (a) to show that a nonempty convex set has a nonempty relativeinterior.

1.2.22

Let X be a bounded subset of <n. Show that

cl(conv(X)

)= conv

(cl(X)

).

In particular, if X is closed and bounded, then conv(X) is closed and bounded(cf. Prop. 1.2.7).

1.2.23

Let C1 and C2 be two nonempty convex sets such that C1 ⊂ C2.

(a) Give an example showing that ri(C1) need not be a subset of ri(C2).

(b) Assuming that the sets ri(C1) and ri(C2) have nonempty intersection, showthat ri(C1) ⊂ ri(C2).

(c) Assuming that the sets C1 and ri(C2) have nonempty intersection, showthat the set ri(C1) ∩ ri(C2) is nonempty.

1.2.24

Let C be a nonempty convex set.

(a) Show the following refinement of the Line Segment Principle [Prop. 1.2.9(c)]:x ∈ ri(C) if and only if for every x ∈ aff(C), there exists γ > 1 such thatx + (γ − 1)(x− x) ∈ C.

(b) Assuming that the origin lies in ri(C), show that cone(C) coincides withaff(C).

(c) Show the following extension of part (b) to a nonconvex set: If X is anonempty set such that the origin lies in the relative interior of conv(X),then cone(X) coincides with aff(X).


1.2.25

Let C be a compact set.

(a) Assuming that C is convex set not containing the origin on its boundary,show that cone(C) is closed.

(b) Give examples showing that the assertion of part (a) fails if C is unboundedor C contains the origin on its boundary.

(c) The convexity assumption in part (a) can be relaxed as follows: assum-ing that conv(C) does not contain the origin on its boundary, show thatcone(C) is closed. Hint : Use part (a) and Exercise 1.2.7(b).

1.2.26

(a) Let C be a convex cone. Show that ri(C) is also a convex cone.

(b) Let C = cone({x1, . . . , xm}

). Show that

ri(C) =

{m∑

i=1

αixi

∣∣∣ αi > 0, i = 1, . . . , m

}.

1.2.27

Let A be an m×n matrix and let C be a nonempty convex set in <n. Assumingthat the inverse image A−1 · C is nonempty, show that

ri(A−1 · C) = A

−1 · ri(C), cl(A−1 ·C) = A

−1 · cl(C).

[Compare these relations with those of Prop. 1.2.9(d) and (e), respectively.]

1.2.28 (Lipschitz Property of Convex Functions)

Let f : <n → < be a convex function and X be a bounded set in <n. Then fhas the Lipschitz property over X, i.e., there exists a positive scalar c such that

|f(x) − f(y)| ≤ c · ||x− y||, ∀ x, y ∈ X.

1.2.29

Let C be a closed convex set and let M be an affine set such that the intersectionC∩M is nonempty and bounded. Show that for every affine set M that is parallelto M , the intersection C ∩M is bounded when nonempty.


1.2.30 (Recession Cones of Nonclosed Sets)


(a) Show by counterexample that part (b) of the Recession Cone Theorem neednot hold when C is not closed.

(b) Show thatRC ⊂ Rcl(C), cl(RC) ⊂ Rcl(C).

Give an example where the inclusion cl(RC) ⊂ Rcl(C) is strict, and anotherexample where cl(RC) = Rcl(C). Also, give an example showing that Rri(C)

need not be a subset of RC .

(c) Let C be a closed convex set such that C ⊂ C. Show that RC ⊂ RC . Give

an example showing that the inclusion can fail if C is not closed.

1.2.31 (Recession Cones of Relative Interiors)


(a) Show that a vector y belongs to Rri(C) if and only if there exists a vectorx ∈ ri(C) such that x + αy ∈ C for every α ≥ 0.

(b) Show that Rri(C) = Rcl(C).

(c) Let C be a relatively open convex set such that C ⊂ C. Show that RC ⊂RC . Give an example showing that the inclusion can fail if C is not relativelyopen. [Compare with Exercise 1.2.30(b).]

Hint : In part (a), follow the proof of Prop.1.2.12(b). In parts (b) and (c), usethe result of part (a).

1.2.32

Let C be a nonempty convex set in <n and let A be an m × n matrix.

(a) Show the following refinement of Prop. 1.2.13: if Rcl(C)∩N (A) = {0}, then

cl(A · C) = A · cl(C), A ·Rcl(C) = RA·cl(C).

(b) Give an example showing that A · Rcl(C) and RA·cl(C) can differ whenRcl(C) ∩N(A) 6= {0}.

1.2.33 (Lineality Space and Recession Cone)

Let C be a nonempty convex set in <n. Define the lineality space of C, denoted byL, to be a subspace of vectors y such that simultaneously y ∈ RC and −y ∈ RC .

(a) Show that for every subspace S ⊂ L

C = (C ∩ S⊥) + S.

Sec. 1.3 Convexity and Optimization 59

(b) Show the following refinement of Prop. 1.2.13 and Exercise 1.2.32: if A isan m× n matrix and Rcl(C) ∩N (A) is a subspace of L, then

cl(A · C) = A · cl(C), RA·cl(C) = A · Rcl(C).

1.2.34

This exercise is a refinement of Prop. 1.2.14.

(a) Let C1, . . . , Cm be nonempty closed convex sets in <n such that the equalityy1 + · · ·+ym = 0 with yi ∈ RCi implies that each yi belongs to the linealityspace of Ci. Then the vector sum C1 + · · ·+ Cm is a closed set and

RC1+···+Cm = RC1 + · · ·+ RCm .

(b) Show the following extension of part (a) to nonclosed sets: Let C1, . . . , Cm

be nonempty convex sets in <n such that the equality y1+· · ·+ym = 0 withyi ∈ Rcl(Ci)

implies that each yi belongs to the lineality space of cl(Ci).Then we have

cl(C1 + · · ·+ Cm) = cl(C1) + · · · + cl(Cm),

Rcl(C1+···+Cm) = Rcl(C1) + · · ·+ Rcl(Cm).

1.3 CONVEXITY AND OPTIMIZATION

In this section we discuss applications of convexity to some basic optimiza-tion issues, such as the existence and uniqueness of global minima. Severalother applications, relating to optimality conditions and polyhedral con-vexity, will be discussed in subsequent sections.

1.3.1 Local and Global Minima

Let X be a nonempty subset of <n and let f : <n 7→ (−∞,∞] be afunction. We say that a vector x∗ ∈ X is a minimum of f over X iff(x∗) = infx∈X f(x). We also call x∗ a minimizing point or minimizer orglobal minimum over X. Alternatively, we say that f attains a minimumover X at x∗, and we indicate this by writing

x∗ ∈ arg minx∈X

f(x).


We use similar terminology for maxima, i.e., a vector x∗ ∈ X such thatf(x∗) = supx∈X f(x) is said to be a maximum of f over X, and we indicatethis by writing

x∗ ∈ arg maxx∈X

f(x).

If the domain of f is the set X (instead of <n), we also call x∗ a (global)minimum or (global) maximum of f (without the qualifier “over X”).

A basic question in minimization problems is whether an optimalsolution exists. This question can often be resolved with the aid of theclassical theorem of Weierstrass, which states that a continuous functionattains a minimum over a compact set. We will provide a more generalversion of this theorem, and to this end, we introduce some terminilogy.We say that a function f : <n 7→ (−∞,∞] is coercive if

limk→∞

f(xk) = ∞

for every sequence {xk} such that ‖xk‖ → ∞ for some norm ‖ · ‖. Notethat as a consequence of the definition, the level sets

{x | f(x) ≤ γ} of a

coercive function f are bounded whenever they are nonempty.

Proposition 1.3.1: (Weierstrass’ Theorem) Let X be a nonemptysubset of <n and let f : <n 7→ (−∞,∞] be a closed function. Assumethat one of the following three conditions holds:

(1) X is compact.

(2) X is closed and f is coercive.

(3) There exists a scalar γ such that the set

{x ∈ X | f(x) ≤ γ

}

is nonempty and compact.

Then, f attains a minimum over X.

Proof: If f(x) = ∞ for all x ∈ X, then every x ∈ X attains the minimumof f over X. Thus, with no loss of generality, we assume that infx∈X f(x) <∞. Assume condition (1). Let {xk} ⊂ X be a sequence such that

limk→∞

f(xk) = infx∈X

f(x).

Since X is bounded, this sequence has at least one limit point x∗ [Prop.1.1.5(a)]. Since f is closed, f is lower semicontinuous at x∗ [cf. Prop.1.2.2(b)], so that f(x∗) ≤ limk→∞ f(xk) = infx∈X f(x). Since X is closed,x∗ belongs to X, so we must have f(x∗) = infx∈X f(x).


Assume condition (2). Consider a sequence {xk} as in the proof undercondition (1). Since f is coercive, {xk} must be bounded and the proofproceeds similar to the proof under condition (1).

Assume condition (3). If the given γ is equal to infx∈X f(x), the setof minima of f over X is

{x ∈ X | f(x) ≤ γ

}, and since by assumption

this set is nonempty, we are done. If infx∈X f(x) < γ, consider a sequence{xk} as in the proof under condition (1). Then, for all k sufficiently large,xk must belong to the set

{x ∈ X | f(x) ≤ γ

}. Since this set is compact,

{xk} must be bounded and the proof proceeds similar to the proof undercondition (1). Q.E.D.

Note that with appropriate adjustments, the above proposition ap-plies to the existence of maxima of f over X. In particular, if f is uppersemicontinuous at all points of X and X is compact, then f attains amaximum over X.

We say that a vector x∗ ∈ X is a local minimum of f over X ifthere exists some ε > 0 such that f(x∗) ≤ f(x) for every x ∈ X satisfying‖x− x∗‖ ≤ ε, where ‖ · ‖ is some vector norm. If the domain of f is the setX (instead of <n), we also call x a local minimum (or local maximum) off (without the qualifier “over X”). Local and global maxima are definedsimilarly.

An important implication of convexity of f and X is that all localminima are also global, as shown in the following proposition and in Fig.1.3.1.

Proposition 1.3.2: If X ⊂ <n is a convex set and f : X 7→ < is aconvex function, then a local minimum of f is also a global minimum.If in addition f is strictly convex, then there exists at most one globalminimum of f .

Proof: See Fig. 1.3.1 for a proof that a local minimum of f is also global.Let f be strictly convex, and to obtain a contradiction, assume that twodistinct global minima x and y exist. Then the average (x + y)/2 mustbelong to X, since X is convex. Furthermore, the value of f must besmaller at the average than at x and y by the strict convexity of f . Sincex and y are global minima, we obtain a contradiction. Q.E.D.

1.3.2 The Projection Theorem

In this section we develop a basic result of analysis and optimization.


αf(x*) + (1 - α)f(x)

x

f(αx* + (1- α)x)

x x*

f(x)

Figure 1.3.1. Proof of why local minima of convex functions are also global.Suppose that f is convex, and assume to arrive at a contradiction, that x∗ is alocal minimum that is not global. Then there must exist an x ∈ X such thatf(x) < f(x∗). By convexity, for all α ∈ (0, 1),

f(αx∗ + (1− α)x

)≤ αf(x∗) + (1− α)f(x) < f(x∗).

Thus, f has strictly lower value than f(x∗) at every point on the line segmentconnecting x∗ with x, except x∗. This contradicts the local minimality of x∗.


Proposition 1.3.3: (Projection Theorem) Let C be a closed con-vex set and let ‖ · ‖ be the Euclidean norm.

(a) For every x ∈ <n, there exists a unique vector z ∈ C that mini-mizes ‖z − x‖ over all z ∈ C . This vector is called the projectionof x on C , and is denoted by PC(x), i.e.,

PC(x) = arg minz∈C

‖z − x‖.

(b) For every x ∈ <n, a vector z ∈ C is equal to PC(x) if and only if

(y − z)′(x− z) ≤ 0, ∀ y ∈ C.

(c) The function f : <n 7→ C defined by f(x) = PC(x) is continuousand nonexpansive, i.e.,

∥∥PC(x) − PC(y)∥∥ ≤ ‖x− y‖, ∀ x, y ∈ <n.

(d) The distance function

d(x,C) = minz∈C

‖z − x‖, x ∈ <n,

is convex.

Proof: (a) Fix x and let w be some element of C. Minimizing ‖x−z‖ overall z ∈ C is equivalent to minimizing the same function over all z ∈ C suchthat ‖x− z‖ ≤ ‖x−w‖, which is a compact set. Furthermore, the functiong defined by g(z) = ‖z − x‖2 is continuous. Existence of a minimizingvector follows by Weierstrass’ Theorem (Prop. 1.3.1).

To prove uniqueness, notice that the square of the Euclidean normis a strictly convex function of its argument [Prop. 1.2.5(d)]. Therefore, gis strictly convex and it follows that its minimum is attained at a uniquepoint (Prop. 1.3.2).

(b) For all y and z in C we have

‖y−x‖2 = ‖y−z‖2 +‖z−x‖2−2(y−z)′(x−z) ≥ ‖z−x‖2−2(y−z)′(x−z).

Therefore, if z is such that (y − z)′(x − z) ≤ 0 for all y ∈ C, we have‖y − x‖2 ≥ ‖z − x‖2 for all y ∈ C, implying that z = PC(x).

Conversely, let z = PC(x), consider any y ∈ C, and for α > 0, define


yα = αy + (1− α)z. We have

‖x− yα‖2 = ‖(1− α)(x− z) + α(x− y)‖2= (1 − α)2‖x− z‖2 + α2‖x− y‖2 + 2(1− α)α(x− z)′(x− y).

Viewing ‖x− yα‖2 as a function of α, we have

∂

∂α

{‖x− yα‖2

}∣∣∣α=0

= −2‖x− z‖2 + 2(x− z)′(x− y) = −2(y − z)′(x− z).

Therefore, if (y − z)′(x− z) > 0 for some y ∈ C, then

∂

∂α

{‖x− yα‖2

}∣∣∣α=0

< 0

and for positive but small enough α, we obtain ‖x− yα‖ < ‖x− z‖. Thiscontradicts the fact z = PC(x) and shows that (y − z)′(x− z) ≤ 0 for ally ∈ C.

(c) Let x and y be elements of <n. From part (b), we have(w−PC(x)

)′(x−

PC(x))≤ 0 for all w ∈ C . Since PC(y) ∈ C, we obtain

(PC(y)− PC(x)

)′(x− PC(x)

)≤ 0.

Similarly, (PC(x)− PC(y)

)′(y − PC(y)

)≤ 0.

Adding these two inequalities, we obtain(PC(y)− PC(x)

)′(x− PC(x)− y + PC(y)

)≤ 0.

By rearranging and by using the Schwartz inequality, we have∥∥PC(y)−PC(x)

∥∥2 ≤(PC(y)−PC(x)

)′(y−x) ≤

∥∥PC(y)−PC(x)∥∥ · ‖y−x‖,

showing that PC(·) is nonexpansive and a fortiori continuous.

(d) Assume, to arrive at a contradiction, that there exist x1, x2 ∈ <n andan α ∈ [0,1] such that

d(αx1 + (1− α)x2, C

)> αd(x1, C) + (1− α)d(x2, C).

Then there must exist z1, z2 ∈ C such that

d(αx1 + (1 − α)x2, C

)> α‖z1 − x1‖+ (1− α)‖z2 − x2‖,

which implies that∥∥αz1 + (1− α)z2 − αx1 − (1− α)x2

∥∥ > α‖x1 − z1‖+ (1− α)‖x2 − z2‖.

This contradicts the triangle inequality in the definition of norm. Q.E.D.

Figure 1.3.2 illustrates the necessary and sufficient condition of part(b) of the Projection Theorem.


x PC(x)

y

C

Figure 1.3.2. Illustration of the conditionsatisfied by the projection PC(x). For eachvector y ∈ C, the vectors x − PC(x) andy − PC(x) form an angle greater than orequal to π/2 or, equivalently,

(y − PC(x)

)′(x− PC(x)

)≤ 0.

1.3.3 Directions of Recession and Existence of Optimal Solutions

The recession cone, discussed in Section 1.2.4, is also useful for character-izing directions along which convex functions asymptotically increase ordecrease. A key idea here is that a function that is convex over <n canbe described in terms of its epigraph, which is a closed and convex set.The recession cone of the epigraph can be used to obtain the directionsalong which the function “slopes downward.” This is the idea underlyingthe following proposition.

Proposition 1.3.4: Let f : <n 7→ < be a convex function and con-sider the level sets La =

{x | f(x) ≤ a

}.

(a) All the level sets La that are nonempty have the same recessioncone, given by

RLa ={y | (y,0) ∈ Repi(f)

},

where Repi(f) is the recession cone of the epigraph of f .

(b) If one nonempty level set La is compact, then all nonempty levelsets are compact.

Proof: From the formula for the epigraph

epi(f) ={(x,w) | f(x) ≤ w

},

it can be seen that for all a for which La is nonempty, we have

{(x, a) | x ∈ La

}= epi(f) ∩

{(x, a) | x ∈ <n

}.

The recession cone of the set in the left-hand side above is{(y, 0) | y ∈

RLa

}. The recession cone of the set in the right-hand side is equal to

the intersection of the recession cone of epi(f) and the recession cone of


{(x, a) | x ∈ <n

}, which is equal to

{(y, 0) | y ∈ <n

}, the horizontal

subspace that passes through the origin. Thus we have

{(y, 0) | y ∈ RLa

}=

{(y,0) | (y,0) ∈ Repi(f)

},

from which it follows that

RLa ={y | (y,0) ∈ Repi(f)

}.

This proves part (a). Part (b) follows by applying Prop. 1.2.13(c) to therecession cone of epi(f). Q.E.D.

For a convex function f : <n 7→ <, the (common) recession cone ofthe nonempty level sets of f is referred to as the recession cone of f , andis denoted by Rf . Thus

Rf ={y | f(x) ≥ f(x + αy), ∀ x ∈ <n, α ≥ 0

}.

Each y ∈ Rf is called a direction of recession of f . If we start at anyx ∈ <n and move indefinitely along a direction of recession y, we must staywithin each level set that contains x, or equivalently we must encounterexclusively points z with f(z) ≤ f(x). In words, a direction of recession off is a direction of uninterrupted nonascent for f .

Conversely, if we start at some x ∈ <n and while moving along adirection y, we encounter a point z with f(z) > f(x), then y cannot be adirection of recession. It is easily seen via a convexity argument that oncewe cross the boundary of a level set of f we never cross it back again, andwith a little thought (see Fig. 1.3.3 and the exercises), it follows that adirection that is not a direction of recession of f is a direction of eventualuninterrupted ascent of f . In view of these observations, it is not surprisingthat directions of recession play a prominent role in characterizing theexistence of solutions of convex optimization problems, as shown in thefollowing proposition.

Proposition 1.3.5: Let f : <n 7→ < be a convex function, X be anonempty closed convex subset of <n, and X∗ be the set of minimizingpoints of f over X. Then X∗ is nonempty and compact if and only ifX and f have no common nonzero direction of recession.

Proof: Let f∗ = infx∈X f(x), and note that

X∗ = X ∩{x | f(x) ≤ f∗

}.

If X∗ is nonempty and compact, it has no nonzero direction of recession[Prop. 1.2.13(c)]. Therefore, there is no nonzero vector in the intersection


f(x + αy)

α

f(x)

(a)

f(x + αy)

α

f(x)

(b)

f(x + αy)

α

f(x)

(c)

f(x + αy)

α

f(x)

(d)

f(x + αy)

α

f(x)

(e)

f(x + αy)

α

f(x)

(f)

Figure 1.3.3. Ascent/descent behavior of a convex function starting at somex ∈ <n and moving along a direction y. If y is a direction of recession of f , thereare two possibilities: either f decreases monotonically to a finite value or −∞[figures (a) and (b), respectively], or f reaches a value that is less or equal to f(x)and stays at that value [figures (c) and (d)]. If y is not a direction of recessionof f, then eventually f increases monotonically to ∞ [figures (e) and (f)]; i.e., forsome α ≥ 0 and all α1, α2 ≥ α with α1 < α2 we have f(x + α1y) < f(x + α2y).

of the recession cones of X and{x | f(x) ≤ f∗

}. This is equivalent to X

and f having no common nonzero direction of recession.Conversely, let a be a scalar such that the set

Xa = X ∩{x | f(x) ≤ a

}


is nonempty and has no nonzero direction of recession. Then, Xa is closed[since X is closed and

{x | f(x) ≤ a

}is closed by the continuity of f ], and

by Prop. 1.2.13(c), Xa is compact. Since minimization of f over X andover Xa yields the same set of minima, X∗, by Weierstrass’ Theorem (Prop.1.3.1) X∗ is nonempty, and since X∗ ⊂ Xa, we see that X∗ is bounded.Since X∗ is closed [since X is closed and

{x | f(x) ≤ f∗

}is closed by the

continuity of f ] it is compact. Q.E.D.

If the closed convex set X and the convex function f of the aboveproposition have a common direction of recession, then either X∗ is empty[take for example, X = (−∞, 0] and f(x) = ex] or else X∗ is nonemptyand unbounded [take for example, X = (−∞, 0] and f(x) = max{0, x}].

Another interesting question is what happens when X and f have acommon direction of recession, call it y, but f is bounded below over X:

f∗ = infx∈X

f(x) > −∞.

Then for any x ∈ X, we have x+αy ∈ X (since y is a direction of recessionof X), and f(x + αy) is monotonically nondecreasing to a finite value asα →∞ (since y is a direction of recession of f and f∗ > −∞). Generally,the minimum of f over X need not be attained. However, it turns outthat the minimum is attained in an important special case: when f isquadratic and X is polyhedral (is specified by linear equality and inequalityconstraints).

To understand the main idea, consider the problem

minimize f(x) = c′x +1

2x′Qx

subject to Ax = 0,(1.9)

where Q is a positive semidefinite symmetric n × n matrix, c ∈ <n is agiven vector, and A is an m× n matrix. Let N(Q) and N (A) denote thenullspaces of Q and A, respectively. There are two possibilities:

(a) For some x ∈ N (A) ∩ N(Q), we have c′x 6= 0. Then, since f(αx) =αc′x for all α ∈ <, it follows that f becomes unbounded from beloweither along x or along −x.

(b) For all x ∈ N(A) ∩ N(Q), we have c′x = 0. In this case, we havef(x) = 0 for all x ∈ N(A)∩N (Q). For x ∈ N(A) such that x /∈ N(Q),since N(Q) and R(Q), the range of Q, are orthogonal subspaces, xcan be uniquely decomposed as xR + xN , where xN ∈ N(Q) andxR ∈ R(Q), and we have f(x) = c′x + (1/2)x′RQxR, where xR isthe (nonzero) component of x along R(Q). Hence f(αx) = αc′x +(1/2)α2x′RQxR for all α > 0, with x′RQxR > 0. It follows that f isbounded below along all feasible directions x ∈ N(A).


We thus conclude that for f to be bounded from below along all directionsin N (A) it is necessary and sufficient that c′x = 0 for all x ∈ N(A)∩N(Q).However, boundedness from below of a convex cost function f along alldirections of recession of a constraint set does not guarantee existence ofan optimal solution, or even boundedness from below over the constraint set(see the exercises). On the other hand, since the constraint set N(A) is asubspace, it is possible to use a transformation x = Bz where the columnsof the matrix B are basis vectors for N(A), and view the problem as anunconstrained minimization over z of the cost function h(z) = f(Bz), whichis positive semidefinite quadratic. We can then argue that boundednessfrom below of this function along all directions z is necessary and sufficientfor existence of an optimal solution. This argument indicates that problem(1.9) has an optimal solution if and only if c′x = 0 for all x ∈ N(A)∩N(Q).By using a translation argument, this result can also be extended to thecase where the constraint set is a general affine set of the form {x | Ax = b}rather than the subspace {x | Ax = 0}.

In part (a) of the following proposition we state the result just de-scribed (equality constraints only). While we can prove the result by for-malizing the argument outlined above, we will use instead a more elemen-tary variant of this argument, whereby the constraints are eliminated viaa penalty function; this will give us the opportunity to introduce a lineof proof that we will frequently employ in other contexts as well. In part(b) of the proposition, we allow linear inequality constraints, and we showthat a convex quadratic program has an optimal solution if and only if itsoptimal value is bounded below. Note that the cost function may be linear,so the proposition applies to linear programs as well.


Proposition 1.3.6: (Existence of Solutions of Quadratic Pro-grams) Let f : <n 7→ < be a quadratic function of the form

f(x) = c′x +1

2x′Qx,

where Q is a positive semidefinite symmetric n×n matrix and c ∈ <n isa given vector. Let also A be an m×n matrix and b ∈ <m be a vector.Denote by N(A) and N(Q), the nullspaces of A and Q, respectively.

(a) Let X = {x | Ax = b} and assume that X is nonempty. Thefollowing are equivalent:

(i) f attains a minimum over X.

(ii) f∗ = infx∈X f(x) > −∞.

(iii) c′y = 0 for all y ∈ N(A) ∩N (Q).

(b) Let X = {x | Ax ≤ b} and assume that X is nonempty. Thefollowing are equivalent:

(i) f attains a minimum over X.

(ii) f∗ = infx∈X f(x) > −∞.

(iii) c′y ≥ 0 for all y ∈ N(Q) such that Ay ≤ 0.

Proof: (a) (i) clearly implies (ii).We next show that (ii) implies (iii). For all x ∈ X, y ∈ N (A)∩N(Q),

and α ∈ <, we have x + αy ∈ X and

f(x + αy) = c′(x + αy) +1

2(x + αy)′Q(x + αy) = f(x) + αc′y.

If c′y 6= 0, then either limα→∞ f(x + αy) = −∞ or limα→−∞ f(x + αy) =−∞, and we must have f∗ = −∞. Hence (ii) implies that c′y = 0 for ally ∈ N(A) ∩N(Q).

We finally show that (iii) implies (i) by first using a translation argu-ment and then using a penalty function argument. Choose any x ∈ X, sothat X = x+N(A). Then minimizing f over X is equivalent to minimizingf(x + y) over y ∈ N (A), or

minimize h(y)

subject to Ay = 0,

where

h(y) = f(x + y) = f(x) +∇f(x)′y +1

2y′Qy.


For any integer k > 0, let

hk(y) = h(y) +k

2‖Ay‖2 = f(x) +∇f(x)′y +

1

2y′Qy +

k

2‖Ay‖2. (1.10)

Note that for all k

hk(y) ≤ hk+1(y), ∀ y ∈ <n,

and

infy∈<n

hk(y) ≤ infy∈<n

hk+1(y) ≤ infAy=0

h(y) ≤ h(0) = f(x). (1.11)

DenoteS =

(N(A) ∩N (Q)

)⊥

and write any y ∈ <n as y = z + w, where

z ∈ S, w ∈ S⊥ = N(A) ∩N (Q).

Then, by using the assumption c′w = 0 [implying that ∇f(x)′w = (c +Qx)′w = 0], we see from Eq. (1.10) that

hk(y) = hk(z + w) = hk(z), (1.12)

i.e., hk is determined in terms of its restriction to the subspace S. It canbe seen from Eq. (1.10) that the function hk has no nonzero direction ofrecession in common with S, so hk(z) attains a minimum over S, call it yk,and in view of Eq. (1.12), yk also attains the minimum of hk(y) over <n.

From Eq. (1.11), we have

hk(yk) ≤ hk+1(yk+1) ≤ infAy=0

h(y) ≤ f(x), (1.13)

and we will use this relation to show that {yk} is bounded and each of itslimit points minimizes h(y) subject to Ay = 0. Indeed, from Eq. (1.13), thesequence {hk(yk)} is bounded, so if {yk} were unbounded, then assumingwithout loss of generality that yk 6= 0, we would have hk(yk)/‖yk‖ → 0, or

limk→∞

(1

‖yk‖f(x) +∇f(x)′yk + ‖yk‖

(1

2y′kQyk +

k

2‖Ayk‖2

))= 0,

where yk = yk/‖yk‖. For this to be true, all limit points y of the boundedsequence {yk} must be such that y′Qy = 0 and Ay = 0, which is impossiblesince ‖y‖ = 1 and y ∈ S. Thus {yk} is bounded and for any one of its limitpoints, call it y, we have y ∈ S and

lim supk→∞

hk(yk) = f(x) +∇f(x)′y +1

2y′Qy + lim sup

k→∞

k

2‖Ayk‖2 ≤ inf

Ay=0h(y).


It follows that Ay = 0 and that y minimizes h(y) over Ay = 0. This impliesthat the vector y = x + y minimizes f(x) subject to Ax = b.

(b) Clearly (i) implies (ii), and similar to the proof of part (a), (ii) impliesthat c′y ≥ 0 for all y ∈ N(Q) with Ay ≤ 0.

Finally, we show that (iii) implies (i) by using the corresponding re-sult of part (a). For any x ∈ X , let J(x) denote the index set of activeconstraints at x, i.e., J(x) = {j | a′jx = bj}, where the a′j are the rows ofA.

For any sequence {xk} ⊂ X with f(xk) → f∗, we can extract asubsequence such that J(xk) is constant and equal to some J . Accordingly,we select a sequence {xk} ⊂ X such that f(xk) → f∗, and the index setJ(xk) is equal for all k to a set J that is maximal over all such sequences [forany other sequence {xk} ⊂ X with f(xk) → f∗ and such that J(xk) = Jfor all k, we cannot have J ⊂ J unless J = J ].

Consider the problem

minimize f(x)

subject to a′jx = bj , j ∈ J.(1.14)

Assume, to come to a contradiction, that this problem does not have asolution. Then, by part (a), we have c′y < 0 for some y ∈ N(A) ∩N(Q),where A is the matrix having as rows the a′j , j ∈ J . Consider the line{xk + γy | γ > 0}. Since y ∈ N (Q), we have

f(xk + γy) = f(xk) + γc′y,

so thatf(xk + γy) < f(xk), ∀ γ > 0.

Furthermore, since y ∈ N (A), we have

a′j(xk + γy) = bj , ∀ j ∈ J, γ > 0.

We must also have a′jy > 0 for at least one j /∈ A [otherwise (iii) would beviolated], so the line {xk + γy | γ > 0} crosses the boundary of X for someγk > 0. The sequence {xk}, where xk = xk + γky, satisfies {xk} ⊂ X,f(xk) → f∗ [since f(xk) ≤ f(xk)], and the active index set J(xk) strictlycontains J for all k. This contradicts the maximality of J , and shows thatproblem (1.14) has an optimal solution, call it x.

Since xk is a feasible solution of problem (1.14), we have

f(x) ≤ f(xk), ∀ k,

so thatf(x) ≤ f∗.


We will now show that x minimizes f over X, by showing that x ∈ X,thereby completing the proof. Assume, to arrive at a contradiction, thatx /∈ X . Let xk be a point in the interval connecting xk and x that belongsto X and is closest to x. We have that J(xk) strictly contains J for all k.Since f(x) ≤ f(xk) and f is convex over the interval [xk, x], it follows that

f(xk) ≤ max{f(xk), f(x)

}= f(xk).

Thus f(xk) → f∗, which contradicts the maximality of J . Q.E.D.

1.3.4 Existence of Saddle Points

Suppose that we are given a function φ : X × Z 7→ <, where X ⊂ <n,Z ⊂ <m, and we wish to either

minimize supz∈Z

φ(x, z)

subject to x ∈ X

ormaximize inf

x∈Xφ(x, z)

subject to z ∈ Z.

These problems are encountered in at least three major optimization con-texts:

(1) Worst-case design, whereby we view z as a parameter and we wishto minimize over x a cost function, assuming the worst possible valueof x. A special case of this is the discrete minimax problem , wherewe want to minimize over x ∈ X

max{f1(x), . . . , fm(x)

},

where the fi are some given functions. Here, Z is the finite set{1, . . . , m}. Within this context, it is important to provide char-acterizations of the max function

maxz∈Z

φ(x, z),

particularly its directional derivative. We do this in Section 1.7, wherewe discuss the differentiability properties of convex functions.

(2) Exact penalty functions , which can be used for example to convertconstrained optimization problems of the form

minimize f(x)

subject to x ∈ X, gj(x) ≤ 0, j = 1, . . . , r(1.15)


to (less constrained) minimax problems of the form

minimize f(x) + c max{0, g1(x), . . . , gr(x)

}

subject to x ∈ X,

where c is a large positive penalty parameter. This conversion isuseful for both analytical and computational purposes, and will bediscussed in Chapters 2 and 4.

(3) Duality theory , where using problem (1.15) as an example, we intro-duce the, so called, Lagrangian function

L(x, µ) = f(x) +r∑

j=1

µjgj(x)

involving the vector µ = (µ1, . . . , µr) ∈ <r, and the dual problem

maximize infx∈X

L(x, µ)

subject to µ ≥ 0.(1.16)

The original (primal) problem (1.15) can also be written as

minimize supµ≥0

L(x, µ)

subject to x ∈ X

[if x violates any of the constraints gj(x) ≤ 0, we have supµ≥0 L(x, µ) =∞, and if it does not, we have supµ≥0 L(x, µ) = f(x)]. Thus the pri-mal and the dual problems (1.15) and (1.16) can be viewed in termsof a minimax problem.

We will now derive conditions guaranteeing that

supz∈Z

infx∈X

φ(x, z) = infx∈X

supz∈Z

φ(x, z), (1.17)

and that the inf and sup above are attained. This is a major issue induality theory because it connects the primal and the dual problems [cf.Eqs. (1.15) and (1.16)] through their optimal values and optimal solutions.In particular, when we discuss duality in Chapter 3, we will see that amajor question is whether there is no duality gap, i.e., whether the optimalprimal and dual values are equal. This is so if and only if

supµ≥0

infx∈X

L(x, µ) = infx∈X

supµ≥0

L(x, µ). (1.18)

We will prove in this section one major result, the Saddle Point The-orem, which guarantees the equality (1.17), assuming convexity/concavity


assumptions on φ and (essentially) compactness assumptions on X andZ.† Unfortunately, the theorem to be shown later in this section is onlypartially adequate for the development of duality theory, because compact-ness of Z and, to some extent, compactness of X turn out to be restrictiveassumptions [for example Z corresponds to the set {µ | µ ≥ 0} in Eq.(1.18), which is not compact]. We will derive additional theorems of theminimax type in Chapter 3, when we discuss duality and we make a closerconnection with the theory of Lagrange multipliers.

A first observation regarding the potential validity of the minimaxequality (1.17) is that we always have the inequality

supz∈Z

infx∈X

φ(x, z) ≤ infx∈X

supz∈Z

φ(x, z), (1.19)

[for every z ∈ Z, write infx∈X φ(x, z) ≤ infx∈X supz∈Z φ(x, z) and take thesupremum over z ∈ Z of the left-hand side]. However, special conditionsare required to guarantee equality.

Suppose that x∗ is an optimal solution of the problem

minimize supz∈Z

φ(x, z)

subject to x ∈ X(1.20)

and z∗ is an optimal solution of the problem

maximize infx∈X

φ(x, z)

subject to z ∈ Z.(1.21)

† The Saddle Point Theorem is also central in game theory, as we now brieflyexplain. In the simplest type of zero sum game, there are two players: the firstmay choose one out of n moves and the second may choose one out of m moves.If moves i and j are selected by the first and the second player, respectively, thefirst player gives a specified amount aij to the second. The objective of the firstplayer is to minimize the amount given to the other player, and the objective ofthe second player is to maximize this amount. The players use mixed strategies,whereby the first player selects a probability distribution x = (x1, . . . , xn) overhis n possible moves and the second player selects a probability distributionz = (z1, . . . , zm) over his m possible moves. Since the probability of selecting iand j is xizj, the expected amount to be paid by the first player to the second is∑

i,jaijxizj or x′Az, where A is the n×m matrix with elements aij.

If each player adopts a worst case viewpoint, whereby he optimizes his

choice against the worst possible selection by the other player, the first player

must minimize maxz x′Az and the second player must maximize minx x′Az. The

main result, a special case of the existence result we will prove shortly, is that

these two optimal values are equal, implying that there is an amount that can

be meaningfully viewed as the value of the game for its participants.


Then we have

supz∈Z

infx∈X

φ(x, z) = infx∈X

φ(x, z∗) ≤ φ(x∗, z∗) ≤ supz∈Z

φ(x∗, z) = infx∈X

supz∈Z

φ(x, z).

(1.22)If the minimax equality [cf. Eq. (1.17)] holds, then equality holds through-out above, so that

supz∈Z

φ(x∗, z) = φ(x∗, z∗) = infx∈X

φ(x, z∗), (1.23)

or equivalently

φ(x∗, z) ≤ φ(x∗, z∗) ≤ φ(x, z∗), ∀ x ∈ X, z ∈ Z. (1.24)

A pair of vectors x∗ ∈ X and z∗ ∈ Z satisfying the two above (equivalent)relations is called a saddle point of φ (cf. Fig. 1.3.4).

The preceding argument showed that if the minimax equality (1.17)holds, any vectors x∗ and z∗ that are optimal solutions of problems (1.20)and (1.21), respectively, form a saddle point. Conversely, if (x∗, z∗) is asaddle point, then the definition (1.23)] implies that

infx∈X

supz∈Z

φ(x, z) ≤ supz∈Z

infx∈X

φ(x, z).

This, together with the minimax inequality (1.19) guarantee that the min-imax equality (1.17) holds and from Eq. (1.22), x∗ and z∗ are optimalsolutions of problems (1.20) and (1.21), respectively.

We summarize the above discussion in the following proposition.

Proposition 1.3.7: A pair (x∗, z∗) is a saddle point of φ if and only ifthe minimax equality (1.17) holds, and x∗ and z∗ are optimal solutionsof problems (1.20) and (1.21), respectively.

Note a simple consequence of the above proposition: the set of saddlepoints, when nonempty , is the Cartesian product X∗×Z∗, where X∗ and Z∗

are the sets of optimal solutions of problems (1.20) and (1.21), respectively.In other words x∗ and z∗ can be independently chosen within the sets X∗

and Z∗, respectively, to form a saddle point. Note also that if the minimaxequality (1.17) does not hold, there is no saddle point, even if the sets X∗

and Z∗ are nonempty.One can visualize saddle points in terms of the sets of minimizing

points over X for fixed z ∈ Z and maximizing points over Z for fixedx ∈ X :

X(z) ={x | x minimizes φ(x, z) over X

},


x

z

Curve of maxima

Curve of minima

φ(x,z)

Saddle point(x*,z*)

^φ(x(z),z)

φ(x,z(x))^

Figure 1.3.4. Illustration of a saddle point of a function φ(x, z) over x ∈ X andz ∈ Z [the function plotted here is φ(x, z) = 1

2 (x2 + 2xz − z2)]. Let

x(z) = arg minx∈X

φ(x, z), z(x) = arg maxz∈Z

φ(x, z)

be the curves of minimizing and maximizing points, and consider the correspond-

ing curves φ(x(z), z

)and φ

(x, z(x)

)[shown in the figure for the case where x(z)

and z(x) are unique; otherwise x(z) and z(x) should be viewed as set-valuedmappings]. By definition, a pair (x∗, z∗) is a saddle point if and only if

maxz∈Z

φ(x∗, z) = φ(x∗, z∗) = minx∈X

φ(x, z∗),

or equivalently, if (x∗, z∗) lies on both curves [x∗ = x(z∗) and z∗ = z(x∗)]. Atsuch a pair, we also have

maxz∈Z

φ(x(z), z

)= max

z∈Zminx∈X

φ(x, z) = φ(x∗, z∗) = minx∈X

maxz∈Z

φ(x, z) = minx∈X

φ(x, z(x)

),

so thatφ(x(z), z

)≤ φ(x∗, z∗) ≤ φ

(x, z(x)

), ∀ x ∈ X, z ∈ Z

(see Prop. 1.3.7). Visually, the curve of maxima φ(x, z(x)

)must lie “above” the

curve of minima φ(x(z), z

)(completely, i.e., for all x ∈ X and z ∈ Z).


Z(x) ={z | z maximizes φ(x, z) over Z

}.

The definition implies that the pair (x∗, z∗) is a saddle point if and only ifit is a “point of intersection” of X(·) and Z(·) in the sense that

x∗ ∈ X(z∗), z∗ ∈ Z(x∗);

see Fig. 1.3.4.We now consider conditions that guarantee the existence of a saddle

point. We have the following classical result.

Proposition 1.3.8: (Saddle Point Theorem) Let X and Z beclosed, convex subsets of <n and <m, respectively, and let φ : X×Z 7→< be a function. Assume that for each z ∈ Z, the function φ(·, z) :X 7→ < is convex and lower semicontinuous, and for each x ∈ X, thefunction φ(x, ·) : Z 7→ < is concave and upper semicontinuous. Thenthere exists a saddle point of φ under any one of the following fourconditions:

(1) X and Z are compact.

(2) Z is compact and there exists a vector z ∈ Z such that φ(·, z) iscoercive.

(3) X is compact and there exists a vector x ∈ X such that −φ(x, ·)is coercive.

(4) There exist vectors x ∈ X and z ∈ Z such that φ(·, z) and−φ(x, ·) are coercive.

Proof: We first prove the result under the assumption that X and Zare compact [condition (1)], and the additional assumption that φ(·, z) isstrictly concave for each x ∈ X.

By Weierstrass’ Theorem (Prop. 1.3.1), the function

f(x) = maxz∈Z

φ(x, z), x ∈ X,

is real-valued and the maximum above is attained for each x ∈ X at apoint denoted z(x), which is unique by the strict concavity assumption.Furthermore, by Prop. 1.2.3(c), f is lower semicontinuous, so again byWeierstrass’ Theorem, f attains a minimum over x ∈ X at some point x∗.Let z∗ = z(x∗), so that

φ(x∗, z) ≤ φ(x∗, z∗) = f(x∗), ∀ z ∈ Z. (1.25)

We will show that (x∗, z∗) is a saddle point of φ, and in view of the aboverelation, it will suffice to show that φ(x∗, z∗) ≤ φ(x, z∗) for all x ∈ X.


Choose any x ∈ X and let

xk =1

kx +

(1− 1

k

)x∗, zk = z(xk), k = 1,2, . . . .

Let z be any limit point of {zk} corresponding to a subsequence K ofpositive integers. Using the convexity of φ(·, zk), we have

f(x∗) ≤ f(xk) = φ(xk, zk) ≤ 1

kφ(x, zk) +

(1− 1

k

)φ(x∗, zk). (1.26)

Taking the limit as k →∞ and k ∈ K, and using the upper semicontinuityof φ(x∗, ·), we obtain

f(x∗) ≤ lim supk→∞, k∈K

φ(x∗, zk) ≤ φ(x∗, z) ≤ maxz∈Z

φ(x∗, z) = f(x∗).

Hence equality holds throughout above, and it follows that φ(x∗, z) ≤φ(x∗, z) = f(x∗) for all z ∈ Z. Since z∗ is the unique maximizer of φ(x∗, ·)over Z, we see that z = z(x∗) = z∗, so that {zk} has z∗ as its unique limitpoint, independently of the choice of the vector x within X (this is the finepoint in the argument where the strict concavity assumption is needed).Equation (1.25) implies that for all k,

φ(x∗, zk) ≤ φ(x∗, z∗) = f(x∗),

which when combined with Eq. (1.26), yields

φ(x∗, z∗) ≤ 1

kφ(x, zk) +

(1− 1

k

)φ(x∗, z∗), ∀ x ∈ X,

orφ(x∗, z∗) ≤ φ(x, zk), ∀ x ∈ X.

Taking the limit as zk → z∗, and using the upper semicontinuity of φ(x, ·),we obtain

φ(x∗, z∗) ≤ φ(x, z∗), ∀ x ∈ X.

Combining this relation with Eq. (1.25), we see that (x∗, z∗) is a saddlepoint of φ.

Next, we remove the assumption of strict concavity of φ(x∗, ·). Weintroduce the functions φk : X × Z 7→ < given by

φk(x, z) = φ(x, z)− 1

k‖z‖2, k = 1, 2, . . .

Since φk(x∗, ·) is strictly concave, by what has already been proved, thereexists a saddle point (x∗k, z∗k) of φk, satisfying

φ(x∗k, z)− 1

k‖z‖2 ≤ φ(x∗k, z∗k)− 1

k‖z∗k‖2 ≤ φ(x, z∗k)− 1

k‖z∗k‖2, ∀ x ∈ X, z ∈ Z.


Let (x∗, z∗) be a limit point of (x∗k, z∗k). By taking limit as k →∞ and byusing the semicontinuity assumptions on φ, it follows that

φ(x∗, z) ≤ lim infk→∞

φ(xk, z) ≤ lim supk→∞

φ(x, zk) ≤ φ(x, z∗), ∀ x ∈ X, z ∈ Z.

(1.27)By alternately setting x = x∗ and z = z∗ in the above relation, we obtain

φ(x∗, z) ≤ φ(x∗, z∗) ≤ φ(x, z∗), ∀ x ∈ X, z ∈ Z,

so (x∗, z∗) is a saddle point of φ.We now prove the existence of a saddle point under conditions (2),

(3), or (4). Let

k ≥

max{maxx∈X ‖x‖, ‖z‖

}if condition (2) holds,

max{‖x‖, maxz∈Z ‖z‖

}if condition (3) holds,

max{‖x‖, ‖z‖} if condition (4) holds,

where x and z are the points appearing in conditions (2)-(4). For eachk ≥ k, we introduce the convex and compact sets

Xk ={x ∈ X | ‖x‖ ≤ k

}, Zk =

{z ∈ Z | ‖z‖ ≤ k

}.

Note that Z = Zk under condition (2) and X = Xk under condition (3).Furthermore, x ∈ Xk and z ∈ Zk for all k ≥ k. Using the result alreadyshown under condition (1), for each k, there exists a saddle point overXk × Zk, i.e., a pair (xk, zk) such that

φ(xk, z) ≤ φ(xk, zk) ≤ φ(x, zk), ∀ x ∈ Xk, z ∈ Zk. (1.28)

Assume that condition (2) holds. Then, since Z is compact, {zk} isbounded. If {xk} were unbounded, the coercivity of φ(·, z) would implythat φ(xk, z) →∞ and from Eq. (1.28) it would follow that φ(x, zk) →∞for all x ∈ X. This contradicts the boundedness of {zk}. Hence {xk} mustbe bounded, and (xk, zk) must have a limit point (x∗, z∗). Taking limit ask → ∞ in Eq. (1.28), and using the semicontinuity assumptions on φ, itfollows that

φ(x∗, z) ≤ lim infk→∞

φ(xk, z) ≤ lim supk→∞

φ(x, zk) ≤ φ(x, z∗), ∀ x ∈ X, z ∈ Z,

which is identical to Eq. (1.27). Thus, by the argument following thatequation, (x∗, z∗) is a saddle point of φ.

A symmetric argument with the obvious modifications, shows theresult under condition (3). Finally, under condition (4), note that Eq.(1.28) yields for all k,

φ(xk, z) ≤ φ(xk, zk) ≤ φ(x, zk).


If {xk}were unbounded, the coercivity of φ(·, z) would imply that φ(xk, z) →∞ and hence φ(x, zk) → ∞, which violates the coercivity of −φ(x, z).Hence {xk} must be bounded, and a symmetric argument shows that {zk}must be bounded. Thus (xk, zk) must have a limit point (x∗, z∗). Theresult then follows from Eq. (1.27), similar to the case where condition (2)holds. Q.E.D.

It is easy to construct examples showing that the convexity of X and Zare essential assumptions for the above proposition (this is also evident fromFig. 1.3.4). The assumptions of compactness/coercivity and lower/uppersemicontinuity of φ are essential for existence of a saddle point (just as theyare essential in Weierstass’ Theorem). An interesting question is whetherconvexity/concavity and lower/upper semicontinuity of φ are sufficient toguarantee the minimax equality (1.17). Unfortunately this is not so forreasons that also touch upon some of the deeper aspects of duality theory(see Chapter 3). Here is an example:

Example 1.3.1

Let

X = {x ∈ <2 | x ≥ 0}, Z = {z ∈ < | z ≥ 0},

and let

φ(x, z) = e−√

x1x2 + zx1,

which satisfy the convexity/concavity and lower/upper semicontinuity as-sumptions of Prop. 1.3.8. For all z ≥ 0, we have

infx≥0

{e−√

x1x2 + zx1

}= 0,

since the expression in braces is nonnegative for x ≥ 0 and can approach zeroby taking x1 → 0 and x1x2 →∞. Hence

supz≥0

infx≥0

φ(x, z) = 0.

We also have for all x ≥ 0

supz≥0

{e−√

x1x2 + zx1

}=

{1 if x1 = 0,∞ if x1 > 0.

Hence

infx≥0

supz≥0

φ(x, z) = 1,

so infx≥0 supz≥0 φ(x, z) > supz≥0 infx≥0 φ(x, z). The difficulty here is thatthe compactness/coercivity assumptions of Prop. 1.3.8 are violated.


E X E R C I S E S

1.3.1

Let f : <n 7→ < be a convex function, let X be a closed convex set, and assumethat f and X have no common direction of recession. Let X∗ be the optimalsolution set (nonempty and compact by Prop. 1.3.5) and let f∗ = infx∈X f(x).Show that:

(a) For every ε > 0 there exists a δ > 0 such that every vector x ∈ X withf(x) ≤ f∗ + δ satisfies minx∗∈X∗ ‖x− x∗‖ ≤ ε.

(b) Every sequence {xk} ⊂ X satisfying limk→∞ f(xk) → f∗ is bounded andall its limit points belong to X∗.

1.3.2

Let C be a convex set and S be a subspace. Show that projection on S is alinear transformation and use this to show that the projection of the set C on Sis a convex set, which is compact if C is compact. Is the projection of C alwaysclosed if C is closed?

1.3.3 (Existence of Solution of Quadratic Programs)

This exercise deals with an extension of Prop. 1.3.6 to the case where the quadraticcost may not be convex. Consider a problem of the form

minimize c′x +

1

2x′Qx

subject to Ax ≤ b,

where Q is a symmetric (not necessarily positive semidefinite) matrix, c and bare given vectors, and A is a matrix. Show that the following are equivalent:

(a) There exists at least one optimal solution.

(b) The cost function is bounded below over the constraint set.

(c) The problem has at least one feasible solution, and for any feasible x, thereis no y ∈ <n such that Ay ≤ 0 and either y′Qy < 0 or y′Qy = 0 and(c + Qx)′y < 0.

Sec. 1.4 Hyperplanes 83

1.3.4 (Existence of Optimal Solutions)

Let f : <n 7→ < be a convex function, and consider the problem of minimizingf over a closed and convex set X . Suppose that f attains a minimum along allhalf lines of the form {x+αy | a ≥ 0} where x ∈ X and y is in the recession coneof X. Show that we may have infx∈X f(x) = −∞. Hint : Use the case n = 2,X = <2, f(x) = minz∈C ‖z − x‖2 − x1, where C =

{(x1, x2) | x2

1 ≤ x2

}.

1.3.5 (Saddle Points in two Dimensions)

Consider a function φ of two real variables x and z taking values in compactintervals X and Z, respectively. Assume that for each z ∈ Z, the function φ(·, z)is minimized over X at a unique point denoted x(z). Similarly, assume that foreach x ∈ X, the function φ(x, ·) is maximized over Z at a unique point denotedz(x). Assume further that the functions x(z) and z(x) are continuous over Zand X, respectively. Show that φ has a unique saddle point (x∗, z∗). Use this toinvestigate the existence of saddle points of φ(x, z) = x2 + z2 over X = [0, 1] andZ = [0, 1].

1.4 HYPERPLANES

A hyperplane is a set of the form {x | a′x = b}, where a ∈ <n, a 6= 0,and b ∈ <, as illustrated in Fig. 1.4.1. An equivalent definition is that ahyperplane in <n is an affine set of dimension n− 1. The vector a calledthe normal vector of the hyperplane (it is orthogonal to the difference x−yof any two vectors x and y of the hyperplane). The two sets

{x | a′x ≥ b}, {x | a′x ≤ b},

are called the halfspaces associated with the hyperplane (also referred to asthe positive and negative halfspaces , respectively). We have the followingresult, which is also illustrated in Fig. 1.4.1. The proof is based on theProjection Theorem and is illustrated in Fig. 1.4.2.

Proposition 1.4.1: (Supporting Hyperplane Theorem) If C ⊂<n is a convex set and x is a point that does not belong to the interiorof C, there exists a vector a 6= 0 such that

a′x ≥ a′x, ∀ x ∈ C. (1.29)


a

C

a C2

C1

(c)(b)

Positive Halfspace{x | a'x ≥ b}

Hyperplane {x | a'x = b}

a

(a)

Negative Halfspace{x | a'x ≤ b}

x

Figure 1.4.1. (a) A hyperplane {x | a′x = b} divides the space in two halfs-paces as illustrated. (b) Geometric interpretation of the Supporting HyperplaneTheorem. (c) Geometric interpretation of the Separating Hyperplane Theorem.

x3x2

x1

x0

a2

a1

a0

C

x2 x1

x0

x

x3

Figure 1.4.2. Illustration of the proof of the Supporting Hyperplane Theoremfor the case where the vector x belongs to the closure of C. We choose a sequence{xk} of vectors not belonging to the closure of C which converges to x, and weproject xk on the closure of C. We then consider, for each k, the hyperplane thatis orthogonal to the line segment connecting xk and its projection, and passesthrough xk . These hyperplanes “converge” to a hyperplane that supports C at x.

Proof: Consider the closure cl(C) of C , which is a convex set by Prop.


1.2.1(d). Let {xk} be a sequence of vectors not belonging to cl(C), whichconverges to x; such a sequence exists because x does not belong to theinterior of C. If xk is the projection of xk on cl(C), we have by part (b) ofthe Projection Theorem (Prop. 1.3.3)

(xk − xk)′(x− xk) ≥ 0, ∀ x ∈ cl(C).

Hence we obtain for all k and x ∈ cl(C),

(xk−xk)′x ≥ (xk−xk)′xk = (xk−xk)′(xk−xk)+(xk−xk)′xk ≥ (xk−xk)′xk.

We can write this inequality as

a′kx ≥ a′kxk, ∀ x ∈ cl(C), k = 0, 1, . . . , (1.30)

where

ak =xk − xk

‖xk − xk‖.

We have ‖ak‖ = 1 for all k, and hence the sequence {ak} has a subsequencethat converges to a nonzero limit a. By considering Eq. (1.30) for all ak

belonging to this subsequence and by taking the limit as k →∞, we obtainEq. (1.29). Q.E.D.

Proposition 1.4.2: (Separating Hyperplane Theorem) If C1

and C2 are two nonempty and disjoint convex subsets of <n, thereexists a hyperplane that separates them, i.e., a vector a 6= 0 such that

a′x1 ≤ a′x2, ∀ x1 ∈ C1, x2 ∈ C2. (1.31)

Proof: Consider the convex set

C = {x | x = x2 − x1, x1 ∈ C1, x2 ∈ C2}.

Since C1 and C2 are disjoint, the origin does not belong to C, so by theSupporting Hyperplane Theorem there exists a vector a 6= 0 such that

0 ≤ a′x, ∀ x ∈ C,

which is equivalent to Eq. (1.31). Q.E.D.


Proposition 1.4.3: (Strict Separation Theorem) If C1 and C2

are two nonempty and disjoint convex sets such that C1 is closed andC2 is compact, there exists a hyperplane that strictly separates them,i.e., a vector a 6= 0 and a scalar b such that

a′x1 < b < a′x2, ∀ x1 ∈ C1, x2 ∈ C2. (1.32)

Proof: Consider the problem

minimize ‖x1 − x2‖subject to x1 ∈ C1, x2 ∈ C2.

This problem is equivalent to minimizing the distance d(x2, C1) over x2 ∈C2. Since C2 is compact, and the distance function is convex (cf. Prop.1.3.3) and hence continuous (cf. Prop. 1.2.12), the problem has at least onesolution (x1, x2). Let

a =x2 − x1

2, x =

x1 + x2

2, b = a′x.

Then, a 6= 0, since x1 ∈ C1, x2 ∈ C2, and C1 and C2 are disjoint. Thehyperplane

{x | a′x = b}contains x, and it can be seen from the preceding discussion that x1 is theprojection of x on C1, and x2 is the projection of x on C2 (see Fig. 1.4.3).By Prop. 1.3.3(b), we have

(x− x1)′(x1 − x1) ≤ 0, ∀ x1 ∈ C1

or equivalently, since x− x1 = a,

a′x1 ≤ a′x1 = a′x + a′(x1 − x) = b− ‖a‖2 < b, ∀ x1 ∈ C1.

Thus, the left-hand side of Eq. (1.32) is proved. The right-hand side isproved similarly. Q.E.D.

The preceding proposition may be used to provide a fundamentalcharacterization of closed convex sets.

Proposition 1.4.4:

(a) A closed convex C ⊂ <n is the intersection of the halfspaces thatcontain C.

(b) The closure of the convex hull of a set C ⊂ <n is the intersectionof the halfspaces that contain C.


(b)

a

C1

(a)

C2x2

x1

x

C2 = {(ξ1,ξ2) | ξ1 > 0, ξ2 >0, ξ1ξ2 ≥ 1}C1 = {(ξ1,ξ2) | ξ1 ≤ 0}

C = {x1 - x2 | x1 ∈ C1, x2 ∈ C2} = {(ξ1,ξ2) | ξ1 < 0}

Figure 1.4.3. (a) Illustration of the construction of a strictly separating hyper-plane of two disjoint closed convex sets C1 and C2 one of which is also bounded(cf. Prop. 1.4.3). (b) An example showing that if none of the two sets is compact,there may not exist a strictly separating hyperplane. This is due to the fact thatthe set C = {x1 − x2 | x1 ∈ C1, x2 ∈ C2} is equal to {(ξ1, ξ2) | ξ1 < 0} and isnot closed, even though C1 and C2 are closed.

Proof: (a) Clearly C is contained in the intersection of the halfspacesthat contain C , so we focus on proving the reverse inclusion. Let x /∈ C.Applying the Strict Separation Theorem (Prop. 1.4.3) to the sets C and{x}, we see that there exists a halfspace containing C but not containing x.Hence, if x /∈ C , then x cannot belong to the intersection of the halfspacescontaining C, proving the result.

(b) Each halfspace H that contains C must also contain conv(C) (since H isconvex), and also cl

(conv(C)

)(since H is closed). Hence the intersection of

all halfspaces containing C and the intersection of all halfspaces containingcl

(conv(C)

)coincide. From part (a), the latter intersection is equal to

cl(conv(C)

). Q.E.D.

E X E R C I S E S

1.4.1 (Strong Separation)

Let C1 and C2 be two nonempty, convex subsets of <n, and let B denote the unit


ball in <n. A hyperplane H is said to separate strongly C1 and C2 if there existsan ε > 0 such that C1 + εB is contained in one of the open halfspaces associatedwith H and C2 + εB is contained in the other. Show that:

(a) The following three conditions are equivalent:

(i) There exists a hyperplane strongly separating C1 and C2.

(ii) There exists a vector b ∈ <n such that infx∈C1 b′x > supx∈C2b′x.

(iii) infx1∈C1, x2∈C2 ‖x1 − x2‖ > 0.

(b) If C1 and C2 are closed and have no common directions of recession, thereexists a hyperplane strongly separating C1 and C2.

(c) If the two sets C1 and C2 have disjoint closures, and at least one of the twois bounded, there exists a hyperplane strongly separating them.

1.4.2 (Proper Separation)

Let C1 and C2 be two nonempty, convex subsets of <n. A hyperplane H is saidto separate properly C1 and C2 if C1 and C2 are not both contained in H . Showthat the following three conditions are equivalent:

(i) There exists a hyperplane properly separating C1 and C2.

(ii) There exists a vector b ∈ <n such that

infx∈C1

b′x ≥ supx∈C2

b′x, supx∈C1

b′x > infx∈C2

b′x,

(iii) The relative interiors ri(C1) and ri(C2) have no point in common.

1.5 CONICAL APPROXIMATIONS AND CONSTRAINEDOPTIMIZATION

Given a set C, the cone given by

C⊥ = {y | y′x ≤ 0, ∀ x ∈ C},

is called the polar cone of C (see Fig. 1.5.1). Clearly, the polar cone C⊥,being the intersection of a collection of closed halfspaces, is closed andconvex (regardless of whether C is closed and/or convex). We also havethe following basic result.

Sec. 1.5 Conical Approximations 89

0C⊥

Ca1

a2

(a)

C

a1

0C⊥

a2

(b)

Figure 1.5.1. Illustration of the polar cone C⊥ of a set C ⊂ <2. In (a) Cconsists of just two points, a1 and a2, while in (b) C is the convex cone {x | x =µ1a1 + µ2a2, µ1 ≥ 0, µ2 ≥ 0}. The polar cone C⊥ is the same in both cases.

Proposition 1.5.1: (Polar Cone Theorem)

(a) For any set C , we have

C⊥ =(cl(C)

)⊥=

(conv(C)

)⊥=

(cone(C)

)⊥.

(b) For any cone C, we have

(C⊥)⊥ = cl(conv(C)

).

In particular, if C is closed and convex, (C⊥)⊥ = C .

Proof: (a) Clearly, we have C⊥ ⊃(cl(C)

)⊥. Conversely, if y ∈ C⊥, then

y′xk ≤ 0 for all k and all sequences {xk} ⊂ C , so that y′x ≤ 0 for all limits

x of such sequences. Hence y ∈(cl(C)

)⊥and C⊥ ⊂

(cl(C)

)⊥.

Similarly, we have C⊥ ⊃(conv(C)

)⊥. Conversely, if y ∈ C⊥, then

y′x ≤ 0 for all x ∈ C so that y′z ≤ 0 for all z that are convex combinations

of vectors x ∈ C . Hence y ∈(conv(C)

)⊥and C⊥ ⊂

(conv(C)

)⊥. A nearly

identical argument also shows that C⊥ =(cone(C)

)⊥.

(b) Figure 1.5.2 shows that if C is closed and convex, then (C⊥)⊥ = C.From this it follows that

((cl

(conv(C)

))⊥)⊥= cl

(conv(C)

),


and by using part (a) in the left-hand side above, we obtain (C⊥)⊥ =cl

(conv(C)

). Q.E.D.

xC

y

z

0

C⊥

z2z

z - z

Figure 1.5.2. Proof of the Polar Cone Theorem for the case where C is a closedand convex cone. If x ∈ C, then for all y ∈ C⊥, we have x′y ≤ 0, which impliesthat x ∈ (C⊥)⊥. Hence, C ⊂ (C⊥)⊥. To prove the reverse inclusion, takez ∈ (C⊥)⊥, and let z be the unique projection of z on C, as shown in the figure.Since C is closed, the projection exists by the Projection Theorem (Prop. 1.3.3),which also implies that

(z − z)′(x− z) ≤ 0, ∀ x ∈ C.

By taking in the preceding relation x = 0 and x = 2z (which belong to C since Cis a closed cone), it is seen that

(z − z)′z = 0.

Combining the last two relations, we obtain (z− z)′x ≤ 0 for all x ∈ C. Therefore,(z − z) ∈ C⊥, and since z ∈ (C⊥)⊥, we obtain (z − z)′z ≤ 0, which when addedto (z − z)′z = 0 yields ‖z − z‖2 ≤ 0. Therefore, z = z and z ∈ C. It follows that(C⊥)⊥ ⊂ C.

The analysis of a constrained optimization problem often centers onhow the cost function behaves along directions leading away from a localminimum to some neighboring feasible points. The sets of the relevantdirections constitute cones that can be viewed as approximations to theconstraint set, locally near a point of interest. Let us introduce two suchcones, which are important in connection with optimality conditions.


Definition 1.5.1: Given a subset X of <n and a vector x ∈ X, afeasible direction of X at x is a vector y ∈ <n such that there existsan α > 0 with x + αy ∈ X for all α ∈ [0, α]. The set of all feasibledirections of X at x is a cone denoted by FX(x).

It can be seen that if X is convex, the feasible directions at x are thevectors of the form α(x − x) with α > 0 and x ∈ X. However, when Xis nonconvex, the cone of feasible directions need not provide interestinginformation about the local structure of the set X near the point x. Forexample, often there is no nonzero feasible direction at x when X is noncon-vex [think of the set X =

{x | h(x) = 0

}, where h : <n 7→ < is a nonlinear

function]. The next definition introduces a cone that provides informationon the local structure of X even when there is no nonzero feasible direction.

Definition 1.5.2: Given a subset X of <n and a vector x ∈ X, avector y is said to be a tangent of X at x if either y = 0 or there existsa sequence {xk} ⊂ X such that xk 6= x for all k and

xk → x,xk − x

‖xk − x‖ →y

‖y‖ .

The set of all tangents of X at x is a cone called the tangent cone ofX at x, and is denoted by TX(x).

xk

xk-1

x + y

x + y k

x + y k-1

Ball of radius ||y||

x + y k+1x

X Figure 1.5.3. Illustration of a tangenty at a vector x ∈ X. There is a se-quence {xk} ⊂ X that converges to xand is such that the normalized directionsequence (xk − x)/‖xk − x‖ converges toy/‖y‖, the normalized direction of y, orequivalently, the sequence

yk =‖y‖(xk − x)

‖xk − x‖

illustrated in the figure converges to y.

Thus a nonzero vector y is a tangent at x if it is possible to approach xwith a feasible sequence {xk} such that the normalized direction sequence(xk − x)/‖xk − x‖ converges to y/‖y‖, the normalized direction of y (seeFig. 1.5.3). The following proposition provides an equivalent definition of


x1

x2

(a)

x1

x2

(b)

(1,2)TX(x) = cl(FX(x))

x = (0,1)x = (0,1)

TX(x)

Figure 1.5.4. Examples of the cones FX (x) and TX (x) of a set X at the vectorx = (0, 1). In (a), we have

X ={

(x1, x2) | (x1 + 1)2 − x2 ≤ 0, (x1 − 1)2 − x2 ≤ 0}

.

Here X is convex and the tangent cone TX (x) is equal to the closure of the cone offeasible directions FX (x) (which is an open set in this example). Note, however,that the vectors (1, 2) and (−1, 2) (as well the origin) belong to TX (x) and alsoto the closure of FX(x), but are not feasible directions. In (b), we have

X ={

(x1, x2) |((x1 + 1)2 − x2

)((x1 − 1)2 − x2

)= 0

}.

Here the set X is nonconvex, and TX (x) is closed but not convex. Furthermore,FX (x) consists of just the zero vector.

a tangent, which is occasionally more convenient.

Proposition 1.5.2: A vector y is a tangent of a set X if and onlyif there exists a sequence {xk} ⊂ X with xk → x, and a positivesequence {αk} such that αk → 0 and (xk − x)/αk → y.

Proof: If {xk} is the sequence in the definition of a tangent, take αk =‖xk − x‖/‖y‖. Conversely, if αk → 0, (xk − x)/αk → y, and y 6= 0, clearlyxk → x and

xk − x

‖xk − x‖ =(xk − x)/αk

‖(xk − x)/αk‖→ y

‖y‖ ,

so {xk} satisfies the definition of a tangent. Q.E.D.

Figure 1.5.4 illustrates the cones TX(x) and FX(x) with examples.The following proposition gives some of the properties of the cones FX(x)and TX(x).


Proposition 1.5.3: Let X be a nonempty subset of <n and let xbe a vector in X. The following hold regarding the cone of feasibledirections FX (x) and the tangent cone TX(x).

(a) TX(x) is a closed cone.

(b) cl(FX(x)

)⊂ TX(x).

(c) If X is convex, then FX(x) and TX(x) are convex, and we have

cl(FX(x)

)= TX (x).

Proof: (a) Let {yk} be a sequence in TX(x) that converges to y. We willshow that y ∈ TX(x). If y = 0, then y ∈ TX(x), so assume that y 6= 0. Bythe definition of a tangent, for every yk there is a sequence {xi

k} ⊂ X withxi

k 6= x such that

limi→∞

xik = x, lim

i→∞xi

k − x

||xik − x|| =

yk

||yk||.

For k = 1, 2, . . ., choose an index ik such that i1 < i2 < . . . < ik and

limk→∞

||xikk − x|| = 0, lim

k→∞

∥∥∥∥∥x

ikk − x

||xikk − x||

− yk

||yk||

∥∥∥∥∥ = 0.

We also have for all k

∥∥∥∥∥x

ikk − x

||xikk − x||

− y

||y||

∥∥∥∥∥ ≤∥∥∥∥∥

xikk − x

||xikk − x||

− yk

||yk||

∥∥∥∥∥ +

∥∥∥∥yk

||yk||− y

||y||

∥∥∥∥ ,

so the fact yk → y, and the preceding two relations imply that

limk→∞

‖xikk − x‖ = 0, lim

k→∞

∥∥∥∥∥x

ikk − x

||xikk − x||

− y

||y||

∥∥∥∥∥ = 0.

Hence y ∈ TX (x) and TX (x) is closed.

(b) Clearly every feasible direction is also a tangent, so FX(x) ⊂ TX(x).Since by part (a), TX (x) is closed, the result follows.

(c) If X is convex, the feasible directions at x are the vectors of the formα(x − x) with α > 0 and x ∈ X. From this it can be seen that FX (x) isconvex. Convexity of TX (x) will follow from the convexity of FX(x) oncewe show that cl

(FX (x)

)= TX(x).


In view of part (b), it will suffice to show that TX(x) ⊂ cl(FX(x)

).

Let y ∈ TX(x) and, using Prop. 1.5.2, let {xk} be a sequence in X and{αk} be a positive sequence such that αk → 0 and (xk − x)/αk → y. SinceX is a convex set, the direction (xk−x)/αk is feasible at x for all k. Hencey ∈ cl

(FX (x)

), and it follows that TX(x) ⊂ cl

(FX(x)

). Q.E.D.

The tangent cone finds an important application in the following basicnecessary condition for local optimality:

Proposition 1.5.4: Let f : <n 7→ < be a smooth function, and letx∗ be a local minimum of f over a set X ⊂ <n. Then

∇f(x∗)′y ≥ 0, ∀ y ∈ TX(x∗).

If X is convex, this condition can be equivalently written as

∇f(x∗)′(x− x∗) ≥ 0, ∀ x ∈ X,

and in the case where X = <n, reduces to ∇f(x∗) = 0.

Proof: Let y be a nonzero tangent of X at x∗. Then there exists a sequence{ξk} and a sequence {xk} ⊂ X such that xk 6= x∗ for all k,

ξk → 0, xk → x∗,

andxk − x∗

‖xk − x∗‖ =y

‖y‖ + ξk.

By the Mean Value Theorem, we have for all k

f(xk) = f(x∗) +∇f(xk)′(xk − x∗),

where xk is a vector that lies on the line segment joining xk and x∗. Com-bining the last two equations, we obtain

f(xk) = f(x∗) +‖xk − x∗‖‖y‖ ∇f(xk)′yk, (1.33)

whereyk = y + ‖y‖ξk.

If ∇f(x∗)′y < 0, since xk → x∗ and yk → y, it follows that for all suffi-ciently large k, ∇f(xk)′yk < 0 and [from Eq. (1.33)] f(xk) < f(x∗). Thiscontradicts the local optimality of x∗.


When X is convex, we have cl(FX(x)

)= TX(x) (cf. Prop. 1.5.3).

Thus the condition shown can be written as

∇f(x∗)′y ≥ 0, ∀ y ∈ cl(FX(x)

),

which in turn is equivalent to

∇f(x∗)′(x− x∗) ≥ 0, ∀ x ∈ X.

If X = <n, by setting x = x∗ + ei and x = x∗ − ei, where ei is the ithunit vector (all components are 0 except for the ith, which is 1), we obtain∂f(x∗)/∂xi = 0 for all i = 1, . . . , n, so ∇f(x∗) = 0. Q.E.D.

A direction y for which ∇f(x∗)′y < 0 may be viewed as a descentdirection of f at x∗, in the sense that we have (by Taylor’s theorem)

f(x∗ + αy) = f(x∗) + α∇f(x∗)′y + o(α) < f(x∗)

for sufficiently small but positive α. Thus Prop. 1.5.4 says that if x∗ isa local minimum, there is no descent direction within the tangent coneTX(x∗).

Note that the necessary condition of Prop. 1.5.4 can equivalently bewritten as

−∇f(x∗) ∈ TX(x∗)⊥

(see Fig. 1.5.5). There is an interesting converse of this result, namely thatgiven any vector z ∈ TX(x∗)⊥, there exists a smooth function f such that−∇f(x∗) = z and x∗ is a local minimum of f over X. We will return tothis result and to the subject of conical approximations when we discussLagrange multipliers in Chapter 2.

The Normal Cone

In addition to the cone of feasible directions and the tangent cone, there isone more conical approximation that is of special interest for the optimiza-tion topics covered in this book. This is the normal cone of X at x, denotedby NX(x), and obtained from the polar cone TX(x)⊥ by means of a clo-sure operation. In particular, we have z ∈ NX(x) if there exist sequences{xk} ⊂ X and {zk} such that xk → x, zk → z, and zk ∈ TX(xk)⊥ for allk. Equivalently, the graph of NX(·), viewed as a point-to-set mapping, isthe closure of the graph of TX (·)⊥:

{(x, z) | x ∈ X, z ∈ NX (x)

}= cl

({(x, z) | x ∈ X, z ∈ TX(x)⊥

}).

Clearly, NX (x) is a closed cone containing TX(x)⊥, but it need notbe convex like TX(x)⊥ (see the examples of Fig. 1.5.6). In the case where


Surfaces of equal cost f(x)

Constraint set Xx*

x* − ∇f(x*)

x* + TX(x*)

Figure 1.5.5. Illustration of the necessary optimality condition

−∇f(x∗) ∈ TX (x∗)⊥

for x∗ to be a local minimum of f over X .

TX(x)⊥ = NX (x), we say that X is regular at x. An important consequenceof convexity of X is that it implies regularity, as shown in the followingproposition.

Proposition 1.5.5: Let X be a convex set. Then for all x ∈ X, wehave

z ∈ TX(x)⊥ if and only if z′(x− x) ≤ 0, ∀ x ∈ X. (1.34)

Furthermore, X is regular at all x ∈ X.

Proof: Since (x − x) ∈ FX(x) ⊂ TX (x) for all x ∈ X, it follows thatif z ∈ TX (x)⊥, then z′(x − x) ≤ 0 for all x ∈ X. Conversely, let x besuch that z′(x − x) ≤ 0 for all x ∈ X, and to arrive at a contradiction,assume that z /∈ TX(x)⊥. Then there exists some y ∈ TX (x) such thatz′y > 0. Since cl

(FX (x)

)= TX(x) [cf. Prop. 1.5.3(c)], there exists a

sequence {yk} ⊂ FX (x) such that yk → y, so that yk = αk(xk − x) forsome αk > 0 and some xk ∈ X. Since z′y > 0, we have αkz′(xk − x) > 0for large enough k, which is a contradiction.


a1a2NX(x)

x = 0

X = TX(x)

NX(x)x = 0

X

TX(x) = Rn

(a)

(b)

Figure 1.5.6. Examples of normal cones. In the case of figure (a), X is the unionof two lines passing through the origin:

X = {x | (a′1x)(a′2x) = 0}.

For x = 0 we have TX (x) = X, TX(x)⊥ = {0}, while NX (x) is the nonconvex setconsisting of the two lines of vectors that are collinear to either a1 or a2. ThusX is not regular at x = 0. At all other vectors x ∈ X, we have regularity withTX (x)⊥ and NX (x) equal to either the line of vectors that are collinear to a1 orthe line of vectors that are collinear to a2.

In the case of figure (b), X is regular at all points except at x = 0, wherewe have TX(x) = <n, TX (x)⊥ = {0}, while NX(x) is equal to the horizontal axis.


If x ∈ X and z ∈ NX (x), there must exist sequences {xk} ⊂ X and{zk} such that xk → x, zk → z, and zk ∈ TX(xk)⊥. By Eq. (1.34) (whichcritically depends on the convexity of X), we must have z′k(x − xk) ≤ 0for all x ∈ X. Taking the limit as k → ∞, we obtain z′(x − x) ≤ 0 forall x ∈ X, which by Eq. (1.34), implies that z ∈ TX(x)⊥. Thus, we haveNX(x) ⊂ TX (x)⊥. Since the reverse inclusion always holds, it follows thatNX(x) = TX (x)⊥, so that X is regular at x. Q.E.D.

Note that convexity of TX(x) does not imply regularity of X at x,as the example of Fig. 1.5.6(b) shows. However, it can be shown that theconverse is true, i.e., if X is regular at x, then TX(x) must be convex (seeRockafellar and Wets [RoW98]). This provides an often useful method forrecognizing nonregularity.

E X E R C I S E S

1.5.1 (Fermat’s Principle in Optics)

Let C ⊂ <n be a closed convex set, and let y and z be given vectors in <n suchthat the line segment connecting y and z does not intersect with C. Considerthe problem of minimizing the sum of distances ‖y − x‖ + ‖z − x‖ over x ∈ C.Derive a necessary and sufficient optimality condition. Does an optimal solutionexist and if so, is it unique? Discuss the case where C is closed but not convex.

1.5.2

Let C1, C2, and C3 be three closed subsets of <n. Consider the problem of findinga triangle with minimum perimeter that has one vertex on each of the three sets,i.e., the problem of minimizing ‖x1 − x2‖ + ‖x2 − x3‖ + ‖x3 − x1‖ subject toxi ∈ Ci, i = 1, 2, 3, and the additional condition that x1, x2, and x3 do not lie onthe same line. Show that if (x∗1, x

∗2, x

∗3) defines an optimal triangle, there exists

a vector z∗ in the triangle such that

(z∗ − x∗i ) ∈ TCi (x∗i )⊥, i = 1, 2, 3.

1.5.3 (Cone Decomposition Theorem)

Let C ⊂ <n be a closed convex cone and let x be a given vector in <n. Showthat:

Sec. 1.5 Polyhedral Convexity 99

(a) x is the projection of x on C if and only if

x ∈ C, x− x ∈ C⊥, (x− x)′x = 0.

(b) The following two properties are equivalent:

(i) x1 and x2 are the projections of x on C and C⊥, respectively.

(ii) x = x1 + x2 with x1 ∈ C, x2 ∈ C⊥, and x′1x2 = 0.

1.5.4

Let C ⊂ <n be a closed convex cone and let a be a given vector in <n. Showthat for any positive scalars β and γ, we have

max‖x‖≤β, x∈C

a′x ≤ γ if and only if a ∈ C⊥ + {x | ‖x‖ ≤ γ/β}.

1.5.5 (Quasiregularity)

Consider the setX =

{x | hi(x) = 0, i = 1, . . . ,m

},

where the functions hi : <n 7→ < are smooth. For any x ∈ X consider thesubspace

V (x) ={y | ∇hi(x)′y = 0, i = 1, . . . , m

}.

Show that:

(a) TX(x) ⊂ V (x).

(b) TX(x) = V (x) if either the gradients ∇hi(x), i = 1, . . . ,m, are linearly in-dependent or the functions hi are linear. Note: The property TX(x) = V (x)is called quasiregularity , and will be significant in the Lagrange multipliertheory of Chapter 2.

1.6 POLYHEDRAL CONVEXITY

1.6.1 Polyhedral Cones

We now develop some basic results regarding the geometry of polyhedralsets. We start with properties of cones that have a polyhedral structure.We say that a cone C is finitely generated , if it has the form

C =

x

∣∣∣ x =

r∑

j=1

µjaj , µj ≥ 0, j = 1, . . . , r

,


where a1, . . . , ar are some vectors. We say that a cone C is polyhedral , ifit has the form

C = {x | a′jx ≤ 0, j = 1, . . . , r},

where a1, . . . , ar are some vectors.It can be seen that polyhedral cones are closed, since they are inter-

sections of closed halfspaces. Finitely generated cones are also closed andare in fact polyhedral, but this is a fairly deep fact, which is shown in thefollowing proposition.

Proposition 1.6.1:

(a) Let a1, . . . , ar be vectors of <n. Then the finitely generated cone

C =

x

∣∣∣ x =

r∑

j=1

µjaj, µj ≥ 0, j = 1, . . . , r

(1.35)

is closed and its polar cone is the polyhedral cone given by

C⊥ ={x | a′jx ≤ 0, j = 1, . . . , r

}. (1.36)

(b) (Farkas’ Lemma) Let x, e1, . . . , em, and a1, . . . , ar be vectors of<n. We have x′y ≤ 0 for all vectors y ∈ <n such that

y′ei = 0, ∀ i = 1, . . . , m, y′aj ≤ 0, ∀ j = 1, . . . , r,

if and only if x can be expressed as

x =

m∑

i=1

λiei +

r∑

j=1

µjaj ,

where λi and µj are some scalars with µj ≥ 0 for all j.

(c) (Minkowski – Weyl Theorem) A cone is polyhedral if and only ifit is finitely generated.

Proof: (a) We first show that the polar cone of C has the desired form(1.36). If y satisfies y′aj ≤ 0 for all j, then y′x ≤ 0 for all x ∈ C , so theset in the right-hand side of Eq. (1.36) is a subset of C⊥. Conversely, ify ∈ C⊥, i.e., if y′x ≤ 0 for all x ∈ C , then (since aj belong to C) we havey′aj ≤ 0, for all j. Thus, C⊥ is a subset of the set in the right-hand sideof Eq. (1.36).


There remains to show that C is closed. We will give two proofsfor this. The first proof is simpler and suffices for the purpose of showingFarkas’ lemma [part (b)]. The second proof shows a stronger result, namelythat C is polyhedral. This not only shows that C is closed, but also proveshalf of the Minkowski-Weyl Theorem [part (c)].

The first proof that C is closed is based on induction on the numberof vectors r. When r = 1, C is either {0} (if a1 = 0) or a halfline, and istherefore closed. Suppose, for some r ≥ 1, all cones of the form

x

∣∣∣ x =r∑

j=1

µjaj , µj ≥ 0

are closed. Then, we will show that a cone of the form

Cr+1 =

x

∣∣∣ x =

r+1∑

j=1

µjaj , µj ≥ 0

is also closed. Without loss of generality, assume that ‖aj‖ = 1 for all j.There are two cases: (i) The vectors −a1, . . . ,−ar+1 belong to Cr+1, inwhich case Cr+1 is the subspace spanned by a1, . . . , ar+1 and is thereforeclosed, and (ii) The negative of one of the vectors, say −ar+1, does notbelong to Cr+1. In this case, consider the cone

Cr =

x

∣∣∣ x =

r∑

j=1

µjaj , µj ≥ 0

,

which is closed by the induction hypothesis. Let

m = minx∈Cr, ‖x‖=1

a′r+1x.

Since, the set {x ∈ Cr | ‖x‖ = 1} is nonempty and compact, the minimumabove is attained at some x∗ by Weierstrass’ Theorem. We have, using theSchwartz inequality,

m = a′r+1x∗ ≥ −‖ar+1‖ · ‖x∗‖ = −1,

with equality if and only if x∗ = −ar+1. It follows that

m > −1,

since otherwise we would have x∗ = −ar+1, which violates the hypothesis(−ar+1) /∈ Cr. Let {xk} be a convergent sequence in Cr+1. We will provethat its limit belongs to Cr+1, thereby showing that Cr+1 is closed. Indeed,


for all k, we have xk = ξkar+1 + yk, where ξk ≥ 0 and yk ∈ Cr. Using thefact ‖ar+1‖ = 1, we obtain

‖xk‖2 = ξ2k + ‖yk‖2 + 2ξka′r+1yk

≥ ξ2k + ‖yk‖2 + 2mξk‖yk‖

=(ξk − ‖yk‖)2 + 2(1 + m)ξk‖yk‖.

Since {xk} converges, ξk ≥ 0, and 1 + m > 0, it follows that the sequences{ξk} and {yk} are bounded and hence, they have limit points denoted byξ and y, respectively. The limit of {xk} is

limk→∞

(ξkar+1 + yk) = ξar+1 + y,

which belongs to Cr+1, since ξ ≥ 0 and y ∈ Cr (by the closure hypothesison Cr). We conclude that Cr+1 is closed, completing the proof.

We now give the alternative proof that C is closed, by showing that itis polyhedral. The proof is constructive and uses induction on the numberof vectors r. To start the induction, we assume without loss of generalitythat a1 = 0. Then, for r = 1, we have C = {0}, which is polyhedral, sinceit can be expressed as

{x | u′ix ≤ 0, −u′ix ≤ 0, i = 1, . . . , n},

where ui is the ith unit coordinate vector.Assume that for some r ≥ 2, the set

Cr−1 =

x

∣∣∣ x =

r−1∑

j=1

µjaj , µj ≥ 0

has a polyhedral representation

Pr−1 ={x | b′jx ≤ 0, j = 1, . . . , m

}.

Let

βj = a′rbj , j = 1, . . . , m,

and define the index sets

J− = {j | βj < 0}, J0 = {j | βj = 0}, J+ = {j | βj > 0}.

Let also

bl,k = bl −βl

βkbk, ∀ l ∈ J+, k ∈ J−.


We will show that the set

Cr =

x

∣∣∣ x =

r∑

j=1

µjaj , µj ≥ 0

has the polyhedral representation

Pr ={x | b′jx ≤ 0, j ∈ J− ∪ J0, b′l,kx ≤ 0, l ∈ J+, k ∈ J−

},

thus completing the induction.We have Cr ⊂ Pr because by construction, all the vectors a1, . . . , ar

satisfy the inequalities defining Pr. To show the reverse inclusion, we fix avector x ∈ Pr and we verify that there exists µr ≥ 0 such that

x− µrar ∈ Pr−1,

which is equivalent toγ ≤ µr ≤ δ,

where

γ = max

{0, max

j∈J+

b′jx

βj

}, δ = min

j∈J−

b′jx

βj.

Since x ∈ Pr, we have

0 ≤ b′kx

βk, ∀ k ∈ J−, (1.37)

and also b′l,kx ≤ 0 for all l ∈ J+, k ∈ J−, or equivalently

b′lx

βl≤ b′kx

βk, ∀ l ∈ J+, k ∈ J−. (1.38)

Equations (1.37) and (1.38) imply that γ ≤ δ, thereby completing theproof.

(b) Define ar+i = ei and ar+m+i = −ei, i = 1, . . . , m. The result to beshown translates to

x ∈ P⊥ ⇐⇒ x ∈ C,

whereP = {y | y′aj ≤ 0, j = 1, . . . , r + 2m} ,

C =

x

∣∣∣ x =

r+2m∑

j=1

µjaj , µj ≥ 0

.


Since by part (a), P = C⊥ and C is closed and convex, we have by the

Polar Cone Theorem (Prop. 1.5.1) P⊥ =(C⊥

)⊥= C .

(c) We have already shown in the proof of part (a) that a finitely generatedcone is polyhedral. To show the reverse, we use part (a) and the Polar ConeTheorem (Prop. 1.5.1) to conclude that the polar of any polyhedral cone[cf. Eq. (1.36)] is finitely generated [cf. Eq. (1.35)]. The finitely generatedcone (1.35) has already been shown to be polyhedral, so its polar, which isthe “typical” polyhedral cone (1.36), is finitely generated. This completesthe proof. Q.E.D.

1.6.2 Polyhedral Sets

A nonempty subset of <n is said to be a polyhedral set (or polyhedron) ifit is of the form

P ={x | a′jx ≤ bj , j = 1, . . . , r

},

where aj are some vectors and bj are some scalars.The following is a fundamental result, showing that a polyhedral set

can be represented as the sum of the convex hull of a finite set of pointsand a finitely generated cone.

Proposition 1.6.2: A set P is polyhedral if and only if there ex-ist a nonempty and finite set of vectors {v1, . . . , vm}, and a finitelygenerated cone C such that

P =

x

∣∣∣ x = y +

m∑

j=1

µjvj , y ∈ C,

m∑

j=1

µj = 1, µj ≥ 0, j = 1, . . . , m

.

Proof: Assume that P is polyhedral. Then, it has the form

P ={x | a′jx ≤ bj , j = 1, . . . , r

},

for some vectors aj and some scalars bj. Consider the polyhedral cone of<n+1

P ={(x,w) | 0 ≤ w, a′jx ≤ bjw, j = 1, . . . , r

}

and note that

P ={x | (x, 1) ∈ P

}.


By the Minkowski – Weyl Theorem [Prop. 1.6.1(c)], P is finitely generated,so it has the form

P =

(x,w)

∣∣∣ x =

m∑

j=1

µjvj , w =

m∑

j=1

µjdj , µj ≥ 0, j = 1, . . . ,m

,

for some vectors vj and scalars dj . Since w ≥ 0 for all vectors (x, w) ∈ P ,we see that dj ≥ 0 for all j. Let

J+ = {j | dj > 0}, J0 = {j | dj = 0}.

By replacing µj by µj/dj for all j ∈ J+, we obtain the equivalent descrip-tion

P =

(x,w)

∣∣∣ x =

m∑

j=1

µjvj , w =∑

j∈J+

µj, µj ≥ 0, j = 1, . . . ,m

.

Since P = {x | (x, 1) ∈ P}, we obtain

P =

x

∣∣∣ x =∑

j∈J+

µjvj +∑

j∈J0

µjvj ,∑

j∈J+

µj = 1, µj ≥ 0, j = 1, . . . , m

.

Thus, P is the vector sum of the convex hull of the vectors vj, j ∈ J+, plusthe finitely generated cone

∑

j∈J0

µjvj | µj ≥ 0, j ∈ J0

.

To prove that the vector sum of the convex hull of a finite set ofpoints with a finitely generated cone is a polyhedral set, we use a reverseargument; we pass to a finitely generated cone description, we use theMinkowski – Weyl Theorem to assert that this cone is polyhedral, and wefinally construct a polyhedral set description. The details are left as anexercise for the reader. Q.E.D.

1.6.3 Extreme Points

A vector x is said to be an extreme point of a convex set C if x belongsto C and there do not exist vectors y ∈ C and z ∈ C , with y 6= x andz 6= x, and a scalar α ∈ (0,1) such that x = αy + (1− α)z. An equivalentdefinition is that x cannot be expressed as a convex combination of somevectors of C, all of which are different from x.


The following proposition provides some intuition into the nature ofextreme points.

Proposition 1.6.3: Let C be a nonempty, closed, convex set in <n.

(a) If H is a hyperplane that passes through a boundary point of Cand contains C in one of its halfspaces, then every extreme pointof T = C ∩H is also an extreme point of C.

(b) C has at least one extreme point if and only if it does not containa line, i.e., a set L of the form L = {x + αd | α ∈ <} with d 6= 0.

Proof: (a) Let x be an element of T which is not an extreme point of C.Then we have x = αy + (1− α)z for some α ∈ (0, 1), and some y ∈ C andz ∈ C , with y 6= x and z 6= x. Since x ∈ H, x is a boundary point of C,and the halfspace containing C is of the form {x | a′x ≥ a′x}, where a 6= 0.Then a′y ≥ a′x and a′z ≥ a′x, which in view of x = αy + (1− α)z, impliesthat a′y = a′x and a′z = a′x. Therefore, y ∈ T and z ∈ T , showing that xcannot be an extreme point of T .

(b) Assume that C has an extreme point x and contains a line L = {x+αd |α ∈ <}, where d 6= 0. We will arrive at a contradiction. For each integern > 0, the vector

xn =

(1− 1

n

)x +

1

n(x + nd) = x + d +

1

n(x− x)

lies in the line segment connecting x and x + nd, so it belongs to C . SinceC is closed, x + d = limn→∞ xn must also belong to C. Similarly, we showthat x − d must belong to C. Thus x − d, x, and x + d all belong to C,contradicting the hypothesis that x is an extreme point.

Conversely, we use induction on the dimension of the space to showthat if C does not contain a line, it must have an extreme point. This istrue in the real line <1, so assume it is true in <n−1. If a nonempty, closed,convex subset C of <n contains no line, it must have some boundary pointx. Take any hyperplane H passing through x and containing C in one ofits halfspaces. Then, since H is an (n − 1)-dimensional affine set, the setC ∩H lies in an (n− 1)-dimensional space and contains no line, so by theinduction hypothesis, it must have an extreme point. By part (a), thisextreme point must also be an extreme point of C. Q.E.D.

An important fact that forms the basis for the simplex method oflinear programming, is that if a linear function f attains a minimum overa polyhedral set C having at least one extreme point, then f attains aminimum at some extreme point of C (as well as possibly at some other


nonextreme points). We will come to this fact after considering the moregeneral case where f is concave and C is closed and convex.

We say that a set C ⊂ <n is bounded from below if there exists avector b ∈ <n such that x ≥ b for all x ∈ C .

Proposition 1.6.4: Let C be a closed convex set which is boundedfrom below and let f : C 7→ < be a concave function. Then if f attainsa minimum over C , it attains a minimum at some extreme point of C.

Proof: We first show that f attains a minimum at some boundary pointof C. Let x∗ be a vector where f attains a minimum over C. If x∗ is aboundary point we are done, so assume that x∗ is an interior point of C.Let

L = {x | x = x∗ + λd, λ ∈ <}be a line passing through x∗, where d is a vector with strictly positive coor-dinates. Then, using the boundedness from below, convexity, and closureof C , we see that the set C ∩ L contains a set of the form

{x∗ + λd | λ1 ≤ λ ≤ λ2}

for some λ2 > 0 and some λ1 < 0 for which the vector

x = x∗ + λ1d

is a boundary point of C. If f(x) > f(x∗), we have by concavity of f ,

f(x∗) ≥ λ2

λ2 − λ1f(x) +

(1− λ2

λ2 − λ1

)f(x∗ + λ2d)

>λ2

λ2 − λ1f(x∗) +

(1− λ2

λ2 − λ1

)f(x∗ + λ2d).

It follows that f(x∗) > f(x∗ + λ2d). This contradicts the optimality of x∗,proving that f(x) = f(x∗).

We have shown that the minimum of f is attained at some boundarypoint x of C. If x is an extreme point of C , we are done. If it is not anextreme point, consider a hyperplane H passing through x and containingC in one of its halfspaces. The intersection T1 = C ∩H is closed, convex,bounded from below, and lies in an affine set M1 of dimension n − 1.Furthermore, f attains its minimum over T1 at x. Thus, by the precedingargument, it also attains its minimum at some boundary point x1 of T1.If x1 is an extreme point of T1, then by Prop. 1.6.3, it is also an extremepoint of C and the result follows. If x1 is not an extreme point of T1, thenwe view M1 as a space of dimension n− 1 and we form T2, the intersection


of T1 with a hyperplane in M1 that passes through x1 and contains T1

in one of its halfspaces. This hyperplane will be of dimension n − 2. Wecan continue this process for at most n times, when a set Tn consisting ofa single point is obtained. This point is an extreme point of Tn and, byrepeated application of Prop. 1.6.3, an extreme point of C. Q.E.D.

As a corollary we have the following:

Proposition 1.6.5: Let C be a closed convex set and let f : C 7→ <be a concave function. Assume that for some invertible n× n matrixA and some b ∈ <n we have

Ax ≥ b, ∀ x ∈ C.

Then if f attains a minimum over C , it attains a minimum at someextreme point of C.

Proof: Consider the transformation x = A−1y and the problem of mini-mizing

h(y) = f(A−1y

)

over Y = {y | A−1y ∈ C}. The function h is concave over the closedconvex set Y . Furthermore, y ≥ b for all y ∈ Y and hence Y is boundedfrom below. By Prop. 1.6.4, h attains a minimum at some extreme pointy∗ of Y . Then f attains its minimum over C at x∗ = A−1y∗, while x∗ is anextreme point of C , since it can be verified that invertible transformationsof sets map extreme points to extreme points. Q.E.D.

1.6.4 Extreme Points and Linear Programming

We now consider a polyhedral set P and we characterize the set of itsextreme points (also called vertices). By Prop. 1.6.2, P can be representedas

P = C + P ,

where C is a finitely generated cone and P is the convex hull of some vectorsv1, . . . , vm:

P =

x

∣∣∣ x =

m∑

j=1

µjvj,

m∑

j=1

µj = 1, µj ≥ 0, j = 1, . . . , m

.

We note that an extreme point x of P cannot be of the form x = c + x,where c 6= 0, c ∈ C , and x ∈ P , since in this case x would be the midpoint


of the line segment connecting the distinct vectors x and 2c+ x. Therefore,an extreme point of P must belong to P , and since P ⊂ P , it must also bean extreme point of P . An extreme point of P must be one of the vectorsv1, . . . , vm, since otherwise this point would be expressible as a convexcombination of v1, . . . , vm. Thus the set of extreme points of P is eitherempty or finite. Using Prop. 1.6.3(b), it follows that the set of extremepoints of P is nonempty and finite if and only if P contains no line.

If P is bounded, then we must have P = P , and it can be shownthat P is equal to the convex hull of its extreme points (not just the con-vex hull of the vectors v1, . . . , vm), as shown in the following proposition.The proposition also gives another and more specific characterization ofextreme points of polyhedral sets, which is central in the theory of linearprogramming.

Proposition 1.6.6: Let P be a polyhedral set in <n.

(a) If P has the form

P ={x | a′jx ≤ bj , j = 1, . . . , r

},

where aj and bj are given vectors and scalars, respectively, thena vector v ∈ P is an extreme point of P if and only if the set

Av ={aj | a′jv = bj , j = 1, . . . , r

}

contains n linearly independent vectors.

(b) If P has the form

P = {x | Ax = b, x ≥ 0},

where A is a given m× n matrix and b is a given vector, then avector v ∈ P is an extreme point of P if and only if the columnsof A corresponding to the nonzero coordinates of v are linearlyindependent.

(c) If P has the form

P =

x

∣∣∣ x =

m∑

j=1

µjvj ,

m∑

j=1

µj = 1, µj ≥ 0, j = 1, . . . ,m

(1.39)where vj are given vectors, then P is the convex hull of its ex-treme points.


(d) (Fundamental Theorem of Linear Programming) Assume that Phas at least one extreme point. Then if a linear function attainsa minimum over P , it attains a minimum at some extreme pointof P .

Proof: (a) If the set Av contains fewer than n linearly independent vectors,then the system of equations

a′jw = 0, ∀ aj ∈ Av

has a nonzero solution w. For sufficiently small γ > 0, we have v +γw ∈ Pand v − γw ∈ P , thus showing that v is not an extreme point. Thus, if vis an extreme point, Av must contain n linearly independent vectors.

Conversely, suppose that Av contains a subset Av consisting of nlinearly independent vectors. Suppose that for some y ∈ P , z ∈ P , andα ∈ (0,1), we have v = αy + (1− α)z. Then for all aj ∈ Av, we have

bj = a′jv = αa′jy + (1 − α)a′jz ≤ αbj + (1− α)bj = bj.

Thus v, y, and z are all solutions of the system of n linearly independentequations

a′jw = bj , ∀ aj ∈ Av.

Hence v = y = z, implying that v is an extreme point.

(b) Let k be the number of zero coordinates of v, and consider the matrixA, which is the same as A except that the columns corresponding to thezero coordinates of v are set to zero. We write P in the form

P = {x | Ax ≤ b, −Ax ≤ −b, −x ≤ 0},

and apply the result of part (a). We obtain that v is an extreme point if andonly if A contains n− k linearly independent rows, which is equivalent tothe n− k nonzero columns of A (corresponding to the nonzero coordinatesof v) being linearly independent.

(c) We use induction on the dimension of the space. Suppose that allbounded polyhedra of (n− 1)-dimensional spaces have a representation ofthe form (1.39), but there is a bounded polyhedron P ⊂ <n and a vectorx ∈ P , which is not in the convex hull PE of the extreme points of P . Letx be the projection of x on PE and let x be a solution of the problem

maximize (x− x)′z

subject to z ∈ P.


The polyhedron

P = P ∩{z | (x− x)′z = (x− x)′x

}

is equal to the convex hull of its extreme points by the induction hypothesis.Show that PE∩ P = Ø, while, by Prop. 1.6.3(a), each of the extreme pointsof P is also an extreme point of P , arriving at a contradiction.

(d) Since P is polyhedral, it has a representation

P = {x | Ax ≥ b},

for some m×n matrix A and some b ∈ <m. If A had rank less than n, thenits nullspace would contain some nonzero vector x, so P would containa line parallel to x, contradicting the existence of an extreme point [cf.Prop. 1.6.3(b)]. Thus A has rank n and hence it must contain n linearly

independent rows that constitute an n× n invertible submatrix A. If b isthe corresponding subvector of b, we see that every x ∈ P satisfies Ax ≥ b.The result then follows using Prop. 1.6.5. Q.E.D.

E X E R C I S E S

1.6.1

Let C1 and C2 be polyhedral sets. Show that C1 + C2 is polyhedral. Show alsothat if C1 and C2 are polyhedral cones, then C1 + C2 is a polyhedral cone.

1.6.2

Show that the image and the inverse image of a polyhedral set under a lineartransformation is polyhedral.

1.6.3 (Gordan’s Theorem of the Alternative)

(a) Let A be an m×n matrix. Then exactly one of the following two conditionsholds:

(i) There is an x ∈ <n such that Ax < 0 (all components of Ax arestrictly negative).

(ii) There is a µ ∈ <m such that A′µ = 0, µ 6= 0, and µ ≥ 0.


(b) Show that an alternative and equivalent statement of part (a) is the follow-ing: a polyhedral cone has nonempty interior if and only if its polar conedoes not contain a line, i.e., a set of the form {αz | α ∈ <}, where z is anonzero vector.

1.6.4

Let P be a polyhedral set represented as

P =

{x

∣∣∣ x = y +

m∑

j=1

µjvj , y ∈ C,

m∑

j=1

µj = 1, µj ≥ 0, j = 1, . . . , m

}

where v1, . . . , vm are some vectors and C is a finitely generated cone (cf. Prop.1.6.2). Show that the recession cone of P is equal to C.

1.7 SUBGRADIENTS

1.7.1 Directional Derivatives

Convex functions have interesting differentiability properties, which we dis-cuss in this section. We first consider convex functions of a single variable.

Let I be an interval of real numbers, and let f : I 7→ < be convex. Ifx, y, z ∈ I and x < y < z, then we can show the relation

f(y)− f(x)

y − x≤ f(z)− f(x)

z − x≤ f(z)− f(y)

z − y, (1.40)

which is illustrated in Fig. 1.7.1. For a formal proof, note that, using thedefinition of a convex function [cf. Eq. (1.3)], we obtain

f(y) ≤(

y − x

z − x

)f(z) +

(z − y

z − x

)f(x)

and either of the desired inequalities follows by appropriately rearrangingterms.

Let a and b be the infimum and the supremum, respectively, of I , alsoreferred to as the end points of I. For any x ∈ I, x 6= b, and for any α > 0such that x + α ∈ I , we define

s+(x,α) =f(x + α)− f(x)

α.

Sec. 1.5 Subgradients 113

slope =

slope =slope =y - x

f(y) - f(x)z - y

f(z) - f(y)

z - xf(z) - f(x)

x y z

Figure 1.7.1. Illustration of the inequalities (1.40). The rate of change of thefunction f is nondecreasing with its argument.

Let 0 < α ≤ α′. We use the first inequality in Eq. (1.40) with y = x + αand z = x + α′ to obtain s+(x,α) ≤ s+(x, α′). Therefore, s+(x, α) is anondecreasing function of α and, as α decreases to zero, it converges eitherto a finite number or to −∞. Let f+(x) be the value of the limit, whichwe call the right derivative of f at the point x. Similarly, if x ∈ I, x 6= a,α > 0, and x− α ∈ I , we define

s−(x, α) =f(x)− f(x− α)

α,

which is, by a symmetrical argument, a nonincreasing function of α. Itslimit as α decreases to zero, denoted by f−(x), is called the left derivativeof f at the point x, and is either finite or equal to ∞.

In the case where the end points a and b belong to the domain I off , we define for completeness f−(a) = −∞ and f+(b) = ∞.

Proposition 1.7.1: Let I ⊂ < be a convex interval and let f : I 7→ <be a convex function. Let a and b be the end points of I .

(a) We have f−(y) ≤ f+(y) for every y ∈ I .

(b) If x belongs to the interior of I , then f+(x) and f−(x) are finite.

(c) If x, z ∈ I and x < z, then f+(x) ≤ f−(z).

(d) The functions f−, f+ : I 7→ [−∞,+∞] are nondecreasing.

(e) The function f+ (respectively, f−) is right– (respectively, left–)continuous at every interior point of I . Also, if a ∈ I (respec-tively, b ∈ I) and f is continuous at a (respectively, b), then f+

(respectively, f−) is right– (respectively, left–) continuous at a(respectively, b).


(f) If f is differentiable at a point x belonging to the interior of I ,then f+(x) = f−(x) = (df/dx)(x).

(g) For any x, z ∈ I and any d satisfying f−(x) ≤ d ≤ f+(x), wehave

f(z) ≥ f(x) + d(z − x).

(h) The function f+ : I 7→ (−∞,∞] [respectively, f− : I 7→ [−∞,∞)]is upper (respectively, lower) semicontinuous at every x ∈ I .

Proof: (a) If y is an end point of I , the result is trivial because f−(a) =−∞ and f+(b) = ∞. We assume that y is an interior point, we let α > 0,and use Eq. (1.40), with x = y − α and z = y + α, to obtain s−(y,α) ≤s+(y, α). Taking the limit as α decreases to zero, we obtain f−(y) ≤ f+(y).

(b) Let x belong to the interior of I and let α > 0 be such that x− α ∈ I .Then f−(x) ≥ s−(x,α) > −∞. For similar reasons, we obtain f+(x) < ∞.Part (a) then implies that f−(x) < ∞ and f+(x) > −∞.

(c) We use Eq. (1.40), with y = (z + x)/2, to obtain s+(x, (z − x)/2

)≤

s−(z, (z−x)/2

). The result then follows because f+(x) ≤ s+

(x, (z−x)/2

)

and s−(z, (z − x)/2

)≤ f−(z).

(d) This follows by combining parts (a) and (c).

(e) Fix some x ∈ I , x 6= b, and some positive δ and α such that x+δ+α < b.We allow x to be equal to a, in which case f is assumed to be continuousat a. We have f+(x + δ) ≤ s+(x + δ,α). We take the limit, as δ decreasesto zero, to obtain limδ↓0 f+(x + δ) ≤ s+(x,α). We have used here the factthat s+(x,α) is a continuous function of x, which is a consequence of thecontinuity of f (Prop. 1.2.12). We now let α decrease to zero to obtainlimδ↓0 f+(x + δ) ≤ f+(x). The reverse inequality is also true because f+

is nondecreasing and this proves the right–continuity of f+. The proof forf− is similar.

(f) This is immediate from the definition of f+ and f−.

(g) Fix some x, z ∈ I. The result is trivially true for x = z. We onlyconsider the case x < z; the proof for the case x > z is similar. Sinces+(x,α) is nondecreasing in α, we have

(f(z) − f(x)

)/(z − x) ≥ s+(x, α)

for α belonging to (0, z−x). Letting α decrease to zero, we obtain(f(z)−

f(x))/(z − x) ≥ f+(x) ≥ d and the result follows.

(h) This follows from parts (a), (d), (e), and the definition of semicontinuity(Definition 1.1.5). Q.E.D.

Given a function f : <n 7→ <, a point x ∈ <n, and a vector y ∈ <n,


we say that f is directionally differentiable at x in the direction y if thelimit

limα↓0

f(x + αy)− f(x)

α,

exists, in which case it is denoted by f ′(x; y). We say that f is directionallydifferentiable at a point x if it is directionally differentiable at x in alldirections. As in Section 1.1.4, we say that f is differentiable at x if it isdirectionally differentiable at x and f ′(x; y) is a linear function of y denotedby

f ′(x; y) = ∇f(x)′y

where ∇f(x) is the gradient of f at x.An interesting property of convex functions is that they are direction-

ally differentiable at all points x. This is a consequence of the directionaldifferentiability of scalar convex functions, as can be seen from the relation

f ′(x; y) = limα↓0

f(x + αy)− f(x)

α= lim

α↓0

Fy(α) − Fy(0)

α= F+

y (0), (1.41)

where F+y (0) is the right derivative of the convex scalar function

Fy(α) = f(x + αy)

at α = 0. Note that the above calculation also shows that the left derivativeF−y (0) of Fy is equal to −f ′(x;−y) and, by using Prop. 1.7.1(a), we obtainF−y (0) ≤ F+

y (0), or equivalently,

−f ′(x;−y) ≤ f ′(x; y), ∀ y ∈ <n. (1.42)

Note also that for a convex function, the difference quotient(f(x + αy)−

f(x))/α is a monotonically nondecreasing function of α, so an equivalent

definition of the directional derivative is

f ′(x; y) = infα>0

f(x + αy)− f(x)

α.

The directional derivative can be used to provide a necessary andsufficient condition for optimality in the problem of minimizing a convexfunction f : <n 7→ < over a convex set X ⊂ <n. In particular, x∗ is aglobal minimum of f over X if and only if

f ′(x∗; x− x∗) ≥ 0, ∀ x ∈ X.

This follows from the definition (1.41) of directional derivative, and fromthe fact that the difference quotient

f(x∗ + α(x− x∗)

)− f(x∗)

α


is a monotonically nondecreasing function of α.The following proposition generalizes the upper semicontinuity prop-

erty of right derivatives of scalar convex functions [Prop. 1.7.1(h)], andshows that if f is differentiable, then its gradient is continuous.

Proposition 1.7.2: Let f : <n 7→ < be convex, and let {fk} be asequence of convex functions fk : <n 7→ < with the property thatlimk→∞ fk(xk) = f(x) for every x ∈ <n and every sequence {xk} thatconverges to x. Then for any x ∈ <n and y ∈ <n, and any sequences{xk} and {yk} converging to x and y, respectively, we have

lim supk→∞

f ′k(xk; yk) ≤ f ′(x; y). (1.43)

Furthermore, if f is differentiable over <n, then it is continuouslydifferentiable over <n.

Proof: For any µ > f ′(x; y), there exists an α > 0 such that

f(x + αy)− f(x)

α< µ, ∀ α ≤ α.

Hence, for α ≤ α, we have

fk(xk + αyk)− fk(xk)

α< µ

for all sufficiently large k, and using Eq. (1.41), we obtain

lim supk→∞

f ′k(xk; yk) < µ.

Since this is true for all µ > f ′(x; y), inequality (1.43) follows.If f is differentiable at all x ∈ <n, then using the continuity of f and

the part of the proposition just proved, we have for every sequence {xk}converging to x and every y ∈ <n,

lim supk→∞

∇f(xk)′y = lim supk→∞

f ′(xk; y) ≤ f ′(x; y) = ∇f(x)′y.

By replacing y by −y in the preceding argument, we obtain

− lim infk→∞

∇f(xk)′y = lim supk→∞

(−∇f(xk)′y

)≤ −∇f(x)′y.

Therefore, we have ∇f(xk)′y → ∇f(x)′y for every y, which implies that∇f(xk) → ∇f(x). Hence, the gradient is continuous. Q.E.D.


1.7.2 Subgradients and Subdifferentials

Given a convex function f : <n 7→ <, we say that a vector d ∈ <n is asubgradient of f at a point x ∈ <n if

f(z) ≥ f(x) + (z − x)′d, ∀ z ∈ <n. (1.44)

If instead f is a concave function, we say that d is a subgradient of f atx if −d is a subgradient of the convex function −f at x. The set of allsubgradients of a convex (or concave) function f at x ∈ <n is called thesubdifferential of f at x, and is denoted by ∂f(x). Figure 1.7.2 providessome examples of subdifferentials.

f(x) = |x| f(x) = max{0, (1/2)(x2 - 1)}

0 0 1- 1 x

∂f(x) ∂f(x)

0 0 1- 1

1

- 1

x x

x

Figure 1.7.2. The subdifferential of some scalar convex functions as a functionof the argument x.

We next provide the relationship between the directional derivativeand the subdifferential, and prove some basic properties of subgradients.


Proposition 1.7.3: Let f : <n 7→ < be convex. For every x ∈ <n,the following hold:

(a) A vector d is a subgradient of f at x if and only if

f ′(x; y) ≥ y′d, ∀ y ∈ <n.

(b) The subdifferential ∂f(x) is a nonempty, convex, and compactset, and there holds

f ′(x; y) = maxd∈∂f(x)

y′d, ∀ y ∈ <n. (1.45)

In particular, f is differentiable at x with gradient ∇f(x), if andonly if it has ∇f(x) as its unique subgradient at x. Furthermore,if X is a bounded set, the set ∪x∈X∂f(x) is bounded.

(c) If a sequence {xk} converges to x and dk ∈ ∂f(xk) for all k,the sequence {dk} is bounded and each of its limit points is asubgradient of f at x.

(d) If f is equal to the sum f1 + · · · + fm of convex functions fj :<n 7→ <, j = 1, . . . , m, then ∂f(x) is equal to the vector sum∂f1(x) + · · ·+ ∂fm(x).

(e) If f is equal to the composition of a convex function h : <m 7→ <and an m × n matrix A [f(x) = h(Ax)], then ∂f(x) is equal toA′∂h(Ax) =

{A′g | g ∈ ∂h(Ax)

}.

(f) x∗ minimizes f over a convex set X ⊂ <n if and only if thereexists a subgradient d ∈ ∂f(x∗) such that

d′(x− x∗) ≥ 0, ∀ x ∈ X.

Equivalently, x∗ minimizes f over a convex set X ⊂ <n if andonly if

0 ∈ ∂f(x∗) + TX(x∗)⊥.

Proof: (a) The subgradient inequality (1.44) is equivalent to

f(x + αy) − f(x)

α≥ y′d, ∀ y ∈ <n, α > 0.

Since the quotient on the left above decreases monotonically to f ′(x; y) asα ↓ 0 [Eq. (1.40)], we conclude that the subgradient inequality (1.44) isequivalent to f ′(x; y) ≥ y′d for all y ∈ <n. Therefore we obtain

d ∈ ∂f(x) ⇐⇒ f ′(x; y) ≥ y′d, ∀ y ∈ <n. (1.46)


(b) From Eq. (1.46), we see that ∂f(x) is the intersection of the closedhalfspaces

{d | y′d ≤ f ′(x; y)

}, where y ranges over the nonzero vectors

of <n. It follows that ∂f(x) is closed and convex. It is also bounded,since otherwise, for some y ∈ <n, y′d could be made unbounded by properchoice of d ∈ ∂f(x), contradicting Eq. (1.46). Since ∂f(x) is both closedand bounded, it is compact.

To show that ∂f(x) is nonempty and that Eq. (1.45) holds, we firstobserve that Eq. (1.46) implies that f ′(x; y) ≥ maxd∈∂f(x) y′d [where themaximum is −∞ if ∂f(x) is empty]. To show the reverse inequality, takeany x and y in <n, and consider the subset of <n+1

C1 ={(µ, z) | µ > f(z)

},

and the half-line

C2 ={(µ, z) | µ = f(x) + αf ′(x; y), z = x + αy, α ≥ 0

};

see Fig. 1.7.3. Using the definition of directional derivative and the convex-ity of f , it follows that these two sets are nonempty, convex, and disjoint.By applying the Separating Hyperplane Theorem (Prop. 1.4.2), we see thatthere exists a nonzero vector (γ,w) ∈ <n+1 such that

γµ+w′z ≥ γ(f(x)+αf ′(x; y)

)+w′(x+αy), ∀ α ≥ 0, z ∈ <n, µ > f(z).

(1.47)We cannot have γ < 0 since then the left-hand side above could be madearbitrarily small by choosing µ sufficiently large. Also if γ = 0, then Eq.(1.47) implies that w = 0, which is a contradiction. Therefore, γ > 0 andby dividing with γ in Eq. (1.47), we obtain

µ+(z−x)′(w/γ) ≥ f(x)+αf ′(x; y)+αy′(w/γ), ∀ α ≥ 0, z ∈ <n, µ > f(z).(1.48)

By setting α = 0 in the above relation and by taking the limit as µ ↓ f(z),we obtain f(z) ≥ f(x) + (z − x)′(−w/γ) for all z ∈ <n, implying that(−w/γ) ∈ ∂f(x). By setting z = x and α = 1 in Eq. (1.48), and by takingthe limit as µ ↓ f(x), we obtain y′(−w/γ) ≥ f ′(x; y), which implies thatmaxd∈∂f(x) y′d ≥ f ′(x; y). The proof of Eq. (1.45) is complete.

From the definition of directional derivative, we see that f is differ-entiable at x with gradient ∇f(x) if and only if the directional derivativef ′(x; y) is a linear function of the form f ′(x; y) = ∇f(x)′y. Thus, from Eq.(1.45), f is differentiable at x with gradient ∇f(x), if and only if it has∇f(x) as its unique subgradient at x.

Finally, let X be a bounded set. To show that ∪x∈X∂f(x) is bounded,we assume the contrary, i.e. that there exists a sequence {xk} ⊂ X , and asequence {dk} with dk ∈ ∂f(xk) for all k and ‖dk‖ → ∞. Without loss ofgenerality, we assume that dk 6= 0 for all k, and we denote yk = dk/‖dk‖.


C1f(z)

0 y x z

µ

C2

Figure 1.7.3. Illustration of the sets C1 and C2 used in the hyperplane separationargument of the proof of Prop. 1.7.3(b).

Since both {xk} and {yk} are bounded, they must contain convergent sub-sequences. We assume without loss of generality that xk converges to somex and yk converges to some y with ‖y‖ = 1. By Eq. (1.45), we have

f ′(xk; yk) ≥ d′kyk = ‖dk‖,

so it follows that f ′(xk; yk) → ∞. This contradicts, however, Eq. (1.43),which requires that lim supk→∞ f ′(xk; yk) ≤ f ′(x; y).

(c) By part (b), the sequence {dk} is bounded, and by part (a), we have

y′dk ≤ f ′(xk; y), ∀ y ∈ <n.

If d is a limit point of {dk}, we have by taking limit in the above relationand by using Prop. 1.7.2

y′d ≤ lim supk→∞

f ′(xk; y) ≤ f ′(x; y), ∀ y ∈ <n.

Therefore, by part (a), we have d ∈ ∂f(x).

(d) It will suffice to prove the result for the case where f = f1 + f2. Ifd1 ∈ ∂f1(x) and d2 ∈ ∂f2(x), then from the subgradient inequality (1.44),we have

f1(z) ≥ f1(x) + (z − x)′d1, ∀ z ∈ <n,

f2(z) ≥ f2(x) + (z − x)′d2, ∀ z ∈ <n,

so by adding, we obtain

f(z) ≥ f(x) + (z − x)′(d1 + d2), ∀ z ∈ <n.


Hence d1 + d2 ∈ ∂f(x), implying that ∂f1(x) + ∂f2(x) ⊂ ∂f(x).To prove the reverse inclusion, suppose to come to a contradiction,

that there exists a d ∈ ∂f(x) such that d /∈ ∂f1(x) + ∂f2(x). Since by part(b), the sets ∂f1(x) and ∂f2(x) are compact, the set ∂f1(x) + ∂f2(x) iscompact (cf. Prop. 1.2.16), and by Prop. 1.4.3, there exists a hyperplanestrictly separating {d} from ∂f1(x) + ∂f2(x), i.e., a vector y and a scalar bsuch that

y′(d1 + d2) < b < y′d, ∀ d1 ∈ ∂f1(x), d2 ∈ ∂f2(x).

From this we obtain

maxd1∈∂f1(x)

y′d1 + maxd2∈∂f2(x)

y′d2 < y′d,

or using part (b),

f ′1(x; y) + f ′2(x; y) < y′d.

By using the definition of directional derivative, f ′1(x; y)+f ′2(x; y) = f ′(x; y),so we have

f ′(x; y) < y′d,

which is a contradiction in view of part (a).

(e) It is seen using the definition of directional derivative that

f ′(x; y) = h′(Ax; Ay), ∀ y ∈ <n.

Let g ∈ ∂h(Ax) and d = A′g. Then by part (a), we have

g′z ≤ h′(Ax; z) ∀ z ∈ <m,

and in particular,

g′Ay ≤ h′(Ax; Ay) ∀ y ∈ <n,

or(A′g)′y ≤ f(x; y), ∀ y ∈ <n.

Hence, by part (a), we have A′g ∈ ∂f(x), so that A′∂h(Ax) ⊂ ∂f(x).To prove the reverse inclusion, suppose to come to a contradiction,

that there exists a d ∈ ∂f(x) such that d /∈ A′∂h(Ax). Since by part (b),the set ∂h(Ax) is compact, the set A′∂h(Ax) is also compact [cf. Prop.1.1.9(d)], and by Prop. 1.4.3, there exists a hyperplane strictly separating{d} from A′∂h(Ax), i.e., a vector y and a scalar b such that

y′(A′g) < b < y′d, ∀ g ∈ ∂h(Ax).


From this we obtainmax

g∈∂h(Ax)(Ay)′g < y′d,

or using part (b),h′(Ax; Ay) < y′d.

Since h′(Ax; Ay) = f ′(x; y), it follows that

f ′(x; y) < y′d,

which is a contradiction in view of part (a).

(f) Suppose that for some d ∈ ∂f(x∗) and all x ∈ X, we have d′(x−x∗) ≥ 0.Then, since from the definition of a subgradient we have f(x) − f(x∗) ≥d′(x − x∗) for all x ∈ X, we obtain f(x) − f(x∗) ≥ 0 for all x ∈ X, sox∗ minimizes f over X. Conversely, suppose that x∗ minimizes f over X.Consider the set of feasible directions of X at x∗

FX(x∗) = {w 6= 0 | x∗ + αw ∈ X for some α > 0},

and the cone

W = −FX(x∗)⊥ ={d | d′w ≥ 0, ∀ w ∈ FX(x∗)

}.

If ∂f(x∗) and W have a point in common, we are done, so to arrive at acontradiction, assume the opposite, i.e., ∂f(x∗) ∩W = Ø. Since ∂f(x∗) iscompact and W is closed, by Prop. 1.4.3 there exists a hyperplane strictlyseparating ∂f(x∗) and W , i.e., a vector y and a scalar c such that

g′y < c < d′y, ∀ g ∈ ∂f(x∗), ∀ d ∈ W.

Using the fact that W is a closed cone, it follows that

c < 0 ≤ d′y, ∀ d ∈ W, (1.49)

which when combined with the preceding inequality, also yields

maxg∈∂f(x∗)

g′y < c < 0.

Thus, using part (b), we have f ′(x∗; y) < 0, while from Eq. (1.49), we seethat y belongs to the polar cone of FX(x∗)⊥, which by the Polar Conetheorem (Prop. 1.5.1), implies that y is in the closure of the set of feasibledirections FX(x∗). Hence for a sequence yk of feasible directions convergingto y we have f ′(x∗; yk) < 0, which contradicts the optimality of x∗.

The last statement follows from the convexity of X which implies thatTX(x∗)⊥ is the set of all z such that z′(x−x∗) ≤ 0 for all x ∈ X (cf. Props.1.5.3 and 1.5.5). Q.E.D.


Note that the last part of the above proposition generalizes the opti-mality condition of Prop. 1.6.2 for the case where f is convex and smooth:

∇f(x∗)′(x− x∗) ≥ 0, ∀ x ∈ X.

In the special case where X = <n, we obtain a basic necessary and sufficientcondition for unconstrained optimality of x∗:

0 ∈ ∂f(x∗).

This optimality condition is also evident from the subgradient inequality(1.44).

We close this section with a version of the chain rule for directionalderivatives and subgradients.

Proposition 1.7.4: (Chain Rule) Let f : <n 7→ < be a convexfunction and let g : < 7→ < be a smooth scalar function. Then thefunction F defined by

F (x) = g(f(x)

)

is directionally differentiable at all x, and its directional derivative isgiven by

F ′(x; y) = ∇g(f(x)

)f ′(x; y), ∀ x ∈ <n, y ∈ <n. (1.50)

Furthermore, if g is convex and monotonically nondecreasing, then Fis convex and its subdifferential is given by

∂F (x) = ∇g(f(x)

)∂f(x), ∀ x ∈ <n. (1.51)

Proof: We have

F ′(x; y) = limα↓0

F (x + αy)− F (x)

α= lim

α↓0

g(f(x + αy)

)− g

(f(x)

)

α. (1.52)

From the convexity of f it follows that there are three possibilities: (1)For some α > 0, f(x + αy) = f(x) for all α ∈ (0, α], (2) For some α > 0,f(x + αy) > f(x) for all α ∈ (0, α], (3) For some α > 0, f(x + αy) < f(x)for all α ∈ (0, α].

In case (1), from Eq. (1.52), we have F ′(x; y) = f ′(x; y) = 0 and thegiven formula (1.50) holds. In case (2), from Eq. (1.52), we have for allα ∈ (0, α]

F ′(x; y) = limα↓0

f(x + αy)− f(x)

α·g(f(x + αy)

)− g

(f(x)

)

f(x + αy) − f(x).


As α ↓ 0, we have f(x + αy) → f(x), so the preceding equation yieldsF ′(x; y) = ∇g

(f(x)

)f ′(x; y). The proof for case (3) is similar.

Assume now that g is convex and monotonically nondecreasing. ThenF is convex since for any x1, x2 ∈ <n and α ∈ [0, 1], we have

αF (x1) + (1− α)F (x2) = αg(f(x1)

)+ (1− α)g

(f(x2)

)

≥ g(αf(x1) + (1− α)f(x2)

)

≥ g(f(αx1 + (1− α)x2

))

= F(αx1 + (1− α)x2

),

where for the first inequality we use the convexity of g, and for the secondinequality we use the convexity of f and the monotonicity of g.

To obtain the formula for the subdifferential of F , we note that byProp. 1.7.3(a), d ∈ ∂F (x) if and only if y′d ≤ F ′(x; y) for all y ∈ <n, orequivalently (from what has been already shown)

y′d ≤ ∇g(f(x)

)f ′(x; y), ∀ y ∈ <n.

If ∇g(f(x)

)= 0, this relation yields d = 0, so ∂F (x) = {0} and the desired

formula (1.51) holds. If ∇g(f(x)

)6= 0, we have ∇g

(f(x)

)> 0 by the

monotonicity of g, so we obtain

y′d

∇g(f(x)

) ≤ f ′(x; y), ∀ y ∈ <n,

which, by Prop. 1.7.3(a), is equivalent to d/∇g(f(x)

)∈ ∂f(x). Thus we

have shown that d ∈ ∂F (x) if and only if d/∇g(f(x)

)∈ ∂f(x), which

proves the desired formula (1.51). Q.E.D.

1.7.3 ε-Subgradients

We now consider a notion of approximate subgradient. Given a convexfunction f : <n 7→ < and a scalar ε > 0, we say that a vector d ∈ <n is anε-subgradient of f at a point x ∈ <n if

f(z) ≥ f(x) + (z − x)′d − ε, ∀ z ∈ <n. (1.53)

If instead f is a concave function, we say that d is an ε-subgradient of fat x if −d is a subgradient of the convex function −f at x. The set of allε-subgradients of a convex (or concave) function f at x ∈ <n is called theε-subdifferential of f at x, and is denoted by ∂fε(x).

For our purposes, ε-subgradients are important because they find sev-eral applications in nondifferentiable optimization. Some of their importantproperties are given in the following proposition.


Proposition 1.7.5: Let f : <n 7→ < be convex and ε be a positivescalar. For every x ∈ <n, the following hold:

(a) The ε-subdifferential ∂fε(x) is a nonempty, convex, and compactset, and for all y ∈ <n there holds

infα>0

f(x + αy) − f(x) + ε

α= max

d∈∂εf(x)y′d.

(b) We have 0 ∈ ∂εf(x) if and only if

f(x) ≤ infz∈<n

f(z) + ε.

(c) If a direction y is such that y′d < 0 for all d ∈ ∂εf(x), then

infα>0

f(x + αy) < f(x)− ε.

(d) If 0 /∈ ∂εf(x), then the direction y = −d, where

d = arg mind∈∂εf(x)

‖d‖,

satisfies y′d < 0 for all d ∈ ∂εf(x).

(e) If f is equal to the sum f1 + · · · + fm of convex functions fj :<n 7→ <, j = 1, . . . ,m, then

∂εf(x) ⊂ ∂εf1(x) + · · ·+ ∂εfm(x) ⊂ ∂mεf(x).

Proof: (a) We have

d ∈ ∂εf(x) ⇐⇒ f(x+αy) ≥ f(x)+αy′d−ε, ∀ α > 0, y ∈ <n. (1.54)

Hence

d ∈ ∂εf(x) ⇐⇒ infα>0

f(x + αy)− f(x) + ε

α≥ y′d, ∀ y ∈ <n. (1.55)

It follows that ∂εf(x) is the intersection of the closed subspaces

{d

∣∣∣ infα>0

f(x + αy)− f(x) + ε

α≥ y′d

}


as y ranges over <n. Hence ∂εf(x) is closed and convex. To show that∂εf(x) is also bounded, suppose to arrive at a contradiction that there is a

sequence {dk} ⊂ ∂εf(x) with ||dk|| → ∞. Let yk = dk||dk|| . Then, from Eq.

(1.54), we have for α = 1

f(x + yk) ≥ f(x) + ||dk|| − ε, ∀ k,

so it follows that f(x + yk) →∞. This is a contradiction since f is convexand hence continuous, so it is bounded on any bounded set. Thus ∂εf(x)is bounded.

To show that ∂εf(x) is nonempty and satisfies

infα>0

f(x + αy)− f(x) + ε

α= max

d∈∂εf(x)y′d, ∀ y ∈ <n,

we argue similar to the proof of Prop. 1.7.3(b). Consider the subset of<n+1

C1 ={(µ, z) | µ > f(x)

},

and the half-line

C2 = {(µ, z) | µ = f(x)−ε+β infα>0

f(x + αy)− f(x) + ε

α, z = x+βy, β ≥ 0}.

These sets are nonempty and convex. They are also disjoint, since we havefor all (µ, z) ∈ C2

µ = f(x)− ε + β infα>0

f(x + αy)− f(x) + ε

α

≤ f(x)− ε + βf(x + βy)− f(x) + ε

β

= f(x + βy)

= f(z).

Hence there exists a hyperplane separating them, i.e., a (γ, w) 6= (0,0) suchthat for all β ≥ 0, z ∈ <n, and µ > f(z),

γµ + w′z ≤ γ

(f(x)− ε + β inf

α>0

f(x + αy) − f(x) + ε

α

)+ w′(x + βy).

(1.56)We cannot have γ > 0, since then the left-hand side above could be madearbitrarily large by choosing µ to be sufficiently large. Also, if γ = 0,Eq. (1.56) implies that w = 0, contradicting the fact that (γ, w) 6= (0, 0).Therefore, γ < 0 and after dividing Eq. (1.56) by γ, we obtain for all β ≥ 0,z ∈ <n, and µ > f(z),

µ +

(w

γ

)′(z − x) ≥ f(x)− ε + β inf

α>0

f(x + αy) − f(x) + ε

α+ β

(w

γ

)′y.

(1.57)


Taking the limit above as µ ↓ f(z) and setting β = 0, we obtain

f(z) ≥ f(x)− ε +

(−w

γ

)′(z − x), ∀ z ∈ <n.

Hence −w/γ belongs to ∂εf(x). Also by taking z = x in Eq. (1.57), andby letting µ ↓ f(x) and by dividing with β, we obtain

−w

γ

′y ≤ − ε

β+ inf

α>0

f(x + αy)− f(x) + ε

α,

Since β can be chosen as large as desired, we see that

−w

γ

′y ≥ inf

α>0

f(x + αy)− f(x) + ε

α.

Combining this relation with Eq. (1.55), we obtain

maxd∈∂εf(x)

d′y = infα>0

f(x + αy) − f(x)− ε

α.

(b) By definition, 0 ∈ ∂εf(x) if and only if f(z) ≥ f(x) − ε for all z ∈ <n,which is equivalent to infz∈<n f(z) ≥ f(x)− ε.

(c) Assume that a direction y is such that

maxd∈∂εf(x)

y′d < 0, (1.58)

while infα>0 f(x + αy) ≥ f(x) − ε. Then f(x + αy) − f(x) ≥ −ε for allα > 0, or equivalently

f(x + αy)− f(x) + ε

α≥ 0, ∀ α > 0.

Consequently, using part (a), we have

maxd∈∂εf(x)

y′d = infα>0

f(x + αy)− f(x) + ε

α≥ 0.

which contradicts Eq. (1.58).

(d) The vector d is the projection of the zero vector on the convex andcompact set ∂εf(x). If 0 6∈ ∂εf(x), we have ‖d‖ > 0, while by Prop. 1.3.3,

(d − d)′d ≥ 0, ∀ d ∈ ∂εf(x).

Henced′d ≥ ||d||2 > 0, ∀ d ∈ ∂εf(x).


(e) It will suffice to prove the result for the case where f = f1 + f2. Ifd1 ∈ ∂εf1(x) and d2 ∈ ∂εf2(x), then from Eq. (1.54), we have

f1(x + αy) ≥ f1(x) + αy′d1 − ε, ∀ α > 0, y ∈ <n,

f2(x + αy) ≥ f2(x) + αy′d2 − ε, ∀ α > 0, y ∈ <n,

so by adding, we obtain

f(x + αy) ≥ f(x) + αy′(d1 + d2) − 2ε, ∀ α > 0, y ∈ <n.

Hence from Eq. (1.54), we have d1 + d2 ∈ ∂2εf(x), implying that ∂εf1(x)+∂εf2(x) ⊂ ∂2εf(x).

To prove that ∂εf(x) ⊂ ∂εf1(x)+∂εf2(x), we use an argument similarto the one of the proof of Prop. 1.7.3(d). Suppose to come to a contradic-tion, that there exists a d ∈ ∂εf(x) such that d /∈ ∂εf1(x)+∂εf2(x). Since bypart (a), the sets ∂εf1(x) and ∂εf2(x) are compact, the set ∂εf1(x)+∂εf2(x)is compact (cf. Prop. 1.2.16), and by Prop. 1.4.3, there exists a hyperplanestrictly separating {d} from ∂εf1(x) + ∂εf2(x), i.e., a vector y and a scalarb such that

y′(d1 + d2) < b < y′d, ∀ d1 ∈ ∂εf1(x), d2 ∈ ∂εf2(x).

From this we obtain

maxd1∈∂εf1(x)

y′d1 + maxd2∈∂εf2(x)

y′d2 < y′d,

or using part (a),

infα>0

f1(x + αy)− f1(x) + ε

α+ inf

α>0

f2(x + αy) − f2(x) + ε

α< y′d.

Let αj , j = 1,2, be positive scalars such that

2∑

j=1

fj(x + αjy)− fj(x) + ε

αj< y′d. (1.59)

Define

α =1

∑2j=1 1/αj

.

As a consequence of the convexity of fj, the ratio(fj(x + αy) − fj(x)

)/α

is monotonically nondecreasing in α. Thus, since αj ≥ α, we have

fj(x + αjy) − fj(x)

αj≥ fj(x + αy)− fj(x)

α, j = 1, 2,


and Eq. (1.59) together with the definition of α yields

y′d >

2∑

j=1

fj(x + αjy)− fj(x) + ε

αj

≥ ε

α+

2∑

j=1

fj(x + αy) − fj(x)

α

≥ infα>0

f(x + αy)− f(x) + ε

α.

This contradicts Eq. (1.55), thus implying that ∂εf(x) ⊂ ∂εf1(x)+∂εf2(x).Q.E.D.

Parts (b)-(d) of Prop. 1.7.5 contain the elements of an important algo-rithm (called ε-descent algorithm and introduced by Bertsekas and Mitter[BeM71], [BeM73]) for minimizing convex functions to within a toleranceof ε. At a given point x, we check whether 0 ∈ ∂εf(x). If this is so, thenby Prop. 1.7.5(a), x is an ε-optimal solution. If not, then by going alongthe direction opposite to the vector of minimum norm on ∂εf(x), we areguaranteed a cost improvement of at least ε. An implementation of thisalgorithm due to Lemarechal [Lem74], [Lem75] is given in the exercises.

1.7.4 Subgradients of Extended Real-Valued Functions

In this work the major emphasis is on real-valued convex functions f :<n 7→ <, which are defined over the entire space <n and are convex over <n.There are, however, important cases, prominently arising in the context ofduality, where we must deal with functions g : D 7→ < that are defined overa convex subset D of <n, and are convex over D. This type of function mayalso be specified as the extended real-valued function f : <n 7→ (−∞,∞]given by

f(x) =

{g(x) if x ∈ D,∞ otherwise,

with D referred to as the effective domain of f .The notion of a subdifferential and a subgradient of such a function

can be developed along the lines of the present section. In particular, givena convex function f : <n 7→ (−∞,∞], a vector d is a subgradient of f at avector x such that f(x) < ∞ if the subgradient inequality holds, i.e.,

f(z) ≥ f(x) + (z − x)′d, ∀ z ∈ <n.

If g : D 7→ < is a concave function (i.e., −g is a convex function overthe convex set D), it can also be represented as the extended real-valuedfunction f : <n 7→ [−∞,∞), where

f(x) =

{g(x) if x ∈ D,−∞ otherwise.


As earlier, we say that d is a subgradient of f at an x ∈ D if −d is asubgradient of the convex function −g at x.

The subdifferential ∂f(x) is the set of all subgradients of the convex(or concave) function f . By convention, ∂f(x) is considered empty for allx with f(x) = ∞. Note that contrary to the case of real-valued functions,∂f(x) may be empty, or closed but unbounded. For example, the extendedreal-valued convex function given by

f(x) =

{−√x if 0 ≤ x ≤ 1,∞ otherwise,

has the subdifferential

∂f(x) =

− 1

2√

xif 0 < x < 1,

[−1/2,∞) if x = 1,Ø if x ≤ 0 or 1 < x.

Thus, ∂f(x) can be empty and can be unbounded at points x that belongto the effective domain of f (as in the cases x = 0 and x = 1, respectively,of the above example). However, it can be shown that ∂f(x) is nonemptyand compact at points x that are interior points of the effective domain off , as also illustrated by the above example.

Similarly, a vector d is an ε-subgradient of f at a vector x such thatf(x) < ∞ if

f(z) ≥ f(x) + (z − x)′d − ε, ∀ z ∈ <n.

The ε-subdifferential ∂εf(x) is the set of all ε-subgradients of f . Figure1.7.4 illustrates the definition of ∂εf(x) for the case of a one-dimensionalfunction f . The figure illustrates that even when f is extended real-valued,the ε-subdifferential ∂εf(x) is nonempty at all points of the effective domainD. This can be shown for multi-dimensional functions f as well.

One can provide generalized versions of the results of Props. 1.7.3-1.7.5 within the context of extended real-valued convex functions, but withappropriate adjustments and additional assumptions to deal with caseswhere ∂f(x) may be empty or noncompact. Since we will not need thesegeneralizations in this book, we will not state and prove them. The readerwill find a detailed account in the book by Rockafellar [Roc70].

1.7.5 Directional Derivative of the Max Function

As mentioned in Section 1.3.4, the max function f(x) = maxz∈Z φ(x, z)arises in a variety of interesting contexts, including duality. It is thereforeimportant to characterize this function, and the following proposition givesthe directional derivative and the subdifferential of f for the case whereφ(·, z) is convex for all z ∈ Z.


0 x

f(x)

ε

Slopes = endpoints of

ε-subdifferential at x

D

0 x

f(x)

ε

Slope = right endpoint ofε-subdifferential at x

D

Figure 1.7.4. Illustration of the ε-subdifferential ∂εf(x) of a one-dimensionalfunction

f(x) =

{g(x) if x ∈ D,∞ otherwise,

where g is convex and the effective domain D is an interval. It corresponds to theset of slopes indicated in the figure. Note that ∂εf(x) is nonempty at all x ∈ D.Its left endpoint is

f−ε (x) =

{supδ<0, x+δ∈D

f(x+δ)−f(x)+εδ if inf D < x,

−∞ if inf D = x,

and its right endpoint is

f+j,ε(x) =

{infδ>0, x+δ∈D

f (x+δ)−f(x)+εδ

if x < supD,∞ if x = supD.

Note that these endpoints can be −∞ (as in the figure on the right) or ∞. Forε = 0, the above formulas also give the endpoints of the subdifferential ∂f(x).Note that while ∂f(x) is nonempty for all x in the interior of D, it may be emptyfor x at the boundary of D (as in the figure on the right).

Proposition 1.7.6: (Danskin’s Theorem) Let Z ⊂ <m be acompact set, and let φ : <n × Z 7→ < be continuous and such thatφ(·, z) : <n 7→ < is convex for each z ∈ Z.

(a) The function f : <n 7→ < given by

f(x) = maxz∈Z

φ(x, z) (1.60)

is convex and has directional derivative given by

f ′(x; y) = maxz∈Z(x)

φ′(x, z; y), (1.61)


where φ′(x, z; y) is the directional derivative of the function φ(·, z) atx in the direction y, and Z(x) is the set of maximizing points in Eq.(1.60)

Z(x) =

{z

∣∣∣ φ(x, z) = maxz∈Z

φ(x, z)

}.

In particular, if Z(x) consists of a unique point z and φ(·, z) is differ-entiable at x, then f is differentiable at x, and ∇f(x) = ∇xφ(x, z),where ∇xφ(x, z) is the vector with coordinates

∂φ(x, z)

∂xi, i = 1, . . . , n.

(b) If φ(·, z) is differentiable for all z ∈ Z and ∇xφ(x, ·) is continuouson Z for each x, then

∂f(x) = conv{∇xφ(x, z) | z ∈ Z(x)

}, ∀ x ∈ <n.

(c) The conclusion of part (a) also holds if, instead of assuming theZ is compact, we assume that Z(x) is nonempty for all x ∈ <n,and that φ and Z are such that for every convergent sequence{xk}, there exists a bounded sequence {zk} with zk ∈ Z(xk) forall k.

Proof: (a) The convexity of f has been established in Prop. 1.2.3(c). Wenote that since φ is continuous and Z is compact, the set Z(x) is nonemptyby Weierstrass’ Theorem (Prop. 1.3.1) and f takes real values. For anyz ∈ Z(x), y ∈ <n, and α > 0, we use the definition of f to obtain

f(x + αy)− f(x)

α≥ φ(x + αy, z)− φ(x, z)

α.

Taking the limit as α decreases to zero, we obtain f ′(x; y) ≥ φ′(x, z; y).Since this is true for every z ∈ Z(x), we conclude that

f ′(x; y) ≥ supz∈Z(x)

φ′(x, z; y), ∀ y ∈ <n. (1.62)

To prove the reverse inequality and that the supremum in the right-hand side of the above inequality is attained, consider a sequence {αk} withαk ↓ 0 and let xk = x +αky. For each k, let zk be a vector in Z(xk). Since{zk} belongs to the compact set Z, it has a subsequence converging to somez ∈ Z. Without loss of generality, we assume that the entire sequence {zk}converges to z. We have

φ(xk, zk) ≥ φ(xk, z), ∀ z ∈ Z,


so by taking the limit as k →∞ and by using the continuity of φ, we obtain

φ(x, z) ≥ φ(x, z), ∀ z ∈ Z.

Therefore, z ∈ Z(x). We now have

f ′(x; y) ≤ f(x + αky) − f(x)

αk

=φ(x + αky, zk) − φ(x, z)

αk

≤ φ(x + αky, zk) − φ(x, zk)

αk

≤ −φ′(x + αky, zk;−y)

≤ φ′(x + αky, zk; y),

(1.63)

where the last inequality follows from inequality (1.42). We apply Prop.1.4.2 to the functions fk defined by fk(·) = φ(·, zk), and with xk = x+αky,to obtain

lim supk→∞

φ′(x + αky, zk; y) ≤ φ′(x, z; y). (1.64)

We take the limit in inequality (1.63) as k → ∞, and we use inequality(1.64) to conclude that

f ′(x; y) ≤ φ′(x, z; y).

This relation together with inequality (1.62) proves Eq. (1.61).For the last statement of part (a), if Z(x) consists of the unique point

z, Eq. (1.61) and the differentiability assumption on φ yield

f ′(x; y) = φ′(x, z; y) = y′∇xφ(x, z), ∀ y ∈ <n,

which implies that ∇f(x) = ∇xφ(x, z).

(b) By part (a), we have

f ′(x; y) = maxz∈Z(x)

∇xφ(x, z)′y,

while by Prop. 1.7.3, we have

f ′(x; y) = maxz∈∂f(x)

d′y.

For all z ∈ Z(x) and y ∈ <n, we have

f(y) = maxz∈Z

φ(y, z)

≥ φ(y, z)

≥ φ(x, z) +∇xφ(x, z)′(y − x)

= f(x) +∇xφ(x, z)′(y − x).


Therefore, ∇xφ(x, z) is a subgradient of f at x, implying that

conv{∇xφ(x, z) | z ∈ Z(x)

}⊂ ∂f(x).

To prove the reverse inclusion, we use a hyperplane separation argu-ment. By the continuity of φ(x, ·) and the compactness of Z, we see thatZ(x) is compact, so by the continuity of ∇xφ(x, ·), the set

{∇xφ(x, z) |

z ∈ Z(x)}

is compact. By Prop. 1.2.8, it follows that conv{∇xφ(x, z) | z ∈

Z(x)}

is compact. If d ∈ ∂f(x) while d /∈ conv{∇xφ(x, z) | z ∈ Z(x)

}, by

the Strict Separation Theorem (Prop. 1.4.3), there exist y 6= 0 and γ ∈ <,such that

d′y > γ > ∇xφ(x, z)′y, ∀ z ∈ Z(x).

Therefore, we have

d′y > maxz∈Z(x)

∇xφ(x, z)′y = f ′(x; y),

contradicting Prop. 1.7.3. Thus ∂f(x) ⊂ conv{∇xφ(x, z) | z ∈ Z(x)

}and

the proof is complete.

(c) The proof of this part is nearly identical to the proof of part (a).Q.E.D.

A simple but important application of the above proposition is whenZ is a finite set and φ is linear in x for all z. In this case, f(x) can berepresented as

f(x) = max{a′1x + b1, . . . , a′mx + bm},

where a1, . . . , am are given vectors in <n and b1, . . . , bm are given scalars.Then ∂f(x) is the convex hull of the set of vectors ai such that a′ix + bi =max{a′1x + b1, . . . , a′mx + bm}.

E X E R C I S E S

1.7.1

Consider the problem of minimizing a convex function f :7→ < over the polyhedralset

X = {x | a′jx ≤ bj, j = 1, . . . , r}.


Show that x∗ is an optimal solution if and only if there exist scalars µ∗1 , . . . , µ∗rsuch that

(i) µ∗j ≥ 0 for all j, and µ∗j = 0 for all j such that a′jx∗ < bj .

(ii) 0 ∈ ∂f(x∗) +∑r

j=1µ∗j aj .

Hint : Characterize the cone TX(x∗)⊥, and use Prop. 1.7.5 and Farkas’ lemma.

1.7.2

Use the boundedness properties of the subdifferential to show that a convexfunction f : <n 7→ < is Lipschitz continuous on bounded sets, i.e., given anybounded set B, there exists a scalar L such that

∣∣f(x)− f(y)∣∣ ≤ L‖x− y‖, ∀ x, y ∈ B.

1.7.3 (Steepest Descent Directions of Convex Functions)

Let f : <n 7→ < be a convex function and fix a vector x ∈ <n. Show that avector d is the vector of minimum norm in ∂f(x) if and only if either d = 0 orelse d/‖d‖ minimizes f ′(x; d) over all d with ‖d‖ ≤ 1.

1.7.4 (Generating Descent Directions of Convex Functions)

Let f : <n 7→ < be a convex function and fix a vector x ∈ <n. A vector d ∈ <n issaid to be a descent direction of f at x if the corresponding directional derivativeof f satisfies

f ′(x; d) < 0.

This exercise provides a method for generating a descent direction in circum-stances where obtaining a single subgradient at x is relatively easy.

Assume that x does not minimize f and let g1 be some subgradient of f atx. For k = 2, 3, . . ., let wk be the vector of minimum norm in the convex hull ofg1, . . . , gk−1,

wk = arg ming∈conv{g1,...,gk−1}

‖g‖.

If wk is a descent direction stop; else let gk be an element of ∂f(x) such that

gk′wk = min

g∈∂f(x)g′wk = f ′(x; wk).

Show that this process terminates in a finite number of steps with a descentdirection. Hint : If wk is not a descent direction, then gi

′wk ≥ ‖wk‖2 ≥ ‖g∗‖2 > 0for all i = 1, . . . , k − 1, where g∗ is the subgradient of minimum norm, while atthe same time gk

′wk ≤ 0. Consider a limit point of {(wk , gk)}.


1.7.5 (Implementing the ε-Descent Method [Lem74])

Let f : <n 7→ < be a convex function. This exercise shows how the procedure ofExercise 1.7.4 can be modified so that it generates an ε-descent direction at a givenvector x. At the typical step of this procedure, we have g1, . . . , gk−1 ∈ ∂εf(x).Let wk be the vector of minimum norm in the convex hull of g1, . . . , gk−1,

wk = arg ming∈conv{g1,...,gk−1}

‖g‖.

If wk = 0, stop; we have 0 ∈ ∂εf(x), so by Prop. 1.7.5(b), f(x) ≤ infz∈<n f(z)+εand x is ε-optimal. Otherwise, by a search along the line {x − αwk | α ≥ 0},determine whether there exists a scalar α such that f(x − αwk) < f(x) − ε. Ifsuch a α can be found, stop and replace x with x−αwk (the value of f has beenimproved by at least ε). Otherwise let gk be an element of ∂εf(x) such that

gk′wk = min

g∈∂εf(x)g′wk.

Show that this process will terminate in a finite number of steps with either animprovement of the value of f by at least ε, or by confirmation that x is anε-optimal solution.

1.8 OPTIMALITY CONDITIONS

We finally extend the optimality conditions of Props. 1.6.2 and 1.7.3(f) tothe case where the cost function involves a (possibly nonconvex) smoothcomponent and a convex (possibly nondifferentiable) component.

Proposition 1.8.1: Let x∗ be a local minimum of a function f :<n 7→ < over a subset X of <n. Assume that the tangent cone TX(x∗)is convex, and that f has the form

f(x) = f1(x) + f2(x),

where f1 is convex and f2 is smooth. Then

−∇f2(x∗) ∈ ∂f1(x∗) + TX(x∗)⊥.

Proof: The proof is analogous to the one of Prop. 1.6.2. Let y be anonzero tangent of X at x∗. Then there exists a sequence {ξk} and asequence {xk} ⊂ X such that xk 6= x∗ for all k,

ξk → 0, xk → x∗,

Sec. 1.5 Optimality Conditions 137

andxk − x∗

‖xk − x∗‖ =y

‖y‖ + ξk.

We write this equation as

xk − x∗ =‖xk − x∗‖‖y‖ yk, (1.65)

whereyk = y + ‖y‖ξk.

By the convexity of f1, we have

−f ′1(xk; xk − x∗) ≤ f ′1(xk; x∗ − xk)

andf1(xk) + f ′1(xk; x∗ − xk) ≤ f1(x∗),

and by adding these inequalities, we obtain

f1(xk) ≤ f1(x∗) + f ′1(xk; xk − x∗), ∀ k.

Also, by the Mean Value Theorem, we have

f2(xk) = f2(x∗) +∇f2(xk)′(xk − x∗), ∀ k,

where xk is a vector on the line segment joining xk and x∗. By adding thelast two relations,

f(xk) ≤ f(x∗) + f ′1(xk; xk − x∗) +∇f2(xk)′(xk − x∗), ∀ k,

so using Eq. (1.65), we obtain

f(xk) ≤ f(x∗) +‖xk − x∗‖‖y‖

(f ′1(xk; yk) +∇f2(xk)′yk

), ∀ k. (1.66)

We now show by contradiction that f ′1(x∗; y)+∇f2(x∗)′y ≥ 0. Indeed,if f ′1(x∗; y) + ∇f2(x∗)′y < 0, then since xk → x∗ and yk → y, it follows,using also Prop. 1.7.2, that for all sufficiently large k,

f ′1(xk; yk) +∇f2(xk)′yk < 0

and [from Eq. (1.66)] f(xk) < f(x∗). This contradicts the local optimalityof x∗.

We have thus shown that f ′1(x∗; y)+∇f2(x∗)′y ≥ 0 for all y ∈ TX(x∗),or equivalently, by Prop. 1.7.3(b),

maxd∈∂f1(x∗)

d′y +∇f2(x∗)′y ≥ 0,


or equivalently [since TX(x∗) is a cone]

min‖y‖≤1

y∈TX (x∗)

maxd∈∂f1(x∗)

(d +∇f2(x∗)

)′y = 0.

We can now apply the Saddle Point Theorem (Prop. 1.3.8 – the convex-ity/concavity and compactness assumptions of that proposition are satis-fied) to assert that there exists a d ∈ ∂f1(x∗) such that

min‖y‖≤1

y∈TX (x∗)

(d +∇f2(x∗)

)′y = 0.

This implies that−

(d +∇f2(x∗)

)∈ TX(x∗)⊥,

which in turn is equivalent to −∇f2(x∗) ∈ ∂f1(x∗) + TX (x∗)⊥. Q.E.D.

Note that in the special case where f1(x) ≡ 0, we obtain Prop. 1.6.2.The convexity assumption on TX(x∗) is unnecessary in this case, but it isessential in general. [Consider the subset X =

{(x1, x2) | x1x2 = 0

}of

<2; it is easy to construct a convex nondifferentiable function that has aglobal minimum at x∗ = 0 without satisfying the necessary condition ofProp. 1.8.1.]

In the special case where f2(x) ≡ 0 and X is convex, Prop. 1.8.1 yieldsthe necessity part of Prop. 1.7.3(e). More generally, when X is convex, anequivalent statement of Prop. 1.8.1 is that if x∗ is a local minimum of fover X, there exists a subgradient d ∈ ∂f1(x∗) such that

(d +∇f2(x∗)

)′(x− x∗) ≥ 0, ∀ x ∈ X.

This is because for a convex set X, we have z ∈ TX (x∗)⊥ if and only ifz′(x− x∗) ≤ 0 for all x ∈ X.

1.9 NOTES AND SOURCES

The material in this chapter is classical and is developed in various books.Most of these books relate to both convex analysis and optimization, butdiffer in their scope and emphasis. Rockafellar [Roc70] focuses on convex-ity and duality in finite dimensions. Ekeland and Temam [EkT76] developthe subject in infinite dimensional spaces. Hiriart-Urruty and Lemarechal[HiL93] emphasize algorithms for dual and nondifferentiable optimization.Rockafellar [Roc84] focuses on convexity and duality in network optimiza-tion and an important generalization, known as monotropic programming.Bertsekas [Ber98] also gives a detailed coverage of this material, which owes

Sec. 1.5 Notes and Sources 139

much to the early work of Minty [Min60] on network optimization. Rock-afellar and Wets [RoW98] extend many of the classical convexity conceptsto a broader analytical context involving nonconvex sets and functions infinite dimensions, a subject also known as nonsmooth analysis. The nor-mal cone, introduced by Mordukhovich [Mor76], plays a central role in thissubject. Bonnans and Shapiro [BoS00] emphasize sensitivity analysis anddiscuss infinite dimensional problems as well. Borwein and Lewis [BoL00]develop many of the concepts in Rockafellar and Wets [RoW98], but withless detail.

Among the early contributors to convexity theory, we note Steinitz[Ste13], [Ste14], [Ste16], who developed the theory of relative interiors,recession cones, and polar cones, and Minkowski [Min11], who first in-vestigated hyperplane separation theorems. The Minkowski-Weyl Theo-rem given here is due to Weyl [Wey35]. The notion of a tangent and thetangent cone originated with Bouligand [Bou30], [Bou32]. Subdifferentialtheory owes much to the work of Fenchel, who in his 1951 lecture notes[Fen51] laid the foundations for the subsequent work on convexity and itsconnection with optimization.

REFERENCES

[BeM71] Bertsekas, D. P, and Mitter, S. K., 1971. “Steepest Descent forOptimization Problems with Nondifferentiable Cost Functionals,” Proc.5th Annual Princeton Confer. Inform. Sci. Systems, Princeton, N. J., pp.347-351.

[BeM73] Bertsekas, D. P., and Mitter, S. K., 1973. “A Descent NumericalMethod for Optimization Problems with Nondifferentiable Cost Function-als,” SIAM J. on Control, Vol. 11, pp. 637-652.

[Ber98] Bertsekas, D. P., 1998. Network Optimization: Continuous andDiscrete Models, Athena Scientific, Belmont, MA.

[Ber99] Bertsekas, D. P., 1999. Nonlinear Programming: 2nd Edition, AthenaScientific, Belmont, MA.

[BoS00] Bonnans, J. F., and Shapiro, A., 2000. Perturbation Analysis ofOptimization Problems, Springer-Verlag, Berlin and N. Y.

[BoL00] Borwein, J. M., and Lewis, A. S., Convex Analysis and NonlinearOptimization, Springer-Verlag, N. Y.

[Bou30] Bouligand, G., 1930. “Sur les Surfaces Depourvues de Points Hy-perlimites,” Annales de la Societe Polonaise de Mathematique, Vol. 9, pp.32-41.

[Bou32] Bouligand, G., 1932. Introduction a la Geometrie InfinitesimaleDirecte, Gauthiers-Villars, Paris.


[EkT76] Ekeland, I., and Temam, R., 1976. Convex Analysis and Varia-tional Problems, North-Holland Publ., Amsterdam.

[Fen51] Fenchel, W., 1951. “Convex Cones, Sets, and Functions,” Mimeogra-phed Notes, Princeton Univ.

[HiL93] Hiriart-Urruty, J.-B., and Lemarechal, C., 1993. Convex Analysisand Minimization Algorithms, Vols. I and II, Springer-Verlag, Berlin andN. Y.

[HoK71] Hoffman, K., and Kunze, R., 1971. Linear Algebra, Prentice-Hall,Englewood Cliffs, N. J.

[LaT85] Lancaster, P., and Tismenetsky, M., 1985. The Theory of Matrices,Academic Press, N. Y.

[Lem74] Lemarechal, C., 1974. “An Algorithm for Minimizing Convex Func-tions,” in Information Processing ’74, Rosenfeld, J. L., (Ed.), pp. 552-556,North-Holland, Amsterdam.

[Lem75] Lemarechal, C., 1975. “An Extension of Davidon Methods to Non-differentiable Problems,” Math. Programming Study 3, Balinski, M., andWolfe, P., (Eds.), North-Holland, Amsterdam, pp. 95-109.

[Min11] Minkowski, H., 1911. “Theorie der Konvexen Korper, InsbesondereBegrundung Ihres Ober Flachenbegriffs,” Gesammelte Abhandlungen, II,Teubner, Leipsig.

[Min60] Minty, G. J., 1960. “Monotone Networks,” Proc. Roy. Soc. London,A, Vol. 257, pp. 194-212.

[Mor76] Mordukhovich, B. S., 1976. “Maximum Principle in the Problemof Time Optimal Response with Nonsmooth Constraints,” J. of AppliedMathematics and Mechanics, Vol. 40, pp. 960-969.

[OrR70] Ortega, J. M., and Rheinboldt, W. C., 1970. Iterative Solution ofNonlinear Equations in Several Variables, Academic Press, N. Y.

[RoW98] Rockafellar, R. T., and Wets, R. J.-B., 1998. Variational Analysis,Springer-Verlag, Berlin.

[Roc70] Rockafellar, R. T., 1970. Convex Analysis, Princeton Univ. Press,Princeton, N. J

[Roc84] Rockafellar, R. T., 1984. Network Flows and Monotropic Opti-mization, Wiley, N. Y.; republished by Athena Scientific, Belmont, MA,1998.

[Rud76] Rudin, W., 1976. Principles of Mathematical Analysis, McGraw-Hill, N. Y.

[Ste13] Steinitz, H., 1913. “Bedingt Konvergente Reihen und Konvexe Sys-tem, I, J. of Math., Vol. 143, pp. 128-175.

Sec. 1.5 Notes and Sources 141

[Ste14] Steinitz, H., 1914. “Bedingt Konvergente Reihen und Konvexe Sys-tem, II, J. of Math., Vol. 144, pp. 1-40.

[Ste16] Steinitz, H., 1916. “Bedingt Konvergente Reihen und Konvexe Sys-tem, III, J. of Math., Vol. 146, pp. 1-52.

[Str76] Strang, G., 1976. Linear Algebra and Its Applications, AcademicPress, N. Y.

[Wey35] Weyl, H., 1935. “Elementare Theorie der Konvexen Polyeder,”Commentarii Mathematici Helvetici, Vol. 7, pp. 290-306.

Documents

Convexity, Duality, and Lagrange Multipliersceimat/antiguo/sea/apuntes/Chapter1.pdf · alongsome directions and stay withinX forat least anontrivial inter-val. In fact a stronger