30
Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of Passau, Department of Informatics and Mathematics, D-94032 Passau, Germany Abstract The optimal quantizer in memory-size constrained vector quantization in- duces a quantization error which is equal to a Wasserstein distortion. How- ever, for the optimal (Shannon-)entropy constrained quantization error a proof for a similar identity is still missing. Relying on principal results of the optimal mass transportation theory, we will prove that the optimal quantization error is equal to a Wasserstein distance. Since we will state the quantization problem in a very general setting, our approach includes the R´ enyi-α-entropy as a complexity constraint, which includes the special case of (Shannon-)entropy constrained (α = 1) and memory-size constrained (α = 0) quantization. Additionally, we will derive for certain distance func- tions codecell convexity for quantizers with a finite codebook. Using other methods, this regularity in codecell geometry has already been proved earlier by Gy¨orgy and Linder [12, 13]. Key words: Wasserstein distance, optimal quantization error, codecell convexity, R´ enyi-α-entropy 2000 MSC: 60B10, 60E05, 62E17, 68P30, 94A17, 94A29 1. Introduction Optimal quantization often arises in electrical engineering in connection with signal processing and data compression. The survey article of Gray and Neuhoff [9] provides a comprehensive overview of this subject. In mathemati- cal terms, quantization is concerned with the approximation of a given proba- Email address: [email protected] (Wolfgang Kreitmeier ) Preprint submitted to Journal of Multivariate Analysis February 21, 2011

Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Optimal vector quantization in terms of Wasserstein

distance

Wolfgang Kreitmeier

Wolfgang Kreitmeier, University of Passau, Department of Informatics andMathematics, D-94032 Passau, Germany

Abstract

The optimal quantizer in memory-size constrained vector quantization in-duces a quantization error which is equal to a Wasserstein distortion. How-ever, for the optimal (Shannon-)entropy constrained quantization error aproof for a similar identity is still missing. Relying on principal resultsof the optimal mass transportation theory, we will prove that the optimalquantization error is equal to a Wasserstein distance. Since we will statethe quantization problem in a very general setting, our approach includesthe Renyi-α-entropy as a complexity constraint, which includes the specialcase of (Shannon-)entropy constrained (α = 1) and memory-size constrained(α = 0) quantization. Additionally, we will derive for certain distance func-tions codecell convexity for quantizers with a finite codebook. Using othermethods, this regularity in codecell geometry has already been proved earlierby Gyorgy and Linder [12, 13].

Key words: Wasserstein distance, optimal quantization error, codecellconvexity, Renyi-α-entropy2000 MSC: 60B10, 60E05, 62E17, 68P30, 94A17, 94A29

1. Introduction

Optimal quantization often arises in electrical engineering in connectionwith signal processing and data compression. The survey article of Gray andNeuhoff [9] provides a comprehensive overview of this subject. In mathemati-cal terms, quantization is concerned with the approximation of a given proba-

Email address: [email protected] (Wolfgang Kreitmeier )

Preprint submitted to Journal of Multivariate Analysis February 21, 2011

Page 2: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

bility by another probability which is induced as an image under a quantizer.In doing so, the complexity of the quantizer must not exceed a certain bound.Optimal quantization is achieved if the quantization error between the origi-nal distribution and the approximation is, subject to the given bound, mini-mal. A first rigorous treatment of this problem in a higher dimensional spaceapparently goes back to Steinhaus [30]. The complexity of a quantizer can bemeasured by different mappings. Standard choices are the support cardinal-ity of the approximating distribution [8] or its (Shannon-)entropy [12, 13]. Inthe first case, we are talking about memory-size (or fixed-rate) quantization,the second one is called (Shannon-)entropy constrained quantization. Re-cently, Renyi-α-entropy has been suggested as complexity mapping [16, 17],which contains the special case of (Shannon-)entropy constrained (α = 1)and memory-size constrained (α = 0) quantization. Alternatively, the quan-tization error between the original probability and its approximation can beinterpreted as the costs arising out of the mass transport between these twodistributions. Indeed, in case of α = 0 and for distance mapping l(x) = xr

with r ≥ 1, it is well-known (see e.g. [8, Lemma 3.4], [24]) that the optimalquantization error is equal to the Wasserstein distance between original and(optimal) approximation, which reflects these costs and is a key term in thetheory of optimal mass transportation. Our main goal is to prove that thisidentity remains valid also for general complexity and distance mappings.

By using principal results in optimal mass transportation theory, we willshow in this paper for a fairly large class of distance mappings l that theoptimal quantization error is equivalent to the minimization of a Wasser-stein distance (cf. Thm. 3.2). Because we make only very few assumptionsregarding the complexity mapping (cf. Definition 2.1), the case of Renyi-α-entropy and, therefore, the special cases of memory-size and (Shannon-)entropy constrained optimal quantization are included. Results from masstransportation theory yield that the codecells, i.e. the preimages of the quan-tizer, have the shape of convex polytopes if the distance mapping is quadratic.Moreover, the codecells are intervals for a large class of distance mappings inthe one-dimensional setting. Using other methods, this regularity in codecellgeometry has already been proved earlier by Gyorgy and Linder [12, 13].

The rest of this paper is organized as follows: The second section con-tains the setup of optimal quantization and mass transportation theory. Inthe third section, we introduce different types of quantization errors andWasserstein distances. Our main result (Theorem 3.2) shows that thesedifferent notions coincide under very general assumptions on distance and

2

Page 3: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

complexity mapping. In particular, we obtain codecell regularity for specialdistance mappings. Additionally, for distance mapping l(x) = xr with r ≥ 1,we will generalize a consistency result for the optimal quantization error (cf.Corollary 3.9). The last section introduces Renyi-α-entropy as complexitymapping and compares the results of this paper with known results for thecases α ∈ {0, 1}. Most of our proofs are given in the appendix.

2. Setup and notation

2.1. Optimal quantization

We begin with a very general definition of optimal quantization. Letd ∈ N = {1, 2, ..} and µ be a Borel probability measure on Rd. Let I ⊂ N andS = {Si : i ∈ I} be a countable and measurable partition of Rd. Moreover,let C = {ci : i ∈ I} be a countable set of points in Rd. Now (Si, ci)i∈I definesa quantizer q : Rd → C with

q(x) = ci if and only if x ∈ Si.

We call C a codebook consisting of codepoints ci. Every Si ∈ S is called acodecell. Clearly, C = q(Rd). Moreover, if we assume w.l.o.g. that ci 6= cjfor every i, j ∈ I, i 6= j, then

S = {q−1(z) : z ∈ q(Rd)}.

Denote by Qd the set of all quantizers and by δa the Dirac measure in a ∈ Rd.For every q ∈ Qd, the image measure

µ ◦ q−1(·) =∑i∈I

µ(Si)δci(·)

has a countable support and defines an approximation of µ, the so-calledquantization of µ by q. Now let P be the space of all probability vectors on[0, 1]N, i.e., for every p = (pi)i∈N ∈ P we have pi ∈ [0, 1] and

∑∞i=1 pi = 1.

Definition 2.1. We call a mapping P 3 p → H(p) ∈ [0,∞] a complexitymapping if

(a) for any bijection τ : N → N and (pi)i∈N ∈ P we have H((pi)i∈N) =H((pτ(i))i∈N), and

3

Page 4: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

(b) for every p ∈ P and k ≥ 2 with pk = (p1, .., pk−1,∑

i≥k pi, 0, ..) ∈ P we

have H(pk) ≤ H(p).

With any enumeration {i1, i2, ...} of I we define

Hµ(q) = H((µ(Si1), µ(Si2), ...))

as the H−complexity of q w.r.t µ. Now we intend to quantify the distancebetween µ and its approximation under q. To this end, let ‖ · ‖ be theEuclidean norm on Rd and l : [0,∞) → [0,∞) a strictly increasing (andtherefore Borel-measurable) distance mapping with l(0) = 0. For q ∈ Qd wedefine as the distance between µ and µ ◦ q−1 the quantization error

Dµ(q) =

∫l(‖ x− q(x) ‖)dµ(x). (1)

For any R ≥ 0 we denote

DHµ (R) = inf{Dµ(q) : q ∈ Qd, Hµ(q) ≤ R} (2)

as the optimal quantization error of µ under H-complexity bound R. We calla quantizer q optimal for µ under H-complexity bound R if Dµ(q) = DH

µ (R).Denote by Q∗d the set of all quantizers whose range is finite. It is essential

for this paper and also of principal interest that we can replace Qd with Q∗din relation (2) under a moment condition on µ. To this end, let M(Rd) bethe set of all Borel probability measures on Rd with finite l−moment, i.e.∫l(‖x‖)dµ(x) <∞ for every µ ∈M(Rd). The following statement is proved

in the appendix.

Proposition 2.2. Let µ ∈ M(Rd). For every complexity mapping H andbound R ≥ 0 we have

DHµ (R) = inf{Dµ(q) : q ∈ Q∗d, Hµ(q) ≤ R}.

As already stated in the introduction, two choices for the complexitymapping are of great practical importance. Quantization with the complexitymapping

P 3 p→ H(p) = log(∑i∈N

1(0,1](pi)) (3)

4

Page 5: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

is called memory-size constrained quantization if we denote by 1A the char-acteristic function on a set A ⊂ Rd. If

P 3 p→ H(p) = −∑i∈N

pi log(pi), (4)

then we are talking of (Shannon-)entropy constrained quantization. Renyi-α-entropy as a more general complexity constraint is discussed in section 4. Formemory-size constrained quantization, optimal quantizers exist under weakassumptions on µ (see e.g. [8, Thm. 4.12], [24]). If µ is non-atomic, Gyorgyand Linder [12, Thm. 3] have shown for (Shannon-)entropy constrainedquantization in the one-dimensional case that always optimal quantizers ex-ist. If µ is absolutely continuous with respect to the Lebesgue measure andl(x) = x2, then this existence results holds also for higher dimensions (cf.[13, Thm. 3]). Unfortunately, optimal quantizers do not exist in general.There are complexity mappings H which lead to the non-existence of opti-mal quantizers (cf. [16, Thm. 3.1]).

2.2. Transportation theory and its relation to quantization and the Wasser-stein distance

The problem of optimal transportation in the sense of Kantorovich [15]can be stated on finite dimensional spaces as follows: Let X and Y be closedand non-empty subsets of Rd. Let µ be a Borel probability measure on X andν be a Borel probability measure on Y . Consider the set Γ(µ, ν) of all Borelprobability measures on X ×Y with first marginal µ and second marginal ν.Kantorovich’s problem was to determine the minimal transport cost

w(µ, ν) = inf{∫X×Y

c(x, y)dγ(x, y) : γ ∈ Γ(µ, ν)} (5)

with the measurable cost function c(·, ·) : X × Y → [0,∞). We call γ ∈Γ(µ, ν) an optimal transport plan if

w(µ, ν) =

∫X×Y

c(x, y)dγ(x, y).

Because the cost function has only non-negative values in our setting, anoptimal solution always exists (cf. [31, Thm. 4.1]). Now we specify for therest of this paper

c(x, y) = l(‖x− y‖) for every (x, y) ∈ X × Y.

5

Page 6: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Definition 2.3. A transport plan γ ∈ Γ(µ, ν) is said to be deterministicif a measurable mapping q : X → Y exists, such that γ = µ ◦ φ−1 with themapping

X 3 x→ φ(x) = (x, q(x)).

The mapping q is called Monge mapping or transport mapping.

Consequently, every deterministic transport plan is induced by a Mongemapping q which is µ-almost surely uniquely defined (when the transport planhas been fixed). Moreover, ν = µ ◦ q−1. Roughly speaking, one could saythat the Monge mapping q transports the mass represented by the measureµ to the mass represented by the measure ν.

If we restrict the transport plans in (5) to be deterministic, the problemof optimal transport turns into the so-called Monge transportation problem,where we have to determine the optimal Monge mapping q, such that∫

l(‖x− q(x)‖)dµ(x)

= inf{∫l(‖x− t(x)‖)dµ(x) : t measurable , µ ◦ t−1 = ν}

= inf{∫X×Y

c(x, y)dγ(x, y) : γ ∈ Γ(µ, ν), γ deterministic }. (6)

Now, if we compare (5), (6) and (1), it turns out that the optimal totalcost of transportation would be a quantization error if the optimal transportplan would be deterministic and the related optimal Monge map would be aquantizer. The following Theorem 2.6 states that this is the case if the targetdistribution ν is discrete and the source distribution µ satisfies a certaincontinuity assumption. Our proof of this fundamental statement relies onprincipal results in the theory of optimal mass transportation. We need thefollowing definition:

Definition 2.4. [31, Definition 5.2 and Remark 5.6] A mapping f : X → Ris said to be c−convex if there exists a mapping g : Y → R such that

f(x) = sup{g(y)− c(x, y) : y ∈ Y } for every x ∈ X.

Moreover, the c−subdifferential or c−subgradient ∂cf(x) of the mapping f atthe point x ∈ X is defined by

∂cf(x) = {y ∈ Y : f(x) + c(x, y) ≤ f(z) + c(z, y) for every z ∈ X}.

6

Page 7: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

We rely in this paper on the following fundamental result. Recall defini-tion (5) of w(µ, ν). Denote by card the cardinality of a set.

Theorem 2.5. [31, Theorem 5.30] Assume that w(µ, ν) < ∞. If for anyc−convex mapping f : X → R the set

{x ∈ X : card(∂cf(x)) > 1}

is contained in a µ−measure-zero set, then there exists a unique optimaltransport plan γ which is deterministic. The Monge map q which induces γis characterized by the existence of a c−convex function f such that

q(x) ∈ ∂cf(x) for µ− a.e. x ∈ X.

Now we intend to apply Theorem 2.5 to a discrete target distribution νon Rd. To be precise, let m ∈ N and a1, .., am ∈ Rd be m different points inRd. Let p1, .., pm ∈ (0, 1] be such that

∑mi=1 pi = 1 and let

ν =m∑i=1

piδai (7)

For (λ1, .., λm) ∈ Rm and i ∈ {1, ..,m}, we define the set

Ai = Ai(λ1, .., λm) (8)

= {x ∈ Rd : l(‖x− ai‖)− λi = min{l(‖x− aj‖)− λj : j ∈ {1, ..,m}}}.

Moreover, let

Q(ν) = {q ∈ Q∗d : q(Rd) = {a1, .., am}with µ(q−1(ai)) = pi for every i ∈ {1, ..,m}}

Theorem 2.6. Let ν be a discrete probability on Rd as defined in (7). As-sume that µ vanishes on the boundary of Ai(β1, .., βm) for every (β1, .., βm) ∈Rm and i ∈ {1, ..,m}. If w(µ, ν) < ∞, then a quantizer q ∈ Q(ν) and(λ1, .., λm) ∈ Rm exist, such that

w(µ, ν) =

∫l(‖x− q(x)‖)dµ(x)

= min{∫l(‖x− q(x)‖)dµ(x) : q ∈ Q(ν)}

7

Page 8: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

and

q−1(ai) = Ai(λ1, .., λm) µ− almost surely for every i ∈ {1, ..,m}.

The quantizer q is µ−almost surely uniquely defined, i.e. we have µ ◦ q−1 =µ ◦ q−1 for every q ∈ Q(ν) which attains the above minimum .

Theorem 2.6 is proved in the appendix by applying Theorem 2.5. If r ≥ 1and l(x) = xr, then the mapping

ρr(·, ·) = l−1 ◦ w(·, ·) = w(·, ·)1/r (9)

is a metric onM(Rd)×M(Rd) (see e.g. [31, Definition 6.1]) and often calledWasserstein distance. In this paper, we are generally interested in suchcontinuous distance mappings l, where l−1◦w satisfies the triangle inequality,i.e. if for every µ1, µ2, µ3 ∈M(Rd) with w(µ1, µ3) <∞ the relation

l−1(w(µ1, µ3)) ≤ l−1(w(µ1, µ2)) + l−1(w(µ2, µ3))

holds. Even if the triangle inequality is satisfied, it is not clear if the mappingl−1◦w has always finite values. Thus l−1◦w does not satisfy a priori all axiomsof a metric, even if the triangle inequality is in force. For a historical overviewof Kantorovich’s problem and further aspects of transportation theory, thereader is referred to Ruschendorf [28] and the references therein. Ambrosioet.al. [4] is also a good source of information for Wasserstein distances.

3. The optimal quantization error in terms of a Wasserstein dis-tortion

Recall M(Rd) as the set of all Borel probability measures on Rd withfinite l−moment. We denote by supp(µ) the support of µ ∈ M(Rd) anddefine

M∗(Rd) = {µ ∈M(Rd) : card(supp(µ)) <∞},M∞(Rd) = {µ ∈M(Rd) : card(supp(µ)) ≤ card(N)}.

Let ν ∈ M∞(Rd) and denote supp(ν) = {ai : i ∈ I} with I ⊂ N. Byadding zeros - if necessary - the distribution ν induces a probability vectorpν ∈ P , where

∑∞i=1 p

νi =

∑i∈I ν(ai) = 1. According to property (a) of H

the mapping ν → H(pν) is well-defined. For µ ∈ M(Rd) and R ≥ 0 wedefine

V Hµ (R) = inf{w(µ, ν) : ν ∈M∞(Rd), H(pν) ≤ R}

8

Page 9: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

as the optimal total cost of transportation for source distribution µ, where thetarget distributions ν have countable support and induce a H−complexitywhich is lower or equal than the bound R. If l ◦ w−1 satisfies the triangleinequality or l is bounded, then we can replace M∞(Rd) by M∗(Rd) in thedefinition of V H

µ (R). The following statement is proved in the appendix.

Proposition 3.1. Let µ ∈M(Rd) and assume that l is continuous and l−1◦wsatisfies the triangle inequality or assume that l is bounded and continuous.For every complexity mapping H and bound R ≥ 0 we have

V Hµ (R) = inf{w(µ, ν) : ν ∈M∗(Rd), H(pν) ≤ R}. (10)

Because we are interested in the case where the target distributions ν areinduced by a quantizer whose range is finite, we also define

WHµ (R) = inf{w(µ, µ ◦ q−1) : q ∈ Q∗d, Hµ(q) ≤ R}. (11)

Obviously,V Hµ (R) ≤ WH

µ (R). (12)

Example 4.3 will show that inequality (12) can be strict. Recall definition(2) of the optimal quantization error DH

µ (R). In view of Proposition 2.2and Proposition 3.1 it is natural to ask under which conditions (if any) thequantities WH

µ (R), V Hµ (R) and DH

µ (R) coincide. In addition to this question(which will be answered in Theorem 3.2), we want to know more aboutthe codecell geometry of the quantizers. Such knowledge has proved veryuseful in analyzing optimal scalar and vector quantizer performance [9]. Ofparticular interest is the question if it suffices to consider in definition (2)only quantizers whose image has finite cardinality and whose codecells areconvex polytopes. To be precise, let a ∈ Rd and b ∈ Rd, a 6= b. We define theclosed halfspace

T (a, b) = {x ∈ Rd : ‖x− a‖ ≤ ‖x− b‖}. (13)

We call a set P ⊂ Rd a convex polytope, if P is a finite intersection of closedor open halfspaces. Let Qcd ⊂ Q∗d be the set of all quantizers where eachcodecell is a convex polytope and every codepoint of such codecell lies in theclosure of the codecell. We define

DH,cµ (R) = inf{Dµ(q) : q ∈ Qcd, Hµ(q) ≤ R}.

9

Page 10: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

From the definition we immediately obtain for any R ≥ 0 that

DHµ (R) ≤ DH,c

µ (R). (14)

Inequality (14) can be strict (cf. [12, Example 1]), but Theorem 3.2 belowstates conditions under which (14) turns into an equation. We are interestedin two subclasses of distance functions, namely

(C1) l is continuous, twice continuously differentiable with l′′ ≥ 0 in (0,∞)and l−1 ◦ w satisfies the triangle inequality

(C2) l is continuous, twice continuously differentiable with l′′ ≤ 0 in (0,∞)and bounded.

Now we can state the main result of this paper. Theorem 3.2 will beproved in the appendix.

Theorem 3.2. Let µ ∈ M(Rd) and assume that µ vanishes on continu-ously differentiable (d− 1)-dimensional submanifolds of Rd. Let the distancefunction l be of type (C1) or (C2). For every R ≥ 0 we have

V Hµ (R) = WH

µ (R) = DHµ (R). (15)

Additionally, if

(a) d = 1 and l is of type (C1), or(b) d > 1 and l(x) = x2 for every x ≥ 0, then

DH,cµ (R) = DH

µ (R). (16)

As already stated, (cf. 9) the distance mappings l(x) = xr satisfy con-dition (C1). Now we state a criterion for distance mappings ensuring thatthey satisfy condition (C1). To this end, we need the following definition.

Definition 3.3. A mapping f : [0,∞)→ [0,∞) is called superadditive, if

f(x+ y) ≥ f(x) + f(y) for every x, y ∈ [0,∞).

Remark 3.4. Consider the following classes of distance functions

(D1) l is continuous, twice continuously differentiable with l′′ > 0 in (0,∞)and the mapping l′

l′′is superadditive in (0,∞)

10

Page 11: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

(D2) l(x) = x for every x ∈ [0,∞).

Lemma A.2 states that every distance mapping which satisfies (D1) or (D2)also satisfies condition (C1). Insofar, Theorem 3.2 holds also for distancemappings which satisfy (D1) or (D2).

Example 3.5. The function l(x) = x exp(−1/x) for x > 0, l(0) = 0 lies inclass (D1) and thus according to Remark 3.4 also in class (C1). Theorem3.2 is also applicable for the concave distance mappings l(x) = arctan(x) orl(x) = tanh(x) as elements of class (C2).

Remark 3.6. Common distance mappings are those who satisfy an Orlicz’scondition (see e.g. [25, Example 2.2.1])

sup{l(2x)/l(x) : x > 0} <∞. (17)

It is not difficult to construct distance mappings which are continuous, twicecontinuously differentiable with l′′ ≥ 0 in (0,∞) and satisfying (17), but arenot superadditive. It remains open whether Theorem 3.2 is still true if wereplace the triangle inequality for l−1 ◦ w in (C1) by condition (17).

Example 4.3 will show that Theorem 3.2 becomes invalid in general ifwe drop that µ vanishes on continuously differentiable (d − 1)-dimensionalsubmanifolds of Rd. We denote by λd the d−dimensional Lebesgue measureon Rd.

Remark 3.7. Every Borel measure µ on Rd which is absolutely continuouswith respect to λd, vanishes on continuously differentiable (d−1)-dimensionalsubmanifolds of Rd. By a result of Mattila [23], also the Hausdorff measurerestricted to a self-similar set whose span equals Rd and satisfies the open setcondition, vanishes on continuously differentiable (d − 1)-dimensional sub-manifolds of Rd.

Remark 3.8. The boundedness of the distance mapping in the definition ofclass (C2) and the triangle inequality in (C1) is only needed in the proof ofProposition 3.1. These are sufficient conditions to ensure that equation (10)is true. It remains open to characterize those distance mappings l for whichequation (10) is true.

The identity (16) has already been shown by Gyorgy and Linder [12, 13].Although they investigate only the case of (Shannon-)entropy constrained

11

Page 12: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

quantization, their proof works also in our more general setting, becausetheir construction of a (finite) quantizer with convex codecells always startsfrom an arbitrary one by redefining the codecells to convex ones but havingthe same probability.

For the special distance mapping l(x) = xr with fixed norm exponentr ≥ 1, we can easily derive a consistency result for the optimal quantizationerror. In the special case of memory-size constrained quantization, the resultis well-known (see e.g. [8, p. 57], [24, Thm. 9]). Recall the definition (9) ofWasserstein distance ρr.

Corollary 3.9. Let µ1, µ2 ∈M(Rd) and assume that µi vanishes on contin-uously differentiable (d − 1)-dimensional submanifolds of Rd for i ∈ {1, 2}.Let r ≥ 1 and assume that l(x) = xr for every x ≥ 0. Let R ≥ 0. Then

|(DHµ1

(R))1/r − (DHµ2

(R))1/r| ≤ ρr(µ1, µ2). (18)

Proof. We distinguish two cases.

1. DHµ1

(R) ≥ DHµ2

(R).

Let ε > 0. According to Theorem 3.2, let ν2 ∈M∞(Rd), such that H(pν2) ≤R and

(DHµ2

(R))1/r ≥ ρr(µ2, ν2)− ε.

Again by Theorem 3.2, we obtain

|(DHµ1

(R))1/r − (DHµ2

(R))1/r|= (DH

µ1(R))1/r − (DH

µ2(R))1/r

≤ inf{ρr(µ1, ν)− ρr(µ2, ν2) : ν ∈M∞(Rd), H(pν) ≤ R}+ ε

≤ inf{ρr(µ1, µ2) + ρr(ν, ν2) : ν ∈M∞(Rd), H(pν) ≤ R}+ ε

= ρr(µ1, µ2) + ε.

By letting ε→ 0, we obtain (18).

2. DHµ1

(R) < DHµ2

(R).

This case is handled similarly to the first one.

Corollary 3.9 becomes invalid if we drop the condition that µi vanisheson continuously differentiable (d − 1)-dimensional submanifolds of Rd fori ∈ {1, 2}. For a (counter-)example, the reader is referred to [12, Example2]. Insofar we cannot apply Corollary 3.9 to the important case, where

12

Page 13: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

µ2 = µ(n)2 is the empirical (n−sample) version of µ1. Although we know

that consistency in this empirical case holds for memory-size constrainedquantization ([8, Corollary 4.24]), it remains open if this is true in general.

Nevertheless, Corollary 3.9 could also be of practical relevance for algo-rithmic quantizer design. Algorithms for designing optimal quantizers oftenconverge to a local error minimum which is not a global one. To avoid thiseffect, a perturbation approach has been proposed [1, 19] in case of memory-size constrained quantization. The original distribution µ is approximated(in a weak sense) by µn = (1 − an)µ + anν, where (an) ⊂ (0, 1) is a se-quence decreasing to zero. ν represents a distribution which has a uniquelocal (and global) optimal quantizer. Now if an optimal quantizer for µn isused as the initial (suboptimal) quantizer for µn+1, the algorithm is likely toconverge to a (global) optimal quantizer for µn+1. Corollary 3.9 ensures thatthe quantization errors of these (global) optimal quantizers converge towardsthe optimal quantization error for µ. It needs further research to determineif this approach also works for general complexity mappings.

4. Renyi-α-entropy as complexity and comparison with known re-sults

Let us give an exact definition of Renyi-α-entropy [26, 29]. Let N :={1, 2, ..}. Let α ∈ [−∞,∞] and p = (p1, p2, ...) ∈ P . The Renyi-α-entropyHα(p) ∈ [0,∞] is defined as (cf. [3, Definition 5.2.35], see also [14], p.1)

Hα(p) =

−∑∞

i=1 pi log(pi), if α = 1

− log (sup{pi : i ∈ N}), if α =∞− log (inf{pi : i ∈ N, pi > 0}), if α = −∞

11−α log (

∑∞i=1 p

αi ) , if α ∈ (−∞,∞)\{1}.

We use the conventions 0·log(0) := 0 and 0x := 0 for all real x ≥ 0. Moreover,1/0 :=∞. The logarithm log is based on e.

Remark 4.1. With these conventions we obtain

H0(p) = log(∑i∈N

1(0,1](pi)).

Using l’Hospital’s rule it is easy to see, that the case α = 1 will be reachedfrom α 6= 1 by taking the limit α → 1. (see e.g. [3, Remark 5.2.34]).

13

Page 14: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Moreover, one has

limα→∞

Hα(·) = H∞(·) and limα→−∞

Hα(·) = H−∞(·).

Now let us show thatHα is a complexity mapping in the sense of definition2.1.

Proposition 4.2. The mapping Hα is a complexity mapping for every α ∈[−∞,∞].

Proof. Clearly, Hα satisfies condition (a) of a complexity mapping. To showthat condition (b) of a complexity mapping is also satisfied, we distinguishseveral cases.

1. α ∈ {−∞, 0,∞}In this case, we deduce immediately from the definition that Hα satisfiescondition (b).

2. α = 1.

Let p ∈ P and k ≥ 2 with pk = (p1, .., pk−1,∑

i≥k pi, 0, ..) ∈ P . If H1(p) =∞,then we have nothing to prove. So, let us assume that H1(p) < ∞. Fromrecursivity of Shannon entropy (cf. [3, relation (1.2.8)]), we obtain

H1(pk) ≤ H1(pk+1)

Due to H1(pk) → H1(p) < ∞ if k → ∞ we obtain that condition (b) issatisfied.

3. α ∈ (−∞, 1)\{0}.Let p ∈ P and k ≥ 2. Due to α < 1 we obtain with the convention 1/0 :=∞that

∞∑i=k

pαi ≥ (∞∑i=k

pi)α, (19)

yielding Hα(pk) ≤ Hα(p).

4. α ∈ (1,∞).

In this case, inequality (19) holds in reversed order, yielding again Hα(pk) ≤Hα(p).

In view of relation (3) memory-size constrained quantization is quantiza-tion with complexity H0 and according to (4) Shannon-entropy constrained

14

Page 15: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

quantization uses H1 as complexity mapping. Quantization with Renyi-α-entropy as complexity has been investigated in [16, 17, 18].

As already announced in section 3 we will now give an example show-ing that Theorem 3.2 becomes invalid in general if µ does not vanish oncontinuously differentiable (d− 1)-dimensional submanifolds of Rd.

Example 4.3. Let l(x) = x2 for x ≥ 0 and d = 1. Let z ∈ (1/2, 1) and

µ = (1/3) · (δ0 + δz + δ1).

Let α > 0 and p = (1/3, 2/3, 0, ..) ∈ P. Define R0 = Hα(p). Now let

ν = (1/5) · δ0 + (4/5) · δ(1+z)/2.

It is plain to see that Hα(pν) < R0. Now let R ∈ (0, R0). Let q ∈ Q1 withHαµ (q) ≤ R. From the definition of R0 we obtain that q consists of only one

codecell. As shown in [8, Example 2.3(b)], the optimal codepoint {c} = q(R)equals the centre of mass, i.e.

c =1 + z

3.

We calculate

Dµ(q) =1

3((0− c)2 + (z − c)2 + (1− c)2)

=2

9(1− z + z2) = DHα

µ (R′)

for every R′ ≤ R. Because q consists of only one codecell, we obtain

WHµ (R) = Dµ(q) = DHα,c

µ (R). (20)

On the other hand we have

ρ22(µ, ν) ≤ 1

5· 02 + (

1

3− 1

5) · (1 + z

2)2 + 2 · 1

3(1 + z

2− z)2

=1

5(1− 4

3z + z2).

Together with (12) and (20) we deduce

V Hµ (R) < WH

µ (R) = DHα

µ (R) = DHα,cµ (R).

15

Page 16: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Remark 4.4. For distance mapping l(x) = x2 and Shannon-entropy as com-plexity (α = 1), an identity similar to (15) has been proved by Linder [20,Lemma 1] for the so-called Lagrangian distortion. But for an entropy boundR > 0, this modified distortion only coincides with the quantization error (2)if the point (R,DH1

µ (R)) lies on the lower convex hull of the mapping DH1

µ (·),which is not the case in general (cf. [11, 16]).

Remark 4.5. Assume that l is of class (D2), i.e. equals the identity, andα = 1. Moreover, assume that µ is absolutely continuous with respect to λd

and has a compact support. In this special case, Matloub et.al. [22] haveshown for every ε > 0 that

DH1

µ (R + ε) ≤ V Hµ (R) ≤ DH1

µ (R). (21)

This result follows from Theorem 3.2. If α = 0 and l(x) = xr with r ≥ 1,then it is well-known (see e.g. [8, Lemma 3.3, Lemma 3.4], [24]) that theequations (15) and (16) are true.

Remark 4.6. If α = 1 and l is of type (D2), then we could immediatelydeduce equation (15) from relation (21) if the mapping DH1

µ (·) would be con-tinuous. Although Theorem 3.2 is true for every α ∈ [−∞,∞], the mappingDHα

µ (·) is generally non-continuous for α ≤ 0 (see e.g. [18, Lemma 7.2],[8, Example 5.5]). In case of α > 0, the mapping DHα

U([0,1])(·) has been com-

pletely determined and is Lipschitz continuous if U([0, 1]) denotes the uniformdistribution on [0, 1] (cf. [16]). It remains open if DHα

µ (·) is Lipschitz con-tinuous for α > 0 and distributions µ which are vanishing on continuouslydifferentiable (d− 1)-dimensional submanifolds of Rd.

Remark 4.7. In view of (8), optimal codecells for finite quantizers are nopolytopes in general if l(x) 6= x2. For illustrations of such codecells, the readeris referred to [27, chapter 1.2], [6, Fig. 2] and [2]. For a detailed study of thetopological properties of the codecells defined by (8), the author recommends[5]. If l(x) = x2, then the polytopes defined in (8) are often called Laguerretesselations in the literature (cf. [10]).

A. Appendix

Recall 1A as the characteristic function on a set A ⊂ Rd.

16

Page 17: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Proof of Proposition 2.2.Obviously, we can assume w.l.o.g. that DH

µ (R) < ∞. Let ε > 0. Accordingto relation (2), let q ∈ Qd with Hµ(q) ≤ R and DH

µ (R) ≥ Dµ(q)− ε. Denoteq(Rd) = {ai : i ∈ I} with a countable set I ⊂ N and points ai ∈ Rd, suchthat ai 6= aj for every i, j ∈ I with i 6= j. Because

∫l(‖x‖)dµ(x) < ∞ and

due to

∞ > Dµ(q) =∑i∈I

∫q−1(ai)

l(‖x− ai‖)dµ(x),

a finite set J ⊂ I exists, such that∫∪i∈I\Jq−1(ai)

l(‖x− ai‖)dµ(x) < ε (22)

and ∫∪i∈I\Jq−1(ai)

l(‖x‖)dµ(x) < ε. (23)

Now we define the quantizer

qJ =∑j∈J

aj1q−1(aj) + 0 · 1∪i∈I\Jq−1(ai).

Let M <∞ be the cardinality of J and {j1, .., jM} be an enumeration of J .Let us first assume that 0 ∈ {aj : j ∈ J}, i.e. a k ∈ {1, ..,M} exists, suchthat ajk = 0. Applying property (a) and (b) of H from definition 2.1, weobtain

Hµ(qJ) = H(µ(q−1(aj1)), .., µ(q−1(ajk−1)), µ(∪i∈{jk}∪I\Jq

−1(ai)),

µ(q−1(ajk+1)), .., µ(q−1(ajM )), 0, ..)

≤ H(µ(q−1(aj1)), .., µ(q−1(ajM )), µ(∪i∈I\Jq−1(ai)), 0, ..). (24)

If 0 /∈ {aj : j ∈ J}, then (24) turns into an equation. Again, by property (a)and (b) of H from definition 2.1, we deduce in any case that

Hµ(qJ) ≤ H(µ(q−1(aj1)), .., µ(q−1(ajM )), µ(∪i∈I\Jq−1(ai)), 0, ..)

≤ Hµ(q) ≤ R.

17

Page 18: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Now we derive from (22) and (23) that

DHµ (R) ≥ Dµ(q)− ε

≥ Dµ(qJ)− |Dµ(q)−Dµ(qJ)| − ε

≥ Dµ(qJ)− |∫∪i∈I\Jq−1(ai)

(l(‖x− q(x)‖)− l(‖x‖))dµ(x)| − ε

≥ Dµ(qJ)− 3ε.

By letting ε→ 0 we get

DHµ (R) ≥ inf{Dµ(q) : q ∈ Q∗d, Hµ(q) ≤ R}

≥ inf{Dµ(q) : q ∈ Qd, Hµ(q) ≤ R} = DHµ (R),

which yields the assertion.

Proof of Theorem 2.6.From definition (5) we obtain

w(µ, ν) = inf{∫

Rd×Rdl(‖x− y‖)dγ(x, y) : γ ∈ Γ(µ, ν)}.

For every γ ∈ Γ(µ, ν), the second marginal of γ equals ν and, therefore,has support {a1, .., am}. Let ν be the restriction of ν to {a1, .., am}. Recall(cf. [31, Thm. 4.1]) that an optimal solution always exists, i.e. chooseγ ∈ Γ(µ, ν), such that w(µ, ν) =

∫Rd×Rd l(‖x− y‖)dγ(x, y). Because the

support of γ is concentrated on Rd × {a1, .., am} (see also [31, Thm. 5.19]),we deduce that

w(µ, ν) = inf{∫

Rd×{a1,..,am}l(‖x− y‖)dγ(x, y) : γ ∈ Γ(µ, ν)}

=

∫Rd×{a1,..,am}

l(‖x− y‖)dγ0(x, y)

with γ0 as the restriction of γ to Rd × {a1, .., am}. If we denote by c therestriction of c to Rd × {a1, .., am}, then γ0 is also an optimal solution of theKantorovich problem for source µ and target ν on Rd×{a1, .., am}. Now let fbe a c−convex mapping on Rd. According to Definition 2.4, let β1, .., βm ∈ Rsuch that

f(x) = max{βi − l(‖x− ai‖) : i ∈ {1, ..,m}} for every x ∈ Rd.

18

Page 19: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Next we will show that

Ai(β1, .., βm) = {x ∈ Rd : ai ∈ ∂cf(x)}. (25)

To this end, let x ∈ Ai(β1, .., βm). Applying (8) we obtain for every z ∈ Rd

that

f(x) = βi − l(‖x− ai‖)= l(‖z − ai‖) + βi − l(‖z − ai‖)− l(‖x− ai‖)≤ l(‖z − ai‖) + max{βj − l(‖z − aj‖) : j ∈ {1, ..,m}} − l(‖x− ai‖)= c(z, ai) + f(z)− c(x, ai).

Consequently, ai ∈ ∂cf(x) according to Definition 2.4. Now let x ∈ Rd, suchthat ai ∈ ∂cf(x). Let z ∈ Rd. Choose ix, iz ∈ {1, ..,m}, such that

f(z) = βiz − l(‖z − aiz‖) and f(x) = βix − l(‖x− aix‖). (26)

Using ai ∈ ∂cf(x) we deduce

f(x) + l(‖x− ai‖) ≤ f(z) + l(‖z − ai‖). (27)

Combining (26) and (27) we obtain

l(‖x− ai‖)− βiz ≤ l(‖x− aix‖)− βix − l(‖z − aiz‖) + l(‖z − ai‖). (28)

Now specialize z ∈ Rd, such that iz = i. Because Ai(β1, .., βm) is non-emptyfor every i ∈ {1, ..,m} such a choice for z is always possible. From (28) weget

l(‖x− ai‖)− βi ≤ l(‖x− aix‖)− βix= min{l(‖x− aj‖)− βj : j ∈ {1, ..,m}}.

Thus, x ∈ Ai(β1, .., βm) which proves (25). Now let x ∈ Rd with card(∂cf(x)) >1. In view of (25), we know that i, j ∈ {1, ..,m} exist with i 6= j such thatx ∈ Ai(β1, .., βm) ∩ Aj(β1, .., βm). Hence,

l(‖x− ai‖)− βi = l(‖x− aj‖)− βj

Because ai 6= aj, we will assume w.l.o.g. that x 6= aj. According to Definition(8), we can find for every ε > 0 a point z ∈ Rd such that

‖z − x‖ < ε, ‖x− aj‖ = ‖z − aj‖ and ‖x− ai‖ < ‖z − ai‖.

19

Page 20: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Because l is strictly increasing, we obtain

l(‖z − ai‖)− βi > l(‖z − aj‖)− βj,

yielding that z lies in the complement of Ai(β1, .., βm). This implies thatx is an element of the boundary of Ai(β1, .., βm), i.e. the set {x ∈ Rd :card(∂cf(x)) > 1} is contained in a µ−measure zero set by our assumption.Now we can apply Theorem 2.5 which implies that γ0 is deterministic. Letq0 be the Monge map which induces γ0 and let f be the c−convex function,such that

q0(x) ∈ ∂cf(x) for µ− a.e. x ∈ Rd.

From above and Definition 2.4 we deduce that for µ−almost every x ∈ Rd

exactly one i ∈ {1, ..,m} exist, such that ai ∈ ∂cf(x), i.e. q0(x) = ai. Nowlet B1, .., Bm be a partition of Rd with

Bi = {x ∈ Rd : ai ∈ ∂cf(x)} µ− almost surely.

According to Definition 2.4 and relation (25), we obtain that (λ1, .., λm) ∈ Rm

exist such that for every i ∈ {1, ..,m} the set Bi equals µ−almost surely theset Ai(λ1, .., λm). Consequently, we can assume w.l.o.g. that Bi is measurablefor every i ∈ {1, ..,m}. Now we define the quantizer q with q(x) = aiif x ∈ Bi. Because q(x) = q0(x) for µ−almost every x ∈ Rd and q0 isa Monge map for γ0 ∈ Γ(µ, ν) we obtain that µ ◦ q−1 = ν, which yieldsµ ◦ q−1(ai) = ν({ai}) = pi for every i ∈ {1, ..,m}. We get

w(µ, ν) =

∫l(‖x− q0(x)‖)dµ(x) =

∫l(‖x− q(x)‖)dµ(x).

Now let q ∈ Q(ν). Because q induces a transport plan π ∈ Γ(µ, ν), wehave w(µ, ν) ≤

∫l(‖x − q(x)‖)dµ(x). If w(µ, ν) =

∫l(‖x − q(x)‖)dµ(x),

then π = γ0 according to Theorem 2.5. Thus, we obtain again from aboveconsiderations that q(x) = q(x) for µ-almost every x ∈ Rd, which finallyproves the assertion.

We denote with ∇F the gradient of a differentiable mapping F : Rd → R.

Lemma A.1. Let µ ∈ M(Rd) and assume that µ vanishes on continu-ously differentiable (d − 1)-dimensional submanifolds of Rd. Let m ∈ Nand a1, .., am ∈ Rd be m different points in Rd. Let λ1, .., λm ∈ R andAj = Aj(λ1, .., λm) as defined in (8). If l is of type (C1) or (C2), then

20

Page 21: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

the boundary of Aj has zero µ−measure. Additionally, if d = 1 and l is oftype (C1), then Aj is an interval.

If d > 1 and l(x) = x2 for every x ≥ 0, then Aj is a convex polytope.

Proof. Let j ∈ {1, ..,m}. Obviously, we can write

Aj = ∩mi=1,i 6=j{x ∈ Rd : l(‖x− aj‖)− l(‖x− ai‖) ≤ λj − λi} (29)

For every i ∈ {1, ..,m}\{j} we define

Gj,i = {aj + λ(aj − ai) : λ ∈ R}

and for every x ∈ Rd let

Ψj,i(x) = l(‖x− aj‖)− l(‖x− ai‖).

Let Gj = ∪mi=1,i 6=jGj,i. In order to show that the boundary of Aj has zeroµ−measure, we distinguish several cases.

1. d = 1.

1.1. l is of type (C1).

We proceed as in the proof of [12, Lemma 1]. Let i ∈ {1, ..,m}\{j} andassume w.l.o.g. that ai > aj. If x < aj, then

Ψ′j,i(x) = l′(ai − x)− l′(aj − x).

Due to l′′ ≥ 0 we obtain that Ψj,i is monotone increasing on (−∞, aj). Ifx > ai, then we deduce by similar considerations that Ψ′j,i(x) ≥ 0, i.e. Ψj,i ismonotone increasing on (ai,∞). If aj < x < ai, then we obtain that

Ψ′j,i(x) = l′(ai − x) + l′(x− aj).

Due to l′ > 0 we get that Ψ′j,i is strictly increasing on (aj, ai). Obviously,

Ψj,i is continuous on R. Hence, Ψ−1j,i ((−∞, λj − λi]) is an (possibly empty

or degenerate) interval. Due to (29), Aj is a finite intersection of intervals,yielding that the boundary of Aj consists of a finite set. Because µ is non-atomic, the assertion is proved in this case.

1.2. l is of type (C2).

Due to l′′ ≤ 0, we obtain similar to above that Ψj,i is monotone decreasingon (−∞, aj) and (ai,∞). Moreover, Ψj,i is strictly increasing on (aj, ai) and

21

Page 22: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

continuous on R. Note also that Ψj,i is negative on (−∞, aj) and positive on(ai,∞). Hence, Ψ−1

j,i ((−∞, λj−λi]) consists of at most two intervals. Hence,the assertion is proved as in case 1.2.

2. d > 1.

Let i ∈ {1, ..,m}\{j}. Let x ∈ Rd\Gj. The mapping Ψj,i is differentiable onRd\Gj and we obtain

∇Ψj,i(x) = l′(‖x− aj‖)x− aj‖x− aj‖

− l′(‖x− ai‖)x− ai‖x− ai‖

.

Because x /∈ Gj, we know that x−aj and x−ai are linearly independent andmin(‖x − aj‖, ‖x − ai‖) > 0. Because l′ has no zeros on (0,∞), we obtain∇Ψj,i(x) 6= 0. If λj − λi ∈ Ψj,i(Rd\Gj), then the submersion theorem yieldsthat Ψ−1

j,i (λj − λi)\Gj is covered by a continuously differentiable (d − 1)-

dimensional submanifold of Rd\Gj. Because Gj as a finite union of lines canbe covered by a finite union of (d − 1)-dimensional hyperplanes, we finallyobtain that Ψ−1

j,i (λj − λi) is always covered by a finite union of continuously

differentiable (d−1)-dimensional submanifolds of Rd. Note that the boundaryof Aj is contained in ∪mi=1,i 6=jΨ

−1j,i (λj − λi), which yields the assertion also in

this case.

Now let d = 1 and assume that l is of type (C1). From 1.1. we obtain thatAj is an interval. Finally assume that d > 1 and l(x) = x2 for every x ≥ 0.For any i ∈ {1, ..,m}\{j} let λ = (λj − λi)/(2‖ai − aj‖2) and

bi = ai + λ(ai − aj), bj = aj + λ(ai − aj).

Now recall the definition (13) of a closed halfspace. A simple calculationshows that

Aj = ∩mi=1,i 6=jT (bj, bi)

is a finite intersection of closed halfspaces and, therefore, a convex polytope.

Lemma A.2. If the distance mapping l is of type (D1) or (D2), then thetriangle inequality holds.

Proof. Let µi ∈ M(Rd), i ∈ {1, 2, 3} and w(µ1, µ3) < ∞. We have to showthat

l−1(w(µ1, µ3)) ≤ l−1(w(µ1, µ2)) + l−1(w(µ2, µ3)). (30)

22

Page 23: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

We are following the standard argumentation for ρr to satisfy the triangleinequality (see e.g. [31, p.94]) or [7, Proposition 2]) and apply a generalizationof the Minkowski inequality (cf. [21, Thm. 3]). Obviously, we can assumew.l.o.g. that w(µ1, µ2) <∞ and w(µ2, µ3) <∞. If l is of type (D2), then theassertion follows immediately from the triangle inequality of the Wassersteindistance. Hence, let us assume that l is of type (D1). Because l is continuous,we deduce from general existence results in optimal transportation theory (cf.[31, Thm. 4.1]) that distributions P1 and P2 on Rd × Rd exist, such that

w(µ1, µ2) =

∫Rd×Rd

l(‖x− y‖)dP1(x, y) (31)

and

w(µ2, µ3) =

∫Rd×Rd

l(‖x− y‖)dP2(x, y), (32)

where P1 has marginals µ1 and µ2 and P2 has marginals µ2 and µ3. Moreover,let P be a distribution on Rd ×Rd ×Rd with marginal distribution P1 whenprojecting on the first two components and P2 when projecting on the lasttwo components. Regarding the existence of such a distribution, see e.g.[4, Remark 5.3.3] or [31, chapter 1]. Denote P3 as the marginal of P byprojecting on the first and the last component. The first marginal of P3

equals µ1 and the second marginal of P3 equals µ3. Because l and l−1 areincreasing we deduce

l−1(w(µ1, µ3)) ≤ l−1

(∫Rd×Rd

l(‖x− y‖)dP3(x, y)

)= l−1

(∫Rd×Rd×Rd

l(‖x− z‖)dP (x, y, z)

)≤ l−1

(∫Rd×Rd×Rd

l(‖x− y‖+ ‖y − z‖)dP (x, y, z)

).

Now let (un)n∈N and (vn)n∈N be sequences of step functions on Rd×Rd×Rd

with un ≤ un+1, vn ≤ vn+1 and

supn∈N

un(x, y, z) = ‖x− y‖, supn∈N

vn(x, y, z) = ‖y − z‖.

Note that l is continuous, monotone increasing and l−1 is continuous. Thus,by monotone convergence we obtain

l−1(w(µ1, µ3)) ≤ limn→∞

l−1

(∫Rd×Rd×Rd

l(un(x, y, z) + vn(x, y, z))dP (x, y, z)

).

23

Page 24: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Now we distinguish two cases. Let us first assume that µi(A) ∈ {0, 1} forevery measurable A ⊂ Rd and i ∈ {1, 2, 3}. Consequently, ai ∈ Rd exist,such that µi = δai . But then relation (30) follows immediately from thetriangle inequality. Hence, we can assume that a measurable set A ⊂ Rd

and j ∈ {1, 2, 3} exists with µj(A) ∈ (0, 1). Let B =∏

i∈{1,2,3}Ai with

Aj = A and Ai = Rd for every i 6= j. Hence, P (B) = µj(A) ∈ (0, 1). Dueto the assumptions on l we can apply a generalized version of the Minkowskiinequality [21, Thm. 3] and deduce

l−1(w(µ1, µ3))

≤ limn→∞

l−1

(∫Rd×Rd×Rd

l(un(x, y, z) + vn(x, y, z))dP (x, y, z)

)≤ lim

n→∞

(l−1

(∫Rd×Rd×Rd

l(un(x, y, z))dP (x, y, z)

)+l−1

(∫Rd×Rd×Rd

l(vn(x, y, z))dP (x, y, z)

))= l−1

(∫Rd×Rd

l(‖x− y‖)dP1(x, y)

)+ l−1

(∫Rd×Rd

l(‖y − z‖)dP2(y, z)

)Applying the identities (31) and (32), we obtain inequality (30).

Proof of Proposition 3.1. Obviously, we can assume w.l.o.g. that V Hµ (R) <

∞. Let ν ∈ M∞(Rd)\M∗(Rd) with H(pν) ≤ R and w(µ, ν) < ∞. Let{a1, a2, ..} = supp(ν) and define for every k ≥ 2 the distribution

νk =k−1∑i=1

ν(ai)δai +∑i≥k

ν(ai)δ0.

Property (b) of the mapping H implies

H(pνk) ≤ H(pν) ≤ R.

1. Let us first assume that l is continuous and l−1 ◦ w satisfies the triangleinequality.We define the transport mapping

qk =k−1∑i=1

ai1{ai} + 0 · 1Rd\{a1,..,ak−1}.

24

Page 25: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

with ν ◦ q−1k = νk. Denote πk ∈ Γ(ν, νk) as the (deterministic) transport plan

which is induced by qk. Thus, we get

w(ν, νk) ≤∫l(‖x− y‖)dπk(x, y) =

∑i≥k

ν(ai)l(‖ai‖). (33)

Because ν ∈ M(Rd) we obtain∫l(‖x‖)dν(x) =

∑∞i=1 ν(ai)l(‖ai‖) < ∞

which together with (33) yields that w(ν, νk) → 0 as k → ∞. Using thetriangle inequality for l−1 ◦ w we conclude that

∞ > l−1(w(µ, ν)) ≥ l−1(w(µ, νk))− l−1(w(νk, ν)).

Letting k →∞, we obtain from the continuity of l and l−1 that

w(µ, ν) ≥ lim infk→∞

w(µ, νk).

This implies

V Hµ (R) ≥ inf{w(µ, κ) : κ ∈M∗(Rd), H(pκ) ≤ R} (34)

and yields the assertion in this first case.2. Now we assume that l is bounded.Note that νk weakly converges to ν. Moreover,

lim infk→∞

w(µ, νk) ≤ supx∈[0,∞)

l(x) <∞.

Denote by πk an optimal transport plan for source µ and target νk, i.e.

w(µ, νk) =

∫l(‖x− y‖)dπk(x, y).

Applying a stability result for optimal transport plans (cf. [31, Thm. 5.20])we obtain a subsequence of (πk), also denote by (πk) such that (πk) weaklyconverges to an optimal transport plan π for source µ and target ν, i.e.

w(µ, ν) =

∫Rd×Rd

l(‖x− y‖)dπ(x, y).

Because l is bounded, weak convergence implies

w(µ, νk) =

∫l(‖x− y‖)dπk(x, y)

→∫l(‖x− y‖)dπ(x, y) = w(µ, ν), as k →∞.

25

Page 26: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

As an immediate consequence we obtain

w(µ, ν) ≥ inf{w(µ, κ) : κ ∈M∗(Rd), H(pκ) ≤ R},

which yields (34) and, hence, proves the assertion also in this second case.

For any set A ⊂ Rd we denote◦A as the interior of A. Recall 1A as the

characteristic function on A. Now we can prove the main result of this paper.

Proof of Theorem 3.2. We will proceed in several steps.

1. We will prove that WHµ (R) ≤ DH

µ (R).

Let q ∈ Q∗d with Hµ(q) ≤ R. With the measurable mapping

Rd 3 x 7→ φ(x) = (x, q(x)) ∈ Rd × Rd

and in view of (5), we obtain

Dµ(q) =

∫Rdl(‖x− q(x)‖)dµ(x)

=

∫Rd×Rd

l(‖x− y‖)dµ ◦ φ−1(x, y) ≥ w(µ, µ ◦ q−1).

Hence, relation (11) and Proposition 2.2 imply that WHµ (R) ≤ DH

µ (R).

2. We will prove that V Hµ (R) ≥ DH

µ (R).

Let us assume w.l.o.g. that V Hµ (R) < ∞. Let m ∈ N and ν ∈ M∗(Rd)

with card(supp(ν)) = m, H(pν) ≤ R and w(µ, ν) < ∞. Let us denotesupp(ν) = {a1, .., am}. For any shifts β1, .., βm ∈ R recall the definition (8)of the set Aj(β1, .., βm). We obtain from Lemma A.1 that the boundary ofAj(β1, .., βm) has zero µ−measure. Thus, we deduce from Theorem 2.6 thata quantizer q and uniquely determined constants λ1, .., λm ∈ R exist, suchthat q(Rd) = {a1, .., am} and

w(µ, ν) = Dµ(q) =m∑j=1

∫Aj(λ1,..,λm)

l(‖x− aj‖)dµ(x), (35)

where

µ(q−1(aj)) = µ(Aj(λ1, .., λm)) = ν(aj) > 0 for every j ∈ {1, ..,m}.

26

Page 27: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

Note thatHµ(q) = H(pν) ≤ R.

Using (35) we obtain

w(µ, ν) = Dµ(q) ≥ DHµ (R).

Because l satisfies (C1) or (C2), the assertion of step 2 follows from Propo-sition 3.1.

3. Rest of the proof.

Combining (12) with step 1 and step 2, we obtain

V Hµ (R) ≤ WH

µ (R) ≤ DHµ (R) ≤ V H

µ (R)

which proves equation (15). Now, additionally, assume that condition (a) or(b) in our assumptions of this theorem is satisfied. In view of (14) and (15)it suffices to prove that V H

µ (R) ≥ DH,cµ (R). Again, we can assume w.l.o.g.

that V Hµ (R) <∞. Let ν ∈ M∗(Rd) and q as in step 2. Due to Lemma A.1,

the source distribution µ vanishes on the boundary of Aj(λ1, .., λm). Thus,we can assume w.l.o.g. that

◦Aj(λ1, .., λm) ⊂ q−1(aj) ⊂ Aj(λ1, .., λm), j ∈ {1, ..,m},

and that q−1(aj) is either an interval (d = 1) or a convex polytope (d > 1).If d = 1, then we can clearly assume w.l.o.g. that aj is contained in Aj.Moreover, if d > 1 let us assume w.l.o.g. according to [8, Example 2.3(b)]that aj is the centroid and therefore optimal for µ restricted to Aj, i.e.∫

Aj

‖x− aj‖2dµ(x) = inf{∫Aj

‖x− b‖2dµ(x) : b ∈ Rd}}.

Because we are operating with the Euclidean norm, we obtain by exactly thesame argument as in the proof of [8, Lemma 2.6 (a)] that aj is contained inAj, which is the closure of q−1(aj). By (35) we get

w(µ, ν) = Dµ(q) ≥ DH,cµ (R).

Because l is of type (C1) or (C2), Proposition 3.1 yields V Hµ (R) ≥ DH,c

µ (R),which finishes the proof.

AcknowledgementsThe author would like to thank the two anonymous referees for valuablecomments leading to an improved version of this paper. The research wassupported by a grant from the German Research Foundation (DFG).

27

Page 28: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

References

[1] E.F. Abaya and G.L. Wise, Convergence of vector quantizers with ap-plications to optimal quantization, SIAM J. Appl. Math. 44 (1984), pp.183-189.

[2] T. Abdellaoui, Optimal solution of a Monge-Kantorovitch transporta-tion problem, J. Comput. Appl. Math. 96 (1998), pp. 149-161.

[3] J. Aczel and Z. Daroczy, On Measures of Information and Their Charac-terizations, Mathematics in Science and Engineering, Academic Press,London (1975).

[4] L. Ambrosio, N. Gigli and G. Savare, Gradient flows in metric spaces andin the space of probability measures, 2nd ed., Lectures in Mathematics,Birkhauser, Basel (2008).

[5] P.F. Ash and E.D. Bolker, Generalized Dirichlet tesselations, Geom.Dedicata 20 (1986), pp. 209-243.

[6] W. Gangbo and R. J. McCann, The geometry of optimal transportation,Acta Math. 177 (1996), pp. 113–161.

[7] C.R. Givens and R.M. Shortt, A class of Wasserstein metrics for prob-ability distributions, Mich. Math. J. 31 (1984), pp. 231-240.

[8] S. Graf and H. Luschgy, Foundations of Quantization for ProbabilityDistributions, Lecture Notes 1730 Springer, Berlin (2000).

[9] R. Gray and D. Neuhoff, Quantization, IEEE Trans. Inf. Theory 44(1998), pp. 2325-2383.

[10] C. Lautensack and S. Zuyev, Random Laguerre tessellations, Adv. inAppl. Probab. 40 (2008), pp. 630-650.

[11] A. Gyorgy and T. Linder, Optimal entropy-constrained scalar quanti-zation of a uniform source, IEEE Trans. Inf. Theory 46 (2000), pp.2704-2711.

[12] A. Gyorgy and T. Linder, On the structure of optimal entropy-constrained scalar quantizers, IEEE Trans. Inf. Theory 48 (2002), pp.416-427.

28

Page 29: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

[13] A. Gyorgy and T. Linder, Codecell convexity in optimal entropy-constrained vector quantization, IEEE Trans. Inf. Theory 49 (2003),pp. 1821-1828.

[14] P. Harremoes, Joint Range of Renyi Entropies, Kybernetika 45 (2009),pp. 901-911.

[15] L. Kantorovitch, On the translocation of masses, C. R. (Dokl.) Acad.Sci. URSS, n. Ser. 37 (1942), pp. 199-201.

[16] W. Kreitmeier, Optimal Quantization for the one-dimensional uniformdistribution with Renyi-α-entropy constraints, Kybernetika 46 (2010),pp. 96-113.

[17] W. Kreitmeier, Error bounds for high-resolution quantization withRenyi-α-entropy constraints, Acta Math. Hungar. 127 (2010), pp. 34-51.

[18] W. Kreitmeier and T. Linder, High-resolution scalar quantization withRenyi entropy constraint, preprint, arXiv: 1008.1744 (2010).

[19] Y. Linde, A. Buzo and R.M. Gray, An algorithm for vector quantizerdesign, IEEE Trans. Comm. 28 (1980), pp. 84–95.

[20] T. Linder, Lagrangian empirical design of variable-rate vector quantiz-ers: consistency and convergence rates, IEEE Trans. Inf. Theory 48(2002), pp. 2998-3003.

[21] J. Matkowski, The converse of the Minkowski’s inequality theorem andits generalization, Proc. Am. Math. Soc. 109 (1990), pp. 663-675.

[22] S. Matloub, D. O’Brien and R. Gray, Optimal quantizer performanceand the Wasserstein distortion, Proceedings of the 2005 IEEE DataCompression Conference, 29-31 March 2005, pp. 243-250.

[23] P. Mattila, On the structure of self-similar fractals, Ann. Acad. Sci.Fenn., Ser. A 7 (1982), pp. 189-195.

[24] D. Pollard, Quantization and the method of k−means, IEEE Trans. Inf.Theory 28 (1982), pp. 199-205.

[25] S.T. Rachev, Probability metrics and the stability of stochastic models,Wiley Series in Probability and Statistics, Wiley, Chichester (1991).

29

Page 30: Optimal vector quantization in terms of Wasserstein distance · Optimal vector quantization in terms of Wasserstein distance Wolfgang Kreitmeier Wolfgang Kreitmeier, University of

[26] A. Renyi, On measures of entropy and information, Proc. 4th BerkeleySymp. Math. Stat. Probab. 1 (1960), pp. 547-561.

[27] L. Ruschendorf and L. Uckelmann, On optimal multivariate couplings,Proceedings of the 1996 conference, Distribution with given marginalsand moment problems, Prague, Czech Republic, pp. 261–273. Kluwer(1996).

[28] L. Ruschendorf, Monge-Kantorovich transportation problem and opti-mal couplings, Jahresber. Dtsch. Math.-Ver. 109 (2007), pp. 113-137.

[29] M.P. Schutzenberger, Contribution aux applications statistiques de latheorie de l’information, Publ. Inst. Stat. Univ. Paris 3 (1954), pp. 1-114.

[30] H. Steinhaus, Sur la division des corps materiels en parties, Bull. Acad.Polon. Sci. Cl. III 4 (1957), pp. 801-804.

[31] C. Villani, Optimal transport. Old and new, Grundlehren der Mathe-matischen Wissenschaften, Springer, Berlin (2009).

30