Algorithmic Frontiers of Doubling Metric Spaces Robert Krauthgamer Weizmann Institute of Science Based on joint works with Yair Bartal, Lee-Ad Gottlieb,

Algorithmic Frontiers of Doubling Metric SpacesRobert Krauthgamer

Weizmann Institute of Science

Based on joint works with Yair Bartal, Lee-Ad Gottlieb, Aryeh Kontorovich

The Traveling Salesman Problem: Low-dimensionality implies PTAS

Robert Krauthgamer Weizmann Institute of Science

Joint work with Yair Bartal and Lee-Ad Gottlieb

Traveling Salesman Problem (TSP) Definition: Given a set of cities (points), find a minimum-length tour

that visits all points Classic, well-studied NP-hard problem

[Karp‘72; Papadimitriou-Vempala‘06] Mentioned in a handbook from 1832!

Common benchmark for optimization methods Many books devoted to TSP…

Numerous variants Closed/open tour Multiple tours Average visit time (repairman) Etc…

Algorithmic Frontiers of Doubling Metric Spaces

Optimal tour

3

Metric TSP Basic assumptions on distances

Symmetric d(x,y) = d(y,x)

Metric Triangle inequality: d(x,y) + d(y,z) ≤ d(x,z)

Easy 2-approximation via MST Since OPT ≥ MST

Can do better… MST+Matching OPT [Christofides’76]


MST

4

Euclidean TSP Sanjeev Arora [JACM‘98] and Joe Mitchell [SICOMP‘99]:

Euclidean TSP with fixed dimension admits a PTAS Find (1+Ɛ)-approximate tour In time n∙(log n)Ɛ-Õ(dimension) where n = #points (Extends to other norms)

They were awarded the

2010 Gödel Prize

for this discovery

Algorithmic Frontiers of Doubling Metric Spaces 5

PTAS Beyond Euclidean? To achieve a PTAS, two properties were assumed

Euclidean space (at least approximately) Fixed dimension

Are both these assumptions required?

Fixed dimension is necessary No PTAS for (log n)-dimensions unless P=NP [Trevisan’00]

Is Euclidean necessary? Consider metric spaces with low Euclidean intrinsic dimension…


Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x.

The doubling constant (of a metric M) is the minimum value >0 such that every ball can be covered by balls of half the radius First used by [Assoud‘83], algorithmically by [Clarkson‘97]. The doubling dimension is ddim(M)=log (M) [Gupta-K. -Lee‘03] M is called doubling if its doubling dimension is constant

Packing property of doubling spaces A set with diameter D>0 and inter-point distance ≥a,

contains at most (D/a)O(ddim) points


Here ≤7.

7

Applications of Doubling Dimension Nearest neighbor search

[K.-Lee’04; HarPeled-Mendel’06; Beygelzimer-Kakade-Langford’06; Cole-Gottlieb‘06]

Spanners, routing [Talwar’04; Kleinberg-Slivkines-Wexler’04;

Abraham-Gavoille-Goldberg-Malkhi’05; Konjevod-Richa-Xia-Yu’07, Gottlieb-Roditty’08; Elkin-Solomon‘12;]

Distance oracles [HarPeled-Mendel’06; Bartal-Gottlieb-Roditty-Kopelowitz-Lewenstein’11]

Dimension reduction [Bartal-Recht-Schulman’11, Gottlieb-K.’11]

Machine learning and statistics [Bshouty-Yi-Long‘09; Gottlieb-Kontorovich-K.’10,‘12; ]


G

2

11

H

2

11

1

8

PTAS for Metric TSP? Does TSP on doubling metrics admit a PTAS?

Arora and Mitchell made strong use of Euclidean properties “Most fascinating problem left open in this area” [James Lee, tcsmath

blog, June ’10] Some attempts

Quasi-PTAS [Talwar‘04] (First description of problem) Quasi-PTAS for TSP w/neighborhoods [Mitchell’07; Chan-Elbassioni‘11] Subexponential-TAS, under weaker assumption [Chan-Gupta‘08]

Our result: TSP on doubling metrics admits a PTAS Find (1+Ɛ)-approximate tour In time: n2O(ddim) 2Ɛ-Õ(ddim) 2O(ddim2) log½n

Euclidean (to compare): n∙(log n)Ɛ-Õ(dimension)


Throughout, think of ddim and ε as constants

9

Metric Partition A quadtree-like hierarchy

[Bartal’96, Gupta-K.-Lee’03,

Talwar‘04]

At level i:


Centers are 2i-apart in arbitrary order

Random radii Ri 2 [2i, 2·2i]

10

Metric Partition (2)


Random radii Ri-1 2 [2i-1, 2·2i-1]

11

A quadtree-like hierarchy

[Bartal’96, Gupta-K.-Lee’03,

Talwar‘04]

Recursively to level i-1:

Caveat: log(n) hiearchical levels suffice Ignore tiny distances < 1/n2

Dense Areas Key observation:

The points (metric space) can be decomposed into sparse areas

Call a level i ball “dense” if local tour weight (i.e. inside Ri-ball) is ≥ Ri/Ɛ

Such a ball can be removed, solving

each sub-problem separately

Cost to join tours is relatively small: only Ri


Sparsification Sparse decomposition:

Search hierarchy bottom-up for dense balls. Remove dense ball:

Ball is composed of 2O(ddim) sparse sub-balls So it’s barely dense, i.e. local tour weight ≤ 2O(ddim) Ri-1/Ɛ

Recurse on remaining point set

But how do we know the local weight of the tour in a ball? Can be estimated using the local MST Modulo caveats like “long” edges…

OPT Ʌ B(u,R) ≤ O(MST(S)) OPT Ʌ B(u,3R) ≥ Ω(MST(S)) - Ɛ-O(ddim) R


Henceforth, we assume the input is sparse

13

Light Tours


2i-1/M

14

Definition: A tour is (m,r)-light on a hierarchy if it enters all cells (clusters) At most r times, and Only via m designated portals

Choose portals as (2i/M)–net points Then m = MO(ddim)

Optimizing over Light Tours Theorem [Arora‘98,Talwar‘04]: Given a hierarchical partition, a

minimum-length (m,r)-light tour for it can be computed exactly In time mr∙O(ddim) n∙log n Via dynamic programming

Join tours for small clusters

into tour for larger cluster


Typically both m,r ≈ polylog(n/ε), thus mr ≈ npolylog n

15

Better Partitions and Lighter Tours Our Theorem: For every (optimal) tour T, there is a partition with an

(m,r)-light tour T’ such that M = ddim∙log n/Ɛ m = MO(ddim) = (log n/Ɛ)Õ(ddim)

r = ε-O(ddim) loglog n And length(T’) ≤ (1+Ɛ)∙length(T)

If the partition were known, then a tour like T’ could be found in time mr O(ddim) n∙log n = n 2Ɛ-Õ(ddim) loglog2n

It remains to prove the Theorem, and show how to find the partition


Now mr ≈ poly(n)

a bit later

after that

16

Constructing Light Tours


2i-1/M

17

Modify a tour T to be (m,r)-light [Arora‘98, Talwar‘04] Part I: Focus on m (i.e. net points) Move cut edges to be incident on net points

Expected cost at one level (for edge of unit length) Radius Ri-12i-1

Pr[cut edge] ≤ O(ddim/Ri-1) Expected cost

≤ (Ri-1/M)(ddim/Ri-1) = ddim/M = Ɛ/log n

Expected cost to edge over all levels:≤ log n ∙ Ɛ/log n = Ɛ

We thus constructed a (1+Ɛ)-approximate tour

Constructing Light Tours (2) Modify a tour to be (m,r)-light [Arora‘98, Talwar‘04]

Part II: Focus on r (i.e. number of crossing edges) Reduce number of crossings

Patching step: Reroute (almost all) crossings back into cluster Cost ≈ length of tour on the patched endpoints

≈ MST of these points

MST Theorem [Talwar ‘04]: For a set S of points MST(S) ≤ diam(S)∙|S|1-1/ddim Cost per point ≤ diam(S) / |S|1/ddim


diam(S)

18

Constructing Light Tours (3) Modify a tour to be (m,r)-light [Arora‘98, Talwar‘04]


Expected cost to edge at level i-1 Radius Ri-1 ≈ 2i-1

Pr [edge is patched ] ≤ Pr[edge is cut ] Expected cost

≤ (Ri-1/r1/ddim)(ddim/Ri-1) = ddim/r1/ddim

As before, want this to be ≤ Ɛ/log n (because we sum over log n levels) Could take r = (ddim∙log n /Ɛ)ddim

But dynamic program runs in time mr QPTAS! [Talwar ‘04]


2Ri-1

Challenge: smaller value for r

19

Patching in Sparse Areas


Ri-1/M

20

Suppose a tour is q-sparse with respect to hierarchy Every R-ball contains weight qR (for all R=2i) Expectation: Random R-ball cuts weight Rq/R = q

Cluster formed by cuts from many levels Expectation: weight q is cut per level

If r = q∙2loglog n Expectation: level i-1 patching includes

edges cut at much higher levels Charge only “top” half of patched edges

Each charged about 2Ri-1

Pr[edge is charged for patching]

≤ Pr[edge is cut at level i+loglog n]

≤ ddim/(Ri-1 log n)

Wrapping Up (Patching Sparse Areas) Modify a tour to be (m,r)-light [Arora‘98, Talwar‘04]


Expected cost at level i-1 Expected cost

≤ (Ri-1/r1/ddim)(ddim/Ri-1log n) = ddim/log n∙r1/ddim

As before, want this term to be equal to Ɛ/log n Take r = (ddim/Ɛ)ddim

Obtain PTAS!


2Ri-1

21

Technical Subtleties

Ri-1/M

22Algorithmic Frontiers of Doubling Metric Spaces

Outstanding problem: Previous analysis assumed ball cuts only q edges True in expectation… Not good enough Solution: try many hierarchies

Choose at random log n radii for each ball and try all their combinations! WHP, some hierarchy cuts q edges in every ball

Drives up runtime of dynamic program

Algorithmic Frontiers of Doubling Metrics

Robert Krauthgamer Weizmann Institute of Science

Joint work with Lee-Ad Gottlieb and Aryeh Kontorovich

Large-margin classification in metric spaces [vonLuxburg-Bousquet’04] Unknown distribution D of labeled points (x,y) 2 M£{-1,1}

M is a metric space (generalizes Rdim) Labels are L-Lipschitz: |yi-yj| ≤ L∙d(xi,xj) (generalizes margin)

Resource: Sample of labeled points

Goal: Build hypothesis f:M {-1,1} that has (1-ε)-agreement with D Statistical complexity: How many samples needed? Computational complexity: Running time?

Extensions: Small fraction of labels are wrong (adversarial noise) Real-valued labels y2[-1,1] (metric

regression)

Machine Learning in Doubling Metrics


-12/L

2/L

+1

f

Generalization Bounds Our approach: Assume M is doubling and use generalized VC-theory

[Alon-BenDavid-CesaBianchi-Haussler’97, Bartlett-ShaweTaylor’99] Example: Earthmover distance (EMD) in the plane between sets of size k

has ddim ≤ O(k log k) Standard algorithm: pick hypothesis that fits all/most observed samples

Theorem: Class of L-Lipschitz functions has fat-shattering dimension

fsdim ≤ (c∙L∙diam(M))ddim.

Corollary: If f is L-Lipschitz and classifies n samples correctly, WHP

PrD[sgn(f(x)) ≠ y] ≤ O(fsdim∙(log n)2/n).

Similarly, if f correctly classifies all but η-fraction, then WHP

PrD[sgn(f(x)) ≠ y] ≤ η + O(fsdim∙(log n)2/n)1/2. Bounds incomparable to [vonLuxburg-Bousquet’04]


Algorithmic Aspects (noise-free) Computing a hypothesis f from the samples (xi,yi):

Where S+ and S- are the positively and negatively labeled samples

Lemma (Lipschitz extension):

If labels are L-Lipschitz, so is f.

Evaluating f(x) requires solving Nearest Neighbor Search Explains a common classification heuristic, e.g. [Cover-Hart’67] But might require Ω(n) time…

We show how to use (1+ε)-Nearest Neighbor Search This can be solved quickly in doubling metrics We prove similar generalization bound by sandwiching sgn(f(x))


f : x 7! mini

Ã

yi + 2d(x;xi)

d(S+ ;S¡ )

!

26

-1

+1

f

?

Extensions (noisy case)1. A small fraction of labels are wrong (adversarial noise) How to compute a hypothesis?

Build a bipartite graph (on S+[S-) of all violations to Lipschitz condition (edge between two points at distance < 2/L).

Compute a minimum vertex cover (or faster: 2-approximation)

2. Real-valued labels y2[-1,1] (metric regression) Minimize risk (expected loss) Ex,y |f(x)-y| Extend the statistical framework by similar ideas But how to compute a hypothesis?

Write LP: minimize Σi |f(xi)-yi|

subject to |f(xi)-f(xj)| ≤ L∙d(xi,xj) 8 i,j Reduce #constraints from O(n2) to O(ε-ddim n) using (1+ε)-spanner on xi’s Apply fast approximate LP solver


Conclusion General paradigm:

low-dim. Euclidean spaces $ doubling metric spaces Mathematically– latter is different (strictly bigger) family

Not even low-distortion embeddings [Laakso’00,’01] For algorithmic efficiency – strong analogy/similarity

E.g., nearest neighbor search, distributed computing and networking, combinatorial optimization, machine learning

Research directions: Other computational tasks or application areas?

Particularly in machine learning, data structures Scenarios where analogy fails?

E.g. [Indyk-Naor’05] which uses random projections Other metric models? E.g. hyperbolic …


Documents

Algorithmic Frontiers of Doubling Metric Spaces Robert Krauthgamer Weizmann Institute of Science Based on joint works with Yair Bartal, Lee-Ad Gottlieb,