Upload
winfred-tate
View
230
Download
3
Embed Size (px)
Citation preview
Ryan O’Donnell (CMU, IAS)
joint work with
Yi Wu (CMU, IBM), Yuan Zhou (CMU)
Locality Sensitive Hashing [Indyk–Motwani ’98]
objects sketchesh :
H : family of hash functions h s.t.
“similar” objects collide w/ high prob.
“dissimilar” objects collide w/ low prob.
Abbreviated history
A
Broder ’97, Altavista
B
0 1 1 1 0 0 1 0 0
1 1 1 0 0 0 1 0 1
wor
d 1?
wor
d 2?
wor
d 3?
wor
d d?
Jaccard similarity:
Invented simple H s.t. Pr [h(A) = h(B)] =
Indyk–Motwani ’98 (cf. Gionis–I–M ’98)
Defined LSH.
Invented very simple H good for
{0, 1}d under Hamming distance.
Showed good LSH implies good
nearest-neighbor-search data structs.
Charikar ’02, STOC
Proposed alternate H (“simhash”) for
Jaccard similarity.
Many papers about LSH
Practice Theory
Free code base [AI’04]
Sequence comparisonin bioinformatics
Association-rule findingin data mining
Collaborative filtering
Clustering nouns bymeaning in NLP
Pose estimation in vision
• • •
[Tenesawa–Tanaka ’07]
[Broder ’97]
[Indyk–Motwani ’98]
[Gionis–Indyk–Motwani ’98]
[Charikar ’02]
[Datar–Immorlica– –Indyk–Mirrokni ’04]
[Motwani–Naor–Panigrahi ’06]
[Andoni–Indyk ’06]
[Neylon ’10]
[Andoni–Indyk ’08, CACM]
Given: (X, dist), r > 0, c > 1
distance space “radius” “approx factor”
Goal: Family H of functions X → S
(S can be any finite set)
s.t. ∀ x, y ∈ X,
≥ p
≤ q
≥ q.5 ≥ q.25 ≥ q.1 ≥ qρ
Theorem
[IM’98, GIM’98]
Given LSH family for (X, dist),
can solve “(r,cr)-near-neighbor search”
for n points with data structure of
size: O(n1+ρ)
query time: Õ(nρ) hash fcn evals.
Example
X = {0,1}d, dist = Hamming
r = ϵd, c = 5
0 1 1 1 0 0 1 0 0
1 1 1 0 0 0 1 0 1
dist ≤ ϵd
or ≥ 5ϵd
H = { h1, h2, …, hd }, hi(x) = xi[IM’98]
“output a random coord.”
Analysis
= q
= qρ
(1 − 5ϵ)1/5 ≈ 1 − ϵ. ∴ ρ ≈
(1 − 5ϵ)1/5 ≤ 1 − ϵ. ∴ ρ ≤
In general, achieves ρ ≤ ∀ c (∀ r).
Optimal upper bound
( {0, 1}d, Ham ), r > 0, c > 1.
S ≝ {0, 1}d ∪ {✔}, H ≝ {hab : dist(a,b) ≤ r}
hab(x) = ✔ if x = a or x = b
x otherwise
0
positive=> 0.5 > 0.1 > 0.01 > 0.0001
Wait, what?
[IM’98, GIM’98] Theorem:
Given LSH family for (X, dist),
can solve “(r,cr)-near-neighbor search”
for n points with data structure of
size: Õ(n1+ρ)
query time: Õ(nρ) hash fcn evals
Wait, what?
[IM’98, GIM’98] Theorem:
size: Õ(n1+ρ)
query time: Õ(nρ) hash fcn evals
More results
For Rd with ℓp-distance:
when p = 1, 0 < p < 1, p = 2
[IM’98] [DIIM’04] [AI’06]For Jaccard similarity: ρ ≤ 1/c
For {0,1}d with Hamming distance:
[Bro’97]
−od(1) (assuming q ≥ 2−o(d))[MNP’06]
immediately
for ℓp-distance
Our Theorem
For {0,1}d with Hamming distance:
−od(1) (assuming q ≥ 2−o(d))
immediately
for ℓp-distance
(∃ r s.t.)
Proof also yields ρ ≥ 1/c for Jaccard.
Proof:
Proof:
Noise-stability is log-convex.
Proof:
A definition, and two lemmas.
Fix any arbitrary function h : {0,1}d → S.
Pick x ∈ {0,1}d at random:
0 1 1 1 0 0 1 0 0x = h(x) = s
Continuous-time (lazy)
random walk for time τ.
0 0 1 1 0 0 1 1 0y = h(y) = s’
def:
Lemma 1:
Lemma 2:
From which the proof of ρ ≥ 1/c follows easily.
For x y,τ
when τ ≪ 1.
Kh(τ) is a log-convex function of τ.
(for any h)
0
1
τ
Continuous-Time Random Walk
: Repeatedly
— waits Exponential(1) seconds,
— dings.
(Reminder: T ~ Expon(1) means Pr[T > u] = e−u.)
In C.T.R.W. on {0,1}d, each coord. gets
its own independent alarm clock.
When ith clock dings, coord. i is rerandomized.
0 1 1 1 0 0 1 0 0 1x =
0 1 0 1 0 0 1 0 1 1y =
timeτ
0
1
1
1
Pr[coord. i never updated] = Pr[Exp(1) > τ] = e−τ
∴ Pr[xi ≠ yi] =
⇒ Lemma 1: dist(x,y) ≈
Lemma 2: Kh(τ) is a log-convex function of τ.
Remark: True for any reversible C.T.M.C.
Recall: For f : {0,1}d → ℝ,
Given hash function h : {0,1}d → S,
for each s ∈ S, introduce
hs : {0,1}d → {0,1}, hs(x) = 1{h(x)=s}
Proof of Lemma 2:
is log-convex.log-convexnon-neg. lin. comb. of
Lemma 1:
Lemma 2:
Theorem: LSH for {0,1}d requires
For x y,τ
is a log-convex function of τ.
Proof: Say H is an LSH family for {0,1}d
with params .
r (c − o(1)) r
def: (Non-neg. lin. comb.
of log-convex fcns.
∴ KH(τ) is also
log-convex.)
w.v.h.p.,
dist(x,y) ≈ ∴ KH(ϵ) ≳ qρ
KH(cϵ) ≲ q
in truth, q+2−Θ(d); we assume q not tiny
∴ KH(ϵ) ≳
KH(cϵ) ≲
∴ KH(0) = ln
ln
ln
1
qρ
q
0
ρ ln q
ln q
KH(τ) is log-convex
0 τ
ln KH(τ)
cϵ
ln q
ϵ
∴
Super-tedious, super-straightforward
Make Lemma 1 precise. (Chernoff)
Make precise. (Taylor)
Choose ϵ = ϵ(c, q, d) very carefully.
Theorem:
Meaningful iff q ≥ 2−o(d); i.e., not tiny.