96
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca, NY 14853

Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 1

Probabilistic Hashing for Efficient Search and Learning

Ping Li

Department of Statistical Science

Faculty of Computing and Information Science

Cornell University

Ithaca, NY 14853

Page 2: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 2

Practice of Statistics Analysis and Data Mining

Key Components :

Models (Methods) + Variables (Features) + Observations (Samples)

Often a Good Practice :

Simple Models (Methods) + Lots of Features + Lots of Data

or simply

Simple Methods + BigData

Page 3: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3

BigData Everywhere

Conceptually, consider a dataset as a matrix of size n × D.

In modern applications, # examples n = 106 is common and n = 109 is not

rare, for example, images, documents, spams, search click-through data.

High-dimensional (image, text, biological) data are common: D = 106 (million),

D = 109 (billion), D = 1012 (trillion), D = 264 or even higher. In a sense, D

can be arbitrarily high by considering pairwise, 3-way or higher interactions.

Page 4: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 4

Examples of BigData Challenges: Linear Learning

Binary classification: Dataset (xi, yi)ni=1, xi ∈ R

D , yi ∈ −1, 1.

One can fit an L2-regularized linear logistic regression:

minw

1

2w

Tw + C

n∑

i=1

log(

1 + e−yiwTxi

)

,

or the L2-regularized linear SVM:

minw

1

2w

Tw + C

n∑

i=1

max

1 − yiwTxi, 0

,

where C > 0 is the penalty (regularization) parameter.

The weight vector w has length D, the same as the data dimensionality.

Page 5: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 5

Challenges of Learning with Massive High-dimensional Data

• The data often can not fit in memory (even when the data are sparse).

• Data loading (or transmission over network) takes too long.

• Training can be expensive, even for simple linear models.

• Testing may be too slow to meet the demand, especially crucial for

applications in search, high-speed trading, or interactive data visual analytics.

• The model itself can be too large to store, for example, we can not really store

a vector of weights for logistic regression on data of 264 dimensions.

• Near neighbor search, for example, finding the most similar document in

billions of Web pages without scanning them all.

Page 6: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 6

Dimensionality Reduction and Data Reduction

Dimensionality reduction: Reducing D, for example, from 264 to 105.

Data reduction: Reducing # nonzeros is often more important. With modern

linear learning algorithms, the cost (for storage, transmission, computation) is

mainly determined by # nonzeros, not much by the dimensionality.

Is PCA enough? PCA is infeasible for bigdata at least not in real-time. Updating

PCA is non-trivial. PCA does not provide a good indexing scheme. In extremely

high-dimensional data, PCA may not be very useful even if it can be computed.

Page 7: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 7

High-Dimensional Data Generated by Feature Expansions

• Certain datasets (e.g., genomics) are naturally high-dime nsional.

• For some applications, one can choose either

– Low-dimensional representations + complex & slow methods

or

– High-dimensional representations + simple & fast methods

• Two popular feature expansion methods

– Global expansion: e.g., pairwise (or higher-order) interactions

– Local expansion: e.g., word/charater n-grams using contiguous

words/charaters.

Page 8: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 8

Webspam: Text Data and Local Expansion

Dim. Time Accuracy

1-gram 254 20 sec 93.30%

Binary 1-gram 254 20 sec 87.78%

3-gram 16,609,143 200 sec 99.6%

Binary 3-gram 16,609,143 200 sec 99.6%

Kernel 16,609,143 Days 99.6%

Experiments were based on linear SVM (LIBLINEAR) unless marked as “kernel”

Page 9: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 9

MNIST: Digit Image Data and Global Expansion

Dim. Time Accuracy

Original 768 80 sec 92.11%

Binary original 768 30 sec 91.25%

Pairwise 295,296 400 sec 98.33%

Binary pairwise 295,296 400 sec 97.88%

Kernel 768 Hours 98.6%

(It is actually not easy to achieve the well-known 98.6% accuracy. Must use very

precise kernel parameters)

Page 10: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 10

Observations from Feature Expansion Experiments

• 3-grams + linear SVM on webspam produces excellent classification result.

• All pairwise features + linear SVM on MNIST also produces excellent

classification result, close to the well-known result from Gaussian kernel.

• For low-dimensional data, binary quantization of the features may lead to

serious degradation of classification performance.

• Binary quantization on high-dimensional data either does not affect the results

at all (e.g., webspam ) or dos not hurt the performance much (e.g., MNIST).

Page 11: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 11

Major issues with feature expansions:

• high-dimensionality

• high storage cost

• relatively high training/testing cost

The search industry has adopted the practice of

high-dimensional data + linear algorithms + probabilistic hashing

——————

Random projection can be viewed as a hashing method.

Page 12: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 12

A Popular Solution Based on Normal Random Projections

Random Projections : Replace original data matrix A by B = A × R

A R = B

R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).

B ∈ Rn×k : projected matrix, also random.

B approximately preserves the Euclidean distance and inner products between

any two rows of A. In particular, E (BBT) = E (ARR

TA

T) = AAT.

Therefore, we can simply feed B into (e.g.,) SVM or logistic regression solvers.

Page 13: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 13

Very Sparse Random Projections

The projection matrix: R = rij ∈ RD×k. Instead of sampling from normals,

we sample from a sparse distribution parameterized by s ≥ 1:

rij =

−1 with prob. 12s

0 with prob. 1 − 1s

1 with prob. 12s

If s = 100, then on average, 99% of the entries are zero.

If s = 10000, then on average, 99.99% of the entries are zero.

——————-

Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.

Ref: Li, Very Sparse Stable Random Projections, KDD’07.

Page 14: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 14

A Running Example Using (Small) Webspam Data

Datset: 350K text samples, 16 million dimensions, about 4000 nonzeros on

average, 24GB disk space.

0 2000 4000 6000 8000 100000

1

2

3

4x 10

4

# nonzeros

Fre

quen

cy

Webspam

Task: Binary classification for spam vs. non-spam.

——————–

The dataset was generated using 3-grams.

Page 15: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 15

Very Sparse Projections + Linear SVM on Webspam Data

Red dashed curves: results based on the original data

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096

Webspam: SVM (s = 1)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096

Webspam: SVM (s = 1000)

CC

lass

ifica

tion

Acc

(%

)Observations:

• We need a large number of projections (e.g., k ≥ 4096) for high accuracy.

• The sparsity parameter s matters little unless k is small.

Page 16: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 16

Very Sparse Projections + Linear SVM on Webspam Data

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096

Webspam: SVM (s = 1)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096

Webspam: SVM (s = 10000)

CC

lass

ifica

tion

Acc

(%

)

As long as k is large (necessary for high accuracy), the projection matrix can be

extremely sparse, even with s = 10000.

Page 17: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 17

Many more experimental results (classification, clustering, and regression) on

very sparse random projections as well as b-bit minwise hashing can be found at

the 2012 Stanford Modern Massive Data Sets (MMDS) Workshop:

http:

//www.stanford.edu/group/mmds/slides2012/s-pli.pdf

Page 18: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 18

Disadvantages of Random Projections (and Variants)

Inaccurate, especially on binary data.

Page 19: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 19

Variance Analysis for Inner Product Estimates

A R = B

First two rows in A: u1, u2 ∈ RD (D is very large):

u1 = u1,1, u1,2, ..., u1,i, ..., u1,Du2 = u2,1, u2,2, ..., u2,i, ..., u2,D

First two rows in B: v1, v2 ∈ Rk (k is small):

v1 = v1,1, v1,2, ..., v1,j, ..., v1,kv2 = v2,1, v2,2, ..., v2,j, ..., v2,k

E (vT1v2) = uT

1u2 = a

Page 20: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 20

a =1

k

k∑

j=1

v1,jv2,j , (which is also an inner product)

E(a) = a

V ar(a) =1

k

(

m1m2 + a2)

m1 =

D∑

i=1

|u1,i|2, m2 =

D∑

i=1

|u2,i|2

———

Random projections may not be good for inner products because the variance is

dominated by marginal l2 norms m1m2, especially when a ≈ 0.

For real-world datasets, most pairs are often close to be orthogonal (a ≈ 0).

——————-

Ref: Li, Hastie, and Church, Very Sparse Random Projections, KDD’06.

Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11

Page 21: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 21

b-Bit Minwise Hashing

• Simple algorithm designed for massive, high-dimensional, binary data.

• Much more accurate than random projections for estimating inner products.

• Much smaller space requirement than the original minwise hashing algorithm.

• Estimating 3-way similarity, while random projections can not.

• Large-scale linear learning (and kernel learning of course).

• Sub-linear time near neighbor search. (Slides available in the appendix)

——————–

Major drawback : very expensive preprocessing. (Now solved!)

Page 22: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 22

Massive, High-dimensional, Sparse, and Binary Data

Binary sparse data are very common in the real-world :

• For many applications (such as text), binary sparse data are very natural.

• Many datasets can be quantized/thresholded to be binary without hurting the

prediction accuracy.

• In some cases, even when the “original” data are not too sparse, they often

become sparse when considering pariwise and higher-order interactions.

Page 23: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 23

Binary Data Vectors and Sets

A binary (0/1) vector in D-dim can be viewed a set S ⊆ Ω = 0, 1, ..., D − 1.

Example: S1, S2, S3 ⊆ Ω = 0, 1, ..., 15 (i.e., D = 16).

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

1

0

0

0

0

1

0

0

1

1

1

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

1

1

0

0

0

S1:

S2:

S3:

0

S1 = 1, 4, 5, 8, S2 = 8, 10, 12, 14, S3 = 3, 6, 7, 14

Page 24: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 24

Binary (0/1) Massive Text Data by Shingling

Shingling : Each document (Web page) can be viewed as a set of w-shingles.

For example, after parsing, a sentence “today is a nice day” becomes

• w = 1: “today”, “is”, “a”, “nice”, “day”

• w = 2: “today is”, “is a”, “a nice”, “nice day”

• w = 3: “today is a”, “is a nice”, “a nice day”

Previous studies used w ≥ 5, as single-word (unit-gram) model is not sufficient.

Shingling generates extremely high dimensional vectors, e.g., D = (105)w .

(105)5 = 1025 = 283, although in current practice, it seems D = 264 suffices.

Page 25: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 25

Notation

A binary (0/1) vector can be equivalently viewed as a set (locations of nonzeros).

Consider two sets S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1 (e.g., D = 264)

f2

f1 a

f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|.

The resemblance R is a popular measure of set similarity

R =|S1 ∩ S2||S1 ∪ S2|

=a

f1 + f2 − a. (Is it more rational than

a√f1f2

?)

Page 26: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 26

Minwise Hashing: Standard Algorithm in the Context of Searc h

The standard practice in the search industry:

Suppose a random permutation π is performed on Ω, i.e.,

π : Ω −→ Ω, where Ω = 0, 1, ..., D − 1.

An elementary probability argument shows that

Pr (min(π(S1)) = min(π(S2))) =|S1 ∩ S2|

|S1 ∪ S2|= R.

Page 27: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 27

An Example

D = 5. S1 = 0, 3, 4, S2 = 1, 2, 3, R = |S1∩S2||S1∪S2|

= 15 .

One realization of the permutation π can be

0 =⇒ 3

1 =⇒ 2

2 =⇒ 0

3 =⇒ 4

4 =⇒ 1

π(S1) = 3, 4, 1 = 1, 3, 4, π(S2) = 2, 0, 4 = 0, 2, 4

In this example, min(π(S1)) 6= min(π(S2)).

Page 28: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 28

Minwise Hashing In 0/1 Data Matrix

Original Data Matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

1

0

0

0

0

1

0

0

1

1

1

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

1

1

0

0

0

S1:

S2:

S3:

0

Permuted Data Matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

1

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

0

0

0

0

π(S1):

π(S2):

π(S3):

min(π(S1)) = 2, min(π(S2)) = 0, min(π(S3)) = 0

Page 29: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 29

Minwise Hashing Estimator

After k permutations, π1, π2, ..., πk , one can estimate R without bias:

RM =1

k

k∑

j=1

1min(πj(S1)) = min(πj(S2)),

Var(

RM

)

=1

kR(1 − R).

—————————-

A major problem : Need to use 64 bits to store each hashed value.

Page 30: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 30

b-Bit Minwise Hashing: the Intuition

Basic idea : Only store the lowest b-bits of each hashed value, for small b.

Intuition:

• When two sets are identical, then their lowest b-bits of the hashed values are

of course also equal. b = 1 only stores whether a number is even or odd.

• When two sets are similar, then their lowest b-bits of the hashed values

“should be” also similar (True?? Need a proof).

• Thus, hopefully we do not need many bits to obtain useful information, as real

applications often care about pairs with high similarities (e.g., R > 0.5).

Page 31: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 31

The Basic Collision Probability Result

Consider two sets, S1 and S2,

S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1,f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|

Define the minimum values under π : Ω → Ω to be z1 and z2:

z1 = min (π (S1)) , z2 = min (π (S2)) .

and their b-bit versions

z(b)1 = The lowest b bits of z1, z

(b)2 = The lowest b bits of z2

Example: if z1 = 7(= 111 in binary), then z(1)1 = 1, z

(2)1 = 3.

Page 32: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 32

Decimal Binary (8 bits) Lowest b = 2 bits Corresp. Decimal

0 00000000 00 0

1 00000001 01 1

2 00000010 10 2

3 00000011 11 3

4 00000100 00 0

5 00000101 10 1

6 00000110 10 2

7 00000111 11 3

8 00001000 00 0

9 00001001 01 1

Page 33: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 33

Collision probability : Assume D is large (which is virtually always true)

Pb = Pr

(

z(b)1 = z

(b)2

)

= C1,b + (1 − C2,b) R

Note that the exact probability can be written as a complicated double summation.

——————–

Recall, (assuming infinite precision, or as many digits as needed), we have

Pr (z1 = z2) = R

Page 34: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 34

Collision probability : Assume D is large (which is virtually always true)

Pb = Pr

(

z(b)1 = z

(b)2

)

= C1,b + (1 − C2,b) R

r1 =f1

D, r2 =

f2

D, f1 = |S1|, f2 = |S2|,

C1,b = A1,br2

r1 + r2+ A2,b

r1

r1 + r2,

C2,b = A1,br1

r1 + r2+ A2,b

r2

r1 + r2,

A1,b =r1 [1 − r1]

2b−1

1 − [1 − r1]2b

, A2,b =r2 [1 − r2]

2b−1

1 − [1 − r2]2b

.

——————–

Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.

Page 35: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 35

The Exact Collision Probability for b = 1

Pb = Pr

(

z(1)1 = z

(1)2

)

= Pr(both even) + Pr(both odd)

=R +∑

i=0,2,4,...

j 6=i,j=0,2,4,...

Pr (z1 = i, z2 = j)

+∑

i=1,3,5,...

j 6=i,j=1,3,5,...

Pr (z1 = i, z2 = j)

.

where

Pr (z1 = i, z2 = j, i < j)

=f2

D

f1 − a

D − 1

j−i−2∏

t=0

D − f2 − i − 1 − t

D − 2 − t

i−1∏

t=0

D − f1 − f2 + a − t

D + i − j − 1 − t

Page 36: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 36

The closed-form collision probability is remarkably accurate even for small D.

The absolute errors (approximate - exact) are very small even for D = 20.

0 0.2 0.4 0.6 0.8 10

0.002

0.004

0.006

0.008

0.01

f2 = 2

f2 = 4

a / f2

App

roxi

mat

ion

Err

or

D = 20, f1 = 4, b = 1

0 0.2 0.4 0.6 0.8 10

1

2

3

4x 10

−4

D = 500, f1 = 50, b = 1

f2 = 2

f2 = 25

f2 = 50

a / f2

App

roxi

mat

ion

Err

or

—————–

Ref: Li and Konig, Theory and Applications of b-Bit Minwise Hashing, CACM

Research Highlights, 2011 .

Page 37: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 37

The Variance-Space Trade-off

Smaller b =⇒ smaller space for each sample. However,

smaller b =⇒ larger variance.

To characterize the var-space trade-off, we use

64 (i.e., space) × variance

b (i.e., space) × variance

In the least-favorable situation, the relative improvement is

64 × variance

b × variance= 64

R

R + 1

If R = 0.5, then the improvement will be 643 = 21.3-fold.

64 bits + k permutations ⇔ 1 bit + 3k permutations

Page 38: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 38

Experiments: Duplicate detection on Microsoft news articl es

The dataset was crawled as part of the BLEWS project at Microsoft. We

computed pairwise resemblances for all documents and retrieved documents

pairs with resemblance R larger than a threshold R0.

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.3

Sample size (k)

Pre

cisi

on

b=1b=2

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.4

Sample size (k)

Pre

cisi

on

2

b=1

b=1b=2b=4M

Page 39: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 39

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.5

Sample size (k)

Pre

cisi

on

b=1

2

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.6

Sample size (k)

Pre

cisi

on

2

b=1

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.7

Sample size (k)

Pre

cisi

on

b=1

2

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.8

Sample size (k)

Pre

cisi

on

2

b=1

b=1b=2b=4M

Page 40: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 40

1/2 Bit Suffices for High Similarity

Certain applications concern very high similarities. One can XOR two bits from

two permutations, into just one bit.

1 XOR 1 = 0, 0 XOR 0 = 0, 1 XOR 0 = 1, 0 XOR 1 = 1

The new estimator is denoted by R1/2. Compared to the 1-bit estimator R1:

• A good idea for highly similar data :

limR→1

Var(

R1

)

Var(

R1/2

) = 2.

• Not so good idea when data are not very similar :

Var(

R1

)

< Var(

R1/2

)

, if R < 0.5774, Assuming sparse data

—————

Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.

Page 41: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 41

b-Bit Minwise Hashing for Estimating 3-Way Similarities

Notation for 3-way set intersections : R123 = |S1∩S2∩S3||S1∪S2∪S3|

a12

f1 a

a23

f3

a13

f2

r1

r3

s12

ss23

r2

s13

3-Way collision probability : Assume D is large.

Pr (lowest b bits of 3 hashed values are equal) = R123 +Z

u

where u = r1 + r2 + r3 − s12 − s13 − s23 + s123, ri = |Si|D ,

sij = |S1∩S2|D , and ...

Page 42: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 42

Z =(s12 − s)A3,b +(r3 − s13 − s23 + s)

r1 + r2 − s12

s12G12,b

+(s13 − s)A2,b +(r2 − s12 − s23 + s)

r1 + r3 − s13

s13G13,b

+(s23 − s)A1,b +(r1 − s12 − s13 + s)

r2 + r3 − s23

s23G23,b

+[

(r2 − s23)A3,b + (r3 − s23)A2,b

] (r1 − s12 − s13 + s)

r2 + r3 − s23

G23,b

+[

(r1 − s13)A3,b + (r3 − s13)A1,b

] (r2 − s12 − s23 + s)

r1 + r3 − s13

G13,b

+[

(r1 − s12)A2,b + (r2 − s12)A1,b

] (r3 − s13 − s23 + s)

r1 + r2 − s12

G12,b,

Aj,b =rj(1 − rj)2

b−1

1 − (1 − rj)2b

,

Gij,b =(ri + rj − sij)(1 − ri − rj + sij)2

b−1

1 − (1 − ri − rj + sij)2b

, i, j ∈ 1, 2, 3, i 6= j.

—————

Ref: Li, Konig, Gui, b-Bit Minwise Hashing for Estimating 3-Way Similarities,

NIPS’10.

Page 43: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 43

Useful messages :

1. Substantial improvement over using 64 bits, just like in the 2-way case.

2. Must use b ≥ 2 bits for estimating 3-way similarities.

——————

Next question : How to use b-bit minwise hashing for linear learning?

Page 44: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 44

Using b-Bit Minwise Hashing for Learning

Random projection for linear learning is straightforward :

A R = B

because the projected data are naturally in an inner product space (linear kernel).

Natural questions about minwise hashing

• Is the n × n matrix of resemblances positive definite (PD)?

• Is the matrix of estimated resemblances PD?

We look for linear kernels to take advantage of modern linear learning algorithms.

Page 45: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 45

Kernels from Minwise Hashing and b-Bit Minwise Hashing

Definition : A symmetric n × n matrix K satisfying∑

ij cicjKij ≥ 0, for all

real vectors c is called positive definite (PD).

Result Apply permutation π on n sets S1, ..., Sn ∈ Ω = 0, 1, ..., D − 1.

Define zi = minπ(Si). The following three matrices are all PD.

1. The resemblance matrix R ∈ Rn×n: Rij =

|Si∩Sj ||Si∪Sj |

=|Si∩Sj |

|Si|+|Sj |−|Si∩Sj |

2. The minwise hashing matrix M ∈ Rn×n: Mij = 1zi = zj

3. The b-bit minwise hashing matrix M(b) ∈ R

n×n:

M(b)ij =

∏bt=1 1 ei,t = ej,t, where ei,t is the t-th lowest bit of zi.

Consequently, for k independent permutations, the sum∑k

s=1 M(b)(s) is also PD.

Page 46: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 46

b-Bit Minwise Hashing for Large-Scale linear Learning

Linear learning algorithms require the estimators to be inner products.

The estimator of minwise hashing

RM =1

k

k∑

j=1

1min(πj(S1)) = min(πj(S2))

is indeed an inner product of two (extremely sparse) vectors in D×k dimensions.

As z1 = min(π(S1)), z2 = min(π(S2)) ∈ Ω = 0, 1, ..., D − 1, we write

1min(πj(S1)) = min(πj(S2)) =D−1∑

i=0

1z1 = i × z2 = i

———–

Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11.

Page 47: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 47

Consider D = 5. We can expand numbers into vectors of length 5.

0 =⇒ [0, 0, 0, 0, 1], 1 =⇒ [0, 0, 0, 1, 0], 2 =⇒ [0, 0, 1, 0, 0]

3 =⇒ [0, 1, 0, 0, 0], 4 =⇒ [1, 0, 0, 0, 0].

———————

If z1 = 2, z2 = 3, then

0 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 1, 0, 0, 0].

If z1 = 2, z2 = 2, then

1 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 0, 1, 0, 0].

Page 48: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 48

Linear Learning Algorithms

Binary classification: Dataset (xi, yi)ni=1, xi ∈ R

D , yi ∈ −1, 1.

One can fit an L2-regularized linear SVM:

minw

1

2w

Tw + C

n∑

i=1

max

1 − yiwTxi, 0

,

or the L2-regularized logistic regression:

minw

1

2w

Tw + C

n∑

i=1

log(

1 + e−yiwTxi

)

,

where C > 0 is the penalty (regularization) parameter.

Page 49: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 49

Integrating b-Bit Minwise Hashing for (Linear) Learning

Very simple :

1. Apply k independent random permutations on each (binary) feature vector xi

and store the lowest b bits of each hashed value. The storage costs nbk bits.

2. At run-time, expand a hashed data point into a 2b × k-length vector, i.e.

concatenate k 2b-length vectors. The new feature vector has exactly k 1’s.

Page 50: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 50

An example with k = 3 and b = 2

For set (vector) S1:

Original hashed values (k = 3) : 12013 25964 20191

Original binary representations :

010111011101101 110010101101100 100111011011111

Lowest b = 2 binary digits : 01 00 11

Corresponding decimal values : 1 0 3

Expanded 2b = 4 binary digits : 0010 0001 1000

New feature vector fed to a solver : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0] × 1√3

Same procedures on sets S2, S3, ...

——————

Note that (i): solvers require normalization; (ii)expansion 0000 is never used.

Page 51: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 51

Datasets and Solver for Linear Learning

This talk presents the experiments on two text datasets.

Dataset # Examples (n) # Dims (D) Avg # Nonzeros Train / Test

Webspam (24 GB) 350,000 16,609,143 3728 80% / 20%

Rcv1 (200 GB) 781,265 1,010,017,424 12062 50% / 50%

To generate the Rcv1 dataset, we used the original features + all pairwise

features + some 3-way features.

We chose LIBLINEAR as the basic solver for linear learning. Note that our method

is purely statistical/probabilistic, independent of the underlying procedures.

Page 52: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 52

Experimental Results on Linear SVM and Webspam

• We conducted our extensive experiments for a wide range of regularization C

values (from 10−3 to 102) with fine spacings in [0.1, 10].

• We experimented with k = 30 to k = 500, and b = 1, 2, 4, 8, and 16.

Page 53: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 53

Testing Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 6,8,10,16

b = 1

svm: k = 200

b = 24

b = 4

Spam: Accuracy

• Solid: b-bit hashing. Dashed (red) the original data

• Using b ≥ 8 and k ≥ 200 achieves about the same test accuracies as using

the original data.

• The results are averaged over 50 runs.

Page 54: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 54

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6b = 8,10,16

svm: k = 50Spam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

CA

ccur

acy

(%)

Spam: Accuracy

svm: k = 100

b = 1

b = 2

b = 4b = 8,10,16

4

6

6

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

svm: k = 150

Spam: Accuracy

b = 1

b = 2

b = 4b = 6,8,10,16

4

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 6,8,10,16

b = 1

svm: k = 200

b = 24

b = 4

Spam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

Spam: Accuracy

svm: k = 300

b = 1

b = 24

b = 4b = 6,8,10,16

10−3

10−2

10−1

100

101

102

80828486889092949698

100

b = 1

b = 24

b = 6,8,10,16

C

Acc

urac

y (%

)

svm: k = 500

Spam: Accuracy

b = 4

Page 55: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 55

Stability of Testing Accuracy (Standard Deviation)

Our method produces very stable results, especially b ≥ 4

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

b = 1

b = 2

b = 4

b = 6b = 8

b = 1610

svm: k = 50Spam accuracy (std)

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

b = 1

b = 2

b = 4

b = 6b = 8

b = 10,16svm: k = 100Spam accuracy (std)

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

svm: k = 150

Spam:Accuracy (std)

b = 1

b = 2

b = 4

b = 8b = 16

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

b = 1

b = 2

b = 4

b = 6

b = 8,10,16svm: k = 200Spam accuracy (std)

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

Spam:Accuracy (std)

b = 1

b = 2

b = 4b = 8b = 16svm: k = 300

10−3

10−2

10−1

100

101

102

10−2

10−1

100

b = 1

b = 2

b = 4

b = 6,8,10,16Spam accuracy (std)

C

Acc

urac

y (s

td %

)

svm: k = 500

Page 56: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 56

Training Time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

svm: k = 50Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

svm: k =100Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

svm: k = 150

Spam:Training Time

b = 16

b = 1

b=8

b = 4b = 1

2416

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

b = 16

svm: k = 200Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

b = 16

b = 1

2

416

b=8

b = 1

svm: k = 300

Spam:Training Time10

−310

−210

−110

010

110

210

0

101

102

103

Spam: Training time

b = 10

b = 16

C

Tra

inin

g tim

e (s

ec)

svm: k = 500

• They did not include data loading time (which is small for b-bit hashing)

• The original training time is about 100 seconds.

• b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8).

Page 57: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 57

Testing Time

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)

svm: k = 50Spam: Testing time

10−3

10−2

10−1

100

101

102

12

10

100

1000

CT

estin

g tim

e (s

ec)

svm: k = 100Spam: Testing time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tes

ting

time

(sec

)

svm: k = 150Spam:Testing Time

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)

svm: k = 200Spam: Testing time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tes

ting

time

(sec

) Spam:Testing Timesvm: k = 300

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)

svm: k = 500Spam: Testing time

However, here we assume the test data have already been processed.

Page 58: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 58

Experimental Results on L2-Regularized Logistic Regression

Testing Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

logit: k = 50Spam: Accuracy

b = 1

b = 2

b = 4

b = 6b = 8,10,16

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)logit: k = 100

Spam: Accuracy

b = 1

b = 2

b = 4b = 6

b = 8,10,16

10−3

10−2

10−1

100

101

102

80

85

90

95

100

C

Acc

urac

y (%

)

Spam:Accuracylogit: k = 150

b = 1

b = 2

b = 4b = 8

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

Spam: Accuracy

logit: k = 200

b = 6,8,10,16

b = 1

b = 2

b = 4

10−3

10−2

10−1

100

101

102

80

90

95

100

C

Acc

urac

y (%

)

logit: k = 300Spam:Accuracy

b = 1

b = 2

b = 4b = 8

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

logit: k = 500

Spam: Accuracy

b = 1

b = 2b = 4

4

b = 6,8,10,16

Page 59: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 59

Training Time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 50

Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 100

Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 150Spam:Training Time

b = 16b = 1

2 b = 84

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 200

Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

b = 16

b = 1

b = 482

logit: k = 300Spam:Training Time

10−3

10−2

10−1

100

101

102

100

101

102

103

CT

rain

ing

time

(sec

) b = 16

logit: k = 500

Spam: Training time

Page 60: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 60

Comparisons with Other Algorithms

We conducted extensive experiments with the VW algorithm (Weinberger et. al.,

ICML’09, not the VW online learning platform), which has the same variance as

random projections.

Consider two sets S1, S2, f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|,R = |S1∩S2|

|S1∪S2|= a

f1+f2−a . (a = 0 ⇔ R = 0) Then, from their variances

V ar (aV W ) ≈ 1

k

(

f1f2 + a2)

V ar(

RMINWISE

)

=1

kR (1 − R)

we can immediately see the significant advantages of minwise hashing, especially

when a ≈ 0 (which is common in practice).

Page 61: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 61

Comparing b-Bit Minwise Hashing with VW

8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about

the same test accuracy as VW with k = 104 ∼ 106.

101

102

103

104

105

106

80828486889092949698

100

C = 0.01

1,10,100

0.1

k

Acc

urac

y (%

)

svm: VW vs b = 8 hashing

C = 0.01

C = 0.1C = 1

10,100

Spam: Accuracy

101

102

103

104

105

106

80828486889092949698

100

k

Acc

urac

y (%

)

1

C = 0.01

C = 0.1

C = 110

C = 0.01

C = 0.1

10,100

logit: VW vs b = 8 hashingSpam: Accuracy

100

Page 62: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 62

8-bit hashing is substantially faster than VW (to achieve the same accuracy).

102

103

104

105

106

100

101

102

103

k

Tra

inin

g tim

e (s

ec) C = 100

C = 10

C = 1,0.1,0.01C = 100

C = 10

C = 1,0.1,0.01

Spam: Training time

svm: VW vs b = 8 hashing

102

103

104

105

106

100

101

102

103

kT

rain

ing

time

(sec

)

C = 0.01

10,1.0,0.1

100

logit: VW vs b = 8 hashingSpam: Training time

C = 0.1,0.01

C = 100,10,1

Page 63: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 63

Experimental Results on Rcv1 (200GB)

Test accuracy using linear SVM (Can not train the original data)

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

svm: k =50 b = 16

b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)rcv1: Accuracy

svm: k =100 b = 16

b = 12b = 8

b = 4b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

svm: k =150

rcv1: Accuracy

b = 16b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

svm: k =200

rcv1: Accuracy

b = 16

b = 12b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

svm: k =300 b = 16b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

b = 1

b = 2

b = 4

b = 8b = 12

b = 16

CA

ccur

acy

(%)

svm: k =500

rcv1: Accuracy

Page 64: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 64

Training time using linear SVM

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=50

12

16

b = 8

12

10−3

10−2

10−1

100

101

102

100

101

102

103

CT

rain

ing

time

(sec

)

rcv1: Train Time

svm: k=100

b = 16

12

b = 8

12

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=150

b = 1612

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=200

b = 1612

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=300

b = 16

12

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

svm: k=500

b = 8

12

b = 16

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

Page 65: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 65

Test accuracy using logistic regression

10−2

100

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

logit: k =50 b = 16

b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

CA

ccur

acy

(%)

rcv1: Accuracyb = 1

b = 2

b = 4

b = 8

b = 12

b = 16logit: k =100

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

logit: k =150 b = 16

b = 12b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

logit: k =200 b = 16b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

b = 16b = 12

b = 8

b = 4

b = 2

b = 1

logit: k =300

rcv1: Accuracy

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

b = 16

b = 12b = 8

b = 4

b = 2

b = 1

rcv1: Accuracy

logit: k =500

Page 66: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 66

Training time using logistic regression

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=50

b = 168

b = 1

b = 4

b = 2

b = 12

10−3

10−2

10−1

100

101

102

100

101

102

103

CT

rain

ing

time

(sec

)

rcv1: Train Time

logit: k=100

b = 8

b = 16

b = 12

b = 1

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=150

b = 16

128

b = 4

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=200

b = 16

128

b = 1

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=300b = 16

12

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

b = 1612

b = 8

logit: k=500

Page 67: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 67

Comparisons with VW on Rcv1

Test accuracy using linear SVM

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

svm: VW vs hashing

rcv1: Accuracy

C = 0.01

C = 0.1,1,10

C = 0.01,0.1,1,10

1−bit

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)rcv1: Accuracy

svm: VW vs hashing2−bit

C = 0.01,0.1,1,10 C = 0.01

C = 0.1,1,10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

rcv1: Accuracysvm: VW vs hashing4−bit

C = 0.01

C = 0.1,1,10

C = 0.01

C = 10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

rcv1: Accuracy

svm: VW vs hashing8−bit

C = 0.1,1,10

C = 0.01C = 0.01

C = 10

101

102

103

104

50

60

70

80

90

100

C = 0.01

K

Acc

urac

y (%

)

C = 0.01C = 0.1,1,10

rcv1: Accuracysvm: VW vs hashing12−bit

C = 10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

C = 0.1,1,10

C = 0.01

rcv1: Accuracy

svm: VW vs hashing16−bit

C = 10

C = 0.01

Page 68: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 68

Test accuracy using logistic regression

101

102

103

104

50

60

70

80

90

100

C = 0.01

C = 0.01,0.1,1,10

C = 0.1,1,10

K

Acc

urac

y (%

)

logit: VW vs hashingrcv1: Accuracy

1−bit

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

C = 1logit: VW vs hashing2−bit

rcv1: Accuracy

C = 0.1,1,10

C = 0.01C = 0.01,0.1,1,10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

C = 0.1,1,10

C = 0.01

C = 0.1,1,10

C = 0.01

rcv1: Accuracylogit: VW vs hashing4−bit

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

rcv1: Accuracylogit: VW vs hashing8−bit

C = 0.01

C = 0.1,1,10

C = 0.01

C = 0.1,1,10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

rcv1: Accuracylogit: VW vs hashing12−bit

C = 0.01

C = 0.1,1,10

C = 0.1,1,10

C = 0.01

101

102

103

104

50

60

70

80

90

100

KA

ccur

acy

(%)

C = 0.01

C = 0.1,1,10

C = 0.01

C = 10

logit: VW vs hashing16−bit

Page 69: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 69

Training time

101

102

103

104

100

101

102

103

C = 0.01

C = 0.1

C = 10

8−bit

K

Tra

inin

g tim

e (s

ec)

C = 0.01

C = 0.1

C = 1

rcv1: Train Time

svm: VW vs hashing

C = 1

C = 10

101

102

103

104

100

101

102

103

KT

rain

ing

time

(sec

)

rcv1: Train Time

logit: VW vs hashing8−bit

C = 10

C = 1

C = 0.1

C = 0.01

C = 0.01

C = 0.1

C = 1

C = 10

Page 70: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 70

k-Permutation Hashing: Conclusions

• BigData everywhere. For many operations including clustering, classification,

near neighbor search, etc. exact answers are often not necessary.

• Very sparse random projections (KDD 2006) work as well as dense random

projections. They however require many projections (e.g., k = 10000)

• Minwise hashing is a standard procedure in the context of search. b-bit

minwise hashing is a substantial improvement by using only b bits per hashed

value instead of 64 bits (e.g., a 20-fold reduction in space).

• b-bit minwise hashing provides a simple strategy for efficient linear learning,

by expanding the bits into vectors of length 2b × k (originally 264 × k)

• We can build hash tables for achieving sub-linear time near neighbor search.

• It can be substantially more accurate than random projections (and variants).

Page 71: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 71

• Major drawbacks of b-bit minwise hashing:

– It was designed (mostly) for binary data.

– It needs a fairly high-dimensional representation (2b × k) after expansion.

– Expensive preprocessing and expensive testing for unprocessed data.

Parallelization solutions based on (e.g.,) GPUs offer good speed but they

are not energy-efficient.

Page 72: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 72

Why Minwise Hashing Is Wasteful?

Original Data Matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

1

0

0

0

0

1

0

0

1

1

1

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

1

1

0

0

0

S1:

S2:

S3:

0

Permuted Data Matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

1

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

0

0

0

0

π(S1):

π(S2):

π(S3):

Only the the minimums and repeat the process k (e.g., 500) times.

Page 73: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 73

One Permutation Hashing

S1, S2, S3 ⊆ Ω = 0, 1, ..., 15 (i.e., D = 16). The figure presents the

permuted sets as three binary (0/1) vectors:

π(S1) = 2, 4, 7, 13, π(S2) = 0, 6, 13, π(S3) = 0, 1, 10, 12

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

1

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

0

0

0

0

1 2 3 4

π(S1):

π(S2):

π(S3):

One permutation hashing: divide the space Ω evenly into k = 4 bins and

select the smallest nonzero in each bin.

Page 74: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 74

Two major tasks:

• Theoretical analysis and estimators

• Practical strategies to deal with empty bins, should they occur

Ref: P. Li, A. Owen, C-H Zhang, One Permutation Hashing, NIPS’12

Page 75: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 75

Two Definitions

Recall: the space is divided evenly into k bins

# Jointly empty bins: Nemp =k

j=1

Iemp,j

# Matched bins: Nmat =k

j=1

Imat,j

Page 76: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 76

where Iemp,j and Imat,j are defined for the j-th bin, as

Iemp,j =

1 if both π(S1) and π(S2) are empty in the j-th bin

0 otherwise

Imat,j =

1 if both π(S1) and π(S1) are not empty and the smallest element

of π(S1) matches the smallest element of π(S2), in the j-th bin

0 otherwise

———————————

Nemp =k

j=1

Iemp,j , Nmat =k

j=1

Imat,j

Page 77: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 77

Results for the Number of Jointly Empty Bins Nemp

Notation :

f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|, f = |S1 ∪ S2| = f1 + f2 − a.

Distribution:

Pr (Nemp = j) =

k−j∑

s=0

(−1)s k!

j!s!(k − j − s)!

f−1∏

t=0

D(

1 − j+sk

)

− t

D − t

Expectation:E (Nemp)

k=

f−1∏

j=0

D(

1 − 1k

)

− j

D − j≤

(

1 − 1

k

)f

Page 78: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 78

Approximation of Nemp in Sparse Data

f = |S1 ∪ S2| = f1 + f2 − a.

E(Nemp)k ≈

(

1 − 1k

)f ≈ e−f/k.

f/k = 5 =⇒ E(Nemp)k ≈ 0.0067 (negligible).

f/k = 1 =⇒ E(Nemp)k ≈ 0.3679 (noticeable, but then why hashing?).

Page 79: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 79

An Unbiased Estimator

Estimator : Rmat =Nmat

k − Nemp

Expectation : E(

Rmat

)

= R

Variance: V ar(

Rmat

)

<R(1 − R)

k

One-permutation hashing has smaller variance than k-permutation hashing.

Page 80: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 80

Empirically Validation

Based on many pairs of vectors, empirical MSEs (MSE = bias2 + variance) verify:

101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

k

MS

E(R

mat

)

RIGHTS−−RESERVED

EmpiricalTheoreticalMinwise

101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

k

MS

E(R

mat

)

OF−−AND

EmpiricalTheoreticalMinwise

101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

k

MS

E(R

mat

)

THIS−−HAVE

EmpiricalTheoreticalMinwise

Page 81: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 81

Summary of Advantages of One Permutation Hashing

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

1

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

0

0

0

0

1 2 3 4

π(S1):

π(S2):

π(S3):

• Computationally much more efficient and energy-efficient.

• Parallelization-based solution requires additional hardware & implementation.

• Testing unprocessed data is much faster, crucial for user-facing applications.

• Implementation is easier, from the perspective of random number generation,

e.g., storing a permutation vector of length D = 109 (4GB) is not a big deal.

• One permutation hashing is a much better matrix sparsification scheme.

Page 82: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 82

One Permutation Hashing for Linear Learning

Almost the same as k-permutation hashing, but we must deal with empty bins *.

Two ideas:

1. Zero coding : Encode empty bins as (e.g.,) 00000000 in the expanded space.

2. Random coding : Encode empty bins as a random number. Did not work as

well.

Page 83: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 83

Statistics of # nonzeros in Webspam Data

0 2000 4000 6000 8000 100000

1

2

3

4x 10

4

# nonzeros

Fre

quen

cy

Webspam

The dataset has 350,000 samples. The average number of nonzeros is about

4000 which should be much larger than the k (e.g., 200 to 500).

About 10% (or 2.8%) of the samples have < 500 (or < 200) nonzeros. We

must deal with empty bins if we do not want to exclude those data points.

Page 84: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 84

Review the original example with k = 3 permutations and b = 2

For set (vector) S1:

Original hashed values (k = 3) : 12013 25964 20191

Original binary representations :

010111011101101 110010101101100 100111011011111

Lowest b = 2 binary digits : 01 00 11

Corresponding decimal values : 1 0 3

Expanded 2b = 4 binary digits : 0010 0001 1000

New feature vector fed to a solver : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0] × 1√3

Same procedures on sets S2, S3, ...

0000 was never used!

Page 85: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 85

Zero Coding Example

One permutation hashing with k = 4 and b = 2

For set (vector) S1:

Original hashed values (k = 4) : 12013 25964 20191 ∗Original binary representations :

010111011101101 110010101101100 100111011011111 ∗Lowest b = 2 binary digits : 01 00 11 ∗Corresponding decimal values : 1 0 3 ∗Expanded 2b = 4 binary digits : 0010 0001 1000 0000

New feature vector : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0] × 1√4 − 1

Same procedures on sets S2, S3, ...

Page 86: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 86

Experimental Results on Webspam Data

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8

SVM: k = 256Webspam: Accuracy

Original1 Permk Perm

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2b = 4,6,8

SVM: k = 512Webspam: Accuracy

Original1 Permk Perm

When k = 512 (or even 256) and b = 8, b-bit one permutation hashing achieves

similar test accuracies as using the original data.

One permutation hashing (zero coding) is even slightly more accurate than

k-permutation hashing (at merely 1/k of the original cost).

Page 87: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 87

One Permutation v.s. k-Permutation

Linear SVM

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6b = 8

SVM: k = 32Webspam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4b = 6b = 8

SVM: k = 64Webspam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8

SVM: k = 256Webspam: Accuracy

Original1 Permk Perm

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2b = 4,6,8

SVM: k = 512Webspam: Accuracy

Original1 Permk Perm

Logistic Regression

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

logit: k = 32Webspam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4b = 6b = 8

logit: k = 64Webspam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2b = 4b = 6,8

logit: k = 256Webspam: Accuracy

Original1 Permk Perm

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1b = 2b = 4,6,8

logit: k = 512Webspam: Accuracy

Original1 Permk Perm

————————

The influence of empty bins is really not that obvious, as expected.

Page 88: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 88

One Permutation Hashing: Conclusion

One permutation hashing is at least as accurate as the standard k-permutation

minwise hashing at merely 1/k of the original processing cost.

Page 89: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 89

Testing the Robustness of Zero Coding on 20Newsgroup Data

For webspam data, the # empty bins may be too small to make the algorithm fail.

The 20Newsgroup (News20) dataset has merely 20, 000 samples in about one

million dimensions, with on average about 500 nonzeros per sample.

Therefore, News20 is merely a toy example to test the robustness of our

algorithm. In fact, we let k as large as k = 4096, i.e., most of bins are empty.

Page 90: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 90

20Newsgroups: One Permutation v.s. k-Permutation (SVM)

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

SVM: k = 8News20: Accuracy

Original1 Permk Perm

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

SVM: k = 16News20: Accuracy

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

SVM: k = 32News20: Accuracy

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4

b = 6

b = 8

SVM: k = 64News20: Accuracy

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6b = 8

SVM: k = 128News20: Accuracy

10−1

100

101

102

103

65

70

75

80

85

90

95

100

CA

ccur

acy

(%)

b = 1

b = 2

b = 4

b = 6b = 8

SVM: k = 256News20: Accuracy

Page 91: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 91

10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4b = 6

b = 8

SVM: k = 512News20: Accuracy

10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8

SVM: k = 1024News20: Accuracy

10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8

SVM: k = 2048News20: Accuracy

Original1 Permk Perm

10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1 b = 2b = 4,6,8

SVM: k = 4096News20: Accuracy

Original1 Permk Perm

One permutation with zero coding can reach the original accuracy 98%.

The original k-permutation can only reach 97.5% even with k = 4096.

Page 92: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 92

20Newsgroups: One Permutation v.s. k-Permutation (Logit)

Logistic Regression

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

logit: k = 32News20: Accuracy

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

logit: k = 64News20: Accuracy

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

logit: k = 128News20: Accuracy

10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6b = 8

logit: k = 256News20: Accuracy

10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4b = 6

b = 8

logit: k = 512News20: Accuracy

10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8

logit: k = 1024News20: Accuracy

10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8

logit: k = 2048News20: Accuracy

Original1 Permk Perm

10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2b = 4,6,8

logit: k = 4096News20: Accuracy

Original1 Permk Perm

Page 93: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 93

b-Bit Minwise Hashing for Efficient Near Neighbor Search

Near neighbor search is a much more frequent operation than training an SVM.

The bits from b-bit minwise hashing can be directly used to build hash tables to

enable sub-linear time near neighbor search.

This is an instance of the general family of locality-sensitive hashing (LSH).

Page 94: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 94

An Example of b-Bit Hashing for Near Neighbor Search

We use b = 2 bits and k = 2 permutations to build a hash table indexed from

0000 to 1111, i.e., the table size is 22×2 = 16.

00 10

11 1011 11

00 0000 01

Index Data Points

11 01

8, 13, 251 5, 14, 19, 29(empty)

33, 174, 3153 7, 24, 156

61, 342

00 10

11 1011 11

00 0000 01

Index Data Points

11 01

8

17, 36, 1292, 19, 83

7, 198

56, 989,9, 156, 879

4, 34, 52, 796

Then, the data points are placed in the buckets according to their hashed values.

Look for near neighbors in the bucket which matches the hash value of the query.

Replicate the hash table (twice in this case) for missed and good near neighbors.

Final retrieved data points are the union of the buckets.

Page 95: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 95

Experiments on the Webspam Dataset

B = b × k: table size L: number of tables.

Fractions of retrieved data points

8 12 16 20 2410

−2

10−1

100

Webspam: L = 25

B (bits)

Fra

ctio

n E

valu

ated

b = 1

b = 2

b = 4

SRPb−bit

8 12 16 20 2410

−2

10−1

100

B (bits)

Fra

ctio

n E

valu

ated

b = 1

b = 2

b = 4

Webspam: L = 50

SRPb−bit

8 12 16 20 2410

−2

10−1

100

Webspam: L = 100

B (bits)

Fra

ctio

n E

valu

ated

b = 1

b = 2

b = 4

SRPb−bit

SRP (sign random projections) and b-bit hashing (with b = 1, 2) retrieved similar

numbers of data points.

————————

Ref: Shrivastava and Li, Fast Near Neighbor Search in High-Dimensional Binary

Data, ECML’12

Page 96: Probabilistic Hashing for Efficient Search and LearningPing Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 96

Precision-Recall curves (the higher the better) for retrieving top-T data points.

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 24 L= 100Webspam: T = 5

b = 1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 24 L= 100Webspam: T = 10

b = 1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 24 L= 100Webspam: T = 50

b = 1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 16 L= 100Webspam: T = 5

1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 16 L= 100Webspam: T = 10

1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 16 L= 100Webspam: T = 50

1

b = 4

SRPb−bit

Using L = 100 tables, b-bit hashing considerably outperformed SRP (sign

random projections), for table sizes B = 24 and B = 16.