Probabilistic Hashing for Efﬁcient Search and LearningPing Li Probabilistic Hashing for Efﬁcient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,

Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 1

Probabilistic Hashing for Efficient Search and Learning

Ping Li

Department of Statistical Science

Faculty of Computing and Information Science

Cornell University

Ithaca, NY 14853


Practice of Statistics Analysis and Data Mining

Key Components :

Models (Methods) + Variables (Features) + Observations (Samples)

Often a Good Practice :

Simple Models (Methods) + Lots of Features + Lots of Data

or simply

Simple Methods + BigData


BigData Everywhere

Conceptually, consider a dataset as a matrix of size n × D.

In modern applications, # examples n = 106 is common and n = 109 is not

rare, for example, images, documents, spams, search click-through data.

High-dimensional (image, text, biological) data are common: D = 106 (million),

D = 109 (billion), D = 1012 (trillion), D = 264 or even higher. In a sense, D

can be arbitrarily high by considering pairwise, 3-way or higher interactions.


Examples of BigData Challenges: Linear Learning

Binary classification: Dataset (xi, yi)ni=1, xi ∈ R

D , yi ∈ −1, 1.

One can fit an L2-regularized linear logistic regression:

minw

1

2w

Tw + C

n∑

i=1

log(

1 + e−yiwTxi

)

,

or the L2-regularized linear SVM:

minw

1

2w

Tw + C

n∑

i=1

max

1 − yiwTxi, 0

,

where C > 0 is the penalty (regularization) parameter.

The weight vector w has length D, the same as the data dimensionality.


Challenges of Learning with Massive High-dimensional Data

• The data often can not fit in memory (even when the data are sparse).

• Data loading (or transmission over network) takes too long.

• Training can be expensive, even for simple linear models.

• Testing may be too slow to meet the demand, especially crucial for

applications in search, high-speed trading, or interactive data visual analytics.

• The model itself can be too large to store, for example, we can not really store

a vector of weights for logistic regression on data of 264 dimensions.

• Near neighbor search, for example, finding the most similar document in

billions of Web pages without scanning them all.


Dimensionality Reduction and Data Reduction

Dimensionality reduction: Reducing D, for example, from 264 to 105.

Data reduction: Reducing # nonzeros is often more important. With modern

linear learning algorithms, the cost (for storage, transmission, computation) is

mainly determined by # nonzeros, not much by the dimensionality.

Is PCA enough? PCA is infeasible for bigdata at least not in real-time. Updating

PCA is non-trivial. PCA does not provide a good indexing scheme. In extremely

high-dimensional data, PCA may not be very useful even if it can be computed.


High-Dimensional Data Generated by Feature Expansions

• Certain datasets (e.g., genomics) are naturally high-dime nsional.

• For some applications, one can choose either

– Low-dimensional representations + complex & slow methods

or

– High-dimensional representations + simple & fast methods

• Two popular feature expansion methods

– Global expansion: e.g., pairwise (or higher-order) interactions

– Local expansion: e.g., word/charater n-grams using contiguous

words/charaters.


Webspam: Text Data and Local Expansion

Dim. Time Accuracy

1-gram 254 20 sec 93.30%

Binary 1-gram 254 20 sec 87.78%

3-gram 16,609,143 200 sec 99.6%

Binary 3-gram 16,609,143 200 sec 99.6%

Kernel 16,609,143 Days 99.6%

Experiments were based on linear SVM (LIBLINEAR) unless marked as “kernel”


MNIST: Digit Image Data and Global Expansion

Dim. Time Accuracy

Original 768 80 sec 92.11%

Binary original 768 30 sec 91.25%

Pairwise 295,296 400 sec 98.33%

Binary pairwise 295,296 400 sec 97.88%

Kernel 768 Hours 98.6%

(It is actually not easy to achieve the well-known 98.6% accuracy. Must use very

precise kernel parameters)


Observations from Feature Expansion Experiments

• 3-grams + linear SVM on webspam produces excellent classification result.

• All pairwise features + linear SVM on MNIST also produces excellent

classification result, close to the well-known result from Gaussian kernel.

• For low-dimensional data, binary quantization of the features may lead to

serious degradation of classification performance.

• Binary quantization on high-dimensional data either does not affect the results

at all (e.g., webspam ) or dos not hurt the performance much (e.g., MNIST).


Major issues with feature expansions:

• high-dimensionality

• high storage cost

• relatively high training/testing cost

The search industry has adopted the practice of

high-dimensional data + linear algorithms + probabilistic hashing

——————

Random projection can be viewed as a hashing method.


A Popular Solution Based on Normal Random Projections

Random Projections : Replace original data matrix A by B = A × R

A R = B

R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).

B ∈ Rn×k : projected matrix, also random.

B approximately preserves the Euclidean distance and inner products between

any two rows of A. In particular, E (BBT) = E (ARR

TA

T) = AAT.

Therefore, we can simply feed B into (e.g.,) SVM or logistic regression solvers.


Very Sparse Random Projections

The projection matrix: R = rij ∈ RD×k. Instead of sampling from normals,

we sample from a sparse distribution parameterized by s ≥ 1:

rij =

−1 with prob. 12s

0 with prob. 1 − 1s

1 with prob. 12s

If s = 100, then on average, 99% of the entries are zero.

If s = 10000, then on average, 99.99% of the entries are zero.

——————-

Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.

Ref: Li, Very Sparse Stable Random Projections, KDD’07.


A Running Example Using (Small) Webspam Data

Datset: 350K text samples, 16 million dimensions, about 4000 nonzeros on

average, 24GB disk space.

0 2000 4000 6000 8000 100000

1

2

3

4x 10

4

# nonzeros

Fre

quen

cy

Webspam

Task: Binary classification for spam vs. non-spam.

——————–

The dataset was generated using 3-grams.


Very Sparse Projections + Linear SVM on Webspam Data

Red dashed curves: results based on the original data

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096

Webspam: SVM (s = 1)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


CC

lass

ifica

tion

Acc

(%

)Observations:

• We need a large number of projections (e.g., k ≥ 4096) for high accuracy.

• The sparsity parameter s matters little unless k is small.


Very Sparse Projections + Linear SVM on Webspam Data

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


CC

lass

ifica

tion

Acc

(%

)

As long as k is large (necessary for high accuracy), the projection matrix can be

extremely sparse, even with s = 10000.


Many more experimental results (classification, clustering, and regression) on

very sparse random projections as well as b-bit minwise hashing can be found at

the 2012 Stanford Modern Massive Data Sets (MMDS) Workshop:

http:

//www.stanford.edu/group/mmds/slides2012/s-pli.pdf


Disadvantages of Random Projections (and Variants)

Inaccurate, especially on binary data.


Variance Analysis for Inner Product Estimates

A R = B

First two rows in A: u1, u2 ∈ RD (D is very large):

u1 = u1,1, u1,2, ..., u1,i, ..., u1,Du2 = u2,1, u2,2, ..., u2,i, ..., u2,D

First two rows in B: v1, v2 ∈ Rk (k is small):

v1 = v1,1, v1,2, ..., v1,j, ..., v1,kv2 = v2,1, v2,2, ..., v2,j, ..., v2,k

E (vT1v2) = uT

1u2 = a


a =1

k

k∑

j=1

v1,jv2,j , (which is also an inner product)

E(a) = a

V ar(a) =1

k

(

m1m2 + a2)

m1 =

D∑

i=1

|u1,i|2, m2 =

D∑

i=1

|u2,i|2

———

Random projections may not be good for inner products because the variance is

dominated by marginal l2 norms m1m2, especially when a ≈ 0.

For real-world datasets, most pairs are often close to be orthogonal (a ≈ 0).

——————-

Ref: Li, Hastie, and Church, Very Sparse Random Projections, KDD’06.

Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11


b-Bit Minwise Hashing

• Simple algorithm designed for massive, high-dimensional, binary data.

• Much more accurate than random projections for estimating inner products.

• Much smaller space requirement than the original minwise hashing algorithm.

• Estimating 3-way similarity, while random projections can not.

• Large-scale linear learning (and kernel learning of course).

• Sub-linear time near neighbor search. (Slides available in the appendix)

——————–

Major drawback : very expensive preprocessing. (Now solved!)


Massive, High-dimensional, Sparse, and Binary Data

Binary sparse data are very common in the real-world :

• For many applications (such as text), binary sparse data are very natural.

• Many datasets can be quantized/thresholded to be binary without hurting the

prediction accuracy.

• In some cases, even when the “original” data are not too sparse, they often

become sparse when considering pariwise and higher-order interactions.


Binary Data Vectors and Sets

A binary (0/1) vector in D-dim can be viewed a set S ⊆ Ω = 0, 1, ..., D − 1.

Example: S1, S2, S3 ⊆ Ω = 0, 1, ..., 15 (i.e., D = 16).

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

1

0

0

0

0

1

0

0

1

1

1

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

1

1

0

0

0

S1:

S2:

S3:

0

S1 = 1, 4, 5, 8, S2 = 8, 10, 12, 14, S3 = 3, 6, 7, 14


Binary (0/1) Massive Text Data by Shingling

Shingling : Each document (Web page) can be viewed as a set of w-shingles.

For example, after parsing, a sentence “today is a nice day” becomes

• w = 1: “today”, “is”, “a”, “nice”, “day”

• w = 2: “today is”, “is a”, “a nice”, “nice day”

• w = 3: “today is a”, “is a nice”, “a nice day”

Previous studies used w ≥ 5, as single-word (unit-gram) model is not sufficient.

Shingling generates extremely high dimensional vectors, e.g., D = (105)w .

(105)5 = 1025 = 283, although in current practice, it seems D = 264 suffices.


Notation

A binary (0/1) vector can be equivalently viewed as a set (locations of nonzeros).

Consider two sets S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1 (e.g., D = 264)

f2

f1 a

f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|.

The resemblance R is a popular measure of set similarity

R =|S1 ∩ S2||S1 ∪ S2|

=a

f1 + f2 − a. (Is it more rational than

a√f1f2

?)


Minwise Hashing: Standard Algorithm in the Context of Searc h

The standard practice in the search industry:

Suppose a random permutation π is performed on Ω, i.e.,

π : Ω −→ Ω, where Ω = 0, 1, ..., D − 1.

An elementary probability argument shows that

Pr (min(π(S1)) = min(π(S2))) =|S1 ∩ S2|

|S1 ∪ S2|= R.


An Example

D = 5. S1 = 0, 3, 4, S2 = 1, 2, 3, R = |S1∩S2||S1∪S2|

= 15 .

One realization of the permutation π can be

0 =⇒ 3

1 =⇒ 2

2 =⇒ 0

3 =⇒ 4

4 =⇒ 1

π(S1) = 3, 4, 1 = 1, 3, 4, π(S2) = 2, 0, 4 = 0, 2, 4

In this example, min(π(S1)) 6= min(π(S2)).


Minwise Hashing In 0/1 Data Matrix

Original Data Matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

1

0

0

0

0

1

0

0

1

1

1

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

1

1

0

0

0

S1:

S2:

S3:

0

Permuted Data Matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

1

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

0

0

0

0

π(S1):

π(S2):

π(S3):

min(π(S1)) = 2, min(π(S2)) = 0, min(π(S3)) = 0


Minwise Hashing Estimator

After k permutations, π1, π2, ..., πk , one can estimate R without bias:

RM =1

k

k∑

j=1

1min(πj(S1)) = min(πj(S2)),

Var(

RM

)

=1

kR(1 − R).

—————————-

A major problem : Need to use 64 bits to store each hashed value.


b-Bit Minwise Hashing: the Intuition

Basic idea : Only store the lowest b-bits of each hashed value, for small b.

Intuition:

• When two sets are identical, then their lowest b-bits of the hashed values are

of course also equal. b = 1 only stores whether a number is even or odd.

• When two sets are similar, then their lowest b-bits of the hashed values

“should be” also similar (True?? Need a proof).

• Thus, hopefully we do not need many bits to obtain useful information, as real

applications often care about pairs with high similarities (e.g., R > 0.5).


The Basic Collision Probability Result

Consider two sets, S1 and S2,

S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1,f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|

Define the minimum values under π : Ω → Ω to be z1 and z2:

z1 = min (π (S1)) , z2 = min (π (S2)) .

and their b-bit versions

z(b)1 = The lowest b bits of z1, z

(b)2 = The lowest b bits of z2

Example: if z1 = 7(= 111 in binary), then z(1)1 = 1, z

(2)1 = 3.


Decimal Binary (8 bits) Lowest b = 2 bits Corresp. Decimal

0 00000000 00 0

1 00000001 01 1

2 00000010 10 2

3 00000011 11 3

4 00000100 00 0

5 00000101 10 1

6 00000110 10 2

7 00000111 11 3

8 00001000 00 0

9 00001001 01 1


Collision probability : Assume D is large (which is virtually always true)

Pb = Pr

(

z(b)1 = z

(b)2

)

= C1,b + (1 − C2,b) R

Note that the exact probability can be written as a complicated double summation.

——————–

Recall, (assuming infinite precision, or as many digits as needed), we have

Pr (z1 = z2) = R


Collision probability : Assume D is large (which is virtually always true)

Pb = Pr

(

z(b)1 = z

(b)2

)

= C1,b + (1 − C2,b) R

r1 =f1

D, r2 =

f2

D, f1 = |S1|, f2 = |S2|,

C1,b = A1,br2

r1 + r2+ A2,b

r1

r1 + r2,

C2,b = A1,br1

r1 + r2+ A2,b

r2

r1 + r2,

A1,b =r1 [1 − r1]

2b−1

1 − [1 − r1]2b

, A2,b =r2 [1 − r2]

2b−1

1 − [1 − r2]2b

.

——————–

Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.


The Exact Collision Probability for b = 1

Pb = Pr

(

z(1)1 = z

(1)2

)

= Pr(both even) + Pr(both odd)

=R +∑

i=0,2,4,...

∑

j 6=i,j=0,2,4,...

Pr (z1 = i, z2 = j)

+∑

i=1,3,5,...

∑

j 6=i,j=1,3,5,...

Pr (z1 = i, z2 = j)

.

where

Pr (z1 = i, z2 = j, i < j)

=f2

D

f1 − a

D − 1

j−i−2∏

t=0

D − f2 − i − 1 − t

D − 2 − t

i−1∏

t=0

D − f1 − f2 + a − t

D + i − j − 1 − t


The closed-form collision probability is remarkably accurate even for small D.

The absolute errors (approximate - exact) are very small even for D = 20.

0 0.2 0.4 0.6 0.8 10

0.002

0.004

0.006

0.008

0.01

f2 = 2

f2 = 4

a / f2

App

roxi

mat

ion

Err

or

D = 20, f1 = 4, b = 1

0 0.2 0.4 0.6 0.8 10

1

2

3

4x 10

−4

D = 500, f1 = 50, b = 1

f2 = 2

f2 = 25

f2 = 50

a / f2

App

roxi

mat

ion

Err

or

—————–

Ref: Li and Konig, Theory and Applications of b-Bit Minwise Hashing, CACM

Research Highlights, 2011 .


The Variance-Space Trade-off

Smaller b =⇒ smaller space for each sample. However,

smaller b =⇒ larger variance.

To characterize the var-space trade-off, we use

64 (i.e., space) × variance

b (i.e., space) × variance

In the least-favorable situation, the relative improvement is

64 × variance

b × variance= 64

R

R + 1

If R = 0.5, then the improvement will be 643 = 21.3-fold.

64 bits + k permutations ⇔ 1 bit + 3k permutations


Experiments: Duplicate detection on Microsoft news articl es

The dataset was crawled as part of the BLEWS project at Microsoft. We

computed pairwise resemblances for all documents and retrieved documents

pairs with resemblance R larger than a threshold R0.

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.3

Sample size (k)

Pre

cisi

on

b=1b=2

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.4

Sample size (k)

Pre

cisi

on

2

b=1

b=1b=2b=4M


0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.5

Sample size (k)

Pre

cisi

on

b=1

2

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.6

Sample size (k)

Pre

cisi

on

2

b=1

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.7

Sample size (k)

Pre

cisi

on

b=1

2

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.8

Sample size (k)

Pre

cisi

on

2

b=1

b=1b=2b=4M


1/2 Bit Suffices for High Similarity

Certain applications concern very high similarities. One can XOR two bits from

two permutations, into just one bit.

1 XOR 1 = 0, 0 XOR 0 = 0, 1 XOR 0 = 1, 0 XOR 1 = 1

The new estimator is denoted by R1/2. Compared to the 1-bit estimator R1:

• A good idea for highly similar data :

limR→1

Var(

R1

)

Var(

R1/2

) = 2.

• Not so good idea when data are not very similar :

Var(

R1

)

< Var(

R1/2

)

, if R < 0.5774, Assuming sparse data

—————

Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.


b-Bit Minwise Hashing for Estimating 3-Way Similarities

Notation for 3-way set intersections : R123 = |S1∩S2∩S3||S1∪S2∪S3|

a12

f1 a

a23

f3

a13

f2

r1

r3

s12

ss23

r2

s13

3-Way collision probability : Assume D is large.

Pr (lowest b bits of 3 hashed values are equal) = R123 +Z

u

where u = r1 + r2 + r3 − s12 − s13 − s23 + s123, ri = |Si|D ,

sij = |S1∩S2|D , and ...


Z =(s12 − s)A3,b +(r3 − s13 − s23 + s)

r1 + r2 − s12

s12G12,b

+(s13 − s)A2,b +(r2 − s12 − s23 + s)

r1 + r3 − s13

s13G13,b

+(s23 − s)A1,b +(r1 − s12 − s13 + s)

r2 + r3 − s23

s23G23,b

+[

(r2 − s23)A3,b + (r3 − s23)A2,b

] (r1 − s12 − s13 + s)

r2 + r3 − s23

G23,b

+[

(r1 − s13)A3,b + (r3 − s13)A1,b

] (r2 − s12 − s23 + s)

r1 + r3 − s13

G13,b

+[

(r1 − s12)A2,b + (r2 − s12)A1,b

] (r3 − s13 − s23 + s)

r1 + r2 − s12

G12,b,

Aj,b =rj(1 − rj)2

b−1

1 − (1 − rj)2b

,

Gij,b =(ri + rj − sij)(1 − ri − rj + sij)2

b−1

1 − (1 − ri − rj + sij)2b

, i, j ∈ 1, 2, 3, i 6= j.

—————

Ref: Li, Konig, Gui, b-Bit Minwise Hashing for Estimating 3-Way Similarities,

NIPS’10.


Useful messages :

1. Substantial improvement over using 64 bits, just like in the 2-way case.

2. Must use b ≥ 2 bits for estimating 3-way similarities.

——————

Next question : How to use b-bit minwise hashing for linear learning?


Using b-Bit Minwise Hashing for Learning

Random projection for linear learning is straightforward :

A R = B

because the projected data are naturally in an inner product space (linear kernel).

Natural questions about minwise hashing

• Is the n × n matrix of resemblances positive definite (PD)?

• Is the matrix of estimated resemblances PD?

We look for linear kernels to take advantage of modern linear learning algorithms.


Kernels from Minwise Hashing and b-Bit Minwise Hashing

Definition : A symmetric n × n matrix K satisfying∑

ij cicjKij ≥ 0, for all

real vectors c is called positive definite (PD).

Result Apply permutation π on n sets S1, ..., Sn ∈ Ω = 0, 1, ..., D − 1.

Define zi = minπ(Si). The following three matrices are all PD.

1. The resemblance matrix R ∈ Rn×n: Rij =

|Si∩Sj ||Si∪Sj |

=|Si∩Sj |

|Si|+|Sj |−|Si∩Sj |

2. The minwise hashing matrix M ∈ Rn×n: Mij = 1zi = zj

3. The b-bit minwise hashing matrix M(b) ∈ R

n×n:

M(b)ij =

∏bt=1 1 ei,t = ej,t, where ei,t is the t-th lowest bit of zi.

Consequently, for k independent permutations, the sum∑k

s=1 M(b)(s) is also PD.


b-Bit Minwise Hashing for Large-Scale linear Learning

Linear learning algorithms require the estimators to be inner products.

The estimator of minwise hashing

RM =1

k

k∑

j=1

1min(πj(S1)) = min(πj(S2))

is indeed an inner product of two (extremely sparse) vectors in D×k dimensions.

As z1 = min(π(S1)), z2 = min(π(S2)) ∈ Ω = 0, 1, ..., D − 1, we write

1min(πj(S1)) = min(πj(S2)) =D−1∑

i=0

1z1 = i × z2 = i

———–

Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11.


Consider D = 5. We can expand numbers into vectors of length 5.

0 =⇒ [0, 0, 0, 0, 1], 1 =⇒ [0, 0, 0, 1, 0], 2 =⇒ [0, 0, 1, 0, 0]

3 =⇒ [0, 1, 0, 0, 0], 4 =⇒ [1, 0, 0, 0, 0].

———————

If z1 = 2, z2 = 3, then

0 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 1, 0, 0, 0].

If z1 = 2, z2 = 2, then

1 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 0, 1, 0, 0].


Linear Learning Algorithms

Binary classification: Dataset (xi, yi)ni=1, xi ∈ R

D , yi ∈ −1, 1.

One can fit an L2-regularized linear SVM:

minw

1

2w

Tw + C

n∑

i=1

max

1 − yiwTxi, 0

,

or the L2-regularized logistic regression:

minw

1

2w

Tw + C

n∑

i=1

log(

1 + e−yiwTxi

)

,

where C > 0 is the penalty (regularization) parameter.


Integrating b-Bit Minwise Hashing for (Linear) Learning

Very simple :

1. Apply k independent random permutations on each (binary) feature vector xi

and store the lowest b bits of each hashed value. The storage costs nbk bits.

2. At run-time, expand a hashed data point into a 2b × k-length vector, i.e.

concatenate k 2b-length vectors. The new feature vector has exactly k 1’s.


An example with k = 3 and b = 2

For set (vector) S1:

Original hashed values (k = 3) : 12013 25964 20191

Original binary representations :

010111011101101 110010101101100 100111011011111

Lowest b = 2 binary digits : 01 00 11

Corresponding decimal values : 1 0 3

Expanded 2b = 4 binary digits : 0010 0001 1000

New feature vector fed to a solver : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0] × 1√3

Same procedures on sets S2, S3, ...

——————

Note that (i): solvers require normalization; (ii)expansion 0000 is never used.


Datasets and Solver for Linear Learning

This talk presents the experiments on two text datasets.

Dataset # Examples (n) # Dims (D) Avg # Nonzeros Train / Test

Webspam (24 GB) 350,000 16,609,143 3728 80% / 20%

Rcv1 (200 GB) 781,265 1,010,017,424 12062 50% / 50%

To generate the Rcv1 dataset, we used the original features + all pairwise

features + some 3-way features.

We chose LIBLINEAR as the basic solver for linear learning. Note that our method

is purely statistical/probabilistic, independent of the underlying procedures.


Experimental Results on Linear SVM and Webspam

• We conducted our extensive experiments for a wide range of regularization C

values (from 10−3 to 102) with fine spacings in [0.1, 10].

• We experimented with k = 30 to k = 500, and b = 1, 2, 4, 8, and 16.


Testing Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 6,8,10,16

b = 1

svm: k = 200

b = 24

b = 4

Spam: Accuracy

• Solid: b-bit hashing. Dashed (red) the original data

• Using b ≥ 8 and k ≥ 200 achieves about the same test accuracies as using

the original data.

• The results are averaged over 50 runs.


10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6b = 8,10,16

svm: k = 50Spam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

CA

ccur

acy

(%)

Spam: Accuracy

svm: k = 100

b = 1

b = 2

b = 4b = 8,10,16

4

6

6

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

svm: k = 150

Spam: Accuracy

b = 1

b = 2

b = 4b = 6,8,10,16

4

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 6,8,10,16

b = 1

svm: k = 200

b = 24

b = 4

Spam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

Spam: Accuracy

svm: k = 300

b = 1

b = 24

b = 4b = 6,8,10,16

10−3

10−2

10−1

100

101

102

80828486889092949698

100

b = 1

b = 24

b = 6,8,10,16

C

Acc

urac

y (%

)

svm: k = 500

Spam: Accuracy

b = 4


Stability of Testing Accuracy (Standard Deviation)

Our method produces very stable results, especially b ≥ 4

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

b = 1

b = 2

b = 4

b = 6b = 8

b = 1610

svm: k = 50Spam accuracy (std)

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

b = 1

b = 2

b = 4

b = 6b = 8

b = 10,16svm: k = 100Spam accuracy (std)

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

svm: k = 150

Spam:Accuracy (std)

b = 1

b = 2

b = 4

b = 8b = 16

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

b = 1

b = 2

b = 4

b = 6

b = 8,10,16svm: k = 200Spam accuracy (std)

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

Spam:Accuracy (std)

b = 1

b = 2

b = 4b = 8b = 16svm: k = 300

10−3

10−2

10−1

100

101

102

10−2

10−1

100

b = 1

b = 2

b = 4

b = 6,8,10,16Spam accuracy (std)

C

Acc

urac

y (s

td %

)

svm: k = 500


Training Time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

svm: k = 50Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

svm: k =100Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

svm: k = 150

Spam:Training Time

b = 16

b = 1

b=8

b = 4b = 1

2416

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

b = 16

svm: k = 200Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

b = 16

b = 1

2

416

b=8

b = 1

svm: k = 300

Spam:Training Time10

−310

−210

−110

010

110

210

0

101

102

103

Spam: Training time

b = 10

b = 16

C

Tra

inin

g tim

e (s

ec)

svm: k = 500

• They did not include data loading time (which is small for b-bit hashing)

• The original training time is about 100 seconds.

• b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8).


Testing Time

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)

svm: k = 50Spam: Testing time

10−3

10−2

10−1

100

101

102

12

10

100

1000

CT

estin

g tim

e (s

ec)


10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tes

ting

time

(sec

)

svm: k = 150Spam:Testing Time

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)


10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tes

ting

time

(sec

) Spam:Testing Timesvm: k = 300

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)


However, here we assume the test data have already been processed.


Experimental Results on L2-Regularized Logistic Regression

Testing Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

logit: k = 50Spam: Accuracy

b = 1

b = 2

b = 4

b = 6b = 8,10,16

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)logit: k = 100

Spam: Accuracy

b = 1

b = 2

b = 4b = 6

b = 8,10,16

10−3

10−2

10−1

100

101

102

80

85

90

95

100

C

Acc

urac

y (%

)

Spam:Accuracylogit: k = 150

b = 1

b = 2

b = 4b = 8

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

Spam: Accuracy

logit: k = 200

b = 6,8,10,16

b = 1

b = 2

b = 4

10−3

10−2

10−1

100

101

102

80

90

95

100

C

Acc

urac

y (%

)

logit: k = 300Spam:Accuracy

b = 1

b = 2

b = 4b = 8

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

logit: k = 500

Spam: Accuracy

b = 1

b = 2b = 4

4

b = 6,8,10,16


Training Time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 50

Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 100

Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 150Spam:Training Time

b = 16b = 1

2 b = 84

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 200

Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

b = 16

b = 1

b = 482

logit: k = 300Spam:Training Time

10−3

10−2

10−1

100

101

102

100

101

102

103

CT

rain

ing

time

(sec

) b = 16

logit: k = 500

Spam: Training time


Comparisons with Other Algorithms

We conducted extensive experiments with the VW algorithm (Weinberger et. al.,

ICML’09, not the VW online learning platform), which has the same variance as

random projections.

Consider two sets S1, S2, f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|,R = |S1∩S2|

|S1∪S2|= a

f1+f2−a . (a = 0 ⇔ R = 0) Then, from their variances

V ar (aV W ) ≈ 1

k

(

f1f2 + a2)

V ar(

RMINWISE

)

=1

kR (1 − R)

we can immediately see the significant advantages of minwise hashing, especially

when a ≈ 0 (which is common in practice).


Comparing b-Bit Minwise Hashing with VW

8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about

the same test accuracy as VW with k = 104 ∼ 106.

101

102

103

104

105

106

80828486889092949698

100

C = 0.01

1,10,100

0.1

k

Acc

urac

y (%

)

svm: VW vs b = 8 hashing

C = 0.01

C = 0.1C = 1

10,100

Spam: Accuracy

101

102

103

104

105

106

80828486889092949698

100

k

Acc

urac

y (%

)

1

C = 0.01

C = 0.1

C = 110

C = 0.01

C = 0.1

10,100

logit: VW vs b = 8 hashingSpam: Accuracy

100


8-bit hashing is substantially faster than VW (to achieve the same accuracy).

102

103

104

105

106

100

101

102

103

k

Tra

inin

g tim

e (s

ec) C = 100

C = 10

C = 1,0.1,0.01C = 100

C = 10

C = 1,0.1,0.01

Spam: Training time

svm: VW vs b = 8 hashing

102

103

104

105

106

100

101

102

103

kT

rain

ing

time

(sec

)

C = 0.01

10,1.0,0.1

100

logit: VW vs b = 8 hashingSpam: Training time

C = 0.1,0.01

C = 100,10,1


Experimental Results on Rcv1 (200GB)

Test accuracy using linear SVM (Can not train the original data)

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

svm: k =50 b = 16

b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)rcv1: Accuracy

svm: k =100 b = 16

b = 12b = 8

b = 4b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

svm: k =150

rcv1: Accuracy

b = 16b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

svm: k =200

rcv1: Accuracy

b = 16

b = 12b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

svm: k =300 b = 16b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

b = 1

b = 2

b = 4

b = 8b = 12

b = 16

CA

ccur

acy

(%)

svm: k =500

rcv1: Accuracy


Training time using linear SVM

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=50

12

16

b = 8

12

10−3

10−2

10−1

100

101

102

100

101

102

103

CT

rain

ing

time

(sec

)

rcv1: Train Time

svm: k=100

b = 16

12

b = 8

12

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=150

b = 1612

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=200

b = 1612

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=300

b = 16

12

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

svm: k=500

b = 8

12

b = 16

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time


Test accuracy using logistic regression

10−2

100

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

logit: k =50 b = 16

b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

CA

ccur

acy

(%)

rcv1: Accuracyb = 1

b = 2

b = 4

b = 8

b = 12

b = 16logit: k =100

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

logit: k =150 b = 16

b = 12b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

logit: k =200 b = 16b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

b = 16b = 12

b = 8

b = 4

b = 2

b = 1

logit: k =300

rcv1: Accuracy

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

b = 16

b = 12b = 8

b = 4

b = 2

b = 1

rcv1: Accuracy

logit: k =500


Training time using logistic regression

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=50

b = 168

b = 1

b = 4

b = 2

b = 12

10−3

10−2

10−1

100

101

102

100

101

102

103

CT

rain

ing

time

(sec

)

rcv1: Train Time

logit: k=100

b = 8

b = 16

b = 12

b = 1

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=150

b = 16

128

b = 4

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=200

b = 16

128

b = 1

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=300b = 16

12

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

b = 1612

b = 8

logit: k=500


Comparisons with VW on Rcv1

Test accuracy using linear SVM

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

svm: VW vs hashing

rcv1: Accuracy

C = 0.01

C = 0.1,1,10

C = 0.01,0.1,1,10

1−bit

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)rcv1: Accuracy

svm: VW vs hashing2−bit

C = 0.01,0.1,1,10 C = 0.01

C = 0.1,1,10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

rcv1: Accuracysvm: VW vs hashing4−bit

C = 0.01

C = 0.1,1,10

C = 0.01

C = 10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

rcv1: Accuracy


C = 0.1,1,10

C = 0.01C = 0.01

C = 10

101

102

103

104

50

60

70

80

90

100

C = 0.01

K

Acc

urac

y (%

)

C = 0.01C = 0.1,1,10

rcv1: Accuracysvm: VW vs hashing12−bit

C = 10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

C = 0.1,1,10

C = 0.01

rcv1: Accuracy


C = 10

C = 0.01


Test accuracy using logistic regression

101

102

103

104

50

60

70

80

90

100

C = 0.01

C = 0.01,0.1,1,10

C = 0.1,1,10

K

Acc

urac

y (%

)

logit: VW vs hashingrcv1: Accuracy

1−bit

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

C = 1logit: VW vs hashing2−bit

rcv1: Accuracy

C = 0.1,1,10

C = 0.01C = 0.01,0.1,1,10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

C = 0.1,1,10

C = 0.01

C = 0.1,1,10

C = 0.01

rcv1: Accuracylogit: VW vs hashing4−bit

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)


C = 0.01

C = 0.1,1,10

C = 0.01

C = 0.1,1,10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)


C = 0.01

C = 0.1,1,10

C = 0.1,1,10

C = 0.01

101

102

103

104

50

60

70

80

90

100

KA

ccur

acy

(%)

C = 0.01

C = 0.1,1,10

C = 0.01

C = 10

logit: VW vs hashing16−bit


Training time

101

102

103

104

100

101

102

103

C = 0.01

C = 0.1

C = 10

8−bit

K

Tra

inin

g tim

e (s

ec)

C = 0.01

C = 0.1

C = 1

rcv1: Train Time

svm: VW vs hashing

C = 1

C = 10

101

102

103

104

100

101

102

103

KT

rain

ing

time

(sec

)

rcv1: Train Time

logit: VW vs hashing8−bit

C = 10

C = 1

C = 0.1

C = 0.01

C = 0.01

C = 0.1

C = 1

C = 10


k-Permutation Hashing: Conclusions

• BigData everywhere. For many operations including clustering, classification,

near neighbor search, etc. exact answers are often not necessary.

• Very sparse random projections (KDD 2006) work as well as dense random

projections. They however require many projections (e.g., k = 10000)

• Minwise hashing is a standard procedure in the context of search. b-bit

minwise hashing is a substantial improvement by using only b bits per hashed

value instead of 64 bits (e.g., a 20-fold reduction in space).

• b-bit minwise hashing provides a simple strategy for efficient linear learning,

by expanding the bits into vectors of length 2b × k (originally 264 × k)

• We can build hash tables for achieving sub-linear time near neighbor search.

• It can be substantially more accurate than random projections (and variants).


• Major drawbacks of b-bit minwise hashing:

– It was designed (mostly) for binary data.

– It needs a fairly high-dimensional representation (2b × k) after expansion.

– Expensive preprocessing and expensive testing for unprocessed data.

Parallelization solutions based on (e.g.,) GPUs offer good speed but they

are not energy-efficient.


Why Minwise Hashing Is Wasteful?

Original Data Matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

1

0

0

0

0

1

0

0

1

1

1

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

1

1

0

0

0

S1:

S2:

S3:

0

Permuted Data Matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

1

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

0

0

0

0

π(S1):

π(S2):

π(S3):

Only the the minimums and repeat the process k (e.g., 500) times.


One Permutation Hashing

S1, S2, S3 ⊆ Ω = 0, 1, ..., 15 (i.e., D = 16). The figure presents the

permuted sets as three binary (0/1) vectors:

π(S1) = 2, 4, 7, 13, π(S2) = 0, 6, 13, π(S3) = 0, 1, 10, 12

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

1

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

0

0

0

0

1 2 3 4

π(S1):

π(S2):

π(S3):

One permutation hashing: divide the space Ω evenly into k = 4 bins and

select the smallest nonzero in each bin.


Two major tasks:

• Theoretical analysis and estimators

• Practical strategies to deal with empty bins, should they occur

Ref: P. Li, A. Owen, C-H Zhang, One Permutation Hashing, NIPS’12


Two Definitions

Recall: the space is divided evenly into k bins

# Jointly empty bins: Nemp =k

∑

j=1

Iemp,j

# Matched bins: Nmat =k

∑

j=1

Imat,j


where Iemp,j and Imat,j are defined for the j-th bin, as

Iemp,j =

1 if both π(S1) and π(S2) are empty in the j-th bin

0 otherwise

Imat,j =

1 if both π(S1) and π(S1) are not empty and the smallest element

of π(S1) matches the smallest element of π(S2), in the j-th bin

0 otherwise

———————————

Nemp =k

∑

j=1

Iemp,j , Nmat =k

∑

j=1

Imat,j


Results for the Number of Jointly Empty Bins Nemp

Notation :

f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|, f = |S1 ∪ S2| = f1 + f2 − a.

Distribution:

Pr (Nemp = j) =

k−j∑

s=0

(−1)s k!

j!s!(k − j − s)!

f−1∏

t=0

D(

1 − j+sk

)

− t

D − t

Expectation:E (Nemp)

k=

f−1∏

j=0

D(

1 − 1k

)

− j

D − j≤

(

1 − 1

k

)f


Approximation of Nemp in Sparse Data

f = |S1 ∪ S2| = f1 + f2 − a.

E(Nemp)k ≈

(

1 − 1k

)f ≈ e−f/k.

f/k = 5 =⇒ E(Nemp)k ≈ 0.0067 (negligible).

f/k = 1 =⇒ E(Nemp)k ≈ 0.3679 (noticeable, but then why hashing?).


An Unbiased Estimator

Estimator : Rmat =Nmat

k − Nemp

Expectation : E(

Rmat

)

= R

Variance: V ar(

Rmat

)

<R(1 − R)

k

One-permutation hashing has smaller variance than k-permutation hashing.


Empirically Validation

Based on many pairs of vectors, empirical MSEs (MSE = bias2 + variance) verify:

101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

k

MS

E(R

mat

)

RIGHTS−−RESERVED

EmpiricalTheoreticalMinwise

101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

k

MS

E(R

mat

)

OF−−AND


101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

k

MS

E(R

mat

)

THIS−−HAVE



Summary of Advantages of One Permutation Hashing

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

1

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

0

0

0

0

1 2 3 4

π(S1):

π(S2):

π(S3):

• Computationally much more efficient and energy-efficient.

• Parallelization-based solution requires additional hardware & implementation.

• Testing unprocessed data is much faster, crucial for user-facing applications.

• Implementation is easier, from the perspective of random number generation,

e.g., storing a permutation vector of length D = 109 (4GB) is not a big deal.

• One permutation hashing is a much better matrix sparsification scheme.


One Permutation Hashing for Linear Learning

Almost the same as k-permutation hashing, but we must deal with empty bins *.

Two ideas:

1. Zero coding : Encode empty bins as (e.g.,) 00000000 in the expanded space.

2. Random coding : Encode empty bins as a random number. Did not work as

well.


Statistics of # nonzeros in Webspam Data

0 2000 4000 6000 8000 100000

1

2

3

4x 10

4

# nonzeros

Fre

quen

cy

Webspam

The dataset has 350,000 samples. The average number of nonzeros is about

4000 which should be much larger than the k (e.g., 200 to 500).

About 10% (or 2.8%) of the samples have < 500 (or < 200) nonzeros. We

must deal with empty bins if we do not want to exclude those data points.


Review the original example with k = 3 permutations and b = 2


Original hashed values (k = 3) : 12013 25964 20191

Original binary representations :

010111011101101 110010101101100 100111011011111

Lowest b = 2 binary digits : 01 00 11

Corresponding decimal values : 1 0 3

Expanded 2b = 4 binary digits : 0010 0001 1000

New feature vector fed to a solver : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0] × 1√3


0000 was never used!


Zero Coding Example

One permutation hashing with k = 4 and b = 2


Original hashed values (k = 4) : 12013 25964 20191 ∗Original binary representations :

010111011101101 110010101101100 100111011011111 ∗Lowest b = 2 binary digits : 01 00 11 ∗Corresponding decimal values : 1 0 3 ∗Expanded 2b = 4 binary digits : 0010 0001 1000 0000

New feature vector : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0] × 1√4 − 1



Experimental Results on Webspam Data

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8

SVM: k = 256Webspam: Accuracy

Original1 Permk Perm

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2b = 4,6,8



When k = 512 (or even 256) and b = 8, b-bit one permutation hashing achieves

similar test accuracies as using the original data.

One permutation hashing (zero coding) is even slightly more accurate than

k-permutation hashing (at merely 1/k of the original cost).


One Permutation v.s. k-Permutation

Linear SVM

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6b = 8


10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4b = 6b = 8


10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8



10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2b = 4,6,8



Logistic Regression

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

logit: k = 32Webspam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4b = 6b = 8


10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2b = 4b = 6,8



10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1b = 2b = 4,6,8



————————

The influence of empty bins is really not that obvious, as expected.


One Permutation Hashing: Conclusion

One permutation hashing is at least as accurate as the standard k-permutation

minwise hashing at merely 1/k of the original processing cost.


Testing the Robustness of Zero Coding on 20Newsgroup Data

For webspam data, the # empty bins may be too small to make the algorithm fail.

The 20Newsgroup (News20) dataset has merely 20, 000 samples in about one

million dimensions, with on average about 500 nonzeros per sample.

Therefore, News20 is merely a toy example to test the robustness of our

algorithm. In fact, we let k as large as k = 4096, i.e., most of bins are empty.


20Newsgroups: One Permutation v.s. k-Permutation (SVM)

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

SVM: k = 8News20: Accuracy


10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8


10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8


10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4

b = 6

b = 8


10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6b = 8


10−1

100

101

102

103

65

70

75

80

85

90

95

100

CA

ccur

acy

(%)

b = 1

b = 2

b = 4

b = 6b = 8



10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4b = 6

b = 8


10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8


10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8



10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1 b = 2b = 4,6,8



One permutation with zero coding can reach the original accuracy 98%.

The original k-permutation can only reach 97.5% even with k = 4096.


20Newsgroups: One Permutation v.s. k-Permutation (Logit)

Logistic Regression

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8

logit: k = 32News20: Accuracy

10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8


10−1

100

101

102

103

50556065707580859095

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6

b = 8


10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6b = 8


10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4b = 6

b = 8


10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8


10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2

b = 4b = 6,8



10−1

100

101

102

103

65

70

75

80

85

90

95

100

C

Acc

urac

y (%

)

b = 1b = 2b = 4,6,8




b-Bit Minwise Hashing for Efficient Near Neighbor Search

Near neighbor search is a much more frequent operation than training an SVM.

The bits from b-bit minwise hashing can be directly used to build hash tables to

enable sub-linear time near neighbor search.

This is an instance of the general family of locality-sensitive hashing (LSH).


An Example of b-Bit Hashing for Near Neighbor Search

We use b = 2 bits and k = 2 permutations to build a hash table indexed from

0000 to 1111, i.e., the table size is 22×2 = 16.

00 10

11 1011 11

00 0000 01

Index Data Points

11 01

8, 13, 251 5, 14, 19, 29(empty)

33, 174, 3153 7, 24, 156

61, 342

00 10

11 1011 11

00 0000 01

Index Data Points

11 01

8

17, 36, 1292, 19, 83

7, 198

56, 989,9, 156, 879

4, 34, 52, 796

Then, the data points are placed in the buckets according to their hashed values.

Look for near neighbors in the bucket which matches the hash value of the query.

Replicate the hash table (twice in this case) for missed and good near neighbors.

Final retrieved data points are the union of the buckets.


Experiments on the Webspam Dataset

B = b × k: table size L: number of tables.

Fractions of retrieved data points

8 12 16 20 2410

−2

10−1

100

Webspam: L = 25

B (bits)

Fra

ctio

n E

valu

ated

b = 1

b = 2

b = 4

SRPb−bit

8 12 16 20 2410

−2

10−1

100

B (bits)

Fra

ctio

n E

valu

ated

b = 1

b = 2

b = 4

Webspam: L = 50

SRPb−bit

8 12 16 20 2410

−2

10−1

100

Webspam: L = 100

B (bits)

Fra

ctio

n E

valu

ated

b = 1

b = 2

b = 4

SRPb−bit

SRP (sign random projections) and b-bit hashing (with b = 1, 2) retrieved similar

numbers of data points.

————————

Ref: Shrivastava and Li, Fast Near Neighbor Search in High-Dimensional Binary

Data, ECML’12


Precision-Recall curves (the higher the better) for retrieving top-T data points.

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 24 L= 100Webspam: T = 5

b = 1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 24 L= 100Webspam: T = 10

b = 1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 24 L= 100Webspam: T = 50

b = 1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 16 L= 100Webspam: T = 5

1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 16 L= 100Webspam: T = 10

1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 16 L= 100Webspam: T = 50

1

b = 4

SRPb−bit

Using L = 100 tables, b-bit hashing considerably outperformed SRP (sign

random projections), for table sizes B = 24 and B = 16.

Documents

Probabilistic Hashing for Efﬁcient Search and LearningPing Li Probabilistic Hashing for Efﬁcient Search and Learning Fall, 2012 Cornell STSCI6520 3 BigData Everywhere Conceptually,