26
1 Low-Support, High- Correlation Finding rare, but very similar items

1 Low-Support, High-Correlation Finding rare, but very similar items

Embed Size (px)

Citation preview

Page 1: 1 Low-Support, High-Correlation Finding rare, but very similar items

1

Low-Support, High-Correlation

Finding rare, but very similar items

Page 2: 1 Low-Support, High-Correlation Finding rare, but very similar items

2

Assumptions

1. Number of items allows a small amount of main-memory/item.

2. Too many items to store anything in main-memory for each pair of items.

3. Too many baskets to store anything in main memory for each basket.

4. Data is very sparse: it is rare for an item to be in a basket.

Page 3: 1 Low-Support, High-Correlation Finding rare, but very similar items

3

Applications

While marketing may require high-support, or there’s no money to be made, mining customer behavior is often based on correlation, rather than support. Example: Few customers buy Handel’s

Watermusick, but of those who do, 20% buy Bach’s Brandenburg Concertos.

Page 4: 1 Low-Support, High-Correlation Finding rare, but very similar items

4

Matrix Representation

Columns = items. Baskets = rows. Entry (r , c ) = 1 if item c is in

basket r ; = 0 if not. Assume matrix is almost all 0’s.

Page 5: 1 Low-Support, High-Correlation Finding rare, but very similar items

5

In Matrix Formm c p b j

{m,c,b} 1 1 0 1 0{m,p,b} 1 0 1 1 0{m,b} 1 0 0 1 0{c,j} 0 1 0 0 1{m,p,j} 1 0 1 0 1{m,c,b,j} 1 1 0 1 1{c,b,j} 0 1 0 1 1{c,b} 0 1 0 1 0

Page 6: 1 Low-Support, High-Correlation Finding rare, but very similar items

6

Similarity of Columns

Think of a column as the set of rows in which it has 1.

The similarity of columns C1 and C2, sim (C1,C2), is the ratio of the sizes of the intersection and union of C1 and C2. (Jaccard measure)

Goal of finding correlated columns becomes finding similar columns.

Page 7: 1 Low-Support, High-Correlation Finding rare, but very similar items

7

Example

C1 C20 11 01 1 sim (C1, C2) =0 0 2/5 = 0.41 10 1

Page 8: 1 Low-Support, High-Correlation Finding rare, but very similar items

8

Signatures

Key idea: “hash” each column C to a small signature Sig (C), such that:1. Sig (C) is small enough that we

can fit a signature in main memory for each column.

2. Sim (C1, C2) is the same as the “similarity” of Sig (C1) and Sig (C2).

Page 9: 1 Low-Support, High-Correlation Finding rare, but very similar items

9

An Idea That Doesn’t Work

Pick 100 rows at random, and let the signature of column C be the 100 bits of C in those rows.

Because the matrix is sparse, many columns would have 00. . .0 as a signature, yet be very dissimilar because their 1’s are in different rows.

Page 10: 1 Low-Support, High-Correlation Finding rare, but very similar items

10

Four Types of Rows Given columns C1 and C2, rows may be

classified as:C1 C2

a 1 1b 1 0c 0 1d 0 0

Also, a = # rows of type a , etc. Note Sim (C1, C2) = a /(a +b +c ).

Page 11: 1 Low-Support, High-Correlation Finding rare, but very similar items

11

Min Hashing

Imagine the rows permuted randomly.

Define “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1.

Page 12: 1 Low-Support, High-Correlation Finding rare, but very similar items

12

Surprising Property

The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2).

Both are a /(a +b +c )! Why?

Look down columns C1 and C2 until we see a 1.

If it’s a type a row, then h (C1) = h (C2). If a type b or c row, then not.

Page 13: 1 Low-Support, High-Correlation Finding rare, but very similar items

13

Min-Hash Signatures Pick (say) 100 random permutations

of the rows. Let Sig (C) = the list of 100 row

numbers that are the first rows with 1 in column C, for each permutation.

Similarity of signatures = fraction of permutations for which minhash values agree = (expected) similarity of columns.

Page 14: 1 Low-Support, High-Correlation Finding rare, but very similar items

14

Example C1 C2 C31 1 0 12 0 1 13 1 0 04 1 0 15 0 1 0

S1 S2 S3Perm 1 = (12345) 1 2 1Perm 2 = (54321) 4 5 4Perm 3 = (34512) 3 5 4

Similarities: 1-2 1-3 2-3Col.-Col. 0 0.5 0.25Sig.-Sig. 0 0.67 0

Page 15: 1 Low-Support, High-Correlation Finding rare, but very similar items

15

Important Trick Don’t actually permute the rows.

The number of passes would be prohibitive.

Rather, in one pass through the data:1. Pick (say) 100 hash functions.2. For each column and each hash function,

keep a “slot” for that min-hash value.3. For each row r , and for each column c with

1 in row r , and for each hash function h do: if h(r ) is a smaller value than slot(h ,c), replace that slot by h(r ).

Page 16: 1 Low-Support, High-Correlation Finding rare, but very similar items

16

Example

Row C1 C21 1 02 0 13 1 14 1 05 0 1

h(x) = x mod 5g(x) = 2x+1 mod 5

h(1) = 1 1 -g(1) = 3 3 -

h(2) = 2 1 2g(2) = 0 3 0

h(3) = 3 1 2g(3) = 2 2 0

h(4) = 4 1 2g(4) = 4 2 0

h(5) = 0 1 0g(5) = 1 2 0

Page 17: 1 Low-Support, High-Correlation Finding rare, but very similar items

17

Summary

Finding frequent pairs: A-priori --> PCY (hashing) -->

multistage. Finding all frequent itemsets Finding similar pairs:

Minhash

Page 18: 1 Low-Support, High-Correlation Finding rare, but very similar items

18

Public-Key Cryptosystems

Page 19: 1 Low-Support, High-Correlation Finding rare, but very similar items

19

Digital Cash

Page 20: 1 Low-Support, High-Correlation Finding rare, but very similar items

20

Blind signatures Bob has public key e, a private key d, and a

public modulus, n. Alice wants Bob blindly to sign message m. Alice chooses a value k, between 1 and n.

Then she blinds m by computing: t=mke(mod n) Bob signs t; td=(mke)d(mod n) Alice unblinds td by computing s= td/k(mod n) The result is s=md(mod n) True since: td=(mke)d=mdk(mod n), so td/k=mdxk/k=md(mod n)

Page 21: 1 Low-Support, High-Correlation Finding rare, but very similar items

21

More advanced scheme

Cash should be anonymous, so in parallel of preventing double spending, we wish also that A’s identity (if follows the protocol) will be kept secret.

We use blind signatures

Page 22: 1 Low-Support, High-Correlation Finding rare, but very similar items

22

Withdrawl

Let I be A’s identity description. 1. I= I(l1) I(r1) =I(l2) I(r2) =……….=I(l100)I(r100)

where each I(l) is a random string.2. Set M=[“$1”, randomseq#,h(I(l1)), h(I(r1)),…….h(I(l100)),h(I(r100))] where random seq# is generated by A.3. Genrate a blinding of M, B. 4. Repeat 1—3 100 times to get B1,….B100, for

M1…..M100.

Page 23: 1 Low-Support, High-Correlation Finding rare, but very similar items

23

Withdrawl (cont.)

5. A sends B1,….B100 to the bank (B). 6. B picks 99 out of the 100 (cut and choose) and ask A to unblind these 99 messages, and

prove they were built correctly. 7. B signs the only unopen blinded message (if all

rest were fine) -- Bk

8. A gets the signed Bk and unblind in order to get the signature on Mk

Alice now has a coin: [M,sigB(M)]

Page 24: 1 Low-Support, High-Correlation Finding rare, but very similar items

24

Spending

1. AV buy for $1 , [M,sigB(M)] 2. VA verifies B’s signature Pick r{0,1}100 randomly and send to Alice. 3. AV Set R=[I1,…..,I100]

where Ik=I(lk) if rk=0

and Ik=I(rk) if rk=1 and send R to V.4. VA V verifies that R matches hashes, and (if so) release the good (notice that V knows nothing about who Alice is!)

Page 25: 1 Low-Support, High-Correlation Finding rare, but very similar items

25

Deposit protocol

1. VB deposit $1 , [M,sigB(M),r,R]

2. B verifies its signature and that R=[I1,…..,I100] matches hashes in the coin. 3. B check if seq# from the coin is already spent. If it is not spent – ok. If the coin has been spent, let coin’= [M,sigB(M),r’,R’] be the coin in the database. Who is guity? V or A? If r=r’ then V is guilty; otherwise A is guilty. But if A is guilty, then who is he/she?

Page 26: 1 Low-Support, High-Correlation Finding rare, but very similar items

26

Deposit protocol (cont.)

If r=(r1,….,r100)r’=(r’1,….,r’100)

then ri r’i for some coordinate i,

but then Ii I’I can be computed and expose Alice.