48
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18

Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

  • Upload
    jewell

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Latent Semantic Indexing (mapping onto a smaller space of latent concepts). Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading 18. Speeding up cosine computation. - PowerPoint PPT Presentation

Citation preview

Page 1: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Latent Semantic Indexing(mapping onto a smaller space of latent concepts)

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 18

Page 2: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Speeding up cosine computation

What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? Now, O(nm) Then, O(km+kn) where k << n,m

Two methods: “Latent semantic indexing” Random projection

Page 3: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

A sketch

LSI is data-dependent Create a k-dim subspace by eliminating

redundant axes Pull together “related” axes – hopefully

car and automobile

Random projection is data-independent Choose a k-dim subspace that guarantees

good stretching properties with high probability between pair of points.

What about polysemy ?

Page 4: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Notions from linear algebra

Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues and eigenvector v: Av = v

Page 5: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Overview of LSI

Pre-process docs using a technique from linear algebra called Singular Value Decomposition

Create a new (smaller) vector space

Queries handled (faster) in this new space

Page 6: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Singular-Value Decomposition

Recall m n matrix of terms docs, A. A has rank r m,n

Define term-term correlation matrix T=AAt

T is a square, symmetric m m matrix Let P be m r matrix of eigenvectors of T

Define doc-doc correlation matrix D=AtA D is a square, symmetric n n matrix Let R be n r matrix of eigenvectors of D

Page 7: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

A ’s decomposition

Do exist matrices P (for T, m r) and R (for D, n r) formed by orthonormal columns (unit dot-product)

It turns out that A = PRt

Where is a diagonal matrix with the eigenvalues of T=AAt in decreasing order.

=

A P Rt

mn mr rr rn

Page 8: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

For some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial]

Denote by k this new version of , having rank k

Typically k is about 100, while r (A ’s rank) is > 10,000

=

P k Rt

Dimensionality reduction

Ak

document

useless due to 0-col/0-row of k

m x r r x n

r

kk

k

00

0

A m x k k x n

Page 9: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Guarantee

Ak is a pretty good approximation to A: Relative distances are (approximately) preserved

Of all m n matrices of rank k, Ak is the best

approximation to A wrt the following measures:

minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k

minB, rank(B)=k ||A-B||F2 = ||A-Ak||F

2 =

k2k+2

2r2

Frobenius norm ||A||F2 =

22r

2

Page 10: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Reduction

Xk = k Rt is the doc-matrix k x n, hence reduced to k dim

Take the doc-correlation matrix: It is D=At

A =(P Rt)t (P Rt) = (Rt)t (Rt) Approx with k, thus get At

A Xkt Xk (both are n x n matr.)

We use Xk to define how to project A and Q: Xk = k Rt , substitute Rt = Pt A, so get Pk

t A . In fact, k Pt = Pk

t which is a k x m matrix

This means that to reduce a doc/query vector is enough to multiply it by Pk

t

Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

R,P are formed by

orthonormal eigenvectorsof the matrices D,T

Page 11: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Which are the concepts ?

c-th concept = c-th row of Pkt (which is k x m)

Denote it by Pkt [c], whose size is m = #terms

Pkt [c][i] = strength of association between c-th

concept and i-th term

Projected document: d’j = Pkt dj

d’j[c] = strenght of concept c in dj

Projected query: q’ = Pkt q

q’ [c] = strenght of concept c in q

Page 12: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Random Projections

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Slides only !

Page 13: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

An interesting math result

f() is called JL-embeddingSetting v=0 we also get a bound on f(u)’s stretching!!!

Lemma (Johnson-Linderstrauss, ‘82)

Let P be a set of n distinct points in m-dimensions.

Given > 0, there exists a function f : P IRk such that for every pair of points u,v in P it holds:

(1 - ) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤ (1 + ) ||u-v||2

Where k = O(-2 log m)

Page 14: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

What about the cosine-distance ?

f(u)’s, f(v)’s stretching

substituting formula above

Page 15: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

How to compute a JL-embedding?

E[ri,j] = 0

Var[ri,j] = 1

If we set R = ri,j to be a random mx k matrix, where the components are independent random variables with one of the following distributions

Page 16: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Finally...

Random projections hide large constantsk (1/)2 * log m, so it may be large…

it is simple and fast to compute

LSI is intuitive and may scale to any koptimal under various metrics

but costly to compute

Page 17: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Document duplication(exact or approximate)

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Slides only!

Page 18: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Duplicate documents

The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates

E.g., Last modified date the only difference between two copies of a page

Sec. 19.6

Page 19: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Natural Approaches

Fingerprinting: only works for exact matches, slow Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes

Edit-distance metric for approximate string-matching expensive – even for one pair of strings impossible – for 1032 web documents

Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ

Page 20: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Obvious techniques Checksum – no worst-case collision probability

guarantees MD5 – cryptographically-secure string hashes

relatively slow

Karp-Rabin’s Scheme Rolling hash: split doc in many pieces Algebraic technique – arithmetic on primes Efficient and other nice properties…

Exact-Duplicate Detection

Page 21: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Near-Duplicate Detection

Problem Given a large collection of documents Identify the near-duplicate documents

Web search engines Proliferation of near-duplicate documents

Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors

30% of web-pages are near-duplicates [1997]

Page 22: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Desiderata

Storage: only small sketches of each document.

Computation: the fastest possible

Stream Processing: once sketch computed, source is

unavailable

Error Guarantees problem scale small biases have large impact need formal guarantees – heuristics will not do

Page 23: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Basic Idea [Broder 1997]

Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection

[ Jaccard ] They are near-duplicates if large shingle-sets

intersect enough

Page 24: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Similarity of Documents

DocBSB

SADocA

• Jaccard measure – similarity of SA, SB

• Claim: A & B are near-duplicates if sim(SA,SB) is high

BA

BABA SS

SS )S,sim(S

Page 25: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Basic Idea [Broder 1997]

Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection

[ Jaccard ] They are near-duplicates if large shingle-sets

intersect enough

We need to cope with “Set Intersection” fingerprints of shingles (for space/time efficiency) min-hash to estimate intersections sizes (for further

time and space efficiency)

Page 26: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Multiset ofFingerprints

Doc shinglingMultiset ofShingles

fingerprint

Documents Sets of 64-bit fingerprints

Fingerprints:

• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)• Fingerprint space [0, …, U-1]• In practice, use 64-bit fingerprints, i.e., U=264

• Prob[collision] ≈ (8q)/264 << 1

This reduces space for storing the multi-setsAnd the time to intersect them, but...

Page 27: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Speeding-up: Sketch of a document

Intersecting shingle-sets is too costly

Create a “sketch vector” (of size ~200) for each document, for its shingle-set

Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates

Sec. 19.6

Page 28: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Sketching by Min-Hashing

Consider SA, SB P

Pick a random permutation π of P (such as ax+b

mod |P|)

Define = π -1( min{π(SA)} ) , = π -

1( min{π(SB)} ) minimal element under permutation π

Lemma: BA

BA

SS

SS β]P[α

Page 29: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Strengthening it…

Similarity sketch sk(A) = k minimal elements under π(SA)

K is fixed or is a fixed ratio of SA,SB ? We might also take K permutations and the min of each

Similarity Sketches sk(A): Succinct representation of fingerprint sets SA

Allows efficient estimation of sim(SA,SB)

Basic idea is to use min-hash of fingerprints

Note: we can reduce the variance by using a larger k

Page 30: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Computing Sketch[i] for Doc1

Document 1

264

264

264

264

Start with 64-bit f(shingles)

Permute on the number linewith i

Pick the min value

Sec. 19.6

Page 31: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

264

264

264

264

Are these equal?

Test for 200 random permutations: , ,… 200

A B

Sec. 19.6

Page 32: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

However…

Document 1 Document 2

264

264

264

264

264

264

264

264

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)

Claim: This happens with probability Size_of_intersection / Size_of_union

BA

Sec. 19.6

Page 33: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Sum up…

Brute-force: compare sk(A) vs. sk(B) for all the pairs of documents A and B.

Locality sensitive hashing (LSH) Compute sk(A) for each document A Use LSH of all sketches, briefly:

Take h elements of sk(A) as ID (may induce false positives)

Create t IDs (to reduce the false negatives)

If one ID matches with another one (wrt same h-selection),

then the corresponding docs are probably near-duplicates;

hence compare.

Page 34: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Search Engines

“Semantic” searches ?

Page 35: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)
Page 36: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

“Diego Maradona won against Mexico”

Dictionary of termsagainstDiegoMaradonaMexicowon

2.25.19.11.00.1

Term Vector

Similarity(v,w) ≈ cos()

t1

v

w

t3

t2

Vector Space model

Classical approach

Mainly term-based:

polysemy and synonymy issues

Page 37: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

38

A new approach: Massive graphs of entities and relations

May 2012

Page 38: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

the paparazzi photographed the star

the astronomer photographed the star

A typical issue: polysemy

Page 39: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

He is using Microsoft’s browser

He is a fan of Internet Explorer

Another issue: synonymy

Page 40: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

http://tagme.di.unipi.ithttp://tagme.di.unipi.it

Page 41: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

• Korean won• Win-loss record• Only won• ...

“Diego Maradona won against Mexico”

•Diego A. Maradona •Diego Maradona jr.•Maradona Stadium•Maradona Film•…

•Mexico nation•Mexico state•Mexico football team•Mexico baseball team•…

No Annotation

PARSINGPARSING PRUNING2 simple features

DISAMBIGUATIONby a voting scheme

TAGME

ρscoreρscore

Page 42: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

obama asks iran for RQ-170 sentinel drone back

us president issues Ahmadinejad ultimatum

Barack Obama

Iran Lockheed Martin RQ-170 Sentinel

President of the United States

Mahmoud Ahmadinejad

Ultimatum

Why is it more powerful ?

Page 43: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

44

Text as a sub-graph of topics

Mahmoud Ahmadinejad

Ultimatum

RQ-170 drone

President of the United States

Barack Obama

Iran

Page 44: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Text as a sub-graph of topics

Mahmoud Ahmadinejad

UltimatumRQ-170 drone

Any relatedness measure over a graph, e.g. [Milne & Witten, 2008]

President of the United States

Barack Obama

Iran

Graph analysis allows to find similarities between texts and entities even if they do not match syntactically (so at concept level)

Page 45: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)
Page 46: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)
Page 47: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Search Results Clustering

Jaguar Cars Panthera Onca Mac OS X Atari Jaguar Jacksonville Jags Fender Jaguar …

TOPICS

Page 48: Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Paper atACM WSDM 2012

Paper atECIR 2012

Paper atIEEE Software 2012

Pls design your killer

app...

http://acube.di.unipi.it/ta

gmeReleasing open-source…