Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang...

Preview:

Citation preview

Cache-Conscious Performance Optimization for

Similarity Search

Maha Alabduljalil, Xun Tang, Tao YangDepartment of Computer Science

University of California at Santa Barbara

36th ACM International Conference on Information Retrieval

• Definition: Finding pairs of objects whose similarity is above a certain threshold.

• Application examples:• Collaborative filtering.• Spam and near duplicate detection.• Image search.• Query suggestions.

• Motivation: APSS still time consuming for large datasets.

All Pairs Similarity Search (APSS)

≥ τSim (di,dj) = cos(di,dj)

2

Previous Work

• Approaches to speedup APSS: Exact APSS:

– Dynamic Computation Filtering. [ Bayardo et al. WWW’07 ]– Inverted indexing. [Arasu et al. VLDB’06]– Parallelization with MapReduce. [Lin SIGIR’09]– Partition-based similarity comparison [Maha WSDM’13]

Approximate APSS via LSH: Tradeoff between precision and recall plus addition of redundant computations.

• Approaches that utilize memory hierarchy: General query processing [ Manegold VLDB02 ]

Other computing problems.

3

Baseline: Partition-based Similarity Search (PSS)

Partitioning with

dissimilarity detection

Similarity comparison with parallel tasks

[WSDM’13]

4

PSS Task

Read assigned partition into area S. Repeat

Read some vectors vi from other partitions

Compare vi with S

Output similar vector pairs

Until other potentially similar vectors are compared.

Memory areas: S = vectors owned, B = other vectors,C = temporary.

Task steps:

5

Focus and Contribution

• Contribution: Analyze memory hierarchy behavior in PSS tasks. New data layout/traversal techniques for speedup:

①Splitting data blocks to fit cache.

②Coalescing: read a block of vectors from other partitions and process them together.

• Algorithms: Baseline: PSS [WSDM’13] Cache-conscious designs: PSS1 & PSS2 6

PROBLEM1: PSS area S is too big to fit in cache

Other vectors B

CInverted index of vectors …

Accumulatorfor S

S

… ……

Too Long to fit in cache!

7

PSS1: Cache-conscious data splitting

B

Accumulator for Si

C…

S1

S2

Sq

aa

aa

aa

aa

aa

aa

aa

aa …

After splitting:

……Split Size?

8

PSS1 Task

Compare (Sx, B)

PSS1 Task

Compare(Sx, B)

Read S and divide into many splitsRead other vectors into B

…for di in Sx

for dj in B Sim(di,dj) += wi,t * wj,t

if( sim(di,dj) + maxwdi *

sumdj <t) then

Output similarity scores

For each split Sx

9

Modeling Memory/Cache Access of PSS1

Area Si Area B

Area C

Sim(di,dj) + = wi,t * wj,t

if( sim(di,dj) + maxwdi * sumdj

<

T ) then

Total number of data accesses :

D0 = D0(Si) + D0(B)+D0(C) 10

Cache misses and data access time

D0 : total memory data accesses.

Memory and cache access counts:

D1 : missed access at L1D2 : missed access at L2D3 : missed access at L3

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem

δi : access time at cache level iδmem : access time in memory.

Memory and cache access time:

11

Total data access time

Data found in L1

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem ~2 cycles

Total data access time

Data found in L2

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem

6-10 cycles

Total data access time

Data found in L3

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem

30-40 cycles

Total data access time

Data found in memory

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem

100- 300 cycles

Actual vs. Predicted

Avg. task time ≈ #features * ( lookup + multiply + add) + accessmem

13

RECALL: Split size s

B

Accumulator for Si

C…

S1

S2

Sq

aa

aa

aa

aa

aa

aa

aa

aa …

……Split Size s

Ratio of Data Access to Computation

Avg. task time ≈ #features * ( lookup + add+multiply) + accessmem

Data accesscomputation

computation

Data access

Split size s15

PSS2: Vector coalescing

• Issues:• PSS1 focused on splitting S to fit into cache.

• PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C.• Solution: coalescing multiple vectors in B

PSS2: Example for improved locality

Si

… …

C

B

…Striped areas in cache

16

Evaluation

• Implementation: Hadoop MapReduce.• Objectives:

• Effectiveness of PSS1, PSS2 over PSS.• Benefits of modeling.

• Datasets: • Twitter, Clueweb, Enron emails, YahooMusic,

Google news.• Preprocessing:

• Stopword removal + df-cut.• Static partitioning for dissimilarity detection.

Improvement Ratio of PSS1,PSS2 over PSS

2.7x

18

RECALL: coalescing size b

Si

… …

C

……

B

…b

Avg. # of sharing

= 2 18

Average number of shared features

19

Overall performance

Overall performance

Clueweb

Impact of split size s in PSS1

Clueweb

Twitter

Emails

RECALL: split size s & coalescing size b

Si

… …

C

……

B

…b

s

20

Affect of s & b on PSS2 performance (Twitter)

fastest

21

Conclusions

• Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1)

• Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data.(PSS2)

• Cost modeling for memory hierarchy access is a guidance to optimize parameter setting.

• Experiments show cache-conscious design can be upto 2.74x as fast as the cache-oblivious baseline.