DBSocial 2013, New York

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions

DBSocial 2013, New York

Foteini Alvanaki Sebastian Michel

Excellence Cluster on Multimodal Computing and Interaction (MMCI)

2

MotivationenBlogue (1)

€

{#flood, #Lourdes}€

{# Algeria, Stratfor}

€

{# Asanz, #Wikileaks}

€

{# obamainBerlin, #Merkel}€

{# Twisted, ABCFamily}

€

{#NSA, #Orwell}

€

{# Kim, #Kanye}€

{# Kim, # Baby}

€

{# Bieber, #NBAFinals}€


€

{# Asanz, # Wikileaks}

€

{# obamainBerlin, # Merkel}

€

{#NSA, # Orwell}

€

{# Kim, # Kanye}€

{# Kim, #Baby}

€

{# Bieber, # NBAFinals}

€

{# Rihanna,# Bieber, # Youtube}€

{# Kim, # Baby}

€

{#Kim, # Baby}

€

{# Kim, # Baby}€

{# Bieber, #NBAFinals} €

{# flood, #Lourdes}

€

{# flood, #Lourdes}

€

{#flood, #Lourdes}

€

{# flood, # Lourdes}

€

{# HeatNation, #NBAFinals} €

{# HeatNation, #NBAFinals}

€

{# obamainBerlin, #Merkel}

€


€


€

{# obama,#berlin}

€

{#obama,# berlin}

• enBlogue: Identifies emergent topics• Input: A stream of documents annotated with hash-tags (e.g. Tweets)• Restricts the focus to the more recent documents using a time sliding window

3

MotivationenBlogue (2)

€

{#flood, #Lourdes}€


€

{# Asanz, # Wikileaks}

€

{# obamainBerlin, # Merkel}€

{# Twisted, ABCFamily}

€

{# NSA, #Orwell}

€

{#Kim, # Kanye}€

{# Kim, #Baby}

€

{# Bieber, #NBAFinals}€


€

{# Asanz, #Wikileaks}

€


€

{# NSA, #Orwell}

€

{# Kim, # Kanye}€

{# Kim, # Baby}

€

{# Bieber, #NBAFinals}

€

{#Rihanna,#Bieber, # Youtube}€

{# Kim, # Baby}

€

{# Kim, #Baby}

€

{# Kim, #Baby}€

{# Bieber, # NBAFinals} €


€


€

{#flood, # Lourdes}

€

{# flood, #Lourdes}

€

{#HeatNation, #NBAFinals} €

{# HeatNation, # NBAFinals}

€


€


€


€

{# obama,# berlin}

€

{# obama,#berlin}

• Tracks the correlation of co-occurring hash-tags over time• Reports on unexpected changes in the correlation

€

{# Kim, # Baby}

time

corr

elati

on

€

{# Kim, # Baby}

€

{# Kim, # Baby}

4

Jaccard Coefficient

• T : A set containing the document ids annotated with tag t

• Pair of tags :

• Set of n tags :

€

J(t1,t2) =T1 I T2

T1 UT2

€

J(t1,..., tn ) =Ii=1

nTi

Ui=1

nTi

€

{t1, t2}

€

{t1,t2,...,tn}

Jaccard Coefficient Computation

• Maintain counters for all subsets of co-occurring tags

5

€

{a, b, c}

€

{a, b}

€

{a, c}

€

{b, c}

€

{a, b, c}

€

{b, c, d}

€

{c, d}

€

{b, d}

€

{b, c, d}

€

AUB AI B

€

AUC AI C

€

BUC BI C

€

CUD C I D

€

BUD BI D

€

AUBUC AI BI C

€

BUCUD BI C I D

6

Inclusion – Exclusion Principle

• Compute the cardinality of the union of n sets using the cardinalities of the intersections of all its subsets:

€

XUZ = X + Z − X I Z

€

Ui=1

nTi = (−1)k+1 Ti1 I L I Tik

1≤ ii <L < ik ≤n∑

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

k=1

n

∑

7

Inclusion – Exclusion PrincipleAdvantages

• Needs to maintain less counters

• Adapts more easily to changes in the load

€

AUB A I B

€

AUC AI C

€

BUC BI C

€

CUD C I D

€

BUD B I D

€

AUBUC AI B I C

€

BUCUD BI C I D€

{a, b, c}

€

{a, b}

€

{a, c}

€

{b, c}

€

{a, b, c}

€

{b, c, d}

€

{c, d}

€

{b, d}

€

{b, c, d}

€

A

€

B

€

C

€

D

€

}€

d'= {a, d}

€

AI D

8

Problem

• For each subset of co-occurring tags– Number of documents annotated each tag– Number of documents annotated with all tags

• A big number of co-occurring tag sets• New documents arrive fast changing the

numbers

€

{t1, t2,...,tn}

€

Ii=1

nTi

€

Ti

Solution: Let multiple nodes compute the Jaccard coefficient for different tag sets

9

Outline Motivation

enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem

• Idea- Architecture– Partition Tags– Updating Counters

• Results– Theoretical Results– Experimental Results

• Conclusion

Architecture

10

Nodes computing the Jaccard coefficients

Nodes computing the partitions

11

Partition TagsRequisites

1. Treat tag-sets as inseparable units

2. Minimise the overlap of single tags tracked by different nodes

€

{a,b}

€

{c,d}

€

{a,c,d}

€

N1 :{a,b}

€

A B AI B

€

N2 :{c,d}

€

C D C I D

€

{a,c,d}

€

C DAI C AI D C I D

€

{a,c,d}

€

AA I C AI D

€

J(a,b) =AI B

A + B − AI B

€

J(c,d) =C I D

C + D − C I D

€

J(a,c,d) =AI C I D

A + C + D − AI C − AI D − C I D + AI C I D

12

Partition TagsAlgorithm

Phase 1: Create an initial assignment of the tags to the nodes Max-k cover : Selects k out of n sets that cover the maximum number of elements

Phase 2: Make sure all sets of tags are assigned to some node

13

Partition TagsExample

€

d1 = {a, b, c}

€

d2 = {b, c}

€

d3 = {a, b, f }

€

d4 = {d, e, g}

€

d5 = {a, d, e}

PHASE 1: MAX-2 COVER

€

{a, b, c}

€

{a, d, e}

PHASE 2: ASSIGNING REMAINING SETS

€

{a, b, f }

€

{d, e, g}

€

{a, b, c}{a,b, f }

€

{d, e, g}{a,d, e}

€

{a, b, c, f }

€

{a,d,e,g}

14

Update Counters

€

N1 :{a, b,c,d}

€

N2 :{b,e, f }

€

BI E BI F E I FBI E I F B E F

€

A B AI BC D C I D

€

d4 = {c,d}

€

|C | + +|D | + +C I D + +

€

d5 = {b, f }€

|B | + +

€

|B | + +| E | + +|BI E | + +

15

Finding nodes

€

d2000 = {a, c}

€

a :{N1, N2}

€

b :{N1}

€

c :{N1}

€

d :{N2}

€

e :{N2}

€

f :{N1}

€

g :{N2}

€

⇒ {N1, N2}U{N1} = {N1, N2}

Inverted Index

16

Outline Motivation


Idea Architecture Distributing Tags Updating Counters

• Results– Theoretical Results– Experimental Results

• Conclusion

17

Theoretic expectation

€

E affected nodes[ ] = k ∗ 1−v−mm

⎛ ⎝ ⎜

⎞ ⎠ ⎟vm ⎛ ⎝ ⎜

⎞ ⎠ ⎟

⎡

⎣ ⎢

⎤

⎦ ⎥

nk

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥

• k partitions• v total tags (vocabulary)• m randomly selected tags per set• n total tag-sets

18

Theoretical ResultsPartitions: 10 Vocabulary Size: 1,000,000

19

Real Data Experiments• Dataset: Tweets of 15th March 2013• Partitions: 10

20

Outline Motivation


Idea Architecture Distributing Tags Updating Counters

Results Theoretical Results Experimental Results

• Conclusion

21

Conclusion

• An algorithm to compute the Jaccard coefficient for tag-sets in a massive data stream.

• Applicable to all measures using intersection and/or unions of sets (e.g. Dice)

• Results show small replication• Load equally distributed to the nodes.

22

Thank you!

Documents

DBSocial 2013, New York