22
Scalable, Continuous Tracking of Tag Co- Occurrences between Short Sets using (Almost) Disjoint Tag Partitions DBSocial 2013, New York Foteini Alvanaki Sebastian Michel Excellence Cluster on Multimodal Computing and Interaction (MMCI)

DBSocial 2013, New York

  • Upload
    grazia

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions. DBSocial 2013, New York. Motivation enBlogue (1). enBlogue : Identifies emergent topics Input: A stream of documents annotated with hash-tags (e.g. Tweets) - PowerPoint PPT Presentation

Citation preview

Page 1: DBSocial  2013, New York

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions

DBSocial 2013, New York

Foteini Alvanaki Sebastian Michel

Excellence Cluster on Multimodal Computing and Interaction (MMCI)

Page 2: DBSocial  2013, New York

2

MotivationenBlogue (1)

{#flood, #Lourdes}€

{# Algeria, Stratfor}

{# Asanz, #Wikileaks}

{# obamainBerlin, #Merkel}€

{# Twisted, ABCFamily}

{#NSA, #Orwell}

{# Kim, #Kanye}€

{# Kim, # Baby}

{# Bieber, #NBAFinals}€

{# Algeria, Stratfor}

{# Asanz, # Wikileaks}

{# obamainBerlin, # Merkel}

{#NSA, # Orwell}

{# Kim, # Kanye}€

{# Kim, #Baby}

{# Bieber, # NBAFinals}

{# Rihanna,# Bieber, # Youtube}€

{# Kim, # Baby}

{#Kim, # Baby}

{# Kim, # Baby}€

{# Bieber, #NBAFinals} €

{# flood, #Lourdes}

{# flood, #Lourdes}

{#flood, #Lourdes}

{# flood, # Lourdes}

{# HeatNation, #NBAFinals} €

{# HeatNation, #NBAFinals}

{# obamainBerlin, #Merkel}

{# obamainBerlin, # Merkel}

{# obamainBerlin, #Merkel}

{# obama,#berlin}

{#obama,# berlin}

• enBlogue: Identifies emergent topics• Input: A stream of documents annotated with hash-tags (e.g. Tweets)• Restricts the focus to the more recent documents using a time sliding window

Page 3: DBSocial  2013, New York

3

MotivationenBlogue (2)

{#flood, #Lourdes}€

{# Algeria, Stratfor}

{# Asanz, # Wikileaks}

{# obamainBerlin, # Merkel}€

{# Twisted, ABCFamily}

{# NSA, #Orwell}

{#Kim, # Kanye}€

{# Kim, #Baby}

{# Bieber, #NBAFinals}€

{# Algeria, Stratfor}

{# Asanz, #Wikileaks}

{# obamainBerlin, # Merkel}

{# NSA, #Orwell}

{# Kim, # Kanye}€

{# Kim, # Baby}

{# Bieber, #NBAFinals}

{#Rihanna,#Bieber, # Youtube}€

{# Kim, # Baby}

{# Kim, #Baby}

{# Kim, #Baby}€

{# Bieber, # NBAFinals} €

{# flood, # Lourdes}

{# flood, # Lourdes}

{#flood, # Lourdes}

{# flood, #Lourdes}

{#HeatNation, #NBAFinals} €

{# HeatNation, # NBAFinals}

{# obamainBerlin, #Merkel}

{# obamainBerlin, #Merkel}

{# obamainBerlin, #Merkel}

{# obama,# berlin}

{# obama,#berlin}

• Tracks the correlation of co-occurring hash-tags over time• Reports on unexpected changes in the correlation

{# Kim, # Baby}

time

corr

elati

on

{# Kim, # Baby}

{# Kim, # Baby}

Page 4: DBSocial  2013, New York

4

Jaccard Coefficient

• T : A set containing the document ids annotated with tag t

• Pair of tags :

• Set of n tags :

J(t1,t2) =T1 I T2

T1 UT2

J(t1,..., tn ) =Ii=1

nTi

Ui=1

nTi

{t1, t2}

{t1,t2,...,tn}

Page 5: DBSocial  2013, New York

Jaccard Coefficient Computation

• Maintain counters for all subsets of co-occurring tags

5

{a, b, c}

{a, b}

{a, c}

{b, c}

{a, b, c}

{b, c, d}

{c, d}

{b, d}

{b, c, d}

AUB AI B

AUC AI C

BUC BI C

CUD C I D

BUD BI D

AUBUC AI BI C

BUCUD BI C I D

Page 6: DBSocial  2013, New York

6

Inclusion – Exclusion Principle

• Compute the cardinality of the union of n sets using the cardinalities of the intersections of all its subsets:

XUZ = X + Z − X I Z

Ui=1

nTi = (−1)k+1 Ti1 I L I Tik

1≤ ii <L < ik ≤n∑

⎝ ⎜ ⎜

⎠ ⎟ ⎟

k=1

n

Page 7: DBSocial  2013, New York

7

Inclusion – Exclusion PrincipleAdvantages

• Needs to maintain less counters

• Adapts more easily to changes in the load

AUB A I B

AUC AI C

BUC BI C

CUD C I D

BUD B I D

AUBUC AI B I C

BUCUD BI C I D€

{a, b, c}

{a, b}

{a, c}

{b, c}

{a, b, c}

{b, c, d}

{c, d}

{b, d}

{b, c, d}

A

B

C

D

}€

d'= {a, d}

AI D

Page 8: DBSocial  2013, New York

8

Problem

• For each subset of co-occurring tags– Number of documents annotated each tag– Number of documents annotated with all tags

• A big number of co-occurring tag sets• New documents arrive fast changing the

numbers

{t1, t2,...,tn}

Ii=1

nTi

Ti

Solution: Let multiple nodes compute the Jaccard coefficient for different tag sets

Page 9: DBSocial  2013, New York

9

Outline Motivation

enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem

• Idea- Architecture– Partition Tags– Updating Counters

• Results– Theoretical Results– Experimental Results

• Conclusion

Page 10: DBSocial  2013, New York

Architecture

10

Nodes computing the Jaccard coefficients

Nodes computing the partitions

Page 11: DBSocial  2013, New York

11

Partition TagsRequisites

1. Treat tag-sets as inseparable units

2. Minimise the overlap of single tags tracked by different nodes

{a,b}

{c,d}

{a,c,d}

N1 :{a,b}

A B AI B

N2 :{c,d}

C D C I D

{a,c,d}

C DAI C AI D C I D

{a,c,d}

AA I C AI D

J(a,b) =AI B

A + B − AI B

J(c,d) =C I D

C + D − C I D

J(a,c,d) =AI C I D

A + C + D − AI C − AI D − C I D + AI C I D

Page 12: DBSocial  2013, New York

12

Partition TagsAlgorithm

Phase 1: Create an initial assignment of the tags to the nodes Max-k cover : Selects k out of n sets that cover the maximum number of elements

Phase 2: Make sure all sets of tags are assigned to some node

Page 13: DBSocial  2013, New York

13

Partition TagsExample

d1 = {a, b, c}

d2 = {b, c}

d3 = {a, b, f }

d4 = {d, e, g}

d5 = {a, d, e}

PHASE 1: MAX-2 COVER

{a, b, c}

{a, d, e}

PHASE 2: ASSIGNING REMAINING SETS

{a, b, f }

{d, e, g}

{a, b, c}{a,b, f }

{d, e, g}{a,d, e}

{a, b, c, f }

{a,d,e,g}

Page 14: DBSocial  2013, New York

14

Update Counters

N1 :{a, b,c,d}

N2 :{b,e, f }

BI E BI F E I FBI E I F B E F

A B AI BC D C I D

d4 = {c,d}

|C | + +|D | + +C I D + +

d5 = {b, f }€

|B | + +

|B | + +| E | + +|BI E | + +

Page 15: DBSocial  2013, New York

15

Finding nodes

d2000 = {a, c}

a :{N1, N2}

b :{N1}

c :{N1}

d :{N2}

e :{N2}

f :{N1}

g :{N2}

⇒ {N1, N2}U{N1} = {N1, N2}

Inverted Index

Page 16: DBSocial  2013, New York

16

Outline Motivation

enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem

Idea Architecture Distributing Tags Updating Counters

• Results– Theoretical Results– Experimental Results

• Conclusion

Page 17: DBSocial  2013, New York

17

Theoretic expectation

E affected nodes[ ] = k ∗ 1−v−mm

⎛ ⎝ ⎜

⎞ ⎠ ⎟vm ⎛ ⎝ ⎜

⎞ ⎠ ⎟

⎣ ⎢

⎦ ⎥

nk

⎢ ⎢ ⎢

⎥ ⎥ ⎥

• k partitions• v total tags (vocabulary)• m randomly selected tags per set• n total tag-sets

Page 18: DBSocial  2013, New York

18

Theoretical ResultsPartitions: 10 Vocabulary Size: 1,000,000

Page 19: DBSocial  2013, New York

19

Real Data Experiments• Dataset: Tweets of 15th March 2013• Partitions: 10

Page 20: DBSocial  2013, New York

20

Outline Motivation

enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem

Idea Architecture Distributing Tags Updating Counters

Results Theoretical Results Experimental Results

• Conclusion

Page 21: DBSocial  2013, New York

21

Conclusion

• An algorithm to compute the Jaccard coefficient for tag-sets in a massive data stream.

• Applicable to all measures using intersection and/or unions of sets (e.g. Dice)

• Results show small replication• Load equally distributed to the nodes.

Page 22: DBSocial  2013, New York

22

Thank you!