42
PMatch: Probabilistic Subgraph Matching on Huge Social Networks Matthias Bröcheler, Andrea Pugliese & V.S. Subrahmanian © Adam Perer

PMatch: Probabilistic Subgraph Matching on Huge Social Networks

Embed Size (px)

DESCRIPTION

Users querying massive social networks or RDF databases are often not 100% certain about what they are looking for due to the complexity of the query or heterogeneity of the data. In this paper, we propose “probabilistic subgraph” (PS) queries over a graph/network database, which afford users great flexibility in specifying “approximately” what they are looking for. We formally define the probability that a substitution satisfies a PS-query with respect to a graph database. We then present the PMATCH algorithm to answer such queries and prove its correctness. Our experimental evaluation demonstrates that PMATCH is efficient and scales to massive social networks with over a billion edges.

Citation preview

Page 1: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch: Probabilistic Subgraph Matching on Huge Social Networks

Matthias Bröcheler, Andrea Pugliese & V.S. Subrahmanian

© Adam Perer

Page 2: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Jenny

Peter

Alice

John

Emily

Jon

Francis

Melissa

Jessie

Mark

Bob

Ashley

Bob

likes

likes

likes

likes

likes

friend likes

likes

likes

likes

likes

likes

friend likes

likes

type type

type

type

attended

attended organized

friend

attended

likes

friend

organized

organized attended

friend

attended

likes

organized

attended

attended

likes

type

type

type

attended

organized

friend

friend

attended

friend

organized

attended

attended

organized

friend

type

attended

type

type

likes

friend

friend

friend

attended

organized

organized likes

likes

attended

attended

Peter‘s Bday party

Halloween 2008

Pizza Feast

Fundraiser for School

Sylvester 2009 Homecoming

09

Chill-out Night

Spring Break Trip

Alice Goodbye

The Godfather

Pulp Fiction

Thriller

Inception

Mystery

Harry Potter

Toy Story

Mrs. Doubtfire

Comedy

Family

The Lion King

Gone with the wind

Drama

Titanic

Sci-Fi Star Wars IV

Page 3: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Example Query

3

Simple query, yet complex structure and difficult to answer by hand

Francis

?f ?u Peter

friend

likes likes

attended

attended

organized

type ?b

?p

Drama

Page 4: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Query Uncertainty

4

Was it really a drama movie?

Francis

?f ?u Peter

friend

likes likes

attended

attended

organized

type ?b

?p

Drama

Page 5: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Probabilistic Matching Query

5

Francis

?f ?u Peter

friend

likes likes

attended

attended

organized

type ?b

?p

Drama

6

2 Certain

Uncertain Capture uncertainty in weighted query boxes

3

Page 6: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Intelligence and Security

6

?target

in ?person1

?bank

Very bad guy

?person1

Africa

?forum Radical

?person3 ?city

Middle East Radical

labeled

labeled

posted to posted to

located in

traveled

lives in

transfer

transfer

affiliated

affiliated

contacted

Page 7: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Intelligence and Security

7

?target

in ?person1

?bank

Very bad guy

?person1

Africa

?forum Radical

?person3 ?city

Middle East Radical

labeled

labeled

posted to posted to

located in

traveled

lives in

transfer

transfer

affiliated

affiliated

contacted

∞ 3

1

1 2

2

Page 8: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Applications  Intelligence

-  Information retrieval on open source data

 Search -  Facilitate productive use of social network data

 Security -  Finding suspicious patterns

 Social Network Analysis -  Querying as a primitive operation for data retrieval

8

Page 9: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Outline

Motivation

PMatch Query Definition

PMatch Query Answering

Experiments

Related Work & Conclusion

Page 10: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Query  Set of Boxes

- Each contains edges - Has a weight •  Infinity certainty or “core” of query

 Naturally extends exact subgraph matching queries

10

Francis

?f ?u Peter

friend

likes likes

attended

attended

organized

type ?b

?p

Drama

6

2

3

Page 11: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Query  Matching boxes

- Let bi = 1 if box Bi is matched by a substitution else 0 for i=1 to n (= # of boxes)

- Weights wi, offset w0

 Probability of match (user defined) =0 if b0 = 0, core must be matched = 1

1 + e−(w0+�n

i=1 wibi)

11

Francis

?f ?u Peter

friend

likes likes

attended

attended

organized

type ?b

?p

Drama

6

2

3

Page 12: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Query Answer  All substitutions with probability above user defined threshold τ

 Need to eliminate redundant answers - Substitutions that can be extended to

higher probability

12

Francis

?f ?u Peter

friend

likes likes

attended

attended

organized

type ?b

?p

Drama

6

2

3

Page 13: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Answer

 1/(1+e-(-5+6)) ≈ 73% -  w0=-5

13

Francis

?f ?u Peter

friend

likes likes

attended

attended

organized

type ?b

?p

Drama

6

2

3

Francis

John

friend

likes

type The

Godfather Drama

Page 14: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Outline

Motivation

PMatch Query Definition

PMatch Query Answering

Experiments

Related Work & Conclusion

Page 15: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Query Answering  Try to find all answers at once by

exploiting shared substructures and efficiently maintain data structures.

 Depth first-search algorithm - Terminates when search space is

exhausted or answer probability falls below threshold.

 Provably correct

15

Page 16: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Initialization

16

Francis

?f

?b Drama

?p

?u Peter

Friend {0}

Likes {0} Likes {2,3}

Attended {2,3}

Attended {2,3}

organi zed {2}

type {6}

Rewrite query (0=certain)

Page 17: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Data Structures Maintained per vertex:

 R: Set of substitution candidates -  Pairs of the form <x,Hx>, where x is a

potential substitution and Hx the set of box ids it matches. Hx maintained as bit vector

 S: Set of box ids for which R has been initialized

17

Page 18: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Initialization

18

Francis

?f

?b Drama

?p

?u Peter

Friend {0}

Likes {0} Likes {2,3}

Attended {2,3}

Attended {2,3}

organi zed {2}

type {6}

R = {} S = {}

R = {} S = {}

R = {} S = {}

R = {} S = {}

R = {<peter|2>} S = {2}

R = {<drama|6>} S = {6}

R = {<francis|0>} S = {0}

θ = {}

τ = 0.75

Page 19: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Vertex selection heuristic  Goodness(v) =

 Progress(v) = cumulative box weight of edges to be processed - Normalize weight by box cardinality

 Cost(v) = use standard cost models from exact subgraph matching - Cardinality, estimated selectivity

19

progress(v) cost(v)

Page 20: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Algorithm

20

Francis

?f

?b Drama

?p

?u Peter

Friend {0}

Likes {0} Likes {2,3}

Attended {2,3}

Attended {2,3}

organi zed {2}

type {6}

R = {} S = {}

R = {} S = {}

R = {} S = {}

R = {} S = {}

R = {<peter|2>} S = {2}

R = {<drama|6>} S = {6}

R = {<francis|0>} S = {0}

θ = {}

τ = 0.75

Initial vertex selection

Page 21: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Algorithm

21

Francis

?f

?b Drama

?p

?u Peter Likes {0}

Likes {2,3}

Attended {2,3} organi zed {2}

type {6}

R = {<Peter’s Bday|2,3>, <Silvester 09|2,3>} S = {2,3}

R = {<John|0>,<Mark|0>, <Melissa|0>} S = {0}

R = {} S = {}

R = {} S = {}

R = {<peter|2>} S = {2}

R = {<drama|6>} S = {6}

θ = {}

τ = 0.75

Processing of „Francis“ vertex

Page 22: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Algorithm

22

Francis

?f

?b Drama

?p

?u Peter

Likes {2,3}

Attended {2,3} organi zed {2}

type {6}

R = {<Peter’s Bday|2,3>, <Silvester 09|2,3>} S = {2,3}

R = {<Titanic,0>,<Star Wars,0>} S = {0}

R = {} S = {}

R = {<peter|2>} S = {2}

R = {<drama|6>} S = {6}

θ = {?f/Mark}

τ = 0.75

Threshold Pruning

Page 23: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Algorithm

23

Francis

?f

?b Drama

?p

?u Peter

Attended {2,3} organi zed {2}

R = {<Peter’s Bday|2,3>, <Silvester 09|2,3>} S = {2,3}

R = {} S = {}

R = {<Jenny|2,3>} S = {2,3}

R = {<peter|2>} S = {2}

θ = {?f/Mark,?b/Titanic}

τ = 0.75

Processing vertex „?b“. Maintaining box indices

Page 24: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Algorithm

24

Francis

?f

?b Drama

?p

?u Peter

organi zed {2}

R = {<Peter’s Bday|2,3>} S = {2,3}

R = {} S = {}

R = {<peter|2>} S = {2}

θ = {?f/Mark,?b/Titanic,?u/Jenny}

τ = 0.75

Crossed Threshold

Page 25: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

PMatch Algorithm

25

Francis

?f

?b Drama

?p

?u Peter

R = {} S = {}

R = {<peter|2>} S = {2}

τ = 0.75

Found Result

0.997

θ = {?f/Mark,?b/Titanic,?u/Jenny,?p/Peter’s Bday}

Page 26: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Query Splitting

26

Francis

?f

?b Drama

?p

?u Peter

Friend {0}

Likes {0} Likes {2,3}

Attended {2,3}

Attended {2,3}

type {6}

R = {} S = {}

R = {} S = {}

R = {} S = {}

R = {<drama|6>} S = {6}

R = {<francis|0>} S = {0}

θ = {}

τ = 0.75

R = {<Peter’s Bday|2>, <Silvester 09|2>} S = {2}

Selecting „Peter“ first causes query to be split.

Page 27: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Outline

Motivation

PMatch Query Definition

PMatch Query Answering

Experiments

Related Work & Conclusion

Page 28: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Experimental Setup I  Implementation in Java (approx 5500

loc) on top of Neo4j  Datasets

-  Friendship Network (Flickr, Orkut, Livejournal,

Youtube): 778 million edges -  Delicious Network: 1.1 billion edges

 Query Benchmark for each dataset: 9 PMatch queries each of increasing complexity

28

Page 29: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Experimental Setup II  Compared

-  PMatch-1: box weight -  PMatch-2: normalized box weight -  DOGMA: Exact matching for all combination -  Neo4j: Exact matching for all combinations

•  Too slow – not shown in the charts

  System: AMD Opteron, 15K RPM 300 GB SAS HD, 2GB of heap space

29

Page 30: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Friendship network (small queries)

For each query, we report query times with cold (right) and warm (left) caches.

30

0

50

100

150

200

250

300

350

Query SA: 6B,23E Query SB: 4B,6E Query SC: 5B,12E Query SD: 4B,6E Query SE: 5B,12E Query SF: 6B,23E

Queries

Que

ry T

ime

(ms)

PMatch-1 PMatch-2 DOGMA

Page 31: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Friendship network (large queries)

For each query, we report query times with cold (right) and warm (left) caches.

31

Queries

Que

ry T

ime

(ms)

PMatch-1 PMatch-2 DOGMA

0

5000

10000

15000

20000

25000

Query SG: 4B,6E Query SH: 5B,12E Query SI: 6B,23E

Page 32: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Delicious network (small queries)

For each query, we report query times with cold (right) and warm (left) caches.

32

Queries

Que

ry T

ime

(ms)

PMatch-1 PMatch-2 DOGMA

0

1000

2000

3000

4000

5000

6000

7000

Query TA: 5B,13E Query TB: 5B,17E Query TC: 4B,6E Query TD: 5B,12E Query TE:5B,12E

Page 33: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Delicious network (large queries)

For each query, we report query times with cold (right) and warm (left) caches.

33

Queries

Que

ry T

ime

(ms)

PMatch-1 PMatch-2 DOGMA

0

5000

10000

15000

20000

25000

30000

35000

40000

Query TF:4B,6E Query TG: 5B,18E Query TH: 5B,18E Query TI: 4B,7E

Page 34: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Outline

Motivation

PMatch Query Definition

PMatch Query Answering

Experiments

Related Work & Conclusion

Page 35: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Related Work I  Exact Subgraph Matching

- Many systems: RDF-3X, OWLIM, YARS (2), Sesame, Jena, DOGMA, COSI etc.

- Efficient on-disk index structures - Query answering algorithms focused on

exact queries (e.g. SPARQL).

 No treatment of query uncertainty

35

Page 36: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Related Work II  Approximate Subgraph Matching

- SAGA, TALE, Grafil, etc - Database containing many small graphs - Do not scale to large social networks - Query uncertainty is defined in the

system.  PMatch provides an expressive language to

specify probabilistic queries and answers them efficiently.

36

Page 37: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Conclusion  Query uncertainty often occurs in social network search, information retrieval and pattern matching.

 PMatch provides a language to concisely express and algorithms to efficiently answer probabilistic subgraph matching queries.

37

Page 38: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

dogma.umiacs.umd.edu

Page 39: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

? Questions? Comments?

Page 40: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Related Work III  Data Uncertainty

- Specified at the tuple or edge level - How to answer certain (SQL) queries

over uncertain data and what are the answer probabilities?

40

First Name Last Name Income Probability

John Smith 80,000 0.6

Alice Baker 85,000 0.5

Jennifer Temper 90,000 0.9

Page 41: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Related Work III  Data Uncertainty

- How to encode dependencies? - Need to make strong independence

assumptions to be efficient - Where are the probabilities coming form?

 Instead, we assume certain data and model uncertainty in the user specified query.

41

Page 42: PMatch: Probabilistic Subgraph Matching on Huge Social Networks

PMatch

Query Uncertainty Motivation  User Uncertainty

-  User does not know what she is looking for exactly

 Lack of schema -  Lack of schema complicates query design

 Data heterogeneity -  Can be caused by data integration

 Noisy or Missing Data -  It’s real world data

42