Upload
matthias-broecheler
View
3.506
Download
0
Embed Size (px)
DESCRIPTION
Users querying massive social networks or RDF databases are often not 100% certain about what they are looking for due to the complexity of the query or heterogeneity of the data. In this paper, we propose “probabilistic subgraph” (PS) queries over a graph/network database, which afford users great flexibility in specifying “approximately” what they are looking for. We formally define the probability that a substitution satisfies a PS-query with respect to a graph database. We then present the PMATCH algorithm to answer such queries and prove its correctness. Our experimental evaluation demonstrates that PMATCH is efficient and scales to massive social networks with over a billion edges.
Citation preview
PMatch
PMatch: Probabilistic Subgraph Matching on Huge Social Networks
Matthias Bröcheler, Andrea Pugliese & V.S. Subrahmanian
© Adam Perer
PMatch
Jenny
Peter
Alice
John
Emily
Jon
Francis
Melissa
Jessie
Mark
Bob
Ashley
Bob
likes
likes
likes
likes
likes
friend likes
likes
likes
likes
likes
likes
friend likes
likes
type type
type
type
attended
attended organized
friend
attended
likes
friend
organized
organized attended
friend
attended
likes
organized
attended
attended
likes
type
type
type
attended
organized
friend
friend
attended
friend
organized
attended
attended
organized
friend
type
attended
type
type
likes
friend
friend
friend
attended
organized
organized likes
likes
attended
attended
Peter‘s Bday party
Halloween 2008
Pizza Feast
Fundraiser for School
Sylvester 2009 Homecoming
09
Chill-out Night
Spring Break Trip
Alice Goodbye
The Godfather
Pulp Fiction
Thriller
Inception
Mystery
Harry Potter
Toy Story
Mrs. Doubtfire
Comedy
Family
The Lion King
Gone with the wind
Drama
Titanic
Sci-Fi Star Wars IV
PMatch
Example Query
3
Simple query, yet complex structure and difficult to answer by hand
Francis
?f ?u Peter
friend
likes likes
attended
attended
organized
type ?b
?p
Drama
PMatch
Query Uncertainty
4
Was it really a drama movie?
Francis
?f ?u Peter
friend
likes likes
attended
attended
organized
type ?b
?p
Drama
PMatch
Probabilistic Matching Query
5
Francis
?f ?u Peter
friend
likes likes
attended
attended
organized
type ?b
?p
Drama
∞
6
2 Certain
Uncertain Capture uncertainty in weighted query boxes
3
PMatch
Intelligence and Security
6
?target
in ?person1
?bank
Very bad guy
?person1
Africa
?forum Radical
?person3 ?city
Middle East Radical
labeled
labeled
posted to posted to
located in
traveled
lives in
transfer
transfer
affiliated
affiliated
contacted
PMatch
Intelligence and Security
7
?target
in ?person1
?bank
Very bad guy
?person1
Africa
?forum Radical
?person3 ?city
Middle East Radical
labeled
labeled
posted to posted to
located in
traveled
lives in
transfer
transfer
affiliated
affiliated
contacted
∞ 3
1
1 2
2
PMatch
PMatch Applications Intelligence
- Information retrieval on open source data
Search - Facilitate productive use of social network data
Security - Finding suspicious patterns
Social Network Analysis - Querying as a primitive operation for data retrieval
8
PMatch
Outline
Motivation
PMatch Query Definition
PMatch Query Answering
Experiments
Related Work & Conclusion
PMatch
PMatch Query Set of Boxes
- Each contains edges - Has a weight • Infinity certainty or “core” of query
Naturally extends exact subgraph matching queries
10
Francis
?f ?u Peter
friend
likes likes
attended
attended
organized
type ?b
?p
Drama
∞
6
2
3
PMatch
PMatch Query Matching boxes
- Let bi = 1 if box Bi is matched by a substitution else 0 for i=1 to n (= # of boxes)
- Weights wi, offset w0
Probability of match (user defined) =0 if b0 = 0, core must be matched = 1
1 + e−(w0+�n
i=1 wibi)
11
Francis
?f ?u Peter
friend
likes likes
attended
attended
organized
type ?b
?p
Drama
∞
6
2
3
PMatch
PMatch Query Answer All substitutions with probability above user defined threshold τ
Need to eliminate redundant answers - Substitutions that can be extended to
higher probability
12
Francis
?f ?u Peter
friend
likes likes
attended
attended
organized
type ?b
?p
Drama
∞
6
2
3
PMatch
PMatch Answer
1/(1+e-(-5+6)) ≈ 73% - w0=-5
13
Francis
?f ?u Peter
friend
likes likes
attended
attended
organized
type ?b
?p
Drama
∞
6
2
3
Francis
John
friend
likes
type The
Godfather Drama
PMatch
Outline
Motivation
PMatch Query Definition
PMatch Query Answering
Experiments
Related Work & Conclusion
PMatch
PMatch Query Answering Try to find all answers at once by
exploiting shared substructures and efficiently maintain data structures.
Depth first-search algorithm - Terminates when search space is
exhausted or answer probability falls below threshold.
Provably correct
15
PMatch
PMatch Initialization
16
Francis
?f
?b Drama
?p
?u Peter
Friend {0}
Likes {0} Likes {2,3}
Attended {2,3}
Attended {2,3}
organi zed {2}
type {6}
Rewrite query (0=certain)
PMatch
PMatch Data Structures Maintained per vertex:
R: Set of substitution candidates - Pairs of the form <x,Hx>, where x is a
potential substitution and Hx the set of box ids it matches. Hx maintained as bit vector
S: Set of box ids for which R has been initialized
17
PMatch
PMatch Initialization
18
Francis
?f
?b Drama
?p
?u Peter
Friend {0}
Likes {0} Likes {2,3}
Attended {2,3}
Attended {2,3}
organi zed {2}
type {6}
R = {} S = {}
R = {} S = {}
R = {} S = {}
R = {} S = {}
R = {<peter|2>} S = {2}
R = {<drama|6>} S = {6}
R = {<francis|0>} S = {0}
θ = {}
τ = 0.75
PMatch
Vertex selection heuristic Goodness(v) =
Progress(v) = cumulative box weight of edges to be processed - Normalize weight by box cardinality
Cost(v) = use standard cost models from exact subgraph matching - Cardinality, estimated selectivity
19
progress(v) cost(v)
PMatch
PMatch Algorithm
20
Francis
?f
?b Drama
?p
?u Peter
Friend {0}
Likes {0} Likes {2,3}
Attended {2,3}
Attended {2,3}
organi zed {2}
type {6}
R = {} S = {}
R = {} S = {}
R = {} S = {}
R = {} S = {}
R = {<peter|2>} S = {2}
R = {<drama|6>} S = {6}
R = {<francis|0>} S = {0}
θ = {}
τ = 0.75
Initial vertex selection
PMatch
PMatch Algorithm
21
Francis
?f
?b Drama
?p
?u Peter Likes {0}
Likes {2,3}
Attended {2,3} organi zed {2}
type {6}
R = {<Peter’s Bday|2,3>, <Silvester 09|2,3>} S = {2,3}
R = {<John|0>,<Mark|0>, <Melissa|0>} S = {0}
R = {} S = {}
R = {} S = {}
R = {<peter|2>} S = {2}
R = {<drama|6>} S = {6}
θ = {}
τ = 0.75
Processing of „Francis“ vertex
PMatch
PMatch Algorithm
22
Francis
?f
?b Drama
?p
?u Peter
Likes {2,3}
Attended {2,3} organi zed {2}
type {6}
R = {<Peter’s Bday|2,3>, <Silvester 09|2,3>} S = {2,3}
R = {<Titanic,0>,<Star Wars,0>} S = {0}
R = {} S = {}
R = {<peter|2>} S = {2}
R = {<drama|6>} S = {6}
θ = {?f/Mark}
τ = 0.75
Threshold Pruning
PMatch
PMatch Algorithm
23
Francis
?f
?b Drama
?p
?u Peter
Attended {2,3} organi zed {2}
R = {<Peter’s Bday|2,3>, <Silvester 09|2,3>} S = {2,3}
R = {} S = {}
R = {<Jenny|2,3>} S = {2,3}
R = {<peter|2>} S = {2}
θ = {?f/Mark,?b/Titanic}
τ = 0.75
Processing vertex „?b“. Maintaining box indices
PMatch
PMatch Algorithm
24
Francis
?f
?b Drama
?p
?u Peter
organi zed {2}
R = {<Peter’s Bday|2,3>} S = {2,3}
R = {} S = {}
R = {<peter|2>} S = {2}
θ = {?f/Mark,?b/Titanic,?u/Jenny}
τ = 0.75
Crossed Threshold
PMatch
PMatch Algorithm
25
Francis
?f
?b Drama
?p
?u Peter
R = {} S = {}
R = {<peter|2>} S = {2}
τ = 0.75
Found Result
0.997
θ = {?f/Mark,?b/Titanic,?u/Jenny,?p/Peter’s Bday}
PMatch
Query Splitting
26
Francis
?f
?b Drama
?p
?u Peter
Friend {0}
Likes {0} Likes {2,3}
Attended {2,3}
Attended {2,3}
type {6}
R = {} S = {}
R = {} S = {}
R = {} S = {}
R = {<drama|6>} S = {6}
R = {<francis|0>} S = {0}
θ = {}
τ = 0.75
R = {<Peter’s Bday|2>, <Silvester 09|2>} S = {2}
Selecting „Peter“ first causes query to be split.
PMatch
Outline
Motivation
PMatch Query Definition
PMatch Query Answering
Experiments
Related Work & Conclusion
PMatch
Experimental Setup I Implementation in Java (approx 5500
loc) on top of Neo4j Datasets
- Friendship Network (Flickr, Orkut, Livejournal,
Youtube): 778 million edges - Delicious Network: 1.1 billion edges
Query Benchmark for each dataset: 9 PMatch queries each of increasing complexity
28
PMatch
Experimental Setup II Compared
- PMatch-1: box weight - PMatch-2: normalized box weight - DOGMA: Exact matching for all combination - Neo4j: Exact matching for all combinations
• Too slow – not shown in the charts
System: AMD Opteron, 15K RPM 300 GB SAS HD, 2GB of heap space
29
PMatch
Friendship network (small queries)
For each query, we report query times with cold (right) and warm (left) caches.
30
0
50
100
150
200
250
300
350
Query SA: 6B,23E Query SB: 4B,6E Query SC: 5B,12E Query SD: 4B,6E Query SE: 5B,12E Query SF: 6B,23E
Queries
Que
ry T
ime
(ms)
PMatch-1 PMatch-2 DOGMA
PMatch
Friendship network (large queries)
For each query, we report query times with cold (right) and warm (left) caches.
31
Queries
Que
ry T
ime
(ms)
PMatch-1 PMatch-2 DOGMA
0
5000
10000
15000
20000
25000
Query SG: 4B,6E Query SH: 5B,12E Query SI: 6B,23E
PMatch
Delicious network (small queries)
For each query, we report query times with cold (right) and warm (left) caches.
32
Queries
Que
ry T
ime
(ms)
PMatch-1 PMatch-2 DOGMA
0
1000
2000
3000
4000
5000
6000
7000
Query TA: 5B,13E Query TB: 5B,17E Query TC: 4B,6E Query TD: 5B,12E Query TE:5B,12E
PMatch
Delicious network (large queries)
For each query, we report query times with cold (right) and warm (left) caches.
33
Queries
Que
ry T
ime
(ms)
PMatch-1 PMatch-2 DOGMA
0
5000
10000
15000
20000
25000
30000
35000
40000
Query TF:4B,6E Query TG: 5B,18E Query TH: 5B,18E Query TI: 4B,7E
PMatch
Outline
Motivation
PMatch Query Definition
PMatch Query Answering
Experiments
Related Work & Conclusion
PMatch
Related Work I Exact Subgraph Matching
- Many systems: RDF-3X, OWLIM, YARS (2), Sesame, Jena, DOGMA, COSI etc.
- Efficient on-disk index structures - Query answering algorithms focused on
exact queries (e.g. SPARQL).
No treatment of query uncertainty
35
PMatch
Related Work II Approximate Subgraph Matching
- SAGA, TALE, Grafil, etc - Database containing many small graphs - Do not scale to large social networks - Query uncertainty is defined in the
system. PMatch provides an expressive language to
specify probabilistic queries and answers them efficiently.
36
PMatch
Conclusion Query uncertainty often occurs in social network search, information retrieval and pattern matching.
PMatch provides a language to concisely express and algorithms to efficiently answer probabilistic subgraph matching queries.
37
PMatch
dogma.umiacs.umd.edu
PMatch
? Questions? Comments?
PMatch
Related Work III Data Uncertainty
- Specified at the tuple or edge level - How to answer certain (SQL) queries
over uncertain data and what are the answer probabilities?
40
First Name Last Name Income Probability
John Smith 80,000 0.6
Alice Baker 85,000 0.5
Jennifer Temper 90,000 0.9
PMatch
Related Work III Data Uncertainty
- How to encode dependencies? - Need to make strong independence
assumptions to be efficient - Where are the probabilities coming form?
Instead, we assume certain data and model uncertainty in the user specified query.
41
PMatch
Query Uncertainty Motivation User Uncertainty
- User does not know what she is looking for exactly
Lack of schema - Lack of schema complicates query design
Data heterogeneity - Can be caused by data integration
Noisy or Missing Data - It’s real world data
42