Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Graph Databases
Marco Serafini
COMPSCI 532Lecture 10
2
Graph DB Use cases• Social network queries
• E.g. Facebook stores the entire metadata in a social graph• Network security
• Find sequence of steps that lead to intrusion• Fraud detection
• Find fraud rings• Knowledge bases
• Answer questions, language models
33
Resource Description Framework• World Wide Web Consortium specification• Used for the Semantic Web
• Web pages define human-readable content• Goal: add machine-readable meta-data describing how pages relate• Format to reuse and share data across the Web
• Examples• Wikipedia, census, life sciences, DBPedia
• Directed labeled multi-graph
44
RDF Format• Graph is set of triplets = (Subject, Predicate, Object)• Subject and predicate are resources
• Associated with Unique Resource Identifiers (URI)• Object can be resource or literal (string)
From S. Decker et al., “Framework for the Semantic Web: An RDF Tutorial”
55
Query Language: SPARQL• Declarative• Defines a query graph• RDF store must find all instances in data graph• Example
• “Return friends of user alice01 who live in Paris”PREFIX sn: http://socialnetwork.com/ontology/SELECT ?friendWHERE {
?user sn:hasName “alice01”;sn:isFriendOf ?friend.
?friend sn:livesIn sn:Paris.}
66
Property Graph Format• Vertices and edges can have associated properties
• Key-value pairs•Vertices can be grouped by label
• Similar to tables, e.g., employees• Properties are similar to columns of a table
• Not a “global” format: no URIs required•Typical more compact than RDFs• Common is NoSQL graph databases
77
Query Languages• Cypher
• Originally used by Neo4j• Linear queries
• Previous example in Cypher
MATCH (u:User)-[:isFriend]->(f:User)–[:livesIn]->(:City {name: ‘Paris’})WHERE (u.name = ‘Alice’)RETURN f.name
88
Relational Representation of Graphs• Graphs is a relational DBMS
• Vertex table, edge table• Sometimes edges as triplets
• Pattern matching• Maintain a set of partial matches• Extend by edge: self-join on edge table
99
Why are Graph Workloads Hard?• Many joins: difficult to estimate cardinality
• Joins require random access• Cardinality estimation gets harder at every join• Skew: few vertices have very high degree
• Indexing•Adjacency list scans are very frequent• Graph-aware databases optimize these
• Some queries have very low selectivity• E.g. triangle closure (potential friends)
1010
Worst-Case Optimal Joins• Worst-Case Optimality
• O(intermediate results) <= O(final results)• Edge-at-a-time approach is not worst-case optimal
• Number of triangles: O(|E|3/2)• Number of wedges: O(|E|2)
• Vertex-at-a-time (multi-way-joins) are WCO• (v1,v2), (v1,v2,v3), (v1,v2,v3, v4), …• Will not materialize all wedges
11
Subgraph Isomorphism (TurboISO)
11
SubTask 1Match spanning tree
from one starting vertex
SubTask 2Match cross-edges
100 100
heavyweight104 subgraphs *2 edge lookups
v vsingle starting vertex
multiple matching vertices
10 1010 10
10*10
lightweight220 edge lookups
10*10
12
TurboISO: Flexible Join Order
12
13
Hard to Parallelize
13
Run
ning
tim
e (m
s)
1414
Subgraph Enumeration• Count all instances of an unlabeled pattern
• E.g. triangles, squares, cliques• Important to rule out permutations
1515
Reachability Queries• Given two vertices v and u• Find (and/or rank) paths connecting them• Simplest approach: parallel BFS from both vertices
• Expensive
1616
Dynamic Graphs• Temporal Analysis à Deal with multiple snapshots• Real-Time analytics à Work on live graph data• Storage implications
UPDATES
TRANSACTIONAL SYSTEM
ANALYTICAL SYSTEM
RESULTSLOAD
DYNAMICDATA STRUCTURE + TRANSACTIONS
READ-ONLYDATA STRUCTURE
NO TRANSACTIONS
E.g.: B-Tree, LSMT E.g.: CSR
1717
Graph Storage for RT Analytics• Sequential adjacency list scan is important• CSR: Sequential scan but read-only• TEL: LOG-based adjacency list
220 221 222 223 224 225 226
graph scale, V
0.01
0.1
1
10
cach
em
iss/
edge TEL
LSMTB+TreeLinked List
220 221 222 223 224 225 226
graph scale, V
0.1
1
10
100
µs/
vert
ex(s
eeks
)
TELLSMT
B+TreeLinked List
Cache misses Seek time Edge scan
220 221 222 223 224 225 226
graph scale, V
10
100
1000
ns/e
dge
(sca
n) TELLSMT
B+TreeLinked List
1818
Open Issues• Graph analytics algorithms are diverse• Still looking for good APIs
• There is no “SQL for graphs”• Hard to leverage hardware characteristics
• Scale out to distributed systems: Hard because of edge cut• SIMD: hard because of skew and random access• Caching: hard because of random access