Graph Databases - Marco Serafini · Property Graph Format •Vertices and edges can have associated properties •Key-value pairs •Vertices can be grouped by label •Similar to

Graph Databases

Marco Serafini

COMPSCI 532Lecture 10

2

Graph DB Use cases• Social network queries

• E.g. Facebook stores the entire metadata in a social graph• Network security

• Find sequence of steps that lead to intrusion• Fraud detection

• Find fraud rings• Knowledge bases

• Answer questions, language models

33

Resource Description Framework• World Wide Web Consortium specification• Used for the Semantic Web

• Web pages define human-readable content• Goal: add machine-readable meta-data describing how pages relate• Format to reuse and share data across the Web

• Examples• Wikipedia, census, life sciences, DBPedia

• Directed labeled multi-graph

44

RDF Format• Graph is set of triplets = (Subject, Predicate, Object)• Subject and predicate are resources

• Associated with Unique Resource Identifiers (URI)• Object can be resource or literal (string)

From S. Decker et al., “Framework for the Semantic Web: An RDF Tutorial”

55

Query Language: SPARQL• Declarative• Defines a query graph• RDF store must find all instances in data graph• Example

• “Return friends of user alice01 who live in Paris”PREFIX sn: http://socialnetwork.com/ontology/SELECT ?friendWHERE {

?user sn:hasName “alice01”;sn:isFriendOf ?friend.

?friend sn:livesIn sn:Paris.}

http://socialnetwork.com/ontology/

66

Property Graph Format• Vertices and edges can have associated properties

• Key-value pairs•Vertices can be grouped by label

• Similar to tables, e.g., employees• Properties are similar to columns of a table

• Not a “global” format: no URIs required•Typical more compact than RDFs• Common is NoSQL graph databases

77

Query Languages• Cypher

• Originally used by Neo4j• Linear queries

• Previous example in Cypher

MATCH (u:User)-[:isFriend]->(f:User)–[:livesIn]->(:City {name: ‘Paris’})WHERE (u.name = ‘Alice’)RETURN f.name

88

Relational Representation of Graphs• Graphs is a relational DBMS

• Vertex table, edge table• Sometimes edges as triplets

• Pattern matching• Maintain a set of partial matches• Extend by edge: self-join on edge table

99

Why are Graph Workloads Hard?• Many joins: difficult to estimate cardinality

• Joins require random access• Cardinality estimation gets harder at every join• Skew: few vertices have very high degree

• Indexing•Adjacency list scans are very frequent• Graph-aware databases optimize these

• Some queries have very low selectivity• E.g. triangle closure (potential friends)

1010

Worst-Case Optimal Joins• Worst-Case Optimality

• O(intermediate results) <= O(final results)• Edge-at-a-time approach is not worst-case optimal

• Number of triangles: O(|E|3/2)• Number of wedges: O(|E|2)

• Vertex-at-a-time (multi-way-joins) are WCO• (v1,v2), (v1,v2,v3), (v1,v2,v3, v4), …• Will not materialize all wedges

11

Subgraph Isomorphism (TurboISO)

11

SubTask 1Match spanning tree

from one starting vertex

SubTask 2Match cross-edges

100 100

heavyweight104 subgraphs *2 edge lookups

v vsingle starting vertex

multiple matching vertices

10 1010 10

10*10

lightweight220 edge lookups

10*10

12

TurboISO: Flexible Join Order

12

13

Hard to Parallelize

13

Run

ning

tim

e (m

s)

1414

Subgraph Enumeration• Count all instances of an unlabeled pattern

• E.g. triangles, squares, cliques• Important to rule out permutations

1515

Reachability Queries• Given two vertices v and u• Find (and/or rank) paths connecting them• Simplest approach: parallel BFS from both vertices

• Expensive

1616

Dynamic Graphs• Temporal Analysis à Deal with multiple snapshots• Real-Time analytics à Work on live graph data• Storage implications

UPDATES

TRANSACTIONAL SYSTEM

ANALYTICAL SYSTEM

RESULTSLOAD

DYNAMICDATA STRUCTURE + TRANSACTIONS

READ-ONLYDATA STRUCTURE

NO TRANSACTIONS

E.g.: B-Tree, LSMT E.g.: CSR

1717

Graph Storage for RT Analytics• Sequential adjacency list scan is important• CSR: Sequential scan but read-only• TEL: LOG-based adjacency list

220 221 222 223 224 225 226

graph scale, V

0.01

0.1

1

10

cach

em

iss/

edge TEL

LSMTB+TreeLinked List

220 221 222 223 224 225 226

graph scale, V

0.1

1

10

100

µs/

vert

ex(s

eeks

)

TELLSMT

B+TreeLinked List

Cache misses Seek time Edge scan

220 221 222 223 224 225 226

graph scale, V

10

100

1000

ns/e

dge

(sca

n) TELLSMT

B+TreeLinked List

1818

Open Issues• Graph analytics algorithms are diverse• Still looking for good APIs

• There is no “SQL for graphs”• Hard to leverage hardware characteristics

• Scale out to distributed systems: Hard because of edge cut• SIMD: hard because of skew and random access• Caching: hard because of random access

Documents

Graph Databases - Marco Serafini · Property Graph Format •Vertices and edges can have associated properties •Key-value pairs •Vertices can be grouped by label •Similar to