email [email protected] Antonio Maccioni c p i o y ...torlone/bigdata/BigGraphDataManagement.pdf · Big Graph Data Management Antonio Maccioni h e l d-b y w h e r e Rome

Big Graph Data Management

Antonio Maccioni

held-by

where

Rome

14 May 2015 Big Data Course

email

locatedIn

topi

c

whe

n

[email protected]

affil

iate

d

is-a

this talk is about

● Graph Databases: models, languages, use cases

● Graph Database Management Systems

● Graph Processing Systems (intro)

● Open Problems in Graph Data Management ● ...and research project around us

scenario

is going towards graph search capabilities (Tao, Unicorn, Open Graph protocol, etc.)

is working on PGX (Parallel Graph Analytics), has launched Green-Marl and has implemented an RDF layer on top of Oracle NoSQL

developing a layer, VERTEXICA, for graph mining on top of the analytical database

launched the language interface SQL-GR to run graph analysis on top of its analytical database

has been working on the “Knowledge Graph” for a while, has created Pregel and has launched Cayley

has just aquired Aurelius TitanDB

has open-sourced its graph database FlockDB

has working on Trinity, an in-memory graph database

“Over 25 percent of enterprises will use graph databases by 2017”

- Enterprise DBMS, Q1 2014. Forrester Research

scenario

“graph databases are catching on commercially” - Michael Stonebraker (2014 ACM Turing Award)

nosql scenarioSimple data models

“Graph Databases are an odd fish in the NoSQL pond”

- P.J. Sadalage, M. Fowler - NoSQL Distilled

Simple data models

But if we want to represent connections we may opt for a graph database management systems:

nosql scenarioSimple data models

a graph

1

63

4

52

[1]->2->4->5[2]->3[3]->5[4]->5->6[5]->6

from\to 1 2 3 4 5 6

1 0 1 0 1 1 0

2 0 0 1 0 0 0

3 0 0 0 0 1 0

4 0 0 0 0 1 1

5 0 0 0 0 0 1

6 0 0 0 0 0 0

Adjacency matrix Adjacency lists

admin

works

belongsbelongs

admin

friends

married

belongs

belongs

likes

worked

friendslikes

follows

graph + data = graph database

why graph databases?

● More natural modeling

● Manage connections explicitly

● Run algorithms of network science (e.g., PageRank)

use case 1: semantic web

● A Web-scale architecture● for metadata and data

management (together)● for interoperarbility of data and services

● Web of Data● Compatible with other Web

technologies● Based a set of W3C standards

(HTTP, IRI/URI,RDF, SPARQL, OWL)

● Web 3.0● make information understandable by machines

● HTTP request of data by URI● You can follow links (the edges of the graph)


Semantic Web + Open Data = Linked Open Data



sparql

Which are the names of the people married with people born in Honolulu?

SELECT ?name{ ?person1 married ?person2 . ?person2 born ?city . ?city name “Honolulu” . ?person1 name ?name1 .}

?person1 ?person2married

?cityHonolulu name

name

?name1

name


sparql?person1 ?person2

married

?cityHonolulu name

name

?name1

Pattern matching style of querying

p2

n1

c2

p1

c1

name MichelleObama

marriedBarackObama

name

born

locatedIn

ChicagoHonolulu

USA

locatedIn

name

name

name

born

name

Pattern matching style of querying


p2

n1

c2

p1

c1

name MichelleObama

marriedBarackObama

name

born

locatedIn

ChicagoHonolulu

USA

locatedIn

name

name

name

born

sparql?person1 ?person2

married

?cityHonolulu name

name

?name1

name

triple stores

s p op1 married p2

p1 name Michelle Obama

p2 born c2

c2 name Honolulu

us name USA

... ... ...

Indexes on: (s,p,o), (s,o,p), (p,s,o), (p,o,s), (o,s,p), (o,p,s)

G:

query processing

G(P=name and O=Honolulu)

G(P=married)G(P=name)

G(P=name)

?person1 ?person2married

?city

Honoluluname

name

?name

?person1

?person1

?person2

?person2 ?city

?city

?person1

?name

name

name

?person1

?person2

married

?person2

?city

name

?city

Honolulu

use case 2: social networks

admin

works

belongsbelongs

admin

friends

married

belongs

belongs

likes

worked

friendslikes

follows

Find common friends for every single profile visit

~ 1.2 billions tuples * average number of friends per person

FRIEND 1 FROM_DAY FRIEND 2


Find common friends for every single profile visit

~ 1.2 billions tuples * average number of friends per person

FRIEND 1 FROM_DAY FRIEND 2


graph database management systems

Three main properties:

1. Property Graph (as data model)2. Index-free Adjacency (as physical

level organization)3. Path-traversal (as query language)

It's a schema-less

data model

A property graph is a directed multigraph g = (N, E) where every node n N∈ and every edge e E∈ is associated with a set of pairs <key, value>, called properties.

property graph data model

We say that a (graph) database g satisfies the index-free adjacency if the existence of an edge between two nodes n

1 and n

2 in g can

be tested on those nodes and does not require to access an external, global, index.

index-free adjacency

GOAL: make the cost of a basic traversal independent of the size of the database, in case keeping O(1)


...trying to keep the cost of a basic traversal O(1)


...trying to keep the cost of a basic traversal O(1)

neo4j physical layer

I. Robinson, J. Webber, E. Eifrem – Graph Databases, 2013.

● Store files for different parts of the graph

● Node store

● Relationship store

● Property store● Each record contains 4 properties● The properties of an element may use more records● Property's values can be either stored in the property store or stored in a dynamic

string store

titan, infinitegraph, levelgraph

LevelGraph

Built above the extensible column storeApache Cassandra

Built above the Object Oriented Database Objectivity/DB

Built on node.js above the key-value store LevelDB but pluggable to different stores

Server Mode1. Go to http://www.neo4j.org/, download Neo4J Server and unzip it

2. Run the command ./bin/neo4j start (use bin\Neo4j.bat on Windows)

3. Find a graphical dashboard at http://localhost:7474/

4. You can also use it with REST API: http://localhost:7474/db/data/

building a neo4j graph db

Embedded mode

1. Import in your java project the library neo4j-kernel-*-*-*.jar and its classes:

2. Create the database:

3. Create nodes and edges:

4. Set the properties:

import org.neo4j.graphdb.*;import org.neo4j.graphdb.factory.GraphDatabaseFactory;

GraphDatabaseService gdb = new GraphDatabaseFactory().newEmbeddedDatabase("/home/...");

Node n1 = gdb.createNode();Node n2 = gdb.createNode();Relationship e12 = n1.createRelationshipTo(n2, EdgeType.TYPE);

n1.setProperty(“name”, “Rome”);n2.setProperty(“name”, “Italy”);e12.setProperty(“type”, “locatedIn”);

Enum implementing RelationshipType

building a neo4j graph db

http://www.neo4j.org/

http://localhost:7474/

http://localhost:7474/db/data/

BLUEPRINTS

GREMLIN

PIPES

Blueprints is a property graph model interface with provided implementations.

Gremlin is a domain specific language for traversing property graphs

FRAMESFrames exposes the elements of a Blueprints graph as Java objects: software is written in terms of domain objects and their relationships to each other.

REXTER

FURNACEFurnace is a property graph algorithms package

http://www.tinkerpop.com/tinkerpop stack

building a graph db with blueprintshttps://github.com/tinkerpop/blueprints/wiki

1. Import in your java project the libraries blueprints-core-*.*.*.jar and blueprints-neo4j-graph-*.*.*.jar with their classes:

2. Create the database:

3. Create nodes and edges:

4. Set the properties:

import com.tinkerpop.blueprints.*;import com.tinkerpop.blueprints.impls.neo4j.Neo4jGraph;

Graph gdb = new Neo4jGraph("/home/...");

Vertex n1 = gdb.addVertex(null);Vertex n2 = gdb.addVertex(null);Edge e12 = gdb.addEdge(null, n1, n2, “locatedIn”);

n1.setProperty(“name”, “Rome”);n2.setProperty(“name”, “Italy”);e12.setProperty(“type”, “locatedIn”);

http://www.tinkerpop.com/

querying a graph database

●Gremlin:● Imperative query language● Descendant of languages such as XPATH

●Cypher:● Declarative query language● Descendant of languages such as SQL

https://github.com/tinkerpop/blueprints/wiki

gremlin> g = new Neo4jGraph("/home/...");

gremlin> g.V.outE.filter{it.edgeid == 'e1'}==>e[2][1EDGE>4]==>e[1][1EDGE>3]==>e[0][1EDGE>2]==>e[7][6EDGE>2]

gremlin: a path traversal QL

gremlin> g.V.outE.filter{it.edgeid == 'e1'}.inV.outE.filter{it.edgeid == 'e2'}.inV.nodeid

==>F==>E==>E

gremlin: a path traversal QL

cypher: a pattern matching QL

START: starting point in the graph

MATCH: the pattern to match, bound to the starting point

WHERE: filtering criteria

RETURN: what to return

MATCH: the pattern to match, bound to the starting point


node1edge1>node2edge2>node3node1[?]>node2[?]>node3

node1[*]>node3

Live hands-on a graph database about beers at http://console.neo4j.org/?id=beer

START n = node(*)MATCH n[r1:EDGE]>x[r2:EDGE]>mWHERE (r1.edgeid = 'e1') and (r2.edgeid = 'e2')RETURN m.nodeid==>”F”==>”E”==>”E”


http://console.neo4j.org/?id=beer

● Secondary Indexes: defined on properties

● Transactions: graph databases usually support ACID properties. In Neo4J all operations have to be performed in a transaction:

● Other programming language wrappers:

try ( Transaction tx = gdb.beginTx() ){ tx.success();}

other features

graph processing systems

Pregel/Giraph

● Frameworks to compute (distributed) graph analysis on large graphs:● have similar motivations of Hadoop, Spark, etc.● help programmers to focus on the algorithm rather

than on the implementation● support different types of graphs● provide a variety of algorithms already implemented

GraphLab/Dato Pegasus

google pregel/apache giraph● User specifies a vertex program

● Computation runs a sequence of supersteps

● In each superstep the program is executed over all the vertexes

● The program can use messages received in a previous superstep from the neighbors and can send messages to them for the next superstep

● A vertex can deactivate itself and the computation halts when all the vertexes are deactivated.

● How to model a graph database?

● How to migrate data and queries from existing databases?

● How to scale queries over large graphs?

research problems about graph DBs

Compact: Sparse: Dense: Reduces the number of data accesses

Can violate property graph constraints

Accesses and updates can be inefficient

Reduces number of joins

Needs human intervention for a semantic enrichment

modeling graph databases

Orienting the ER:


ENTITY 1

ENTITY 2

RELATIONSHIP

(0:1)

RELATIONSHIP : 0

(0:1)

ENTITY 1

ENTITY 2

ENTITY 1

ENTITY 2

RELATIONSHIP

(0:N)

RELATIONSHIP : 1

(0:1)

ENTITY 1

ENTITY 2

ENTITY 1

ENTITY 2

RELATIONSHIP

(0:N)

RELATIONSHIP : 2

(0:N)

ENTITY 1

ENTITY 2

R. De Virgilio, A. Maccioni, R. Torlone – Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.

Oriented-ER:



Partitioning:modeling graph databases

Rule 1: if a node n is disconnected then it forms a group by itself.

Rule 2: if a node n has w−(n)>1 and w+(n)>0 then n forms a group by itself.

Rule 3: if a node n has w−(n)<2 and w+(n)<2 then n is added to the group of a node m such that there exists the edge (m, n) in the O-ER diagram.


Partitioning:modeling graph databases


Graph Database Templatemodeling graph databases


SQLselect *from Twhere T.A1 = v1

R. De Virgilio, A. Maccioni, R. Torlone – R2G: a Tool for Migrating Relations to Graphs – EDBT International Conference on Extending Database Technology, 2014

R2G: from relations to graphs

R2G: unifiability of data values• Joinable tuples t1 R∈ 1 and t2 ∈ R2:

– there is a foreign key constraint between R1.A and R2.B and t1[A] = t2[B].

• Unifiability of data values t1[A] and t2[B]:– (i) t1=t2 and both A and B do not belong to a multi-attribute

key;– (ii) t1 and t2 are joinable and A belongs to a multi-attribute key;– (iii) t1 and t2 are joinable, A and B do not belong to a

multi-attribute key and there is no other tuple t3 that is joinable with t2.


R2G: schema graph

Full Schema Paths:

FR.fuser US.uid US.uname→ →FR.fuser FR.fblog BG.bid BG.bname→ → →FR.fuser FR.fblog BG.bid BG.admin US.uid US.uname→ → → → → ...


R2G: data migration


R2G: query migration


scalability over real-world graphs

Graph Databases are hard to scale

> 500 million users > 1.2 billion active users > 500 million users

power-law graphs

scale-free graphspreferential attachment

Graph Databases are hard to scale

10% of the users follow the same five users

real-world graphs

Replication Partitioning

Graph Databases are very hard to scale

10% of the users follow the same five users

real-world graphs

real-world graphs

A. Maccioni, D. J. Abadi – Scalable Pattern Matching over Compressed Graphs via Sparsification

follows

follows

follows

real-world graphs


real-world graphs


any redundancy?


high-degree node

low-degree nodecompressor node

SPARSIFICATION

sparsification


GREEDY SPARSIFICATION

2 B

1 C

A3 5

4

6

2B

1 C

A3

5

4

BC

AB

ABC

6

compression via sparsification

2B

1 C

A

3

5

4

ABC

6

SPACE-AWARE SPARSIFICATION


with Space-aware Compressed Graphs

graph pattern matchingwith Greedy Compressed Graphs


●How to shard/partition a graph database?

●How to visualize large graphs?● Specialized startups are addressing the problem

●Standardization of a query language

●Graph processing with GPUs● Medusa-gpu, MapGraph

●What is the best way to implements graph layer on top of SQL/NoSQL systems?● IBM, Oracle, Teradata, HP, ...

open problems

●Graphs are used in many fields● Social Networks, Bioinformatics, Semantic Web, Geo-informatics, ...

●Graphs are more complicated to manage than other types of data● and we need different considerations

●When we need to store a big graph● we have many options, each one with both advantages and disadvantages

●Scaling queries over graph databases is still an infant area of database industry and research

conclusion

thanks for the attention

Documents

email [email protected] Antonio Maccioni c p i o y ...torlone/bigdata/BigGraphDataManagement.pdf · Big Graph Data Management Antonio Maccioni h e l d-b y w h e r e Rome