Graph-like power
Roman R.
MATCH (a:Actor),(m:Movie)WHERE a.name ='Keanu Reeves' AND m.title='The Matrix'CREATE (actor)-[:ACTS_IN]->(movie)
Today○ Graphs in NoSQL world
○ classification○ definition○ components
○ Neo4j○ nodes, rels, props, indexes○ Cypher
○ PHP and Neo4j○ Demo○ Alternatives○ Q/A
1
NoSQL Databases
Key-Value
Document
Graph
Column (BigTable
)
MemcacheDB
Redis
Riak
Cassandra
CouchDB
Neo4j
TITAN
HBase/Hadoop
OrientDB
2
Elasticsearch
RavenDB
Tokyo Cabinet
Infinite GraphAllegroGraph
NoSQL
MongoDB
What is a Graph in math
3
● represent a connected set of objects
● graph:○ vertex (node/points)○ edge (arc/line/relationship/arrow) - undirected○ attribute (property) - on node/relationship
● types:○ pair: G = (V, E)○ digraph: D = (V, A)○ mixed: G = (V, E, A)
V = {1, 2, 3, 4, 5, 6}
E = {{1, 2}, {1, 5}, {2, 3}, {2, 5}, {3, 4}, {4, 5}, {4, 6}}
What is a Graph database
4
● stores data in a graph and retrieving vast networks of data● shines when storing richly-connected data
● consists of nodes, connected by relationships○ A Graph —records data in→ Nodes —which have→ Properties○ Nodes —are organized by→ Rels —which also have→ Properties○ Nodes —are grouped by→ Labels —into→ Sets○ A Traversal —navigates→ a Graph
it —identifies→ Paths —which order→ Nodes○ An Index —maps from→ Properties —to either→ Nodes or Rels○ A Graph Database —manages a→ Graph and
—also manages related→ Indexes
Nodes, Rels, Props, Labels
5
A Graph—records data in→ Nodes—which have→ Properties
Nodes—are organized by→ Relationships—which also have→ Properties
Nodes—are grouped by→ Labels—into→ Sets
Graph Traversal
6
A Traversal—navigates→ a Graph
it—identifies→ Paths—which order→ Nodes
what music
do my friends like
that I don’t yet own
if this power supply goes down,
what web services
are affected?
Graph Index
7
An Index—maps from→ Properties—to either→ Nodes or Rels
find the Account
for username master-of-graphs
Graph
8
A Graph Database—manages a→ Graph and—also manages related→ Indexes
How looks Graph database
9
A Graph Database transforms a RDBMS
10
A Graph Database elaborates a Key-Value Store
11
K* = keyV* = value
A Graph Database relates Column-Family
12
● BigTable databases are an evolution of key-value,using "families" to allow grouping of rows
● stored in a graph, the families could become hierarchical, and the relationships among data becomes explicit
A Graph Database navigates a Document Store
13
D=Document,S=Subdocument,V=Value,D2/S2 = reference
NoSQL Data Models
14
90% of all use cases
Relational Databases
15
● intuitive, using a graph model for data representation● reliable, fully transactional, upholds ACID● durable and fast, using a custom disk-based, native storage engine● massively scalable, up to several billion nodes/relationships/properties● highly-available, when distributed across multiple machines● expressive, with a powerful, human readable declarative graph query
language● fast, with a powerful traversal framework for high-speed graph queries● embeddable, with a few small jars● simple, accesible by a convenient REST API interface or an object-
oriented JAVA API● indexes are based on Apache Lucene, supports Secondary Indexes● has been in commercial development for 10 years and in production for
over 7 years; since 2003;● Cross-platform; Simple set-up; Well documented; Open source;● GPL for Community, AGPL for Enterprise
16
Neo4j features
● CPU - Intel Core i3/i7● Memory - 2GB .. 16/32GB● Disk - 10GB SATA .. SSD w/ SATA● Filesystem - ext4 .. ext4/ZFS● Software - Oracle JAVA 7
17
Neo4j requirements
● Neo4j Community○ Open-Source High Performance○ fully ACID transactional graph database
● Neo4j Enterprise○ High-Performance Cache (up to 10x faster)○ Horizontal scalability with Neo4j Clustering (predictable scalability)○ High-availability and online backups○ Cache based sharding (shard your graph in memory)○ Advanced Monitoring (operational metrics)○ Certified for Windows and Linux○ Email/Phone Support (10x5, 24x7 hours)○ Subscriptions
■ Personal (up to 3 devs, $100k annual revenue) = FREE■ Startups (<$10M funding, <$5M annual revenue) = $12k■ Business (medium, to Global 2000) = Contact Sales
18
Neo4j license
19
● for the simple friends of friends query, Neo4j is 60% faster than MySQL● for friends of friends of friends, Neo is 180 times faster● and for the depth four query, Neo4j is 1,135 times faster● and MySQL just chokes on the depth 5 query
Neo4j vs. Mysql
Neo4j: Nodes● fundamental units that form a graph● can have key/value-style properties● index nodes and relationships
by {key, value} pairs● represent entities
20
Neo4j: Relationships #1/2
● connect entities and structure domain● allow for finding related data● are always directed (outgoing or incoming)● are equally well traversed in either direction● can have relationships to itself● have a relationship type (label)
21
Neo4j: Relationships #2/2
22
Neo4j: Properties● nodes and relationships can have properties● are key-value pairs
○ key is a string○ values can be either a primitive or an array of
one primitive type■ boolean, String, int, int[], etc
■ Java Language Specification
● entity attributes, rels qualities,and metadata
23
Neo4j: Labels● used to group nodes into sets● any number of labels, including none● can be added and removed during runtime● can be used to mark temporary states for nodes● names case-sensitive● CamelCase (convention)
24
Neo4j: Paths● is one or more nodes with connecting relationships● shortest path:
● a path of length one:
● a path of length one:
25
Neo4j: Traversal● Traversal Framework from box● means visiting nodes, following relationships by rules● in most cases only a subgraph is visited● callback based traversal API
○ you can specify the traversal rules● traversing breadth- or depth-first● open Java API
26
Neo4j: graph algorithms● A* (> uses the A* algorithm to find the cheapest path between two
nodes)● Dijkstra (dijkstra > Dijkstra algorithm to find the cheapest path
between two nodes)● PathWithLength (> all paths of a certain length (depth)
between two nodes)● Shortest paths (shortestPath Default > find all the
shortest paths between two nodes)● All simple paths (allSimplePaths > find all simple paths
between two nodes; without loops;)● All paths (allPaths > find all available paths between two
nodes)
27
Neo4j: Schema● is schema-optional graph database
28
● introduced in Neo4j 2.0● eventually available (populating in the background, is
not immediately available for querying)○ come online after fully populated○ failed status (drop and recreate the index)
● can be created on labels group● indexed Nodes & Rels● node_auto_indexing=false,
node_keys_indexable
Neo4j: Index
29
Neo4j: Constraints
● can help you keep your data clean● specify the rules for what your data should
look like● unique constraints is the only available
constraint type
30
● single server instance○ nodes = 2^35 (~34 billion)○ relationships = 2^35 (~34 billion)○ labels = 2^31 (~2 billion)○ properties = 2^36 to 2^38 depending on
property types (maximum ~274 billion, always at least ~68 billion)
○ relationship types = 2^15 (~ 32’000)
31
Neo4j: Data Size
● powerful graph query language● relatively simple● declarative grammar (say what you want, not how)
● humane query language● self-explanatory (based on English prose and neat iconography)● written in Scala● pattern-matching (borrows expression approaches from SPARQL)● aggregation, ordering, limits● create, update, delete● structure and most of keywords inspired by SQL● changing rather rapidly (CYPHER 1.9 START ...)
Cypher Query Language
32
“Makes the simple things easy, and the complex things possible”
Cypher patterns #1/2
33
● (a)● (b)● (a)-->(b)● (a)-->(b)-->(c)● (b)-->(c)<--(a)● (b)-->()<--(a)● (a)--(b)● (a)-(*5)->(b)● (a)-(*3..5)->(b)
○ (a)-(*3..)->(b)○ (a)-(*..5)->(b)○ (a)-(*)->(b)
Cypher patterns #2/2
34
● (a:Label)-->(m)● (a:User:Admin)-->(m)● (a)--(m)● (a)-[r]->(m)● (a)-[ACTED_IN]->(m)● (a)-[r:SOME|ELSE|WTH]->(m)
Cypher: START / RETURN“It all starts with the START”
Michael Hunger, Cypher webinar, Sep 2012
● designates the start points● START is optional (in Neo4j >= 2.0)
Examples:● START <lookup> RETURN <expression>● START n=node(0) RETURN n● START n=node(*) RETURN n.name
35
Cypher: MATCH● primary way of getting data from the database
● START <lookup> MATCH <pattern> RETURN <expr>● OPTIONAL MATCH <lookup> RETURN <expr>
Examples:● MATCH (n) RETURN count(n)● MATCH (actor:Actor) RETURN actor.name;● START me=node(0) MATCH (me)--(f) RETURN f.name● MATCH (n)-[r]->(m) RETURN n AS FROM, r AS `->`, m AS TO
36
● creates nodes and relationships
● CREATE (<name>[:label] [properties,..])● CREATE (<node-in>)-[<var>:RELATION [properties,..]]->(<node-out>);
● CREATE UNIQUE ...
Examples:● CREATE (n:Actor { name:"Keanu Reeves" });● CREATE (keanu)-[:ACTED_IN]->(matrix)● MATCH (keanu {name:”..”}) SET keanu.age=49 RETURN
Cypher: CREATE / SET
37
Cypher: WHERE● filters the results
● MATCH <pattern> WHERE <condition> RETURN <expr>
Examples:● WHERE n.name =~ “(?i)John.*”● WHERE NOT ..● WHERE type(rel) =~ “Perso.*”
38
Cypher: RETURN● creates the result table● any query can return data● can be nodes, relationships, or properties on these
● RETURN DISTINCT <expression> AS x● RETURN aggregate(expr) as alias● RETURN nodes, rels, properties● RETURN expressions of funcs and operators● RETURN aggregation funcs on the above
39
Cypher: etc● CASE / WHEN / ELSE● ORDER BY node.key, node2.key, .. ASC|DESC● LIMIT / SKIP● WITH (WITH count(*) as c)
● UNION / UNION ALL (combining results from multiple queries)
● USING INDEX/SCAN● MERGE / SET / DELETE / REMOVE / FORECH
● Expressions● Operators● Comments● Functions: ALL, ANY, LENGTH, {Math}, {String}, ...
40
● any updating query will run in a transaction● ACID● “it is very important to finish each transaction”
● write lock on node/rel:○ adding, changing or removing prop on a node/rel
● write lock on node:○ creating or deleting a node
● write lock on node and both its nodes:○ creating or deleting a relationship
Cypher: Transactions
41
Cypher: Aggregation● count(node/rel/prop)● count(n), count(n.prop)● sum(n.prop)● avg(n.prop)● percentileDisc(n.prop, {median})● stdev(n.prop, {median}) - calculate deviation from group
● max(n.prop, {median})● collect(n.prop, {median})
● RETURN n, count(*)
42
● SELECT *FROM PersonWHERE name=“Valentin” and age > 30
● START person=node:Person(node=”Valentin”)WHERE person.age > 30RETURN person
Cypher: back to SQL #1/5
43
Cypher: back to SQL #2/5
● SELECT “Email”.*FROM PersonJOIN “Email” ON “Person”.id = “Email”.person_idWHERE “Person”.name = “Benedikt”
● START person=node:Person(name=”Benedikt”)MATCH person-[:email]->emailRETURN email
44
Cypher: back to SQL #3/5
● show me all people that are both actors and directors
● SELECT name FROM PersonWHERE person_id IN (SELECT person_id FROM Actor) AND person_id IN (SELECT person_id FROM Director)
● START person=node:Person(“name:*”)WHERE (person)-[:ACTS_IN]->() AND (person)-[:DIRECTED]->()RETURN person.name
45
Cypher: back to SQL #4/5
● show me all Tom Hanks’s co-actors
● SELECT DISTICT co_actor.name FROM Person tomJOIN Movie a1 ON tom.person_in = a1.person_idJOIN Actor a2 ON a1.movie_id = a2.movie_idJOIN Person co_actor ON co_actor.person_id = a2.person_idWHERE tom.name = “Tom Hanks”
● START tom=node:Person(name=”Tom Hanks”)MATCH tom-[:ACTS_IN]->movie, co_actor-[:ACTS_IN]->movieRETURN DISTINCT co_actor.name
46
Cypher: back to SQL #5/5
● show me all Lucy’s favorite directors
● SELECT dir.name, count(*) FROM Person lucyJOIN Actor on Person.person_id = Actor.person_idJOIN Director ON Actor.movie_id = Director.movie_idJOIN Person dir ON Director.person_id = dir.person_idWHERE lucy.name = “Lucy Liu”GROUP BY dir.nameORDER BY count(*) DESC
● START lucy=node:Person(name=”Lucy Liu”)MATCH lucy-[:ACTS_IN]->movie, director-[:DIRECTED]->movieRETURN director.name, count(*)ORDER BY director.name, count(*) DESC
47
STARTlucy = node:Person(name=”Lucy Lui”),kevin = node:Person(name=”Kevin Bacon”)
MATCHp = shortestPath( lucy-[:ACTS_IN*]-kevin )
RETURNEXTRACT (n in NODES(p):
COALESCE(n.name?, n.title?))
48
Cypher: back to SQL #6/5
Neo4j Shell
● command-line shell for running Cypher queries● supports remote shell● :schema
● bash# neo4j-shell -path data/graph.db -readonly -config conf/neo4j.properties -c “<command>”
49
Neo4j: Security
● does not deal with data encryption explicitly
● can be used all means built into the Java● can be used encrypted datastore● webadmin https
50
● manipulate data stored in RDF format● focused on match triple sets
PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT ?name ?emailWHERE { ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email.}
SPARQL
51
● graph traversal language● scripting language● Pipe & Filter (similar to jQuery)● across different graph databases● based on Groovy (limited to Java)● not as stable in Neo4j● XPath like
● ./outE[label=”family”]/inV/@name● g.v(1).out('likes').in('likes').out('likes').groupCount(m)● g.V.as('x').out.groupCount(m).loop('x'){c++ < 1000}● g.v(1).in(‘LOVE_OF’).out(‘SOME_IN’).has(‘title’,’abc’).back(2)
Gremlin
52
Neo4j and PHP● everyman/neo4jphp < packagist.org
○ PHP wrapper for the Neo4j using REST interface○ Follows the PSR-0 autoloading standard○ Basic wrappers for all components○ Last update - a month ago○ supports Gremlin
● Neo4j-PHP OGM < a lot of based on○ Object Graph Mapper, inspired by Doctrine○ based on Doctrine\Common○ borrows significantly Doctrine\ORM design○ uses annotations on classes○ MIT Licence
● Neo4J PHP REST API client○ Using Neo4j REST API○ Node create/find/delete○ Relationship create/list/filter
53
High Availability with Neo4j● in HA - a single master and zero or more slaves● slave synchronizing with the master to preserve
consistency● master write to slave before transaction completes
54
DemoNeo4j.org Example Datasets:● DrWho (nodes=1'060; rels=2'286)
● Cineasts Movies & Actors (nodes=64'069; rels=121'778)
● Hubway Data Challenge (nodes=554'674; rels=2'011'904)
GraphGist:● JIRA and neo4j● PHP and neo4j● Kant in neo4j
XSS
55
Gephi (win, nix, mac)
56
Linkurious.us
57
Neoclipse (eclipse plugin)
58
KeyLines (JavaScript library)
59
Graffeine (npm package)
60
Neovigator (neography + processing.js)
61
● Heroku○ GrapheneDB beta○ bash$ heroku addons:add graphenedb
● Jelastic Cloud PaaS
Cloud
62
● GrapheneDB - based on neo4j
● AllegroGraph - Closed Source, Commercial, RDF-QuadStore
● Sones - Closed Source, .NET focused○ graph database built around the W3C spec for the Resource
Description Framework○ supports SPARQL, RDFS++, and Prolog
● Virtuoso - Closed Source, RDF focused
● GraphDB - graph database built in .NET by the German company sones
● InfiniteGraph - goal is to create a graph database with "virtually unlimited scalability."
● FlockDB
Analogues
63
Docs● http://docs.neo4j.org/chunked/snapshot/● http://docs.neo4j.org/refcard/2.0/● http://graphdatabases.com/ - book, O'REILLY● http://www.cs.usfca.
edu/~galles/visualization/Algorithms.html - Graph Algorithms visualization
● http://bit.ly/rr-neo4j● https://github.com/itspoma/test-neo4j
64
● best used for graph-style,rich or complex,structured dense data,deep graphs with unlimited depth and cyclical,with weighted connections,interconnected data
● quickly add new functionality without impacting existing deployments
● schema-less forcing to re-think entire approach to data● not the silver bullet for all problems
Conclusion