Graph databases in computational bioloby: case of neo4j and TitanDB

Graph databasesIn computational biology:Neo4j and TitanDB

Andrei Kucharavy

23/08/2013

Rigid structure of Interactions = Interactome

Knowledge access structure = GO

Why even bother?

Those are Graphs

Why even bother?

~ 1 Gb of raw data from Reactome

~ 300 Mb of Data from Uniprot / GO / ENSEMBL/ mappings

=> this is way over the conventional 1024 Mb JVM limit => heap crash

~ 15 minutes to load

Nightmare to visualize and debug

Relational Databases

Intro to neo4j presentation jexp @ slideshare

Data models

Graph databases

Intro to neo4j presentation jexp @ slideshare

Core abstractions

Objects:Nodes (Vertexes)

Relationships between nodes (Edges)

Properties for Vertexes and Edges

Node1Node2Node3

Property1Node2Property2Property3Property1Property1Property2Property2

PropertyPropertyProperty

Core abstractions

Objects:Nodes (Vertexes)

Relationships between nodes (Edges)

Properties for Vertexes and Edges

Operations:Immediate relations

TraversalsGet the shortest path from j to k

Get the path with least weight from j to k, ...

Main advantages promised

Increased speed for graph-type applicationsAvoid join on 10M rows to get ~20 related elements

Traversals

Simplified programmingJava objects

Xml / rdf / owl

Schema alterations

Why join millions of rows if only 10 relationships are iteresting?

What to do if we want traversals

Main advantages promised

Ease of deployment / maintenance:Scalability

Complexity

Modifications

Schema migrations

Started in 2003

Schema-free

ACID transactions

Reasonably scalable, reasonably replicatable

10 000 open source projects, 1000 commercial costumers

Started in 2003

Schema-free

ACID transactions

100 % open source

https://github.com/neo4j/neo4j

Started in 2003

Schema-free

ACID transactions

100 % open source

Master-slave replication

AGPL 3 license: if you are open source, it is free,Even the support

Started in 2003

Schema-free

ACID transactions

100 % open source

Master-slave replication

AGPL 3 license: if you are open source, it is free,Even the support

Plus graphical interface => De-bug!!!!

Deployment Demo

cd to specific DB location (better as a special user)

./neo4j start

./neo4j stop

=> Serves localhost:7474

40 000 files => mainly indexes / user accesses

Under the hood

Java & JVM

Split in twoIn-RAM pre-heated v.s. Whole in-HDD

Scalability:32 G nodes / 32 G relations / 64 G properties

1 M traversals / sec, size-independent of a graph

Lucene index: instant search

Interfaces

Two-fold interface:REST server

Local instance

Specific query Language: Cipher

Interfaces

Local instance

Interoperability: support for tinkerpop stack

TinkerPop stack

REST APIs

What is Gremlin

Domain-specific graph language

Build atop GroovyJVM

Dynamically evaluated

~ scripting in java

Core = javaJava

Scala / Clojure

Jpypes / Jython / Jruby

Supported by most graph databases

Interfaces

Local instance

Interoperability: support for TinkerPop stack

Native bindings:Java

Python, PHP, Ruby / Rails, node.js, .Net

Scala, Clojure, Haskell, ...

My stack:Native Python and Python through bulbs and REST

Python + Bulbs + REST + neo4j

Bulbs = Pythonic wrapper for Gremlin

Portability(BluePrints + Rexter)Titan DB (will be discussed later on)

Infinite Graph

ArangoDB

Class heritability and DDT:Java-like class heritability

Demo 2

Datatype declaration

GraphDB connection and declaration

Fill-in

Graphical Interface

neo4j-specific

Lucene index in the backendExact indexing => constant-time retrieval

Full-text indexing => searching partial names and adding the missing links SRC = SRC_HUMAN = SRC1

Constant node retrieval time / internode connection distance time

Performing the partial search

Adding missing links

Neo4j server v.s. Local database

Performing simple Gremlin queries

Use Case:

Existent map of correlations:

ProteinDomainDomain TypeProtein
function

Use Case:

Wanted map of correlations:

function

Use Case:

Wanted map of correlations:

function

Use Case

SQL Python / SQLAlchemy:Create new table

Add ForeignKeys, Primary key, indexes, ...

Add the table to the data model,

Create functions for access/update,

Use Case

Bulbs / Neo4j => Live demo

Use case 2

In human proteome, find all chemical groups A and B separated by less then x Database Structure:Suppose all the proteins are connected to a Type node

Each protein is linked to it's domains, each domain is linked to it's amino acids, each amino-acid linked to it's chemical groups and ultimately atoms

Chemical groups have assigned distance between them and groups they are close to

AlgorithmSelect a protein of interest

Get all of it's chemical groups: 1000(a.a)*3(ch.gr/a.a)

Filter all of the Relations longer than k: 1000*3*100(possible contacts per ch.gr)

Recover the proteins: 1000*3*100*2

With 1M traversals per second => 0.6 sec. to execute the query

If TitanDB with ElasticSearch and geo-queries (all within circle of radius x), higher speeds possible

Limitations

Node Number:32 Giga Nodes / Edges is a lot on servers ~100 Tb of data

1 Unix partition

40 000 ++ simultaneously opened files (Indexes+users)

32 Giga Edges is relatively small in biology~ 43 M nodes in UniProt Only

GO x UNIPROT x EMBL x GeneNames x Interaction Maps x Localisations x names & Accesses ....

All potentially druggable molecules, all protein atoms, all atom-atom interactions

Limitations

Absence of parallelism/distributionOne process at time:1 traversal at time

ACID => Database locks

Though master-slave distribution

Single partition Replication

100 Tb + RAID!?

Though full support for AWS and VM

Limitations

Bubs: python over gremlin scriptsGremlin Groovy JVM do what you want=> SQL (Gremlin) injections

Request sanitation neededHashes of the queries without variablesPre-filtering before query referral to server

Limitations

Bulk insert not naively implemented in Bulbs:Insertion rate ~10 nodes /sec

Naive python binding tests:~60 msec for ACID compliance (HDD write)

~1.8 msec/node cold insertion routines
(HDD sequential write)

~0.3 msec/node hot write insertion routines (RAM buffer)

500 - 1500 nodes/sec if packages of 1000 6 h to fill the database up to theoretical limit

github.com/chefjerome/graphalchemy implements efficient flush based on bulbs (alpha and thus unstable right now)

Port to TitanDB

TitanDB

Hbase / Cassandra / BerkleyDB as backend

TitanDB

Hbase / Cassandra / BerkleyDB as storage backend

Lucene / ElasticSearch as Indexing backend

Served over Rexter serverFull distribution> 500 simultaneous connections (5000 is still stable)Automatic replication (Hadoop)Multiple simultaneous queriesSky is the only limit for storage quantities=> TitanDB / Hbase is stable up to 5 Pbytes in production

Neo4j for bioinformatics:
parsing and curating Reactome.org

Reactome.org: BioPax : xml / RDF / OWL

Reactome.org structure: BioPax : xml / RDF / OWL

Physical entities:Proteins, small molecules, Complexes, RNA, DNA

Fragments of physical entities

Interaction:Degradation / polymerisation / Biochemical reactions

Molecular interaction

Genetic interaction

Pathways, Genes, Post-translational modifications...

Protege

Reality of Reactome.org: Main connex element: ~ 22 000 entities, but 6 other with >100 elements

Presence of generic classes : groups of objects

Proteins = mix between proteins, domains, groups, groups of domains

15 000 proteins, 5000 UNIPROT references

156 genes, 56 RNA molecules => translation / transcription regulation is not well described

Reality of Reactome.org: heavily comment-based: case of SRC

Completed with HiNT protein-protein interaction from Yue lab at Cornell

Re-indexed:SwissProt protein names

Full names from SwissProt

Gene Names

KEGG, GO, EMBL, ChEBI cross-references

PDB implemented, not re-run

Example of pathway Parsing

Conclusion

Systems biology is more about graphs then about systems of tables

Graph Databases are awesome

Neo4j is terrific

TitanDB is cool

You should definitely pick one of them, load Reactome.org dataset or whatever you are interested in and play with it.

Questions ?

Thanks

Pr. Philp BournePr. Bart DeplankeCedric MerlotLi XieSpencer Blieven

Jiang WangJulia PonomarenkoCole ChristieAndreas PrilicLilia Iakoucheva

Graph databases in computational bioloby: case of neo4j and TitanDB

Technology

Brett Ragozzine - Graph Databases and Neo4j

Visual Design of GraphQL Data - Programmer Books · • Explanation of the Neo4j-GraphQL integration and a look at applying GraphQL to both existing and new Neo4j graph databases

Introduction to Graph databases and Neo4j (by Stefan Armbruster)

An Introduction to NOSQL, Graph Databases and Neo4j

Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

Neo4j Adam Foust Road Map Introduction to Neo4j NoSQL databases Graph databases How Neo4j works Comparison

Neo4J - tutorialspoint.com · This tutorial explains the basics of Neo4j, Java with Neo4j, and Spring DATA with Neo4j. ... Follow the steps given below to download Neo4j into your

Stefan Armbruster BED-con 2016 Databases and the... · Users Love Neo4j “We found Neo4j to be literally thousands of times faster than our prior MySQL solution, with queries that

Cassandra Summit 2014: TitanDB - Scaling Relationship Data and Analysis with Cassandra

Roland Guijt - Introduction to Graph Databases and Neo4j

The Neo4j Java Developer Reference v3 - Neo4j 图数据 … 3.0 Docs/neo4j-java-reference-3...Chapter 1. Extending Neo4j Neo4j provides a pluggable infrastructure for extensions. Procedures

Modeling the IoT with TitanDB and Cassandra

Introduction in Graph Databases and Neo4j Stefan Armbruster @darthvader42 stefan.armbruster@neotechnology.com Introduction in Graph Databases and Neo4j most slides from: Michael Hunger

Neo4j - 7 databases in 7 weeks

Sarvesh Nagarajan. What is Neo4j? Graph Databases Cypher Application Domains

Intro to Neo4j and Graph Databases

Graph Databases and Neo4j - University of Stirling · Graph Databases and Neo4j Kevin Swingler Re˜ t˙ nsh˙ s • We all know how a relational database models relationships •

Introduction to Graph Databases - Meetupfiles.meetup.com/1381525/introduction to graph databases.pdf · Introduction to Graph Databases 1 David Montag @dmontag #neo4j Thursday, May

Graph Processing with Neo4j - VI4IO€¦ · Source: Online Course: Introduction to Graph Databases and Neo4j [31] Julian M. Kunkel Lecture BigData Analytics, 2016 5/37. Overview Cypher

Graph Databases, Neo4j, Cyphersvoboda/courses/2015-2-MIE-PDB/lectures/Lecture-09-Graph...Graph Databases Basic Characteristics To store entities and relationships between these entities