View
4.267
Download
1
Category
Preview:
Citation preview
Graph databasesIn computational biology:Neo4j and TitanDB
Andrei Kucharavy
23/08/2013
Rigid structure of Interactions = Interactome
Knowledge access structure = GO
Why even bother?
Those are Graphs
Why even bother?
~ 1 Gb of raw data from Reactome
~ 300 Mb of Data from Uniprot / GO / ENSEMBL/ mappings
=> this is way over the conventional 1024 Mb JVM limit => heap crash
~ 15 minutes to load
Nightmare to visualize and debug
Relational Databases
Intro to neo4j presentation jexp @ slideshare
Data models
Graph databases
Intro to neo4j presentation jexp @ slideshare
Core abstractions
Objects:Nodes (Vertexes)
Relationships between nodes (Edges)
Properties for Vertexes and Edges
Node1Node2Node3
Property1Node2Property2Property3Property1Property1Property2Property2
PropertyPropertyProperty
Core abstractions
Objects:Nodes (Vertexes)
Relationships between nodes (Edges)
Properties for Vertexes and Edges
Operations:Immediate relations
TraversalsGet the shortest path from j to k
Get the path with least weight from j to k, ...
Main advantages promised
Increased speed for graph-type applicationsAvoid join on 10M rows to get ~20 related elements
Traversals
Simplified programmingJava objects
Xml / rdf / owl
Schema alterations
Why join millions of rows if only 10 relationships are iteresting?
What to do if we want traversals
Main advantages promised
Ease of deployment / maintenance:Scalability
Complexity
Modifications
Schema migrations
neo4j
Started in 2003
Schema-free
ACID transactions
Reasonably scalable, reasonably replicatable
10 000 open source projects, 1000 commercial costumers
neo4j
Started in 2003
Schema-free
ACID transactions
10 000 open source projects, 1000 commercial costumers
100 % open source
https://github.com/neo4j/neo4j
neo4j
Started in 2003
Schema-free
ACID transactions
10 000 open source projects, 1000 commercial costumers
100 % open source
Master-slave replication
AGPL 3 license: if you are open source, it is free,Even the support
neo4j
Started in 2003
Schema-free
ACID transactions
10 000 open source projects, 1000 commercial costumers
100 % open source
Master-slave replication
AGPL 3 license: if you are open source, it is free,Even the support
Plus graphical interface => De-bug!!!!
Deployment Demo
cd to specific DB location (better as a special user)
./neo4j start
./neo4j stop
=> Serves localhost:7474
40 000 files => mainly indexes / user accesses
Under the hood
Java & JVM
Split in twoIn-RAM pre-heated v.s. Whole in-HDD
Scalability:32 G nodes / 32 G relations / 64 G properties
1 M traversals / sec, size-independent of a graph
Lucene index: instant search
Interfaces
Two-fold interface:REST server
Local instance
Specific query Language: Cipher
Interfaces
Two-fold interface:REST server
Local instance
Specific query Language: Cipher
Interoperability: support for tinkerpop stack
TinkerPop stack
REST APIs
What is Gremlin
Domain-specific graph language
Build atop GroovyJVM
Dynamically evaluated
~ scripting in java
Core = javaJava
Scala / Clojure
Jpypes / Jython / Jruby
Supported by most graph databases
Interfaces
Two-fold interface:REST server
Local instance
Specific query Language: Cipher
Interoperability: support for TinkerPop stack
Native bindings:Java
Python, PHP, Ruby / Rails, node.js, .Net
Scala, Clojure, Haskell, ...
My stack:Native Python and Python through bulbs and REST
Python + Bulbs + REST + neo4j
Bulbs = Pythonic wrapper for Gremlin
Portability(BluePrints + Rexter)Titan DB (will be discussed later on)
Bitsy
Infinite Graph
Sqrrl
ArangoDB
Class heritability and DDT:Java-like class heritability
Demo 2
Datatype declaration
GraphDB connection and declaration
Fill-in
Graphical Interface
neo4j-specific
Lucene index in the backendExact indexing => constant-time retrieval
Full-text indexing => searching partial names and adding the missing links SRC = SRC_HUMAN = SRC1
Demo3
Constant node retrieval time / internode connection distance time
Performing the partial search
Adding missing links
Neo4j server v.s. Local database
Performing simple Gremlin queries
Use Case:
Existent map of correlations:
ProteinDomainDomain TypeProtein
function
Use Case:
Existent map of correlations:
Wanted map of correlations:
ProteinDomainDomain TypeProtein
function
ProteinDomainDomain TypeProtein
function
Use Case:
Existent map of correlations:
Wanted map of correlations:
ProteinDomainDomain TypeProtein
function
ProteinDomainDomain TypeProtein
function
Use Case
SQL Python / SQLAlchemy:Create new table
Add ForeignKeys, Primary key, indexes, ...
Add the table to the data model,
Create functions for access/update,
...
Use Case
Bulbs / Neo4j => Live demo
Use case 2
In human proteome, find all chemical groups A and B separated by less then x Database Structure:Suppose all the proteins are connected to a Type node
Each protein is linked to it's domains, each domain is linked to it's amino acids, each amino-acid linked to it's chemical groups and ultimately atoms
Chemical groups have assigned distance between them and groups they are close to
AlgorithmSelect a protein of interest
Get all of it's chemical groups: 1000(a.a)*3(ch.gr/a.a)
Filter all of the Relations longer than k: 1000*3*100(possible contacts per ch.gr)
Recover the proteins: 1000*3*100*2
With 1M traversals per second => 0.6 sec. to execute the query
If TitanDB with ElasticSearch and geo-queries (all within circle of radius x), higher speeds possible
Limitations
Node Number:32 Giga Nodes / Edges is a lot on servers ~100 Tb of data
1 Unix partition
40 000 ++ simultaneously opened files (Indexes+users)
32 Giga Edges is relatively small in biology~ 43 M nodes in UniProt Only
GO x UNIPROT x EMBL x GeneNames x Interaction Maps x Localisations x names & Accesses ....
All potentially druggable molecules, all protein atoms, all atom-atom interactions
Limitations
Absence of parallelism/distributionOne process at time:1 traversal at time
ACID => Database locks
Though master-slave distribution
Single partition Replication
100 Tb + RAID!?
Though full support for AWS and VM
Limitations
Bubs: python over gremlin scriptsGremlin Groovy JVM do what you want=> SQL (Gremlin) injections
Request sanitation neededHashes of the queries without variablesPre-filtering before query referral to server
Limitations
Bulk insert not naively implemented in Bulbs:Insertion rate ~10 nodes /sec
Naive python binding tests:~60 msec for ACID compliance (HDD write)
~1.8 msec/node cold insertion routines
(HDD sequential write)
~0.3 msec/node hot write insertion routines (RAM buffer)
500 - 1500 nodes/sec if packages of 1000 6 h to fill the database up to theoretical limit
github.com/chefjerome/graphalchemy implements efficient flush based on bulbs (alpha and thus unstable right now)
Port to TitanDB
TitanDB
Hbase / Cassandra / BerkleyDB as backend
TitanDB
Hbase / Cassandra / BerkleyDB as storage backend
Lucene / ElasticSearch as Indexing backend
Served over Rexter serverFull distribution> 500 simultaneous connections (5000 is still stable)Automatic replication (Hadoop)Multiple simultaneous queriesSky is the only limit for storage quantities=> TitanDB / Hbase is stable up to 5 Pbytes in production
Neo4j for bioinformatics:
parsing and curating Reactome.org
Reactome.org: BioPax : xml / RDF / OWL
Neo4j for bioinformatics:
parsing and curating Reactome.org
Reactome.org structure: BioPax : xml / RDF / OWL
Physical entities:Proteins, small molecules, Complexes, RNA, DNA
Fragments of physical entities
Interaction:Degradation / polymerisation / Biochemical reactions
Molecular interaction
Genetic interaction
Pathways, Genes, Post-translational modifications...
Protege
Neo4j for bioinformatics:
parsing and curating Reactome.org
Reality of Reactome.org: Main connex element: ~ 22 000 entities, but 6 other with >100 elements
Presence of generic classes : groups of objects
Proteins = mix between proteins, domains, groups, groups of domains
15 000 proteins, 5000 UNIPROT references
156 genes, 56 RNA molecules => translation / transcription regulation is not well described
Neo4j for bioinformatics:
parsing and curating Reactome.org
Reality of Reactome.org: heavily comment-based: case of SRC
Neo4j for bioinformatics:
parsing and curating Reactome.org
Neo4j for bioinformatics:
parsing and curating Reactome.org
Completed with HiNT protein-protein interaction from Yue lab at Cornell
Re-indexed:SwissProt protein names
Full names from SwissProt
Gene Names
KEGG, GO, EMBL, ChEBI cross-references
PDB implemented, not re-run
Neo4j for bioinformatics:
parsing and curating Reactome.org
Example of pathway Parsing
Conclusion
Systems biology is more about graphs then about systems of tables
Graph Databases are awesome
Neo4j is terrific
TitanDB is cool
You should definitely pick one of them, load Reactome.org dataset or whatever you are interested in and play with it.
Questions ?
Thanks
Pr. Philp BournePr. Bart DeplankeCedric MerlotLi XieSpencer Blieven
Jiang WangJulia PonomarenkoCole ChristieAndreas PrilicLilia Iakoucheva
Recommended