27
GRAPH DATABASES: THE SOLUTION FOR STORING SEMI-STRUCTURED BIG DATA Mohamed Taher Alrefaie

Graph databases: Tinkerpop and Titan DB

Embed Size (px)

Citation preview

Page 1: Graph databases: Tinkerpop and Titan DB

GRAPH DATABASES: THE SOLUTION FOR STORING

SEMI-STRUCTURED BIG DATAMohamed Taher Alrefaie

Page 2: Graph databases: Tinkerpop and Titan DB

DATA IS GETTING BIGGER “Every two days, we create as much information as we did us to 2003”. Eric Schmidt, former Google CEO, 2010.

Page 3: Graph databases: Tinkerpop and Titan DB

DATA IS MORE CONNECTED Having a look at the

following proves it: - Facebook Graph - LinkedIn Graph - Linked Data - Blogs/Tagging

Page 4: Graph databases: Tinkerpop and Titan DB

DATA IS LESS STRUCTURED Modelling FB Graph? Persons, friendships, photos, locations, apps, pages, ads, interests, age range, etc.

Page 5: Graph databases: Tinkerpop and Titan DB

NOSQL DATABASES Four types of databases that alleviate the performance issues of relational databases

Page 6: Graph databases: Tinkerpop and Titan DB

KEY VALUE STORES Data Model: Global key-value mappingBig scalable HashMapHighly fault tolerant (typically) Examples:

Redis, Riak, Voldemort. Dynamo

Page 7: Graph databases: Tinkerpop and Titan DB

KEY VALUE STORES: PROS AND CONS Pros:Simple data modelScalable ConsCreate your own “foreign keys”Poor for complex data

Page 8: Graph databases: Tinkerpop and Titan DB

COLUMN FAMILY Main idea is based on BigTable: Google’s distributed storage model for Structured Data Data Model: A big table, with column familiesMap Reduce for querying/processing Examples:

HBase, HyperTable, Cassandra

Page 9: Graph databases: Tinkerpop and Titan DB

COLUMN FAMILY: PROS AND CONS Pros:Supports Semi-Structured DataNaturally Indexed (columns)Scalable ConsPoor for interconnected data

Page 10: Graph databases: Tinkerpop and Titan DB

DOCUMENT DATABASES Data Model: A collection of documentsA document is a key value collectionIndex-centric, uses map-reduce extensively Examples:

CouchDB, MongoDB

Page 11: Graph databases: Tinkerpop and Titan DB

DOCUMENT DATABASES: PROS AND CONS Pros:Simple, powerful data modelScalable ConsPoor for interconnected dataQuery model limited to keys and indexesMap reduce for larger queries

Page 12: Graph databases: Tinkerpop and Titan DB

GRAPH DATABASES Data Model: Nodes and Relationships Examples:

Titan, Neo4j, OrientDB, etc.

Page 13: Graph databases: Tinkerpop and Titan DB

GRAPH DATABASES: PROS AND CONS Pros:Powerful data model, as general as RDBMSConnected data locally indexedEasy to query ConsSharding Requires different data modelling

Page 14: Graph databases: Tinkerpop and Titan DB

RDBMS

LIVING IN A NOSQL WORLDCo

mpl

exity

BigTableClones

Size

Key-ValueStore

DocumentDatabases

GraphDatabases

90% ofUse

Cases

RelationalDatabases

9,223,372,036,854,775,807

Page 15: Graph databases: Tinkerpop and Titan DB

WHAT IS A GRAPH? An abstract representation of a set of objects where some pairs are connected by links.

Object (Vertex, Node)

Link (Edge, Arc, Relationship)

Page 16: Graph databases: Tinkerpop and Titan DB

WHAT IS A GRAPH DATABASE? A database with an explicit graph structure Each node knows its adjacent nodes through edges As the number of nodes increases, the cost of a local step (or hop) remains the same plus an Index for lookups

Page 17: Graph databases: Tinkerpop and Titan DB

APACHE TINKERPOP: A UNIFIED API Dealing with such complex databases, requires a well-implemented API by the vendor. But using a vendor specific API, makes migrating to another database impossible. The solution is provided by Apache Tinkerpop.

Page 18: Graph databases: Tinkerpop and Titan DB

WHAT IS APACHE TINKERPOP?● A Graph processing system● Currently under Apache incubation ( 2015 )● Has Tinkerpop3 Structure API

● Graph, Element, Property

● Has Tinkerpop3 Process API● TraversalSource, GraphComputer

● Gremlin query language● A scripting language for graph traversal and mutation

● REST API

Page 19: Graph databases: Tinkerpop and Titan DB

WHY APACHE TINKERPOP? Tinkerpop is a generic API for graph databases Think ODBC, JDBC or Hibernate for relational databases

Integrates with:Titan DBNeo4jOrient DBAnd many more.Uses Gremlin graph scripting language

Page 20: Graph databases: Tinkerpop and Titan DB

TITAN DATABASE Titan is a scalable graph database using Tinkerpop APIs optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Supports Apache Spark and Hadoop (implicitly) for map-reduce operations. Integrates with:Elasticsearch, Solr, Lucene

Uses as a backend storage:Apache CassandraApache HbaseOracle BerkeleyDB

Page 21: Graph databases: Tinkerpop and Titan DB

PUTTING IT ALL TOGETHERApache Tinkerpop

APIGremlin server

Graph traversal

Gremlin client Monitoring

Titan DBStorage specific (Cassandra, HBase,

BerkeleyDB)

Page 22: Graph databases: Tinkerpop and Titan DB

TITAN: EXAMPLE Download titan server and console here https://github.com/thinkaurelius/titan/wiki/Downloads

$ cd titan-1.0.0-hadoop1 $ bin/gremlin.shgremlin> graph=TitanFactory.open(“conf/titan-berkely-es.properties”)gremlin> g=GraphOfGodsFactory.load(graph).traversal()

Page 23: Graph databases: Tinkerpop and Titan DB

TINKERPOP: EXAMPLEGraph g = TinkerGraph.open(); (1)Vertex marko = g.addVertex(Element.ID, 1, "name", "marko", "age", 29); (2)Vertex vadas = g.addVertex(Element.ID, 2, "name", "vadas", "age", 27); Vertex lop = g.addVertex(Element.ID, 3, "name", "lop", "lang", "java"); Vertex josh = g.addVertex(Element.ID, 4, "name", "josh", "age", 32); Vertex ripple = g.addVertex(Element.ID, 5, "name", "ripple", "lang", "java"); Vertex peter = g.addVertex(Element.ID, 6, "name", "peter", "age", 35);marko.addEdge("knows", vadas, Element.ID, 7, "weight", 0.5f); (3) marko.addEdge("knows", josh, Element.ID, 8, "weight", 1.0f); marko.addEdge("created", lop, Element.ID, 9, "weight", 0.4f); josh.addEdge("created", ripple, Element.ID, 10, "weight", 1.0f); josh.addEdge("created", lop, Element.ID, 11, "weight", 0.4f); peter.addEdge("created", lop, Element.ID, 12, "weight", 0.2f);

Page 24: Graph databases: Tinkerpop and Titan DB

TINKERPOP: EXAMPLE (CONT.)

gremlin> g.V().has('name','marko') .out('knows')

.values('name') (3) ==>vadas ==>josh

Page 25: Graph databases: Tinkerpop and Titan DB

SUMMARY Graph databases are the solution for highly scalable semi-structured connected data. Apache Tinkerpop is a generic API for graph databases to avoid DB vendor specific business logic code. Titan DB is a scalable distributed graph database on top of several other databases. It uses BerkeleyDB, HBase or BerkeleyDB as an end storage. This helps the database to be as linear or scalable you want it to be.

Page 27: Graph databases: Tinkerpop and Titan DB

MOHAMED TAHER ALREFAIE 07/12/2015