Employing Graph Databases as a Standardization Model towards Addressing Heterogeneity

Employing Graph Databases as a Standardization Model towards

Addressing Heterogeneity Dippy Aggarwal and Karen C. Davis

University of Cincinnati Cincinnati, Ohio

IEEE 17th International Conference on Information Reuse and Integration

July 28-30, 2016, Pittsburgh, USA

Agenda

Employing Graph Databases as a Standardization Model towards

Addressing Heterogeneity

Motivation and Challenge Our Proposed Approach

Results and Future Work

A Short Example Architecture Novelty

Integration of data from multiple sources lays foundation for building rich and effective analytics systems.

Schema heterogeneity has been perceived as a major challenge towards data integration and exchange for more

than two decades.

Proliferation in data models

Relational databases de-facto standard for decades

RDF databases standard for linked data

NoSQL family of data models

“Map/Reduce is a great hammer but not everything is a nail” – Benjamin Hindman (Co-Founder and Chief Architect at Mesosphere)

F. O¨ zcan, N. Tatbul, D. J. Abadi, M. Kornacker, C. Mohan, K. Ramasamy, and J. Wiener. Are we experiencing a big data bubble? In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 1407–1408, New York, NY, USA, 2014. ACM.

Our vision: It would be useful to have an approach that allows leveraging both schema-based and

schemaless data stores.

+ NoSQL

Our research question

Given the the unique advantages possessed by different classes of data stores, how can we bring them together under a homogeneous representation?

Image Credits: http://www.slideshare.net/jexp/intro-to-neo4j-presentation

Our Solution

Adopting graphs as a means towards standardization and integration of different

data stores.

Why graphs? 1. A simple and flexible abstraction for modeling artifacts of different kinds

Facebook Open Graph

Trends in databases

2. Attracting significant attention and interest in the past few years

Leveraging Neo4j for graph implementation

Nodes and relationships can have properties (key-value pairs)

Image Credits: Exploiting RDF Open Data Using NoSQL Graph Databases” – R. Bouhali and A. Laurent

Example of schema and data model heterogeneity

Relational schema excerpt

RDF excerpt

Addressing schema heterogeneity challenge

Relational schema excerpt

Neo4j representation

Key-value properties for a node – Jason Doe

Graph Representation for the RDF Schema Excerpt

What is the additional merit that the common graph representation offers compared to the knowledge that could have been derived from the native model representations?

Name, homepage, gender, birthday etc.

Advantage of graph model towards unification

By unifying them based on common attributes such as date of birth or SkypeId each of the nodes can benefit by incorporating information from the other schema.

Maps_With

“Exploiting RDF Open Data Using NoSQL Graph Databases” – R. Bouhali and A. Laurent

R. Bouhali and A. Laurent. Artificial Intelligence Applications and Innovations: 11th IFIP WG 12.5 International Conference, AIAI 2015, Bayonne, France,September 14-17, 2015, Proceedings, Exploiting RDF Open Data Using NoSQL Graph Databases, pages 177–190. Springer International Publishing, Cham, 2015.

Data expressed in RDF RDF mapped to a property graph

Limitations: focus on converting only RDF data into a graph model whereas we envision an extensible approach that embraces model diversity by allowing multiple models. Novelty of our model: native model’s concept-preserving characteristic.

Architecture of our approach Employs our transformation rules.

Export user defined relational schemas in a CSV format

Evaluation Evaluation metrics (proposed by Bouhali et al.) Conciseness: The total number of nodes and relationships and can be used to calculate the graph size. Connectivity: is calculated by dividing the number of relationships with the total number of nodes.

Sakila database in MySQL

Bouhali et al. – connectivity should be at least 1.5 Our results reflect a value (0.32) lower than the benchmark. Why so? Sakila database: https://dev.mysql.com/doc/sakila/en/

Evaluation - trade-off between conciseness and connectivity

Modeling attributes as nodes

Increased conciseness

Evaluation metrics - trade-off between conciseness and connectivity

Conclusions: • The connectivity depends on the nature of original model • A higher connectivity may come at the cost of an increase in the graph size.

Strong connectivity between nodes in a graph certainly is good for processing but it also does not automatically lead to the conclusion that a lower number is not desirable.

Increased conciseness

Contributions • An idea of employing graph databases as a means of

bridging the gap between schema-based and schemaless data stores.

• A concept-preserving yet integrated graph model that addresses the model heterogeneity and carries the potential for handling the variety dimension of the big data landscape. • A proof-of-concept that illustrates the potential of graph-based solutions towards addressing diversity in data representations. • A software-oriented, automated approach to transform relational into a graph database.

The Path Forward 1. Extending our work by incorporating additional data

stores and illustrating integration. 2. Incorporate an evaluation study of the transformation

process to address the efficiency of the approach. 3. A performance study of querying an integrated graph

schema versus disconnected original native schemas is another research direction.

4. The idea of reverse engineering the graph model to obtain the schemas in the original models can also be useful.

Selected References • P. Atzeni, P. Cappellari, and P. A. Bernstein. Modelgen:Model

independent schema translation. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pages 1111–1112. IEEE, 2005.

• R. Bouhali and A. Laurent. Artificial Intelligence Applications and Innovations: 11th IFIP WG 12.5 International Conference, AIAI 2015, Bayonne, France, September 14-17, 2015, Proceedings, chapter Exploiting RDF Open Data Using NoSQL Graph Databases, pages 177–190. Springer International Publishing, Cham, 2015.

• S. Bowers and L. Delcambre. The uni-level description: A uniform framework for representing information in multiple data models. In Conceptual Modeling-ER 2003, pages 45–58. Springer, 2003.

References (Image Credits) • Facebook Open Graph http://www.nanigans.com/2012/02/03/10-facebook-open-graph-apps-actions/ • Data Integration (Slide 3)

http://www.dbta.com/BigDataQuarterly/Articles/The-New-Newly-Democratized-Data- Integration-109144.aspx

• Trends in databases https://www.linkedin.com/pulse/future-decentralized-data-processing-architecture- raunak-jhawar

https://www.google.com/trends/

Thank you. Questions?

Science

Employing Graph Databases as a Standardization Model towards Addressing Heterogeneity