7
Analysis of Data Management and Query Handling in Social Networks using NoSQL Databases Anita Brigit Mathew Department of CSE NIT Calicut Email: [email protected] Abstct-In the past few decades, traditional RDBMS(Reiational Database Management Systems) was a predominant technology used for storing and retrieving structured data in web and business applications since 1980. However, relational databases have started squandering its importance due to strict schema reliance and costly infrastructure. This has collectively led to the problem in upgrade hardware-soſtware challenges and relationships between objects. Another major issue of failure is the brobdingnagian hike of BigData. A new database model called NoSQL, plays a vital role in BigData analytics. The main focus of this paper is on four different types of NoSQL databases owned by social networking sites like Facebook, LinkedIn, Twitter, MySpace, Foursquare, Flickr and Friendfeed. Main features of these four types of NoSQL databases, like scalability of the data model, concurrency control, consistency in storage, availability during partitioning, durability, transactions, implementation language, query support possibilities and programming characteristics are compared and analyzed. These features are compared for the sub categories of the four NoSQL databases, for swiſt querying from social networking sites. In the detailed analysis presented, the features like data storage and fast retrieval phase of query processing are given primary importance. We also present a comparison of the time taken during insert and read operations of social network data of Facebook associating friends, intimate friends, family and groups like hometown and workplace. The results are compared and most suitable database from NoSQL Graph database subcategories for insert and read operations in Facebook are identified. Keywords-NoSQL tabases, Key Value Stores, Document Stores, Column Family Stores, Gph Databases, Socwl Net- works(SN's). I. IN TRODUC TION One of the major computational challenges in these days is to store, manipulate and analyze large amount of data now described as BigData. As per International Data Corporation(IDC), by year 2020, the volume of BigData from daily transactions would reach upto 32.4 trillion [1]. In BigData analytics, there is a relationship between performance and data complexity. NoSQL database system is one of the most popular framework in the BigData plan. Framework of conventional tools in data management are unable to handle the exponential growth of BigData. According to McKinSey [2], statistics on BigData taken from companies like Twitter, Face- book, Youtube, Friendfeed, LinkedIn, WhatsApp and Orkut in January 2014 can be managed using NoSQL databases in one way or another. NoSQL databases can handle complex data structures as well as hierarchical nested data structures 978-1-4799-8792-4/15/$31.00 ©2015 IEEE S. D. Madhu Kumar Department of CSE NIT Calicut Email: [email protected] easily by choosing or integrating different types of database models. To accomplish the same in SQL, we need multiple relational tables with unique and correlated keys. Moreover when we try to store massive amount of data from social networking applications, semantic web etc., this would degrade the performance of traditional RDBMS. Different datatypes and the growing amount of large multiple datasets from various data sources and manipulation of this unstructured data have changed the platform from RDBMS to NoSQL Databases. In this paper four different NoSQL databases are compared after identifying key features for the assessment and a classification is provided depending upon their applicability in Social Networks(SN's). This comparison is made based on parameters to support query processing like the type, structure of data storage model and relationship between the attributes etc., in social networking sites. We also present a detailed comparison of each division of NoSQL with their subcate- gories used in the social networks like Facebook, LinkedIn, Twitter, Youtube, Friendfeed, MySpace, Flickr, Foursquare and WhatsApp. All these comaprisons are listed in section 2. Each subcategory NoSQL database features are compared based on some important features like scalability of the data model, parameters like data model design, index structure, statistical allocation, design stability, platform and Use Base, concurrency control, consistency in storage, availability during partitioning, durability, replication opportunities, implementa- tion language, query support possibilities and programming characteristics. Main focus on selecting these parameters was the data storage, index creation and fast data retrieval phase of query processing in Social Networks(SN's). In section 3, we look at the performance analysis of the four different NoSQL databases and their subcategories for insert and read operations in social networking site like Facebook. An analysis was made based on the comparison parameters chosen and the best suitable subcategory NoSQL database for each one of the four NoSQL is selected based on there performance in insert and read operations in a social network like Facebook. Section 4 concludes the paper. II. THE FOUR NoSQL DATABASES 800 NoSQL databases have emerged to enable cost-effective management of data in cloud environment networks. These databases need low-latency access to multiple datasets and they are used in applications like BigData processing which has large transactional volume content. To improve the scalability in relational databases, bigger servers had to be added were as in NoSQL any number of commodity hardware could

Analysis of Data Management and Query Handling in Social … · 2016-11-10 · upgrade hardware-software challenges and relationships between ... Java lan- optimistic lock- Eventually

Embed Size (px)

Citation preview

Analysis of Data Management and Query Handling in Social Networks using NoSQL Databases

Anita Brigit Mathew Department of CSE

NIT Calicut

Email: [email protected]

Abstract-In the past few decades, traditional RDBMS(Reiational Database Management Systems) was a predominant technology used for storing and retrieving structured data in web and business applications since 1980. However, relational databases have started squandering its importance due to strict schema reliance and costly infrastructure. This has collectively led to the problem in upgrade hardware-software challenges and relationships between objects. Another major issue of failure is the brobdingnagian

hike of BigData. A new database model called NoSQL, plays a vital role in BigData analytics. The main focus of this paper is on four different types of NoSQL databases owned by social networking sites like Facebook, LinkedIn, Twitter, MySpace, Foursquare, Flickr and Friendfeed. Main features of these four types of NoSQL databases, like scalability of the data model, concurrency control, consistency in storage, availability during partitioning, durability, transactions, implementation language, query support possibilities and programming characteristics are compared and analyzed. These features are compared for the sub categories of the four NoSQL databases, for swift querying from social networking sites. In the detailed analysis presented, the features like data storage and fast retrieval phase of query processing are given primary importance. We also present a comparison of the time taken during insert and read operations of social network data of Facebook associating friends, intimate

friends, family and groups like hometown and workplace. The results are compared and most suitable database from NoSQL Graph database subcategories for insert and read operations in Facebook are identified.

Keywords-NoSQL databases, Key Value Stores, Document Stores, Column Family Stores, Graph Databases, Socwl Net­works(SN's).

I. INTRODUC TION

One of the major computational challenges in these days is to store, manipulate and analyze large amount of data now described as BigData. As per International Data Corporation(IDC), by year 2020, the volume of BigData from daily transactions would reach upto 32.4 trillion [1]. In BigData analytics, there is a relationship between performance and data complexity. NoSQL database system is one of the most popular framework in the BigData plan. Framework of conventional tools in data management are unable to handle the exponential growth of BigData. According to McKinSey [2], statistics on BigData taken from companies like Twitter, Face­book, You tube, Friendfeed, LinkedIn, WhatsApp and Orkut in January 2014 can be managed using NoSQL databases in one way or another. NoSQL databases can handle complex data structures as well as hierarchical nested data structures

978-1-4799-8792-4/15/$31.00 ©2015 IEEE

S. D. Madhu Kumar Department of CSE

NIT Calicut

Email: [email protected]

easily by choosing or integrating different types of database models. To accomplish the same in SQL, we need multiple relational tables with unique and correlated keys. Moreover when we try to store massive amount of data from social networking applications, semantic web etc., this would degrade the performance of traditional RDBMS. Different datatypes and the growing amount of large multiple datasets from various data sources and manipulation of this unstructured data have changed the platform from RDBMS to NoSQL Databases. In this paper four different NoSQL databases are compared after identifying key features for the assessment and a classification is provided depending upon their applicability in Social Networks(SN's). This comparison is made based on parameters to support query processing like the type, structure of data storage model and relationship between the attributes etc., in social networking sites. We also present a detailed comparison of each division of NoSQL with their subcate­gories used in the social networks like Facebook, LinkedIn, Twitter, Youtube, Friendfeed, MySpace, Flickr, Foursquare and WhatsApp. All these comaprisons are listed in section 2. Each subcategory NoSQL database features are compared based on some important features like scalability of the data model, parameters like data model design, index structure, statistical allocation, design stability, platform and Use Base, concurrency control, consistency in storage, availability during partitioning, durability, replication opportunities, implementa­tion language, query support possibilities and programming characteristics. Main focus on selecting these parameters was the data storage, index creation and fast data retrieval phase of query processing in Social Networks(SN's). In section 3, we look at the performance analysis of the four different NoSQL databases and their subcategories for insert and read operations in social networking site like Facebook. An analysis was made based on the comparison parameters chosen and the best suitable subcategory NoSQL database for each one of the four NoSQL is selected based on there performance in insert and read operations in a social network like Facebook. Section 4 concludes the paper.

II. THE FOUR NoSQL DATABASES

800

NoSQL databases have emerged to enable cost-effective management of data in cloud environment networks. These databases need low-latency access to multiple datasets and they are used in applications like BigData processing which has large transactional volume content. To improve the scalability in relational databases, bigger servers had to be added were as in NoSQL any number of commodity hardware could

be added. NoSQL databases can handle unstructured, semi­structured and structured data and they are highly scalable, reliable and easy to use. Unstructured data can be in any form like audio, video, social network data, documents etc. Figure 1 shows the growth of unstructured, semi-structured and structured data from the year 2000 to 2014 which is found from [3], [4], [5], [6], [7] based on Facebook dataset [2]. Social networks data like Facebook, Twitter and other social networking sites have the ability to handle millions of users. These social networking sites are the major benificiaries of NoSQL. Some of the NoSQL technologies like DynamoDB [8] is used as a storage mechanism by MySpace. Cassandra [9] was open sourced by Facebook which is a column oriented database. MongoDB is used for F1ickr and Ne04j [10] open source graph database is used as a structure model, etc. play major role in todays world of BigData Analytics. The NoSQL database subcategories used in social networking sites like Facebook, Twitter, LinkedIn, Foursquare, Friendfeeed, Flickr and MySpace are summensed in Table I. Parameters like data storage model and structure connection model for each of the NoSQL Databases are taken and compared. These parameters are the primary features required during the hour of efficient query processing from the NoSQL Databases used in SN's. This is illustrated in Table II. Each classification, namely Wide Column Family Store, Document Store, KeyValue and Graph of NoSQL Databases is further analyzed and illustrated in the following subsections.

Fig. 1.

2.5 · ·· --· ·-i·· .. --· ··t-· ··· · -·-I·-·- ·- .. S�;;t;;i�OCi·�;;(j O';:ii-· · .--

--------·---� ·--·----·-----+·-----------·1--

---·--·---·_j--- ----

------l ----

-------

! ;.

�.,. o " � � 1.5

:= ""

� t - ---- I --------------(----------

<5 0.5 ---

----! · .. ·-·�·

+�-��_ J _____ .. - :=��bi�t��-:·

= o 2000 2002 2004 2006 2008 2010 2012 2014

Year

Growth of BigData

TABLE I. NoSQL DATABASES USED IN SOCIAL NETWORKS

Social Nelworking NoSQL Database Subcategories used sites

Facebook Cassandra, HBase, Ne04j

Twitter FlockDB, Cassandra, HBase, Ne04j

Linkedln Voldemort, MongoDB, HBase, AllegroGraph

Flickr MongoDB, Ne04j

Friendfeed HBase, Cassandra, OrientDB

Foursquare MongoDB, CouchDB, Riak, Cassandra, InfoGrid

MySpace MongoDB, DynamoDB, Ne04j

TABLE II. NoSQL DATABASE CATEGORIES

Type Data Storage Model Structure Connection Model

Wide Column Family Distributed Column Ori- Clustering column at-Store [10] ented tributes

Document Store [I I] Centralised Document Congregate associated Structure documents

Key Value / Tuple Value Centralised, Key for No, unique keys assigned Store [12] tuple-column

GraphDB [13] Distributed, Nodes with Complete connected Keys

A. Wide Column Store / Column Families

These are NoSQL databases which belong to Column Store Family. They have powerful cache and are exabyte scalable systems that are resistent to failures. Column family stores are created to store and process very large amounts of data distributed over numerous machines. There are unique keys to point to each unique columns. Table III illustrates these databases used in SN's. Parameters like API, language used, level of concurrency control, consistency in storage and repli­cation factor to support fault tolerance issue are considered and listed in Table III. These parameters were taken based on the requirement for efficient query processing from NoSQL Wide Column Family Store databases in SN's. Most widely used database in this category is Apache Hadoop/HBase open ware.

TABLE Ill. WIDE COLUMN FAMILY STORE DATABASE E XAMPLES

Name API/Language Concurrency Consistency in Storage /

I Control Replication

Hadoop / JRUBY IRB shell Locks attained Consistent at cluster HBase [9] similiar to SQL, separately for level only. Has

Java, Python. read and write. asynchronous replicate factor

Cassandra Java, Python. optimistic Consistent at disk level [9] locking and has asychronous

replication.

B. Document Store

Document stores support complex data than key-value stores. They support secondary index mechanisms for efficient query processing. In this database model, multiple types of documents are nested in the form of lists to support querrying capabilities. MongoDB open source document store model of IOgen [14] is the most widely used NoSQL database store model in SN's. Features like persistent datamodel, replication of documents, automatic distribution across servers and index structure can be applied to the Document store model. Efficient processing of MapReduce jobs enables multidataset querying capability features.

Document databases suggested by Lotus Notes company [11] are similar to key-value stores. The fully structured architecture model consisting of documents are a collection of key-value pairs. The semi-structured documents are stored in formats like JSON and contain multiple levels of key-value pairs. Some of the document stores used in SN's are shown in table IV. Table IV also lists the parameters required for efficient query processing from NoSQL Document Store databases in SN's. The parameters taken include API, language used, level of concurrency control, consistency in storage and replication factor to support fault tolerance. We found that among the document store category the most widely used database is MongoDB.

TABLE IV. DOCUMENT STORE DATABASE EXAMPLES

Name API / Concurrency Con- Consistency in Storage / Language trol Replication

MongoDB Written in C++ Locks attai ned sep- Consistent at disk level [15] Supported by arately for read and and has asynchronous

IOgen write. replication.

CouchDB Erlang, Java Multi-granularity Asynchronous partial [16] locking consistent framework.

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCl) 801

C. KeyValuefTupleValue Store

TABLE V. KEY- TUPLE VALUE STORE DATABASES

Name API I Language Concurrency Consistency in Storage I Control Replication

DynamoDB Erlang, Java lan- optimistic lock- Eventually consistent [8] guage with simple ing asynchronous replication

get put operations

Riak [17] C,C++ optimistic lock- Consistent synchronous ing and Plug-in replication. data storage.

Voldemort Python and ACID with no Asynchronous [4] Java enterprise locks replicatian with

language. consistency only at document level.

In this model, the data is stored in unstructured form. Dataset is accessed with keys and values associated with each of the records. It bestows a simple primary key retrieval interface to manage query services with better reliability. It has better performance and cost-effectiveness. Among Key-Tuple value store databases most popular and widely used in SN's are Apache Dynamo and Voldemort. Table V lists the different examples of Key Tuple value store databases used in SN's. Parameters like API, language pre-owned, level of concurrency control, consistency in storage and replication factor to support fault tolerance issue are considered and clearly listed in table V. These parameters were taken based on the requirement for efficient query processing from NoSQL KeyValue Store databases in SN's.

D. GraphDB

A graph database uses graph structures like nodes, edges, and properties of relationships between nodes and edges to represent and store data in a graph-oriented data structure. The graph-oriented storage system provides adjacency, that is, every element contains a pointer to its adjacent elemen. Accustomed graph databases can store any graph and are dissimilar from functional graph databases such as triplestores and network databases. These databases can scale across many machines. Some of the general criteria's fulfilled by graph databases are,

• Data storage in graphs is represented clearly with the help of nodes, edges and relationships in the form of tables.

• Traversal of the graph is optimized with the help of edges without using index structure. The techniques perform well for local reads by using graph traversal algorithms.

• Classical graph theory algorithms like single-pair shortest path, all-pair shortest path, A * etc can be used for faster traversals

• Availability for processing large datasets, flexibility in schema representation and difficulty in modelling complex structures can be easily identified.

Some of the NoSQL Graph databases used in SN's are listed in Table VI. Among them Ne04j was considered for experiments and a new efficient index based Ne04j model was suggested by Mathew et.al [18]. Table VI lists some of the parameters like API, language used, level of concurrency

control, consistency in storage and replication factor to support fault tolerance. These parameters are considered important because they are relevant in the context of efficient query processing from NoSQL Graph databases in social networking sites. Most widely used graph databases in social networks are Ne04j and FlockDB.

TABLE VI. EXAMPLES OF GRAPH DATABASES

Name Concurrency Con- Consistency in Stor- Replication trol age

Ne04J multi-version con- High at all node Asynchronous and [10] currency control level persistant

FlockDB support ACID High at all node Persistent [10] prperties, optimistic level Asynchronous

locking

InfoGrid Lock with times- High at all node Graphical [10] tamps level Asynchronous

OrientDB support ACID High at all node Persistent [10] prperties,optimistic level Asynchronous

locking

AliegroGra h Locks for read and High at all node Effective [10] write level memory utilized

asynchronous

On comparative study of social networking sites like 'Facebook', 'Twitter' etc. we found that there is an increase in growth of about more than 50 percent of data every year. This exponential advancement of unstructured and semi-structured BigData have to be managed. NoSQL databases are very much useful in this context. NoSQL databases cynosure on analytical processing of this large scale structured, unstructured and semi­structured data offering increased scalability over commodity machines. NoSQL systems also exhibit the ability to store and index arbitrarily multi-data sets while enabling a large amount of concurrent user requests.

III. FEATURE ANALYSIS OF NoSQL DATABASE

CATEGORIES

Table VII, in the Appendix illustrates four chosen NoSQL Database Categories namely document stored, wide column family stored, key value stored and graph databases that are widely used in social networks [19], [3], [20], [4], [21], [22], [12], [11], [23]. These four NoSQL databases are chosen based on the application needs of client users in SN's, popularity in the market sector of SN's and also due to their support for handling Bigdata Analytics in case of structured, unstructured and semi-structured data. We also compare and analyze some of the attributes like data model, index structure, statistical allocation, design stability, platform and user base. These parameters are taken for consideration to promote efficient query processing from stored datasets in chosen four NoSQL databases frequently used in SN's. The features considered in each one of the attributes are detailed in Table VII of Appendix below.

Features considered for data model attributes are storage data layout style design, language used for query processing, level of support for running MapReduce jobs, method of concurrency control used, compression approach followed and type of data partitioning. These features help us to identify which NoSQL databases are suitable for a specific application during the data model for query processing in SN's.

Features considered for index structure attributes are -first the support for secondary storage index, which would

802 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCl)

help for fault tolerance. Second is composite keys used this is not mandatory. Third is full text search, this is very much required because from database to database this might vary. Fourth feature is the support for geospatial index in case of geographical information processing. Last feature considered is the support for graphical structure representation. All these features contribute to the index creation for each application based on the NoSQL databases chosen in SN's.

Features considered for statistical allocation attribute are -horizontal scalability of data sets, data layout design, mode of replicatian and replication factor. Sharding of database is also considered which means horizontal partitioning of database, each partition is a shard and this would help to search for shards in a parallel manner and thus retrieves the data at a faster pace. Shared nothing architecture factor is also looked at. This architecture has a distributed computing environment where each node is independent and no nodes share memory.

Design Stability is another feature considered. Here we look at the model level integrity and the support for ACID properties like traditional RDBMS. This indeed helps each application to create a stability data model framework to support efficient query processing in SN's.

The last feature considered is Platform and User Base design. Here we look at the available platforms the design supports, programming languages used and the well known sample deployments. This helps us to select the most appro­priate database for each application in SN's.

The summary of the comparative study shown in Ta­ble VII(Appendix), the Graph database model is superior among the others because it supported many applications like facebook in finding friends, relations through graph search, Google's Knowledge Graph, Twitter and many other com­panies are using graphs for recommendations. The alternate databases need not be ignored because they are adroit in other applications. A document-oriented database is adept for storing, retrieving and manipUlating document-oriented information like semi-structured data. Documents inside a document-oriented database are similar; it is made up of a series of self-contained documents. A column family database is well suited for applications like data warehousing, customer association management systems. Dynamo KeyValue database is prefered for single query retrieval in real time applications. Although Dynamo has a complex structure, it is used for controlling the session information of all new user applications on our phones and devices.

IV. EXPERIMEN TAL EVALUATION OF NoSQL DATABASE

CATEGORIES

Experimental evaluation of the NoSQL subcategories are carried on with a single node setup. Performance measurement of memory utilized by each NoSQL subcategory/packages is analyzed based on the dataset obtained from Facebook [24]. We conducted those experiments with four NoSQL Database sub categories like Hadoop/HBase, Cassandra, MongoDB, CouchDB, DynamoDB, Riak, Voldemort, Ne04J, FlockDB, InfoGrid, AllegroGraph and OrientDB. Experiments are con­ducted on Ubuntu 14.04 LTS Dell Precision T3610 Tower Workstation with Intel Xeon E5-1620 v2 3.70 GHz. This performance analysis is shown in Figure 2. It was found

Document store databases use less memory where as Graph databases use more memory.

Fig. 2. Memory Used by each of the NoSQL subcategories

After computing the memory used in each of the NoSQL packages, we looked into the performance analysis of each one of the Graph NoSQL databases. This is because Graph databases play an important role during query processing in SN's to connect various groups of netwroks like network of friends, mutual friends, family etc. We focussed on the time taken for 'insert' operations of nodes, their relationships and properties associated with their edges. Properties of nodes and their relationships with other nodes are represented using edges and are inserted during node and relationship insertion simultaneously. We calucated the time taken in seconds for node insertion with their properties. This is shown in Figure 3. From Figure 3 one can observe that Ne04J takes less time for the insertion of nodes compared to the other Graph databases studied in social networks.

Fig. 3. Time taken during insertion of nodes and their properties to the Graph databases

Figure 4 illustrates the time taken in seconds to insert the relationships of each node with the properties associated with each of the relationships. We can see from Figure 4 that Ne04J takes less time for relationships insertion compared to the other Graph databases used in social networks. Figure 5 illustrates the time taken in minutes for the insertion of properties of each edge connecting the nodes. Again from Figure 5 we observe that Ne04J outstands by taking less time of computation during edge property insertion compared to the other Graph databases used in social networks.

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCl) 803

[ .l

Fig. 4. Time taken during insertion of relationships and their properties to the Graph databases

Fig. 5. Time taken during insertion of edges and their properties to the Graph databases

After computing the time taken for the insertion of nodes, relationships and edge properties we found Neo4J is better than other Graph databases. We also looked into the time taken for read operations conducted on the Graph Databases. Figure 6 shows the time taken in milliseconds for the reading offriends, mutual friends or family member details. From Figure 6 we again see Neo4J and FlockDB take less time for read operation in Facebook social network dataset compared to other Graph databases.

V. CONC L USION

BigData storage, management and retrieval have become critical while considering wide range of storage locations, types of complex and enormous amount of data. Data has already elude from the control of IT departments into the wider reaches of cloud based services and social netwroking sites. This expansion and pinnacle growth of BigData, strengthen the steer need for reliable and rapid NoSQL database services. Four NoSQL databases are analysed in this paper, these four were selected based on their wide application in Social net­working sites. A detailed inspection based on the requirement of users to map 'friendship', 'relationships', 'likes' etc. in Facebook, to search for posts in Twitter and to search for jobs in specific skilled areas in LinkedIn etc., were done. We found that, features considered in Graph databases were higher in ranking of utilization compared to the other NoSQL databases in social networking sites like Facebook, MySpace, Twitter, Flickr, LinkedIn etc. An analysis was conducted on the performance of the Graph databases in Facebook dataset on insert and read operations. It was found that the performance

of Neo4J is better when compared to other NoSQL Graph databases during insert and read operations. Various attributes like data storage design, index structure, statistical allocation, design stability and platform user base design were analyzed among the NoSQL subcategories. The comparisons clearly provides a platform indicating how each one of them are suited for faster data retrieval during query processing from social netwroking sites(SN's). It was also found Graph NoSQL Databases play a pivotal part in social networks compared to other NoSQL categories. However, this is not the final set of the parameters that can be chosen since we are focusing primarily on query processing. There could still be other parameters for consideration when selecting other aspects like data locality, efficient scheduling of different groups etc.

Fig. 6. Time taken during reading of nodes with their relationships and properties on edges in Graph databases

REFERENCES

[I] J. Gantz and D. Reinsel, " The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east," IDC iView: IDC Analyze the Future, vol. 2007, pp. 1-16, 2012.

[2] P. Groves, B. Kay y ali, D. Knott, and S. Van Kuiken, " The 'big data'revolution in healthcare," McKinsey Quarterly, 2013.

[3] G. Deka, "A survey on cloud database," 2013. [4] A. Moniruzzaman and S. A. Hossain, "Nosql database: New era

of databases for big data analy tics- classification, characteristics and comparison. ," International Journal of Database Theory & Application, vol. 6, no. 4, 2013.

[5] R. Hecht and S. Jablonski, "Nosql evaluation: A use case oriented survey," in Cloud and Service Computing (CSC), 2011 International Conference on, pp. 336-341, IEEE, 2011.

[6] C. Strauch, U. - L . S. Sites, and W. Kriha, "Nosql databases," URL: hIlP://www.chrislOf-strauch.de/nosqldbs.pdf( 07.11. 201 2), 2011.

[7] C. J. Tauro, S. Aravindh, and A. Shreeharsha, "Comparative study of the new generation, agile, scalable, high performance nosql databases," International Journal of Complller Applications (0975-888) Volume, pp. 7461-0336, 2012.

[8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. L akshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: amazon's highly available key - value store," in SOSP, vol. 7, pp. 205-220, 2007.

[9] L. George, HBase: the definitive guide. O'ReiUy Media, Inc. , 2011. [10] G. Vaish, Gelling Started with Nosql. Packt Publishing, 2013. [II] B. G. Tudorica and C. Bucur, "A comparison between several nosql

databases with comments and notes," in Roedunet International Con­ference (RoEduNet), 2011 1 0th, pp. 1-5, IEEE, 2011.

[12] R. Cattell, "Scalable sql and nosql data stores," ACM SIGMOD Record, vol. 39, no. 4, pp. 12-27, 2011.

[13] N. Developers, "Ne04j," Graph NoSQL Database [online], 2012.

804 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCl)

[14] W. Kim, "Web data stores (aka nosql databases): a data model and data management perspective," International Journal of Web and Grid

Services, vol. 10, no. 1, pp. 100-110, 2014.

[15] 1. Han, E. Haihong, G. L e, and 1. Du, "Survey on nosql database," in Pervasive computing and applications (ICPCA), 20 11 6th international conference on, pp. 363-366, IEEE, 2011.

[16] 1. C. Anderson, 1. L ehnardt, and N. Slater, CouchDB: the definitive guide. O'Reilly, 2010.

[l7] D. Bartholomew, "Sql vs. nosql," Linux Journal, vol. 2010, no. 195, p. 4, 2010.

[18] A. B. Mathew and S. M. Kumar, "An efficient index based query handling model for ne04j," ]JCST, vol. 3, no. 2, pp. 12-18, 2014.

[19] A. B. Mathew, P. Pattnaik, and S. Madhu Kumar, "Efficient information retrieval using lucene, lindex and hindex in hadoop," in Computer Systems and Applications (AlCCSA), 20 14IEEEIACS 11th International

Conference on, pp. 333-340, IEEE, 2014.

[20] V. Bhatnagar, "Data mining-based big data analytics: parameters and lay ered framework," International Journal of Computational Systems

Engineering, vol. 1, no. 4, pp. 265-276, 2013.

[21] Y Zhang, Y- P. Bai, D. -F. Zhu, and Z. -Y L v, "A big data based data storage sy stems for rock burst experiment," International Journal of

Wireless and Mobile Computing, vol. 6, no. 5, pp. 463-472, 2013.

[22] c. Nance, T. L osser, R. Iy pe, and G. Harmon, "Nosql vs rdbms- why there is room for both," 2013.

[23] R. K. L omotey and R. Deters, "Analy tics-as-a-service framework for terms association mining in unstructured data," International Journal of Business Process Integration and Management, vol. 7, no. 1, pp. 49-61, 2014.

[24] K. L ewis, 1. Kaufman, M. Gonzalez, A. Wimmer, and N. Christakis, 'Tastes, ties, and time: A new social network dataset using facebook. com," Social networks, vol. 30, no. 4, pp. 330-342, 2014.

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCl) 805

� � �

10 " .ll j

is .j3 ! 1:l .� .;;; in

.� � in .� 1!:

! � 1: § iii '"

806

Appendix

TABLE VII. Categorisation and Assessment of four NoSQL Databases sub-categories used in Social Networks

Anributes NoSQl Databases

Database Model Document Stored Wide Column Family Stored Key Value Stored GraphDB

Features MongoDB CouchDB Hypertable Hadoopl Hbase Cassandra DynamoDB Voldemort Riak Ne04j AliegroGraph I FlockDB I InloGrid I OrientDB

Log structured Simple Key·

merge tree, Partilioned Row value byte array Nodes,Edges, Nodes,Edges, Properties, Relationships in Numerical Adjacency

tables with key Store, Rows in stores, Properties, matrices

Dala Storage Style Disk File value pairs Row - Column tables with primary timestamp Disk File Buckels and Relalionships in and Design used System Disk File System used oriented key ordering System Keys tables

SPARQl, Language used for C++,Python, Java, REST, Avro SQl, Jav� JRUBY, Cypher, Gremlin, Cyphe�Gremlin, Scala, Cypher, Gremlin, RDFS++,

Querying Java E�an9,Java C,C++ or Thrilt Python Jav�Python Java e,c++ SPARQl, lisp SPARQl Java Cypher,Java Prolong

SUpport lor MapReduce 100% Partially Partially 100% 100% Partially 100% Partially 100% Partially 100% Partially 90·100%

Method 01 Supports Optimistic

Concurrency Multi Version Concurrency Control Read Locks and vmte Append Locks controlled by Master node Locking

Separate locks for read and write Read write Read write

Control locks MV C C locks

Compression PAX GziplPAX PAX PAX compression Gzip CpiolGzip compression PAX compression Huffman PAX

Huffman compression Approach lollowed compression Gzip compression compression compression compression compression compression

Supports range Supports Robin Partitioning

Robin hood Range based based hood Range based based on Does not Only consistent Partitioning based on related nodes of

partitioning and partitioning and partitioning and partitioning and Range based partitioning related keys, support hashing possible

consistent consistent consistent consistent partitioning and and Hopscotch open address partitioning No partitioning with clustered groups

IType 01 Partitionin hashing hashing hashing hashing onsistenthashin hashing hashing only hashin only clu�ering nodes.

Partially, due Yes,all keys

SUpport lor >70% upto the »O%upto the table size should be eS,allkeys should Secondary Index size of documnt sizeof documnt overhead connected be connected Partially No No Full Full Full Full Full

Composite Keys used Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes

n y wnninslOg n yWRnlnme Partially, only wRhin the Level of Full Text All Inter-related document ",thin all stored column value PartiallY,only cluster Partially, only within

Search Documents level subdivisions No data ",thin all data sets lamily No No Yes within the cluster Yes theclusler

SUpport lor Yes, all related Yes but ",thin datasets belong to same Yes with clustering Yes with Yes with R Yes K Means Clustering

Geospatiallndex documents No complicated column No No No using R Tree index clustering Index

Yes, each Yes, each key-

Graph Structure column Yes, each column value represent

Support No No No represent node represent node node No No Yes Yes Yes Yes Yes

Horizontal Yes,document Yes, document by Yes column by Yes column by Yes column by

Scalablility by document document column column column Partially No No Partially No Partially Partially Partially

Yes,Read lock Yes, Read and

Yes, Read Lock and Append Lock, Yes, asynChronous, Locks Yes, Consistent Read and Append Consistent read but eventual consistency in

only Write lock, Yes,Read Lock

Consistent for read and write operations write operations. Support inte�eaved read

Replication SUpport Consistent Consistent Yes and writes.

Single Master Multi Slave Single Master Multi Slave Replication

Data node multi server replication multi master replication multi slave/ data node replication

Mode of Replication Replication Replication within same cluster

SUpport lor Based on Yes Document by Document Yes based on cluster record set Yes Graph clu�er

Sharding Yes table wise No No Graph cluster adjacency matrix

Shared Nothing Auto Sharding of Documents as Auto Sharding horizontally of Supports cluster level based on relationships and properties in distributed Architecture single independent nodes Partially yes columns in a cluster recordset No No No computing environment

only write operations that affect a single row in a single column family Support high level atomicity

Level of Atomicity Conditional Partially No are atomic Partially Partially No

Gradual Consisteny maintained Eventual Consistency obtained throughout, if

within the same documents immediate changes have to be applied the base Eventual Partial SUpport high level Consistency throughout

Consistency Level system manager takes control Consistency Consistency No

Complete Complete Cluster wise isolation

Isolation Level No SUpports partially Support No No No No Support

Durability Level Yes Yes Yes Yes Yes Yes Yes No Support high level durability

limited form of transactional consistency, group

Transaction Document level transactions only writes of multiple columns not pOSSible, transaction High level Transaction support with associated relatioships and properties

Support support for data belong to same shard Yes Yes No

Yes to relate contents of same useato Used during sharding of database

document or related referenced reference each consecutive columns are SUpport with foreign key constraints Referential Integrity ones replicas referenced Partially Partially No

accumulating,

Metlile, occasionally Aerospike, All social networking sites,process

Craigslist,New changing data Startedlrom Amazon Web AT&T, AOl, geospatial information, routing

Johannes York TImes, with pre-defined Powerset Apixio, AppScale, Cell·level Services, The weather Social Ernest Nuvolabase

Sourceforge, queries,master company, Facebook, Security, Rackspace, Channel, projects etc

Networking site ltd. Well known Sample Foursquare, replication,multi- in-house search, Cloudkick, IBM, Server-side Heroku, Symantec -Twitter. AGPLI

Deployments eBayetc. site deployments software Facebook NeUlix, Digg etc. programming Aerospike etc. Apache Properietary Apache

Ubuntu, Red Hat, Linux, Cross Cross Platform based on specific

Available Platforms Cross Platform Windows Cross Platform Cross Platform Cross Platform Cross Platform ",ndows PlaUorm Cross Platform applications only Cross Platform

Programming C, CH, Erlang, Jav� Scal� Java, Scala, Language used C++ Python Java Java Java Java C,C++ Erlang Java Java Ruby Spark Java, C++

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCl)