Data Management Project B

Scalability and Performance comparison of thefeatures of MongoDB and Cassandra NoSql

Databases

Name: Rohit MaheshwariStudent Number: 14109701Module: Data Storage and ManagementProject: B

Introduction

MongoDB and Cassandra are NoSQL databases. Both are free and open-source software. Thelimitations of scalability and complexity of traditional SQL databases leads to the evaluationof NoSQL databases. MongoDB is first developed by 10gen (now MongoDB) softwarecompany with the aim to perform platform as a service (PaaS). MongoDB is the most popularNoSQL database system. MongoDB is a document database which provide high availability,high scalability and high performance. It is considered as a perfect fit for the Node.JSapplications, which allows us to write Javascript for the client database and backend layer. Thisdatabase use pointers instead of joins. It provides a bunch of drivers which can be used withRuby, python and other languages.

Apache Cassandra is another scalable NoSQL database. Its roots can be found in organisationslike Google, Facebook are known to handle massive and manage big data. It is an open source,distributed decentralized, elastically scalable, highly available, fault-tolerant and column-oriented database.

Cassandra is used today by number of industries to manage their critical data infrastructure. Itbasically provides high performance at massive scale, which never goes down. It is a set ofnodes that uses consistent hashing to distribute rows based on the row id.

Key Characteristics of MongoDB and Cassandra Database

There are few important features which we should take into consideration while working onMongoDB database. The main features include:

- Document-oriented: MongoDB stores the subjects into the minimal number ofdocuments instead of taking the subjects and breaking it up and storing it in a relationaldatabase. For example the traditional relational database store the title and authorinformation in two different relational structure whereas in MongoDB both theinformation are stored in a single document called Book, which makes the work easierand fast when working we are dealing with big data.

- Ad hoc queries: This feature support search by field, range queries and regularexpression search. Any field available in MongoDB application is indexed.

- Replication: This feature provides high availability by replicating it. It works onmaster-slave replication where the master performs all the read and write operationsand secondary node.

- Load Balancing: MongoDB can distribute data horizontally using master-slavearchitecture. It runs over multiple servers, balancing the load to keep the system up andrunning.

- Aggregation: MapReduce can be used for batch processing of data and aggregationoperations. It offers similar functionality as SQL GROUP by clause.

- Server-Side JavaScript Execution: JavaScript is used in queries and MapReducefunctions.

- Special Support for Locations: Natively understand the latitude and longitude.

Key characteristics of Apache Cassandra are:

- Elastic Scalability: Cassandra scales horizontally, adding more machines that have allor some of the data. Adding of nodes increase performance throughput linearly with nodowntime and interruption to applications.

- Distributed and Decentralized: Distributed means it is capable of running on multiplemachines. Decentralized means no single point of failure. The other features are thereis no master-slave issue due to peer-to-peer architecture and we can run singleCassandra server from geographically dispersed data centres.

- High availability and fault tolerance: Multiple networked computers operating in acluster and it has a capability of recognizing node failure. It also has high availability.

- Tunable Consistency: Choose between strong and eventual consistency and it isadjustable for read and write operations separately. Conflicts are solved during reads,as the focus lies on write-performance.

- MapReduce Support: Cassandra can be integrated with Hadoop and MapReducesupport.

- Query Language: Cassandra has CQL (Cassandra Query Language) an alternative tothe SQL. The drivers for this is available for platforms.

MongoDB and Cassandra Architecture

In MongoDB architecture, the main components are App driver, Mongos (also known as QueryRouter), configdb file and mongod (Mongo Database). In a production environment, we haveto connect to mongod through mongos (Query Router). There are multiple mongod nodes. So,mongos is the query router which acts between mongos and mongod or shard cluster andconfigdb contains the meta-data of the information which contains the information like whichdata is being kept in which cluster or node. If we look at the architecture diagram below thebox at the bottom is a MongoDB cluster which contains multiple shard nodes. Mongos is theinterface which act as a query router. Mongos will do the distribution of the query and performread-write operations. To do that mongos will take the help of configuration database (configdbfile) which contains the meta-data of the cluster.

In each shard which act as a node has one primary storage and two secondary storage. So, ifone storage disk goes down than any of the two will become primary and perform the task ofthe primary node. The replication mechanism is configured inside the shard.

Cassandra Database Architecture

Cassandra database is designed to manage big data workloads across multiple work nodes withno single point of failure. The architecture has considered that system and hardware failure dooccur. Cassandra uses hashing to assign data to the nodes. The hashing can include columnname, row ID etc. The hash range is also divided among nodes in the cluster. The logs on eachnode captures are for data durability. Data is written and indexed to in-memory structure calleda memtable. Memtable is the location where data is updated and deleted. It is a temporarylocation which is written to a disk in an SSTable when it is full. All the written files arepartitioned and replicated automatically throughout the cluster.

Cassandra is a row oriented database which allows any user to connect to any node of the datacentre by using CQL (Cassandra Query Language). It has more or less similar syntax as SQL.The few main component of Cassandra architecture are:

- Gossip: It is peer-to-peer communication protocol which discover and share locationinformation with other nodes in a Cassandra cluster.

- Partitioner: This component of the Cassandra architecture determines which data is tobe discovered and on which node the data should be kept. The partitioner work on ahash function.

- Replication Factor: there are some replication factors across the Cassandra cluster. Ifthere is a replication factor of 1, then it means that there will be one copy of each rowon one node, in a similar fashion if there is replication factor 2 then it there will be 2copies of each row will be available on a different node. We can’t set the replicationstrategy greater than number of nodes. There is no master replica all the replica areequally important. The user defines the replication strategy of the Cassandra data centreenvironment.

- Replica placement strategy: Cassandra ensures reliability and fault tolerance bystoring copies of data on multiple nodes. On which node the replication is to beperformed is chosen by replication strategy.

- Snitch: The group of machines in the racks of data centres that is used by the replicationstrategy to perform the replication is known as snitch. It is advisable to configure snitchwhile creating a cluster. By default, it is enabled and use in most of the deployment.

- System keyspace table properties: We can set storage configuration attributes on per-keyspace or per-table using client application.

Test Plan

We have tested the two NoSQL databases through YCSB tool but before going through thedetails of installation methods we need to talk about the hardware specifications which we hadtaken to install and test both the databases.

Hardware Specifications:

Node Memory Storage Processor1 2GB 30GB Intel i5

Before starting the testing on various metrics we need to install MongoDB and Cassandradatabase and then perform benchmarking through YCSB tool.

Installation of YCSB Tool:

The goal of yahoo cloud serving benchmark (YCSB) tools is to develop a framework or settingup some type benchmarks through set of workloads. The framework consists of workloadgenerator which makes it easy to define new workload types. These workloads are tested invarious databases to do the performance testing by evaluating the performance of different“key-value”. These YCSB framework are available open source. The steps to install YCSBtool is as follows:

Step 1: wget https://github.com/downloads/brianfrankcooper/YCSB/ycsb-0.1.4.tar.gz

Step 2: tar xfvz ycsb-0.1.4.tar.gz

We have installed the YCSB tool by creating a separate user called hduser.

Installation of MongoDB

MongoDB provides packages of the officially supported MongoDB builds in its ownrepository. We have downloaded MongoDB packages from the repository. But, beforeinstallation of MongoDB we have to install java in our virtual machine. The steps to installMongoDB is as follows:

Step 1: sudo apt-get update

Step 2: sudo apt-get install openjdk-7-jre

Step 3: sudo apt-get install openjdk-7-jdk

After installing java, below steps are used to install MongoDB database in our VM.

Step 1:

cd /usr/local/

sudo apt-get install curl

sudo curl -O http://downloads.mongodb.org/linux/mongodb-linux-x86_64-2.6.4.tgz

sudo tar -zxvf mongodb-linux-x86_64-2.6.4.tgz

sudo chown -R hduser:hadoop mongodb-linux-x86_64-2.6.4

sudo ln –s mongodb-linux-x86_64-2.6.4 mongodb

sudo cd ..

sudo mkdir mongodbdata

sudo chown –R hduser:Hadoop mongodbdata

cd /

sudo ln –s /usr/mongodbdata data

cd data

mkdir –p db

After performing the above steps we need to export path in .bashrc file.

Step 2:

export PATH=$PATH:/usr/local/mongodb/bin

After exporting path, we need to source the .bashrc file and start the mongodb server.

Step 3:

. .bashrc

Mongod –bind_ip 127.0.0.1

Installation of Cassandra database

There are some prerequisites for installing Cassandra in the Virtual Machine. They are:

- Oracle java 7 must be installed.- There should be root and sudo access.- At least 256MB of memory for testing light workloads.

Below are steps which we have used to install Cassandra database

First we have created a directory called cassandra

Step 1:

cd ~

mkdir Cassandra

cd Cassandra

After creating a directory we have to install Cassandra from the repository and move it to thepersonal folder.

Step 2:

wget

http://www.eng.lsu.edu/mirrors/apache/cassandra/2.1.2/apache-cassandra-2.1.2-bin.tar.gz

tar -zxvf apache-cassandra-2.1.2-bin.tar.gz

mv apache-cassandra-2.1.2 ~/Cassandra

Then we’ll make folders Cassandra accesses, such as the log folder and that Cassandra has theright to write on it:

Step 3:

Sudo mkdir /var/lib/cassandra

Sudo mkdir /var/log/cassandra

Sudo chown –R $USER:$GROUP /var/lib/cassandra

Sudo chown –R $USER:$GROUP /var/log/cassandra

Now we have to set Cassandra variables by running

Step 4:

export CASSANDRA_HOME=~/cassandra

export PATH=$PATH:$CASSANDRA_HOME/bin

Then we had configured the Cassandra per-thread stack size to larger one 280k than the default180k size.

Step 5:

nano ~/cassandra/conf/cassandra-env.sh

Search for the below line in this file.

JVM_OPTS=”$JVM_OPTS –Xss180k”

Change it to

JVM_OPTS=”$JVM_OPTS –Xss280k”

The next step is to run Cassandra

Sudo sh ~/cassandra/bin/cassandra

Sudo sh ~/cassandra/bin/cassandra-cli

We have also created user table in Cassandra for running the test.

create keyspace usertable;use usertable;create column family data;

The further test both the databases we are going to perform a comparative analysis of both theNoSQL databases on the bases of 2 characteristics below:

- Performance- Security

We have various types of workload provided by YCSB tool to perform the analysis ofdatabases. Below table describes the workloads which we have used to perform this test.

Workloads Operations Record SelectionA-Update Heavy Read:50%, Update: 50% ZipfianB-Read Heavy Read:95%, Update: 5% ZipfianC-Read Only Read: 100% ZipfianD-Read Latest Read: 95%, Insert:5% Latest

E-Short Ranges Scan: 95%, Insert: 5% Zipfian/UniformF- Write Heavily Read 10%, Insert 90% Zipfian

In MongoDB we have used the below command to get the output

- This command will load the workloada to execute in MongoDB server./bin/ycsb load mongodb -s -p mongodb.url=mongodb://127.0.0.1:27017 -pmongodb.database=ycsb -p mongodb.writeconcern=strict -pmongodb.maxconnections=10 -P workloads/workloada

- This command will run the workloada and save it to workloada.txt file../bin/ycsb run mongodb -s -p mongodb.url=mongodb://127.0.0.1:27017 -pmongodb.database=ycsb -p mongodb.writeconcern=strict -pmongodb.maxconnections=10 -P workloads/workloada | tee workloada.txt

The same operations is performed for all the remaining workloads b to f.

In Cassandra we have used the below command to get the output

- This command will load the workloada to execute in Cassandra server./bin/ycsb load Cassandra-10 –p hosts=”127.0.0.1” –P workloads/workloada

- This command will run the workloada and save it to workloada.txt file../bin/ycsb run Cassandra-10 –p hosts=”127.0.0.1” –P workloads/workloada

The same operations is performed for all the remaining workloads b to f.

Below are the results when test is performed on the databases.

MongoDB

Runtime Throughput Average LatencyWorkload a 814 1228 Operations per Second 1703usWorkload b 1928 518 Operations per second 4114usWorkload c 1885 530 Operations per Second 1350usWorkload d 2608 383 Operations per second 8948usWorkload e 3314 301 Operations per Second 4665usWorkload f 1031 969 Operations per Second 702us

Cassandra

Runtime Throughput Average LatencyWorkload a 2317 431 Operations per Second 4251usWorkload b 5943 168 Operations per Second 7951usWorkload c 3224 310 Operations per Second 3031usWorkload d 2340 427 Operations per Second 3680usWorkload e 10304 97 Operations per Second 12696usWorkload f 2411 414 Operations per Second 612.7us

Evaluation and Results

Scalability

To test the scalability we have increase the record count in workload A for both MongoDB andCassandra and saved the results in the text file. The following are the results in both thedatabases:

MongoDB

Record Count Runtime Throughput Average Latency10000 820 1219 118520000 838 1193 122830000 843 1186 125140000 2068 483 3035

Cassandra

Record Count Runtime Throughput Average Latency10000 4536 220 826520000 7918 126 1619430000 9022 110 1752840000 16698 59 34012

Performance

From the above figure we can clearly see the runtime of cassandra nosql database is more thanmongodb nosql database. In 5 workloads the mongodb performs better accept workload dwhere the runtime of both the databases are almost same.

0

2000

4000

6000

8000

10000

12000

Workload a Workload b Workload c Workload d Workload e Workload f

Runtime

Cassandra Runtime MongoDB Runtime

In the above figure we did the comparison between the two databases on the basis of throughputin executing the 6 workloads. In this section also MongoDB performs better than Cassandrawith huge difference of throughput in workload a and workload f. The throughput is almostsame in workload d.

In the above figure we did the comparison between the two databases on the basis of averagelatency. Again MongoDB performs better than Cassandra. The difference of the workloadincreases while executing workload d which is read/insert operations where Cassandraperforms better than mongodb.

0

200

400

600

800

1000

1200

1400


Throughput

Cassandra Throughput MongoDB Throughput

0

2000

4000

6000

8000

10000

12000


Average Latency

Cassandra Avg Latency MongoDB Avg Latency

Scalability Graphs

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Scalability for 10000 Record Count

Cassandra MongoDBRuntime Throughput Average Latency

0

2000

4000

6000

8000

10000

12000

14000

16000

18000



The above 4 figure shows the scalability of Cassandra and MongoDB at various record count.Clearly, we can depict that MongoDB performs better than Cassandra in all the four recordcounts. Hence, MongoDB has better runtime, throughput and average latency than Cassandra.

Conclusion

We have performed the experiment on both the nosql databases i.e. MongoDB and Cassandra.By performing various experiments on both the databases we have observed that MongoDBperforms much better than Cassandra database on the basis of runtime, throughput and averagelatency.

Overall if we look at single server MongoDB performs better in that. Whereas, if we talk aboutreliability Cassandra database has no single point of failure and it support multiple data centres.

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000



0

5000

10000

15000

20000

25000

30000

35000

40000



References

- Docs.datastax.com, (2015). Architecture in brief | DataStax Cassandra 2.0Documentation. [online] Available at:http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureIntro_c.html [Accessed 30 Apr. 2015].

- Docs.mongodb.org, (2015). Production Cluster Architecture — MongoDBManual 3.0.2. [online] Available at:http://docs.mongodb.org/manual/core/sharded-cluster-architectures-production/ [Accessed 30 Apr. 2015].

- Mongodb.com, (2015). MongoDB Architecture. [online] Available at:http://www.mongodb.com/mongodb-architecture [Accessed 30 Apr. 2015].

- Wiki.apache.org, (2015). ArchitectureOverview - Cassandra Wiki. [online]Available at: http://wiki.apache.org/cassandra/ArchitectureOverview[Accessed 30 Apr. 2015].

Documents

Data Management Project B