5

Click here to load reader

Hadoop vs MongoDB_whitepaper

Embed Size (px)

Citation preview

Page 1: Hadoop vs MongoDB_whitepaper

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 15

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 25

In recent years there has been an explosion of data IDC predicts that the digital universe will grow to 27zettabytes in 2012 up 48 from 2011 By 2015 this is expected to grow to 8 zettabytes of data 1 Zettabyte =1000000000000000000000 bytes More than half of this is the unstructured data from social networks mobiledevices web applications and other similar sources Traditional RDBMS systems are data storage solutions thatwere designed decades ago At that time most of the data was structured and had a different avor than todayWe need solutions that address Big Data problems and were designed on the basis of recent changes in the na-ture of data (1) (2)

There are many solutions that address the Big Data problem Some of these solutions are tweaks andor hacksaround standard RDBMS solutions Some are independent data processing frameworks data storage solutions orboth There is a website which lists many of them (3)

According to Jaspersoft there are two leading technologies in this new eld First is Apache Hadoop a frameworkthat allows for the distributed processing of large data sets using a simple programming model The other isMongoDB an open source NoSQL scalable distributed database Both of these solutions have their strengthsweaknesses and unique characteristics (4) (5) (6)

MongoDB and Hadoop are fundamentally different systems MongoDB is a database while Hadoop is a data pro-cessing and analysis framework MongoDB focuses on storage and ef cient retrieval of data while Hadoop focuses

on data processing using MapReduce In spite of this basic difference both technologies have similar functional-ity MongoDB has its own MapReduce framework and Hadoop has HBase HBase is a scalable database similar toMongoDB (7) (8)

The main aw in Hadoop is that it has a single point of failure namely the ldquoNameNode rdquo If the NameNode goesdown the entire system becomes unavailable There are a few workarounds for this which involve manually re-storing the NameNode (9)

Regardless the single point of failure exists Mon-goDB has no such single point of failure If at anypoint in time one of the primaries con g-servers or nodes goes down there is a replicated resource

which can take over the responsibility of the systemautomatically (10) (11)

MongoDB supports rich queries like traditional RD-BMS systems and is written in a standard JavaScriptshell Hadoop has two different components forwriting MapReduce (MR) code Pig and Hive Pig is ascripting language (similar to python perl) thatgenerates MR code while Hive is a more SQL-likelanguage Hive is mainly used to structure the dataand provides a rich set of queries Data has to be inJSON or CSV format to be imported into MongoDBHadoop on the other hand can accept data in al-most any format Hadoop structures data usingHive but can handle unstructured data easily usingPig With the help of Apache Sqoop Pig can eventranslate between RDBMS and Hadoop (12) (13) (14)

What about transactions

Database transactions A transaction is a logical unit ofwork that usually consists of one or more database opera-tions (15)

MongoDB has no transactions Initially that fact may be aconcern for people from a RDBMS background Transac-tions are somewhat obsolete in the age of the web Whendealing with distributed systems long-running databaseoperations and concurrent data contention the concept ofa ldquodatabase transactionrdquo may require a different strategyJim Gray - The Transactional Concept Virtues and Limita-tions (16)

MongoDB does have something that is ldquotransaction-likerdquo

in that a database write can happen in a single blockingsynchronous ldquofsyncrdquo MongoDB supports atomic opera-tions to some extent So long as the schema is structuredcorrectly you can have a reliable write for a single entry

MongoDB vs Hadoop

Pa

1 httpwwwsmartercomputingblogcom20120321 nally-more-data-about-big-data 2 httpenwikipediaorgwikiBig_data3 httpnosql-databaseorg 4 httpnosqlmypopescucompost20001178842nosql-databases-adoption-in-numbers5 httphadoopapacheorg 6 httpwwwmongodborg 7 httphadoopapacheorgcommondocscurrentmapred_tutorialhtmlOverview8 httphbaseapacheorg 9 httpwikiapacheorghadoopNameNode 10 httpwwwmongodborgdisplayDOCSReplica+Sets11 httpwwwmongodborgdisplayDOCSCon guring+ShardingCon guringSharding-Con gServers

12 httppigapacheorg 13httphiveapacheorg 14 httpsqoopapacheorg 15 httpenwikipediaorgwikiDatabase_transaction16 httpenwikipediaorgwikiDatabase_transaction

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 35

MongoDB (written in C++) manages memory more cost-ef ciently than Hadooprsquos HBase (written in Java) WhileJava garbage collection (GC) will in theory be as CPUtime ef cient as unmanaged memory it requires 5-10x asmuch memory to do so In practice there is a large performance cost for GC on these types of large scale distrib-uted systems Both systems also take a different approach to space utilization MongoDB pre-allocates space forstorage improving performance but wasting space Hadoop optimizes space usage but ends up with lower writeperformance by comparison with MongoDB

Hadoop is not a single product but rather a software family Its common components consist of the followingbull Pig a scripting language used to quickly write MapReduce code to handle unstructured sourcesbull Hive used to facilitate structure for the databull HCatalog used to provide inter-operatability between these internal systems (15)

bull HBase which is essentially a database built on top of Hadoopbull HDFS the actual le system for hadoop (16)

MongoDB is a standalone product with supported binaries The learning curve for MongoDB is generally lowerthan that of Hadoop

Recently there has been a lot of talkabout security with NoSQL databases

Both MongoDB and Hadoop have basicsecurity MongoDB has simple authenti-cation and MD5 hash and Hadoop offersfairly rudimentary security in its variousframeworks This is not a aw NoSQLdatabases like MongoDB were never de-signed to handle security they weredesigned to ef ciently handle Big Datawhich they effectively do It is simple toimplement security in your applicationinstead of expecting it from your datasolution Here is more on this (17)

For data processing and data analysisthere is almost no technology OpenSource or otherwise that beats HadoopHadoop was designed speci cally to ad -dress this issue and thereby it contains allthe components necessary to rapidlyprocess terabytes to petabytes ofinformation Writing MapReduce code inHadoop is elementary Pig is easy tolearn and makes it uncomplicated towrite user-de ned functions MongoDBhas its own MapReduce frameworkwhich though subpar to Hadoop doesthe job well When it boils down to sheernumbers Hadoop is ahead Yahoo has a4000 node Hadoop cluster Continued on page 3

Pa

What is MapReduce Why do we want it MapReduce is a framework for processing highly distributableproblems across huge datasets

It divides the basic problem into a set of smaller manageable tasksand assigns them to a large number of computers (nodes) An idealMapReduce task is too large for any one node to process but canbe accomplished by multiple nodes ef ciently

MapReduce is named for the two steps at the heart of the frame-work

bull Map step The master node takes the input divides it into

smaller sub-problems and distributes them to worker nodesEach worker node processes its smaller problem and passesthe result back to its master node There can be multiple levelsof workers

bull Reduce step The master node collects the results from all ofthe sub-problems combines the results into groups based onthe key and then assigns them to worker nodes called reducersEach reducer processes those values and sends the result backto the master node

MapReduce can be a huge help in analyzing and processing largechunks of databull buying pattern analysis customer usage and interest patterns

in e-commercebull processing the large amount of data generated in the elds of

science and medicinebull processing and analyzing security data credit scores and other

large data-sets in the nancial industry

These and many other uses make MapReduce an indispensabletool in the software industry

MongoDB vs Hadoop

15 httpincubatorapacheorghcatalog16 httphadoopapacheorghdfs

17 httpwwwdarkreadingcomblog232600288a-response-to-nosql-security-concernshtml

18 httpdeveloperyahoocomblogshadoopposts200809scaling_hadoop_to_4000_nodes_a

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 45

MongoDB vs Hadoop

and someone is aiming to test a 10000 nodes soon enough MongoDB is typically used in clusters with around 30-50 shards and a 100 shard cluster under testing Typically MongoDB is used with systems less than approximately5 TB of data Hadoop on the other hand has been used for systems larger than 100 TB including systems contain-ing petabytes of data (18)

There are several use cases of both systems

Hadoopbull Log Processing (best use case) - Log les are usually

very large and there are typically lots of them This cre-ates huge amounts of data A single machine may notbe able to ef ciently process them Hadoop is the bestanswer to this problem splitting the log into smallerworkable chunks and assigning them to workers resultsin very fast processing

bull ETL -- For unstructured data streaming in real time sayfrom a web-application Hadoop is a good choice tostructure the data and then store it Additionally Hadoopprovides ways to pre-process data prior to structuring it

bull Analytics - Various organizations are using Hadooprsquossuperior processing capabilities to analyze hugeamounts of data The most famous usage is Facebookrsquos21PB 2000 node cluster which Facebook uses forseveral tasks

bull Genomics - Scientists working in this eld constantlyneed to process DNA sequences These are very long andcomplicated strands of information which require largeamounts of storage and processing Hadoop provides asimple cheap solution to their processing problem

bull Machine education - Apache Mahout is built on top ofHadoop and essentially works along with it to facilitatetargeted trade in e-commerce

MongoDBbull Archiving - Craigslist uses MongoDB as an ef cient stor -

age for archives They still use SQL for active data whilethey archive most of their data using MongoDB

bull Flexibility in data handling - With data in a non-standard format (like photos) it is easy to leverage theunstructured document oriented storage of MongoDBin addition to its scaling capacities

bull E-Commerce - OpenSky had a very complicated e-com-

merce model It was very dif cult to design the correctschema for it and even then tables had to be alteredmany times MongoDB allows them to alter the schemaas and when required and simultaneously scale

bull Online website data-storage - MongoDB is very good at real-time inserts updates and queries Scalabilityand replication are required for very large websites

bull Mobile systems - MongoDB is a great choice for mobile systems due to its geo-spatial index

Sometimes it is helpful to use MongoDB and Hadoop together in the same system For instance with a systemthat is reasonably large that requires rich queries with indexes and effective retrieval we can leverage the bestqualities of both systems

P

Why NoSQL Relational databases have been used for de-cades NoSQL databases are radically differentand less mature Relational databases can scaleand handle Big Data The difference is that rela-tional databases require more effort to scale

Facebook uses MySQL-based database storage Facebook with 800 million users and their evergrowing numbers is the best example of howrelational databases can scale (17)

The question arises why do we need one of

these new NoSQL databases Is it to scaleClearly not Relational databases have been apart of large-scale systems for years The issueis the amount of work to make them scale Forexample Twitter began as a relational databaseaccessed by a Rails application Amazon beganas a relational database accessed by a C++ ap-plication Facebook also began as a relationaldatabase accessed with PHP While these giantshave been successful at scaling their successhas not come easily They had to make manytweaks and changes and even implement a fewsoftware and hardware tricks to achieve it

Consider the simple act of ldquoshardingrdquo and whatwould be required to shard a set of customerrecords on a traditional RDBMS What aboutthe related data How do you handle an un-usual number of orders on the node for cus-tomers with names beginning with the letterS You can shard an RDBMS but a lot of manhours will go into what a NoSQL database canhandle automatically (Continues)

17 httpgigaomcomcloudfacebook-shares-some-secrets-on-making-mysql-scale

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 55

MongoDB vs Hadoop

Why NoSQL Continued

What of ETL Processing terabytes ofdata is possible using parallel queriesstar schemas disk layouts and othertechniques Using these traditional

means you can eek out as much CPUas possible and then push the data intoseparate databases for additional pro-cessing However it will still be hard toachieve through this manual structur-ing what you can achieve with Hadoopor a similar solution It can be doneand has been many times before but itwill not be as easy or as quick as one ofthese modern automated solutions

What about reliability Oracle RAC istried and true However it is expensiveand dif cult to work with on largersystems It requires abandoning manyof the features of the Oracle RDBMSMicrosoft SQL Server also supportsclustering with a similar approach andsimilar limitations as Oracle MongoDBand some (but not all) of the otherNoSQL solutions support out of the boxreliability

For a startup or a growing companyscalable databases which requireinvesting a lot of effort and time intomaking the storage solution scale is notan option This is where NoSQL data-bases are the clear winner For a largercompany with IT budget cuts andincreasing pressure to make the bestuse of staff and out of the box solutionsNoSQL and Big Data solutions can nolonger be ignored

Hadoop + MongoDB

bull The ldquoCatching Osamardquo problemProblem

A few billion pages with geospatial-tags Tremendous storage and processing required MongoDB fails to effectively process data due to a

single-thread bottleneck per node Hadoop does not have indexes thereby data retrieval

takes longer than usual Solution Store all images in MongoDB with geo-spatial indexes Retrieve using the Hadoop framework and assign pro

cessing to multiple mappers and reducers

bull Demographic data analysisbull Whenever there is a need for processing demographic data

spread over a large geographical area the geospatial in-dexes of MongoDB are unmatched and MongoDB becomesan ideal storage Hadoop is essential when processing hundreds of terabytes of information

P

Contactwwwosintegratorscom345 W Main St Suite 201Durham NC 27701(919) 321-0119infoosintegratorscom

Given an unstructured data source and the desire tostore it in MongoDB Hadoop can be used to rst structure the data This facilitates easier storage in MongoDB Then Hadoop can be used repeatedly for processing it laterin large chunks

Both MongoDB and Hadoop have their place in modernsystems MongoDB is rapidly becoming a replacement for

highly scalable operational systems and websites Hadoopoffers exceptional features for applying MapReduce to BigData and large analytical systems Currently both are dis-placing traditional RDBMS systems

This white paper was written by Deep Mistry Open Software Inte-grators Open Software Integrators LLC is a professional servicescompany that provides consulting training support and coursedevelopment

Page 2: Hadoop vs MongoDB_whitepaper

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 25

In recent years there has been an explosion of data IDC predicts that the digital universe will grow to 27zettabytes in 2012 up 48 from 2011 By 2015 this is expected to grow to 8 zettabytes of data 1 Zettabyte =1000000000000000000000 bytes More than half of this is the unstructured data from social networks mobiledevices web applications and other similar sources Traditional RDBMS systems are data storage solutions thatwere designed decades ago At that time most of the data was structured and had a different avor than todayWe need solutions that address Big Data problems and were designed on the basis of recent changes in the na-ture of data (1) (2)

There are many solutions that address the Big Data problem Some of these solutions are tweaks andor hacksaround standard RDBMS solutions Some are independent data processing frameworks data storage solutions orboth There is a website which lists many of them (3)

According to Jaspersoft there are two leading technologies in this new eld First is Apache Hadoop a frameworkthat allows for the distributed processing of large data sets using a simple programming model The other isMongoDB an open source NoSQL scalable distributed database Both of these solutions have their strengthsweaknesses and unique characteristics (4) (5) (6)

MongoDB and Hadoop are fundamentally different systems MongoDB is a database while Hadoop is a data pro-cessing and analysis framework MongoDB focuses on storage and ef cient retrieval of data while Hadoop focuses

on data processing using MapReduce In spite of this basic difference both technologies have similar functional-ity MongoDB has its own MapReduce framework and Hadoop has HBase HBase is a scalable database similar toMongoDB (7) (8)

The main aw in Hadoop is that it has a single point of failure namely the ldquoNameNode rdquo If the NameNode goesdown the entire system becomes unavailable There are a few workarounds for this which involve manually re-storing the NameNode (9)

Regardless the single point of failure exists Mon-goDB has no such single point of failure If at anypoint in time one of the primaries con g-servers or nodes goes down there is a replicated resource

which can take over the responsibility of the systemautomatically (10) (11)

MongoDB supports rich queries like traditional RD-BMS systems and is written in a standard JavaScriptshell Hadoop has two different components forwriting MapReduce (MR) code Pig and Hive Pig is ascripting language (similar to python perl) thatgenerates MR code while Hive is a more SQL-likelanguage Hive is mainly used to structure the dataand provides a rich set of queries Data has to be inJSON or CSV format to be imported into MongoDBHadoop on the other hand can accept data in al-most any format Hadoop structures data usingHive but can handle unstructured data easily usingPig With the help of Apache Sqoop Pig can eventranslate between RDBMS and Hadoop (12) (13) (14)

What about transactions

Database transactions A transaction is a logical unit ofwork that usually consists of one or more database opera-tions (15)

MongoDB has no transactions Initially that fact may be aconcern for people from a RDBMS background Transac-tions are somewhat obsolete in the age of the web Whendealing with distributed systems long-running databaseoperations and concurrent data contention the concept ofa ldquodatabase transactionrdquo may require a different strategyJim Gray - The Transactional Concept Virtues and Limita-tions (16)

MongoDB does have something that is ldquotransaction-likerdquo

in that a database write can happen in a single blockingsynchronous ldquofsyncrdquo MongoDB supports atomic opera-tions to some extent So long as the schema is structuredcorrectly you can have a reliable write for a single entry

MongoDB vs Hadoop

Pa

1 httpwwwsmartercomputingblogcom20120321 nally-more-data-about-big-data 2 httpenwikipediaorgwikiBig_data3 httpnosql-databaseorg 4 httpnosqlmypopescucompost20001178842nosql-databases-adoption-in-numbers5 httphadoopapacheorg 6 httpwwwmongodborg 7 httphadoopapacheorgcommondocscurrentmapred_tutorialhtmlOverview8 httphbaseapacheorg 9 httpwikiapacheorghadoopNameNode 10 httpwwwmongodborgdisplayDOCSReplica+Sets11 httpwwwmongodborgdisplayDOCSCon guring+ShardingCon guringSharding-Con gServers

12 httppigapacheorg 13httphiveapacheorg 14 httpsqoopapacheorg 15 httpenwikipediaorgwikiDatabase_transaction16 httpenwikipediaorgwikiDatabase_transaction

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 35

MongoDB (written in C++) manages memory more cost-ef ciently than Hadooprsquos HBase (written in Java) WhileJava garbage collection (GC) will in theory be as CPUtime ef cient as unmanaged memory it requires 5-10x asmuch memory to do so In practice there is a large performance cost for GC on these types of large scale distrib-uted systems Both systems also take a different approach to space utilization MongoDB pre-allocates space forstorage improving performance but wasting space Hadoop optimizes space usage but ends up with lower writeperformance by comparison with MongoDB

Hadoop is not a single product but rather a software family Its common components consist of the followingbull Pig a scripting language used to quickly write MapReduce code to handle unstructured sourcesbull Hive used to facilitate structure for the databull HCatalog used to provide inter-operatability between these internal systems (15)

bull HBase which is essentially a database built on top of Hadoopbull HDFS the actual le system for hadoop (16)

MongoDB is a standalone product with supported binaries The learning curve for MongoDB is generally lowerthan that of Hadoop

Recently there has been a lot of talkabout security with NoSQL databases

Both MongoDB and Hadoop have basicsecurity MongoDB has simple authenti-cation and MD5 hash and Hadoop offersfairly rudimentary security in its variousframeworks This is not a aw NoSQLdatabases like MongoDB were never de-signed to handle security they weredesigned to ef ciently handle Big Datawhich they effectively do It is simple toimplement security in your applicationinstead of expecting it from your datasolution Here is more on this (17)

For data processing and data analysisthere is almost no technology OpenSource or otherwise that beats HadoopHadoop was designed speci cally to ad -dress this issue and thereby it contains allthe components necessary to rapidlyprocess terabytes to petabytes ofinformation Writing MapReduce code inHadoop is elementary Pig is easy tolearn and makes it uncomplicated towrite user-de ned functions MongoDBhas its own MapReduce frameworkwhich though subpar to Hadoop doesthe job well When it boils down to sheernumbers Hadoop is ahead Yahoo has a4000 node Hadoop cluster Continued on page 3

Pa

What is MapReduce Why do we want it MapReduce is a framework for processing highly distributableproblems across huge datasets

It divides the basic problem into a set of smaller manageable tasksand assigns them to a large number of computers (nodes) An idealMapReduce task is too large for any one node to process but canbe accomplished by multiple nodes ef ciently

MapReduce is named for the two steps at the heart of the frame-work

bull Map step The master node takes the input divides it into

smaller sub-problems and distributes them to worker nodesEach worker node processes its smaller problem and passesthe result back to its master node There can be multiple levelsof workers

bull Reduce step The master node collects the results from all ofthe sub-problems combines the results into groups based onthe key and then assigns them to worker nodes called reducersEach reducer processes those values and sends the result backto the master node

MapReduce can be a huge help in analyzing and processing largechunks of databull buying pattern analysis customer usage and interest patterns

in e-commercebull processing the large amount of data generated in the elds of

science and medicinebull processing and analyzing security data credit scores and other

large data-sets in the nancial industry

These and many other uses make MapReduce an indispensabletool in the software industry

MongoDB vs Hadoop

15 httpincubatorapacheorghcatalog16 httphadoopapacheorghdfs

17 httpwwwdarkreadingcomblog232600288a-response-to-nosql-security-concernshtml

18 httpdeveloperyahoocomblogshadoopposts200809scaling_hadoop_to_4000_nodes_a

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 45

MongoDB vs Hadoop

and someone is aiming to test a 10000 nodes soon enough MongoDB is typically used in clusters with around 30-50 shards and a 100 shard cluster under testing Typically MongoDB is used with systems less than approximately5 TB of data Hadoop on the other hand has been used for systems larger than 100 TB including systems contain-ing petabytes of data (18)

There are several use cases of both systems

Hadoopbull Log Processing (best use case) - Log les are usually

very large and there are typically lots of them This cre-ates huge amounts of data A single machine may notbe able to ef ciently process them Hadoop is the bestanswer to this problem splitting the log into smallerworkable chunks and assigning them to workers resultsin very fast processing

bull ETL -- For unstructured data streaming in real time sayfrom a web-application Hadoop is a good choice tostructure the data and then store it Additionally Hadoopprovides ways to pre-process data prior to structuring it

bull Analytics - Various organizations are using Hadooprsquossuperior processing capabilities to analyze hugeamounts of data The most famous usage is Facebookrsquos21PB 2000 node cluster which Facebook uses forseveral tasks

bull Genomics - Scientists working in this eld constantlyneed to process DNA sequences These are very long andcomplicated strands of information which require largeamounts of storage and processing Hadoop provides asimple cheap solution to their processing problem

bull Machine education - Apache Mahout is built on top ofHadoop and essentially works along with it to facilitatetargeted trade in e-commerce

MongoDBbull Archiving - Craigslist uses MongoDB as an ef cient stor -

age for archives They still use SQL for active data whilethey archive most of their data using MongoDB

bull Flexibility in data handling - With data in a non-standard format (like photos) it is easy to leverage theunstructured document oriented storage of MongoDBin addition to its scaling capacities

bull E-Commerce - OpenSky had a very complicated e-com-

merce model It was very dif cult to design the correctschema for it and even then tables had to be alteredmany times MongoDB allows them to alter the schemaas and when required and simultaneously scale

bull Online website data-storage - MongoDB is very good at real-time inserts updates and queries Scalabilityand replication are required for very large websites

bull Mobile systems - MongoDB is a great choice for mobile systems due to its geo-spatial index

Sometimes it is helpful to use MongoDB and Hadoop together in the same system For instance with a systemthat is reasonably large that requires rich queries with indexes and effective retrieval we can leverage the bestqualities of both systems

P

Why NoSQL Relational databases have been used for de-cades NoSQL databases are radically differentand less mature Relational databases can scaleand handle Big Data The difference is that rela-tional databases require more effort to scale

Facebook uses MySQL-based database storage Facebook with 800 million users and their evergrowing numbers is the best example of howrelational databases can scale (17)

The question arises why do we need one of

these new NoSQL databases Is it to scaleClearly not Relational databases have been apart of large-scale systems for years The issueis the amount of work to make them scale Forexample Twitter began as a relational databaseaccessed by a Rails application Amazon beganas a relational database accessed by a C++ ap-plication Facebook also began as a relationaldatabase accessed with PHP While these giantshave been successful at scaling their successhas not come easily They had to make manytweaks and changes and even implement a fewsoftware and hardware tricks to achieve it

Consider the simple act of ldquoshardingrdquo and whatwould be required to shard a set of customerrecords on a traditional RDBMS What aboutthe related data How do you handle an un-usual number of orders on the node for cus-tomers with names beginning with the letterS You can shard an RDBMS but a lot of manhours will go into what a NoSQL database canhandle automatically (Continues)

17 httpgigaomcomcloudfacebook-shares-some-secrets-on-making-mysql-scale

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 55

MongoDB vs Hadoop

Why NoSQL Continued

What of ETL Processing terabytes ofdata is possible using parallel queriesstar schemas disk layouts and othertechniques Using these traditional

means you can eek out as much CPUas possible and then push the data intoseparate databases for additional pro-cessing However it will still be hard toachieve through this manual structur-ing what you can achieve with Hadoopor a similar solution It can be doneand has been many times before but itwill not be as easy or as quick as one ofthese modern automated solutions

What about reliability Oracle RAC istried and true However it is expensiveand dif cult to work with on largersystems It requires abandoning manyof the features of the Oracle RDBMSMicrosoft SQL Server also supportsclustering with a similar approach andsimilar limitations as Oracle MongoDBand some (but not all) of the otherNoSQL solutions support out of the boxreliability

For a startup or a growing companyscalable databases which requireinvesting a lot of effort and time intomaking the storage solution scale is notan option This is where NoSQL data-bases are the clear winner For a largercompany with IT budget cuts andincreasing pressure to make the bestuse of staff and out of the box solutionsNoSQL and Big Data solutions can nolonger be ignored

Hadoop + MongoDB

bull The ldquoCatching Osamardquo problemProblem

A few billion pages with geospatial-tags Tremendous storage and processing required MongoDB fails to effectively process data due to a

single-thread bottleneck per node Hadoop does not have indexes thereby data retrieval

takes longer than usual Solution Store all images in MongoDB with geo-spatial indexes Retrieve using the Hadoop framework and assign pro

cessing to multiple mappers and reducers

bull Demographic data analysisbull Whenever there is a need for processing demographic data

spread over a large geographical area the geospatial in-dexes of MongoDB are unmatched and MongoDB becomesan ideal storage Hadoop is essential when processing hundreds of terabytes of information

P

Contactwwwosintegratorscom345 W Main St Suite 201Durham NC 27701(919) 321-0119infoosintegratorscom

Given an unstructured data source and the desire tostore it in MongoDB Hadoop can be used to rst structure the data This facilitates easier storage in MongoDB Then Hadoop can be used repeatedly for processing it laterin large chunks

Both MongoDB and Hadoop have their place in modernsystems MongoDB is rapidly becoming a replacement for

highly scalable operational systems and websites Hadoopoffers exceptional features for applying MapReduce to BigData and large analytical systems Currently both are dis-placing traditional RDBMS systems

This white paper was written by Deep Mistry Open Software Inte-grators Open Software Integrators LLC is a professional servicescompany that provides consulting training support and coursedevelopment

Page 3: Hadoop vs MongoDB_whitepaper

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 35

MongoDB (written in C++) manages memory more cost-ef ciently than Hadooprsquos HBase (written in Java) WhileJava garbage collection (GC) will in theory be as CPUtime ef cient as unmanaged memory it requires 5-10x asmuch memory to do so In practice there is a large performance cost for GC on these types of large scale distrib-uted systems Both systems also take a different approach to space utilization MongoDB pre-allocates space forstorage improving performance but wasting space Hadoop optimizes space usage but ends up with lower writeperformance by comparison with MongoDB

Hadoop is not a single product but rather a software family Its common components consist of the followingbull Pig a scripting language used to quickly write MapReduce code to handle unstructured sourcesbull Hive used to facilitate structure for the databull HCatalog used to provide inter-operatability between these internal systems (15)

bull HBase which is essentially a database built on top of Hadoopbull HDFS the actual le system for hadoop (16)

MongoDB is a standalone product with supported binaries The learning curve for MongoDB is generally lowerthan that of Hadoop

Recently there has been a lot of talkabout security with NoSQL databases

Both MongoDB and Hadoop have basicsecurity MongoDB has simple authenti-cation and MD5 hash and Hadoop offersfairly rudimentary security in its variousframeworks This is not a aw NoSQLdatabases like MongoDB were never de-signed to handle security they weredesigned to ef ciently handle Big Datawhich they effectively do It is simple toimplement security in your applicationinstead of expecting it from your datasolution Here is more on this (17)

For data processing and data analysisthere is almost no technology OpenSource or otherwise that beats HadoopHadoop was designed speci cally to ad -dress this issue and thereby it contains allthe components necessary to rapidlyprocess terabytes to petabytes ofinformation Writing MapReduce code inHadoop is elementary Pig is easy tolearn and makes it uncomplicated towrite user-de ned functions MongoDBhas its own MapReduce frameworkwhich though subpar to Hadoop doesthe job well When it boils down to sheernumbers Hadoop is ahead Yahoo has a4000 node Hadoop cluster Continued on page 3

Pa

What is MapReduce Why do we want it MapReduce is a framework for processing highly distributableproblems across huge datasets

It divides the basic problem into a set of smaller manageable tasksand assigns them to a large number of computers (nodes) An idealMapReduce task is too large for any one node to process but canbe accomplished by multiple nodes ef ciently

MapReduce is named for the two steps at the heart of the frame-work

bull Map step The master node takes the input divides it into

smaller sub-problems and distributes them to worker nodesEach worker node processes its smaller problem and passesthe result back to its master node There can be multiple levelsof workers

bull Reduce step The master node collects the results from all ofthe sub-problems combines the results into groups based onthe key and then assigns them to worker nodes called reducersEach reducer processes those values and sends the result backto the master node

MapReduce can be a huge help in analyzing and processing largechunks of databull buying pattern analysis customer usage and interest patterns

in e-commercebull processing the large amount of data generated in the elds of

science and medicinebull processing and analyzing security data credit scores and other

large data-sets in the nancial industry

These and many other uses make MapReduce an indispensabletool in the software industry

MongoDB vs Hadoop

15 httpincubatorapacheorghcatalog16 httphadoopapacheorghdfs

17 httpwwwdarkreadingcomblog232600288a-response-to-nosql-security-concernshtml

18 httpdeveloperyahoocomblogshadoopposts200809scaling_hadoop_to_4000_nodes_a

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 45

MongoDB vs Hadoop

and someone is aiming to test a 10000 nodes soon enough MongoDB is typically used in clusters with around 30-50 shards and a 100 shard cluster under testing Typically MongoDB is used with systems less than approximately5 TB of data Hadoop on the other hand has been used for systems larger than 100 TB including systems contain-ing petabytes of data (18)

There are several use cases of both systems

Hadoopbull Log Processing (best use case) - Log les are usually

very large and there are typically lots of them This cre-ates huge amounts of data A single machine may notbe able to ef ciently process them Hadoop is the bestanswer to this problem splitting the log into smallerworkable chunks and assigning them to workers resultsin very fast processing

bull ETL -- For unstructured data streaming in real time sayfrom a web-application Hadoop is a good choice tostructure the data and then store it Additionally Hadoopprovides ways to pre-process data prior to structuring it

bull Analytics - Various organizations are using Hadooprsquossuperior processing capabilities to analyze hugeamounts of data The most famous usage is Facebookrsquos21PB 2000 node cluster which Facebook uses forseveral tasks

bull Genomics - Scientists working in this eld constantlyneed to process DNA sequences These are very long andcomplicated strands of information which require largeamounts of storage and processing Hadoop provides asimple cheap solution to their processing problem

bull Machine education - Apache Mahout is built on top ofHadoop and essentially works along with it to facilitatetargeted trade in e-commerce

MongoDBbull Archiving - Craigslist uses MongoDB as an ef cient stor -

age for archives They still use SQL for active data whilethey archive most of their data using MongoDB

bull Flexibility in data handling - With data in a non-standard format (like photos) it is easy to leverage theunstructured document oriented storage of MongoDBin addition to its scaling capacities

bull E-Commerce - OpenSky had a very complicated e-com-

merce model It was very dif cult to design the correctschema for it and even then tables had to be alteredmany times MongoDB allows them to alter the schemaas and when required and simultaneously scale

bull Online website data-storage - MongoDB is very good at real-time inserts updates and queries Scalabilityand replication are required for very large websites

bull Mobile systems - MongoDB is a great choice for mobile systems due to its geo-spatial index

Sometimes it is helpful to use MongoDB and Hadoop together in the same system For instance with a systemthat is reasonably large that requires rich queries with indexes and effective retrieval we can leverage the bestqualities of both systems

P

Why NoSQL Relational databases have been used for de-cades NoSQL databases are radically differentand less mature Relational databases can scaleand handle Big Data The difference is that rela-tional databases require more effort to scale

Facebook uses MySQL-based database storage Facebook with 800 million users and their evergrowing numbers is the best example of howrelational databases can scale (17)

The question arises why do we need one of

these new NoSQL databases Is it to scaleClearly not Relational databases have been apart of large-scale systems for years The issueis the amount of work to make them scale Forexample Twitter began as a relational databaseaccessed by a Rails application Amazon beganas a relational database accessed by a C++ ap-plication Facebook also began as a relationaldatabase accessed with PHP While these giantshave been successful at scaling their successhas not come easily They had to make manytweaks and changes and even implement a fewsoftware and hardware tricks to achieve it

Consider the simple act of ldquoshardingrdquo and whatwould be required to shard a set of customerrecords on a traditional RDBMS What aboutthe related data How do you handle an un-usual number of orders on the node for cus-tomers with names beginning with the letterS You can shard an RDBMS but a lot of manhours will go into what a NoSQL database canhandle automatically (Continues)

17 httpgigaomcomcloudfacebook-shares-some-secrets-on-making-mysql-scale

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 55

MongoDB vs Hadoop

Why NoSQL Continued

What of ETL Processing terabytes ofdata is possible using parallel queriesstar schemas disk layouts and othertechniques Using these traditional

means you can eek out as much CPUas possible and then push the data intoseparate databases for additional pro-cessing However it will still be hard toachieve through this manual structur-ing what you can achieve with Hadoopor a similar solution It can be doneand has been many times before but itwill not be as easy or as quick as one ofthese modern automated solutions

What about reliability Oracle RAC istried and true However it is expensiveand dif cult to work with on largersystems It requires abandoning manyof the features of the Oracle RDBMSMicrosoft SQL Server also supportsclustering with a similar approach andsimilar limitations as Oracle MongoDBand some (but not all) of the otherNoSQL solutions support out of the boxreliability

For a startup or a growing companyscalable databases which requireinvesting a lot of effort and time intomaking the storage solution scale is notan option This is where NoSQL data-bases are the clear winner For a largercompany with IT budget cuts andincreasing pressure to make the bestuse of staff and out of the box solutionsNoSQL and Big Data solutions can nolonger be ignored

Hadoop + MongoDB

bull The ldquoCatching Osamardquo problemProblem

A few billion pages with geospatial-tags Tremendous storage and processing required MongoDB fails to effectively process data due to a

single-thread bottleneck per node Hadoop does not have indexes thereby data retrieval

takes longer than usual Solution Store all images in MongoDB with geo-spatial indexes Retrieve using the Hadoop framework and assign pro

cessing to multiple mappers and reducers

bull Demographic data analysisbull Whenever there is a need for processing demographic data

spread over a large geographical area the geospatial in-dexes of MongoDB are unmatched and MongoDB becomesan ideal storage Hadoop is essential when processing hundreds of terabytes of information

P

Contactwwwosintegratorscom345 W Main St Suite 201Durham NC 27701(919) 321-0119infoosintegratorscom

Given an unstructured data source and the desire tostore it in MongoDB Hadoop can be used to rst structure the data This facilitates easier storage in MongoDB Then Hadoop can be used repeatedly for processing it laterin large chunks

Both MongoDB and Hadoop have their place in modernsystems MongoDB is rapidly becoming a replacement for

highly scalable operational systems and websites Hadoopoffers exceptional features for applying MapReduce to BigData and large analytical systems Currently both are dis-placing traditional RDBMS systems

This white paper was written by Deep Mistry Open Software Inte-grators Open Software Integrators LLC is a professional servicescompany that provides consulting training support and coursedevelopment

Page 4: Hadoop vs MongoDB_whitepaper

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 45

MongoDB vs Hadoop

and someone is aiming to test a 10000 nodes soon enough MongoDB is typically used in clusters with around 30-50 shards and a 100 shard cluster under testing Typically MongoDB is used with systems less than approximately5 TB of data Hadoop on the other hand has been used for systems larger than 100 TB including systems contain-ing petabytes of data (18)

There are several use cases of both systems

Hadoopbull Log Processing (best use case) - Log les are usually

very large and there are typically lots of them This cre-ates huge amounts of data A single machine may notbe able to ef ciently process them Hadoop is the bestanswer to this problem splitting the log into smallerworkable chunks and assigning them to workers resultsin very fast processing

bull ETL -- For unstructured data streaming in real time sayfrom a web-application Hadoop is a good choice tostructure the data and then store it Additionally Hadoopprovides ways to pre-process data prior to structuring it

bull Analytics - Various organizations are using Hadooprsquossuperior processing capabilities to analyze hugeamounts of data The most famous usage is Facebookrsquos21PB 2000 node cluster which Facebook uses forseveral tasks

bull Genomics - Scientists working in this eld constantlyneed to process DNA sequences These are very long andcomplicated strands of information which require largeamounts of storage and processing Hadoop provides asimple cheap solution to their processing problem

bull Machine education - Apache Mahout is built on top ofHadoop and essentially works along with it to facilitatetargeted trade in e-commerce

MongoDBbull Archiving - Craigslist uses MongoDB as an ef cient stor -

age for archives They still use SQL for active data whilethey archive most of their data using MongoDB

bull Flexibility in data handling - With data in a non-standard format (like photos) it is easy to leverage theunstructured document oriented storage of MongoDBin addition to its scaling capacities

bull E-Commerce - OpenSky had a very complicated e-com-

merce model It was very dif cult to design the correctschema for it and even then tables had to be alteredmany times MongoDB allows them to alter the schemaas and when required and simultaneously scale

bull Online website data-storage - MongoDB is very good at real-time inserts updates and queries Scalabilityand replication are required for very large websites

bull Mobile systems - MongoDB is a great choice for mobile systems due to its geo-spatial index

Sometimes it is helpful to use MongoDB and Hadoop together in the same system For instance with a systemthat is reasonably large that requires rich queries with indexes and effective retrieval we can leverage the bestqualities of both systems

P

Why NoSQL Relational databases have been used for de-cades NoSQL databases are radically differentand less mature Relational databases can scaleand handle Big Data The difference is that rela-tional databases require more effort to scale

Facebook uses MySQL-based database storage Facebook with 800 million users and their evergrowing numbers is the best example of howrelational databases can scale (17)

The question arises why do we need one of

these new NoSQL databases Is it to scaleClearly not Relational databases have been apart of large-scale systems for years The issueis the amount of work to make them scale Forexample Twitter began as a relational databaseaccessed by a Rails application Amazon beganas a relational database accessed by a C++ ap-plication Facebook also began as a relationaldatabase accessed with PHP While these giantshave been successful at scaling their successhas not come easily They had to make manytweaks and changes and even implement a fewsoftware and hardware tricks to achieve it

Consider the simple act of ldquoshardingrdquo and whatwould be required to shard a set of customerrecords on a traditional RDBMS What aboutthe related data How do you handle an un-usual number of orders on the node for cus-tomers with names beginning with the letterS You can shard an RDBMS but a lot of manhours will go into what a NoSQL database canhandle automatically (Continues)

17 httpgigaomcomcloudfacebook-shares-some-secrets-on-making-mysql-scale

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 55

MongoDB vs Hadoop

Why NoSQL Continued

What of ETL Processing terabytes ofdata is possible using parallel queriesstar schemas disk layouts and othertechniques Using these traditional

means you can eek out as much CPUas possible and then push the data intoseparate databases for additional pro-cessing However it will still be hard toachieve through this manual structur-ing what you can achieve with Hadoopor a similar solution It can be doneand has been many times before but itwill not be as easy or as quick as one ofthese modern automated solutions

What about reliability Oracle RAC istried and true However it is expensiveand dif cult to work with on largersystems It requires abandoning manyof the features of the Oracle RDBMSMicrosoft SQL Server also supportsclustering with a similar approach andsimilar limitations as Oracle MongoDBand some (but not all) of the otherNoSQL solutions support out of the boxreliability

For a startup or a growing companyscalable databases which requireinvesting a lot of effort and time intomaking the storage solution scale is notan option This is where NoSQL data-bases are the clear winner For a largercompany with IT budget cuts andincreasing pressure to make the bestuse of staff and out of the box solutionsNoSQL and Big Data solutions can nolonger be ignored

Hadoop + MongoDB

bull The ldquoCatching Osamardquo problemProblem

A few billion pages with geospatial-tags Tremendous storage and processing required MongoDB fails to effectively process data due to a

single-thread bottleneck per node Hadoop does not have indexes thereby data retrieval

takes longer than usual Solution Store all images in MongoDB with geo-spatial indexes Retrieve using the Hadoop framework and assign pro

cessing to multiple mappers and reducers

bull Demographic data analysisbull Whenever there is a need for processing demographic data

spread over a large geographical area the geospatial in-dexes of MongoDB are unmatched and MongoDB becomesan ideal storage Hadoop is essential when processing hundreds of terabytes of information

P

Contactwwwosintegratorscom345 W Main St Suite 201Durham NC 27701(919) 321-0119infoosintegratorscom

Given an unstructured data source and the desire tostore it in MongoDB Hadoop can be used to rst structure the data This facilitates easier storage in MongoDB Then Hadoop can be used repeatedly for processing it laterin large chunks

Both MongoDB and Hadoop have their place in modernsystems MongoDB is rapidly becoming a replacement for

highly scalable operational systems and websites Hadoopoffers exceptional features for applying MapReduce to BigData and large analytical systems Currently both are dis-placing traditional RDBMS systems

This white paper was written by Deep Mistry Open Software Inte-grators Open Software Integrators LLC is a professional servicescompany that provides consulting training support and coursedevelopment

Page 5: Hadoop vs MongoDB_whitepaper

892019 Hadoop vs MongoDB_whitepaper

httpslidepdfcomreaderfullhadoop-vs-mongodbwhitepaper 55

MongoDB vs Hadoop

Why NoSQL Continued

What of ETL Processing terabytes ofdata is possible using parallel queriesstar schemas disk layouts and othertechniques Using these traditional

means you can eek out as much CPUas possible and then push the data intoseparate databases for additional pro-cessing However it will still be hard toachieve through this manual structur-ing what you can achieve with Hadoopor a similar solution It can be doneand has been many times before but itwill not be as easy or as quick as one ofthese modern automated solutions

What about reliability Oracle RAC istried and true However it is expensiveand dif cult to work with on largersystems It requires abandoning manyof the features of the Oracle RDBMSMicrosoft SQL Server also supportsclustering with a similar approach andsimilar limitations as Oracle MongoDBand some (but not all) of the otherNoSQL solutions support out of the boxreliability

For a startup or a growing companyscalable databases which requireinvesting a lot of effort and time intomaking the storage solution scale is notan option This is where NoSQL data-bases are the clear winner For a largercompany with IT budget cuts andincreasing pressure to make the bestuse of staff and out of the box solutionsNoSQL and Big Data solutions can nolonger be ignored

Hadoop + MongoDB

bull The ldquoCatching Osamardquo problemProblem

A few billion pages with geospatial-tags Tremendous storage and processing required MongoDB fails to effectively process data due to a

single-thread bottleneck per node Hadoop does not have indexes thereby data retrieval

takes longer than usual Solution Store all images in MongoDB with geo-spatial indexes Retrieve using the Hadoop framework and assign pro

cessing to multiple mappers and reducers

bull Demographic data analysisbull Whenever there is a need for processing demographic data

spread over a large geographical area the geospatial in-dexes of MongoDB are unmatched and MongoDB becomesan ideal storage Hadoop is essential when processing hundreds of terabytes of information

P

Contactwwwosintegratorscom345 W Main St Suite 201Durham NC 27701(919) 321-0119infoosintegratorscom

Given an unstructured data source and the desire tostore it in MongoDB Hadoop can be used to rst structure the data This facilitates easier storage in MongoDB Then Hadoop can be used repeatedly for processing it laterin large chunks

Both MongoDB and Hadoop have their place in modernsystems MongoDB is rapidly becoming a replacement for

highly scalable operational systems and websites Hadoopoffers exceptional features for applying MapReduce to BigData and large analytical systems Currently both are dis-placing traditional RDBMS systems

This white paper was written by Deep Mistry Open Software Inte-grators Open Software Integrators LLC is a professional servicescompany that provides consulting training support and coursedevelopment