[IEEE 2014 International Conference on Communication Systems and Network Technologies (CSNT) - Bhopal, India (2014.04.7-2014.04.9)] 2014 Fourth International Conference on Communication

Prominence of MapReduce in BIG DATA Processing

Shweta Pandey Shri Vaishnav Institute of Tech. & Science

Indore,India [email protected]

Dr.Vrinda Tokekar Institute of Engineering & Technology

Indore, India [email protected]

Abstract— Big Data has come up with aureate haste and a clef enabler for the social business; Big Data gifts an opportunity to create extraordinary business advantage and better service delivery. Big Data is bringing a positive change in the decision making process of various business organizations. With the several offerings Big Data has come up with several issues and challenges which are related to the Big Data Management, Big Data processing and Big Data analysis. Big Data is having challenges related to volume, velocity and variety. Big Data has 3Vs Volume means large amount of data, Velocity means data arrives at high speed, Variety means data comes from heterogeneous resources. In Big Data definition, Big means a dataset which makes data concept to grow so much that it becomes difficult to manage it by using existing data management concepts and tools. Map Reduce is playing a very significant role in processing of Big Data. This paper includes a brief about Big Data and its related issues, emphasizes on role of MapReduce in Big Data processing. MapReduce is elastic scalable, efficient and fault tolerant for analysing a large set of data, highlights the features of MapReduce in comparison of other design model which makes it popular tool for processing large scale data. Analysis of performance factors of MapReduce shows that elimination of their inverse effect by optimization improves the performance of Map Reduce.

Keywords- Big Data, MapReduce, Hadoop Distributed File System, Google file System, Hadoop.

I. INTRODUCTION

Large Volume of Data is growing because the organizations are continuously capturing the collective amount of data for better decision making process. Volume of data increases by online contents like blogs, posts, social networking site interactions, photos created by the users and servers continuously record the messages about what the online users are doing. Today’s Business is unexpectedly affected by this growth of Data. Every day 2.5 quintillion bytes of data are created according to the estimation done by IBM and it is very large amount so the 90% of data in the world has been created in last 2 years [3]. It is a mind-boggling figure and but the sad part is instead of having more information, people feels less conversant.

Industrial organizations are essentially intent to make the sense from the massive arrival of Big Data to develop the analytic platforms for producing the traditional structure data which includes semi-structured and

unstructured sources of information. So in this manner Industrial organizations can take the advantage of Big Data processing for better decision making process. Success of Big Data depends on its analysis. Big Data analysis is an on-going process for this required a set of activities instead ofan isolated activity. Therefore for finding the Data and determining the new insights then integrating and presenting those new insights properly to conferring unique goals required a unified set of solutions [23].

MapReduce has come up as a highly effective and efficient tool for Big Data analysis; Reasons behind the popularity of MapReduce is its unique features which includes simplicity and communicative manners of its programming model as MapReduce has mainly two functions map( ) and reduce( ) even though a large number of data analytical tasks can be expressed as a set of MapReduce jobs, high degree of elastic scalability and fault tolerance[4] but the performance of MapReduce is still far-off from ideal in the database context. A recent search shows that Hadoop[1], the open source implementation of MapReduce, is slower than two parallel database systems by a factor of 3.1 to 6.5 [5].Organization required not only elastically scalable but also efficient data processing system. MapReduce can also become more efficient by making proper tuning of various performance affective parameters. This paper has focused on those parameters and their proper tuning.

The rest of this paper is organized as follows: section II represent Importance and challenges related to Big Data, in section III and its subsections, represents the contribution of MapReduce in Big Data Processing, Comparison study between Big Data and MapReduce, A glance on performance affective parameters of MapReduce and optimization of those parameters for improving the performance of MapReduce. Section IV highlights on Literature Review of the paper. Finally section V concludes the paper.

II. BIG DATA CHALLENEGS

This is the world of Big Data, Web logs, sensor network, Internet texts and Documents, Internet search indexing, CRD (call records details) etc. these all includes Big Data..

2014 Fourth International Conference on Communication Systems and Network Technologies

978-1-4799-3070-8/14 $31.00 © 2014 IEEE

DOI 10.1109/CSNT.2014.117

555

2014 Fourth International Conference on Communication Systems and Network Technologies

978-1-4799-3070-8/14 $31.00 © 2014 IEEE

DOI 10.1109/CSNT.2014.117

555

For making more effective decisions and taking the full advantage of available information, it is required to process Big Data efficiently but Big Data includes problems in capturing, searching, storing, sharing, analyzing and visualizing of big amount of data.

Here is presenting a glance on Big Data challenges:

(A) BIG DATA MANAGEMENT AND STORAGE

In Big Data Big means the size of data is growing continuously, on the other hand increasing speed of storage capacity is much less than the rising amount of Data. The reconstruction of available information framework is needed to form a hierarchical framework because Researchers has come up with the conclusion that available DBMSs are not adequate to process the large amount of data. Architecture commonly used for processing of data uses the database server, Database server has constraint of scalability and cost which are prime goals of Big Data.

Different business models has been suggested by the providers of database but basically those are application specific for eg.Google seems to be more interested in small applications. Big Data Storage is another big issue in Big Data management as available computer algorithms aresufficient to store homogeneous data but not able to smartly store data comes in real time because of its heterogeneous behaviour. So how to rearrange Data is another big problem in context of Big Data Management. Virtual server technology can sharpen the problem reason is it raises the issue of overcommitted resources specially when there is lack of communication between the application server and storage administrator. Also need to solve the problem of concurrent I/O and a single node master /slave architecture.

B) BIG DATA CALCULATION AND EXAMINATION

Speed is a prime requirement when process query in Database but the process may take time because in short time it can not traverse all the relevant data in database, Use of indices is most favourable solution but Index is only intended simple type of data in Big Data but Big Data is not only having simple type of data as it is facing complications. So to solve the same can take the combination of Index for Big Data and advanced pre-processing techniques. For the same to solve Big Data problems can use application parallelization and divide-and-conquer both are natural computational process but require to add some extra computational resources which is not that easy. Traditional series algorithms are not sufficient for Big Data. Cloud's reduced cost model can be used to process an application having sufficient data parallelism because it allows using hundreds of computers for a short time cost.

(C) BIG DATA PRIVACY PROTECTION

IT companies are using online Big Data applications for sharing information and reducing the cost, In this way security and Privacy is badly affect the Big Data storage and processing as third party services are involved to host private information and to perform computation tasks on this available data. Big Data is growing in continuation and brings lots of challenges of dynamic data monitoring and security protection. Security in Big Data means to process the data mining without exposing the sensitive information of individuals. Data is always dynamically changed as an eg. Variations of attributes, addition of new data, thus it is challenging to implement effective privacy protection for this complicated situation while current technologies of Privacy protection is mainly based on static data.

III. Contributions of MapReduce in Big Data Processing

The business and scientific applications need to process and analyse a large volume of data for making better decision .There is required to process terabytes of data in efficient manner on daily bases as explained before in the introductory part of paper, Industries faced the Big Data problem due to the inability of conventional database systems and software tools to manage and process this large amount of data in a tolerable time limits. MapReduce is a playing a promising role for processing and managing the Big Data. MapReduce system is well known because of its elastic scalability and fault tolerance behaviour.

For Big Data processing Parallel DBMSs and MR, both are available solution, but both are having their own importance .In section 1 have prepared a table on basis ofcomparison study between MapReduce(MR) and Parallel DBMSs in reference of work done by [7].

1 Comparison of MapReduce with other analytical models:

The given Table 1 has tried to summarize the comparison between both MR and Parallel DBMSs approaches on basis of various parameters:

Table II-1 Parameters Map Reduce Parallel DBMSs

Complicated Transformations

Complicated transformations are easier to express in MR.

Complicated transformations are difficult to express in SQL.User defined functions can be used with SQL for such

556556

complications but UDF support is either buggy (DBMS-X) or missing (vertica).

Compression optimization

Compression is less valuable in MR because parsing of records is mandatory at run time. It’s one of the reasons of performance difference between MR and parallel DBMS.

Compression is more valuable in Parallel DBMS as parsing must be done at load time.

Data Format MR allows Data in any arbitrary format.

While all DBMSs require that data conform to a well-defined Schema.

Indexing MR doesn’t use pre-generated indices [7].

Parallel DBMS uses the predefined index before processing the data, It is one of the reason because of that Parallel DBMS perform better.

Languages MR tasks use procedural, object oriented language for eg. Java.

Parallel DBMSs use declarative language SQL.

Programming Model

MR model is simple which is one of its benefit, MR having only two function Map() and Reduce().so it performs complex computation by simple chaining of Map() and Reduce().MR follows a modest and more comprehensible step-by-step

It’s not as simple as MR Model, respectively has to write SQL queries, which can be hard if the query is complex.

manner.

Query execution strategies

MR divides the dataset in to smaller parts of data and distributes it ,In distinction of loading and indexing phase of Parallel DBMS , MR only loads data in to distributed file system and process it ,on-the-fly. As loading is not required so MR takes less time in processing of data as compare to parallel DBMS.

For processing, A query system automatically computes a query plan distributes it to the entire cluster and partially executes query in parallel over the cluster. It uses loading and indexing during the processing of data which improves its performance but loading phase takes long time.

Storage Independence

MR is storage independent and simple, before processing of data data loading in data base is not required.

Parallel DBMS system is not storage independent and not as simple as MR.Programmer needs to specify their objective in high level language then SQL translator is used.

Strengths Strengths of MR: Fault tolerance, Storage system independence, flexibility and simplicity.

In contrast of MR, strengths of Parallel DBMS are: Robust, it provides high performance.

Type of Dataset If data is dynamic and user is concerned for few queries then, MR is more useful for such scenario.

Parallel DBMS is preferred for Static dataset and the data has to be queried frequently.

Use of Failure Model

Widely use the fine grained failure Model but performance

For any failure model has to re-execute the whole task,

557557

is low. which increases the cost but provides better performance in comparison of MR.

Implementation& Code

Length of code is bigger for MR implementation as compare to Parallel DBMS because MR implementation handles parallelism, failures of nodes, data distribution and load balancing.

It requires less coding.

Time For processing of data, MR needs less time as compare to parallel DBMS because loading of data is not required for processing.

Due to the load phase it requires more time to tune data.

one can improve the performance of MR by proper tuning of various performance affective parameters.

2 MR's Performance Parameters:

On the basis of architectural design of MR five parameters (I) I/O (II) Index Structure (III) Record Parsing (IV) Grouping Algorithms (V) Block level scheduling has been identified, which affects the performance of MR. Here we enlighten a brief on these parameters:

(a) I/O Interface: MR's storage system is independent but still requires I/O interface for scanning the data. Two I/O modes are available direct I/O and streaming I/O mode. Direct I/O outperforms streaming by 10% shown by the benchmarking of HDFS (Hadoop Distributed File System) [19].(b) Index Structure:MR Performance can improve more by the use of index structure.MR can support three index structure sorted file with range index, B+ tree files and database indexed tables. Experimental work shows that performance can be improve by a factor of 2.5 in selection task and by a factor of 10 in the Join task depends on the selectivity of filtering conditions.(c)Record Parsing: Earlier in[5] [8]as reported that record parsing is one of the reason of poor performance in MR ,which is not true, Its found that only immutable decoding introduces high

performance overhead. Mutable decoding scheme is faster than the immutable scheme by a factor of 10 shown in [9] so its suggested that MR should strictly use mutable decoding scheme for database like workload, which improves the performance of MR in selection task by a factor of 2.(d)Grouping algorithms: Programming model of MR focuses on data transformation, which is specified in map() and reduce() function but programming model itself does not specify that how the intermediate result generated by map() grouped for reduce() functions to process. Merge sort algorithm used by MR is not efficient for certain analytical tasks like aggregation and Join. It has been observed performance can be improved by using different grouping algorithms. For the same purpose Hadoop core need to be modified for an efficient grouping algorithm.(e) Block Level Scheduling: Scheduling at block level incurs performance overhead .Generally more data blocks to schedule, higher the cost of scheduler incurs. The scheduling algorithm, which determines the assignment of map tasks to the nodes, also affects performance. Scheduling strategy currently using in Hadoop is sensitive to the processing speed of slave nodes, and may slow down the execution time of the entire job by 20% to 30% [9].Concluded part is, By proper tuning of above five parameters the performance of MR can improved by a factor of 2.3 to 3.5 [9].it is contrary to the studies which states that MR is inferior to the parallel DBMS in terms of performance, instead it can offer a competitive edge as they are elastically scalable and efficient.

3. Optimization of Various Parameters:

The Parameters mentioned above have been identified on the basis of three design components programming model, storage independent design and run time scheduling.

a.1) Programming Model: In MR data transformation logic is specified by using Map() and reduce() functions. Programming model does not specify how to group the generated intermediate key-value pairs for Reduce(). According to designers shown in [4], specifying a data grouping algorithms is difficult task and this complexity should be hidden from framework. MR’s default sort merge algorithms is not sufficient for all analytical jobs especially for those which do not bother about order of intermediate keys like aggregation and joins tasks.So need to switch on other grouping algorithms, for aggregation task a fingerprint based algorithms has been suggested in [9], in which for each key value pair a 32 bit integer as a fingerprint is generated, which compares fingerprint of keys when a map task sorts the intermediate pairs, if it matches further compare the original key, in the same way reduce task merges the intermediate pairs, first it merges by the fingerprints and then by the original key value pair of each group. For join task in MR Partition map-side join has been

558558

evaluated in [9] which uses the partition information, implementation details of these algorithms using Hadoop platform has been given [9].MR requires to extend its programming model so user can specify their customized grouping algorithms.

a.2) Storage Independence: MR programming model is independent from storage system which is beneficial in heterogeneous system because it allows to analyze data from different storage system. In MR reader reads key-value pair, retrieves each record from storage system and wraps the record in key value pairs.MR is data processing system without any built in processing engine while parallel database systems are using query engine and storage engine, query engine directly reads records from storage engine no cross engine calls are required while in MR processing engine needs to call the readers to load data which results lower performance. Three factors mentioned in section 3.2, has been identified on the basis of storage independence system of MR which are I) I/O mode, the way reader retrieve data from storage system II) data parsing, the scheme taken by reader to parse the format of records III) indexing. Next section represents a brief about on optimization of these 3 parameters to improve the performance of MR:

a.2.1) I/O Mode: Reader in MR can choose either direct I/O mode or streaming I/O mode. In direct I/O mode; reader directly reads data from local disk through DMA controller. Its more efficient way no IPC (Inter process communication calls) is required while in streaming I/O scheme reader reads data from any another running system process like Data Node in Hadoop through certain IPC schemes such as TCP/IP or JDBC. Streaming I/O is only the choice when reader requires to read data from remote node. As per performance concern direct I/O is more efficient than steaming I/O. That is one of the reason for choosing direct I/O mode in parallel DBMS,In which query engine and processing engine run inside the same database instance and share the memory, so when query engine fills the memory buffer it transfer the memory buffer directly to processing engine for further processing. In MR a hybrid I/O mode is implemented in HDFS as Hadoop is the implementation platform of MR [9].

a.2.2) Record Parsing: In MR, Data parsing is procedure in which reader transform the retrieved data in to key/value pair. The need of data parsing is to decode the raw data and transform it in to data objects for further processing through a programming language like Java. Two kind of decoding schemes is used 1) Immutable decoding scheme 2) mutable decoding scheme. Immutable scheme causes the CPU overhead because for each record it generates a unique immutable object which is read only objects can not be modified while in mutable decoding scheme record is decodes in its native storage format according to schema

and fills the mutable object with new values. Here is only one data object is created for any no of decoded records so mutable scheme is faster than immutable scheme as CPU overhead is present in immutable scheme because of huge no of generated immutable objects for each decoded record.

a.2.3) Index: At first glance MR may not be utilize index[5] but it is not true MR can utilize index to make processing of data efficient and faster. Index can be used in 3 ways:(A) By implementing a customized grouping algorithm: Grouping algorithm can be used in 2 ways depends on the input:(A1) Range Index: If input is sorted files like (HDFS, GFS) can use range Index.(A2) Naming Information: If each input file follows certain naming rules, this naming information can be used for grouping algorithms. Another two methods for indexing in MR is based on type of input files:(B) If input itself is set of sorted indexed files (B+ tree, Hash),can process these kind of files in MR by implementing a new reader which takes some specific condition as an input and applies it to the index to retrieves the needful records from each input file.(C) Ifinput is a set of index tables which stores in M relational database servers, m map () task needs to process those tables, In which each map() task submits a SQL query to one database server and this transparently utilizes database indexed to retrieve data .This scheme is first presented in [4] and demonstrated in [10].

a.2.4) Scheduling: MR uses run time scheduling in which scheduler assigns data blocks to the available nodes forfurther processing ,while in parallel DBMS uses compile time scheduling strategy in which, when a query is submitted query optimizer generates a distributed query plan, processing of data blocks at each node done according to the generated query plan. In this scheme scheduling cost is not required after the generation of query plan. MR run time scheduling strategy is more expensive and it slow down the performance, but it provides MR elastic scalable behaviour means ability to dynamically adjust the resources during job execution. The run time scheduling depends onthe 2 parameters 1) the size of data blocks 2) the scheduling algorithm. So to improve the performance it is recommended to take large size of data blocks because larger the size requires less time for processing. For second parameter should use an appropriate scheduling algorithm which helps to improve the performance which gets slowdown due to the run time scheduling, Researchers are working on the same.So by proper tuning of above parameters can improve the performance of MR.

IV. LITERATURE SURVEY

McKinsey’s Business Technology Office and McKinsey Global Institute(MGI) has conducted a study on BIG DATA in 5 Domains – healthcare in US , the public sector in Europe , retail in the United States and manufacturing

559559

,personal –location data globally. And their study proves that Big Data can generate value in each listed domain. For example by use of BIG DATA a retailer can increase its operational margin by 60% [11].In a study by McKinsey’s Business Technology Office and McKinsey Global Institute (MGI) firm that the U.S. faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of Big Data [12].MapReduce plays a significant role for solving the issues related to Big Data. MapReduce was proposed by Dean and Ghenmawat in [4] as a programming model for processing large amount of Data. MapReduce can also use as analytic tool for processing relational queries, It has been confirmed [13] [14] [15] [16] [17].

V.CONCLUSION AND FUTURE SCOPE

Here concluded that Big Data is playing a very significant role in industries for making better business decisions but the main issue with Big Data is, its analysis, MR is playing a significant role in Big Data analysis and giving an equal competitive edge to the parallel DBMS. The performance of MR can improve by the proper tuning of its performance affecting parameters but still the research work is required in the same direction .Researchers need to provide the proper grouping algorithms, scheduling algorithms to make MR more effective as performance wise.

REFERENCES

[1] http://hadoop.apache.org. [2] http://developer.yahoo.net/blogs/hadoop/2008/09/.[3] Improving Decision Making in the World of Big Data

http://www.forbes.com/sites/christopherfrank/2012/03/25/improvingdecision-making-in-the-world-of-big-data/.

[4] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, pages 137-150{ 2004}.

[5] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD,pages 165-178{.ACM,2009}.

[6] Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. CommunIn. ACM, pages53(1):72-77{ 2010}.

[7] Jeffery Dean ,Sanjay Ghemawat .An Article on MapReduce:A Flexible Data Processing Tool.In SigMOD pages3(1):72-75{ACM 2010 }.

[8] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden,E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes. In ACM, 53(1):pages64-71{ 2010}.

[9] Dawei Jiang Beng, Chin Ooi ,Lei Shi, Sai Wu . The performance of MapReduce : An in- depth study ,Proceeding of the VLDB Endowment,,3(1):pages110-120{VLDB2010}.

[10] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi,A. Silberschatz, and A. Rasin.Hadoop db: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDBEndow., 2(1):922-933{ 2009}.

[11] Big Data: The next frontier for innovation, competition and productivity

http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.

[12] Big Data: The next frontier for competition http://www.mckinsey.com/features/big_data.

[13] ]H.-C. Yang, A. Dasdan, R.-L.Hsiao, and D. S. Parker. Map-Reduce-Merge: simplified relational data processing on large clusters. In SIGMOD,pages121-134{ 2007}.

[14] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A note on foreign language for data processing. In SIGMOD, pages 1099-1110{ ACM,2008}.

[15] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. In.PVLDB, 1(2):pages1265-1276{ 2008}.

[16] D. J. DeWitt, E. Paulson, E. Robinson, J. Naughton, J. Royalty, S. Shankar, and A. Krioukov. Clustera: An integrated computation and data management system...InProc. VLDB Endow.,1(1):pages28-41{ 2008}.

[17] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wycho®, and R. Murthy. Hive - Awarehousing solution over a map-reduce framework.In PVLDB, 2(2):1626-1629{ 2009}.

[18] S. Babu. Towards automatic optimization of mapreduce programs. In SoCC, pages 137-142.{ ACM,2010}.

[19] Amazon elastic compute cloud (Amazon EC2) http://aws.amazon.com/ec2/.

[20] Changqing Ji, Yu Li, Wenming Qi and Uchechukwu Awada,Keqiu L. Big Data Processing in Cloud Computing Enviornments ,In International Symposium on Pervasive Systems, Algorithms and Networks,pages 1087-4089{IEEE 2012}.

[21] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI, pages 29-42{2008}

[22] . F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach,M. Burrows, T. Chandra, A. Fikes, and R. Gruber, Bigtable:A distributed structured data storage system.In OSDI,7(1): pages 305-314{2006}

560560