18
Research Article Parallel MapReduce: Maximizing Cloud Resource Utilization and Performance Improvement Using Parallel Execution Strategies Ahmed Abdulhakim Al-Absi , 1,2 Najeeb Abbas Al-Sammarraie, 2 Wael Mohamed Shaher Yafooz, 2 and Dae-Ki Kang 3 1 Department of Smart Computing, Kyungdong University, Global Campus, 46 4-gil, Gosung, Gangwondo 24764, Republic of Korea 2 Faculty of Computer and Information Technology, Al-Madinah International University, 2 Jalan Tengku Ampuan Zabedah E/9E, 40100 Shah Alam, Selangor, Malaysia 3 Department of Computer & Information Engineering, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan 47011, Republic of Korea Correspondence should be addressed to Dae-Ki Kang; [email protected] Received 12 April 2018; Accepted 30 September 2018; Published 17 October 2018 Academic Editor: Gerald J. Wyckoff Copyright © 2018 Ahmed Abdulhakim Al-Absi et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MapReduce is the preferred cloud computing framework used in large data analysis and application processing. MapReduce frameworks currently in place suffer performance degradation due to the adoption of sequential processing approaches with little modification and thus exhibit underutilization of cloud resources. To overcome this drawback and reduce costs, we introduce a Parallel MapReduce () framework in this paper. We design a novel parallel execution strategy of Map and Reduce worker nodes. Our strategy enables further performance improvement and efficient utilization of cloud resources execution of Map and Reduce functions to utilize multicore environments available with computing nodes. We explain in detail makespan modeling and working principle of the framework in the paper. Performance of is compared with Hadoop through experiments considering three biomedical applications. Experiments conducted for BLAST, CAP3, and DeepBind biomedical applications report makespan time reduction of 38.92%, 18.00%, and 34.62% considering the framework against Hadoop framework. Experiments’ results prove that the cloud computing platform proposed is robust, cost-effective, and scalable, which sufficiently supports diverse applications on public and private cloud platforms. Consequently, overall presentation and results indicate that there is good matching between theoretical makespan modeling presented and experimental values investigated. 1. Introduction Delivery model of data intensive applications/services on cloud platforms is the new paradigm. Scalable storage and computing capabilities of cloud platforms aid delivery models with various aspects. e cloud is maintained using distribut- ed computing frameworks capable of handling and process- ing a large amount of data. Of all cloud frameworks available [1–5], Hadoop MapReduce is the most widely adopted [6, 7] owing to its ease of deployment, scalability, and open-source nature. e Hadoop MapReduce model predominantly consists of the following phases: Setup, Map, Shuffle, Sort, and Reduce, which is shown in Figure 1. e Hadoop frameworks consist of a master node and a cluster of computing nodes. Jobs sub- mitted to Hadoop are further distributed into Map and Reduce tasks. In the Setup phase, input data of a job to be processed (residing generally on the Hadoop Distributed File Systems (HDFS)) is logically partitioned into homogenous volumes called chunks for the Map worker nodes. Hadoop divides each MapReduce job into a set of tasks where each chunk is processed by the Map worker. e Map phase takes input as key/value pair as ( 1 , V 1 ) and generates a list of ( 2 , V 2 ) intermediate key/value pairs as output. e Shuffle phase begins with completion of the Map phase that collects the intermediate key/value pairs from all the Map tasks. A Sort Hindawi BioMed Research International Volume 2018, Article ID 7501042, 17 pages https://doi.org/10.1155/2018/7501042

Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

Research ArticleParallel MapReduce Maximizing Cloud ResourceUtilization and Performance Improvement Using ParallelExecution Strategies

Ahmed Abdulhakim Al-Absi 12 Najeeb Abbas Al-Sammarraie2

Wael Mohamed Shaher Yafooz2 and Dae-Ki Kang 3

1Department of Smart Computing Kyungdong University Global Campus 46 4-gil Gosung Gangwondo 24764 Republic of Korea2Faculty of Computer and Information Technology Al-Madinah International University2 Jalan Tengku Ampuan Zabedah E9E 40100 Shah Alam Selangor Malaysia3Department of Computer amp Information Engineering Dongseo University 47 Jurye-ro Sasang-gu Busan 47011 Republic of Korea

Correspondence should be addressed to Dae-Ki Kang dkkangdongseoackr

Received 12 April 2018 Accepted 30 September 2018 Published 17 October 2018

Academic Editor Gerald J Wyckoff

Copyright copy 2018 Ahmed Abdulhakim Al-Absi et al This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

MapReduce is the preferred cloud computing framework used in large data analysis and application processing MapReduceframeworks currently in place suffer performance degradation due to the adoption of sequential processing approaches with littlemodification and thus exhibit underutilization of cloud resources To overcome this drawback and reduce costs we introduce aParallel MapReduce (119875119872119877) framework in this paper We design a novel parallel execution strategy of Map and Reduce workernodes Our strategy enables further performance improvement and efficient utilization of cloud resources execution of Map andReduce functions to utilize multicore environments available with computing nodes We explain in detail makespan modeling andworking principle of the 119875119872119877 framework in the paper Performance of 119875119872119877 is compared with Hadoop through experimentsconsidering three biomedical applications Experiments conducted for BLAST CAP3 and DeepBind biomedical applicationsreport makespan time reduction of 3892 1800 and 3462 considering the 119875119872119877 framework against Hadoop frameworkExperimentsrsquo results prove that the 119875119872119877 cloud computing platform proposed is robust cost-effective and scalable whichsufficiently supports diverse applications on public and private cloud platforms Consequently overall presentation and resultsindicate that there is good matching between theoretical makespan modeling presented and experimental values investigated

1 Introduction

Delivery model of data intensive applicationsservices oncloud platforms is the new paradigm Scalable storage andcomputing capabilities of cloud platforms aid deliverymodelswith various aspectsThe cloud ismaintained using distribut-ed computing frameworks capable of handling and process-ing a large amount of data Of all cloud frameworks available[1ndash5] Hadoop MapReduce is the most widely adopted [6 7]owing to its ease of deployment scalability and open-sourcenature

The Hadoop MapReduce model predominantly consistsof the following phases SetupMap Shuffle Sort andReduce

which is shown in Figure 1 The Hadoop frameworks consistof a master node and a cluster of computing nodes Jobs sub-mitted to Hadoop are further distributed into Map andReduce tasks In the Setup phase input data of a job to beprocessed (residing generally on the Hadoop Distributed FileSystems (HDFS)) is logically partitioned into homogenousvolumes called chunks for the Map worker nodes Hadoopdivides each MapReduce job into a set of tasks where eachchunk is processed by the Map worker The Map phase takesinput as keyvalue pair as (1198961 V1) and generates a list of (1198962 V2)intermediate keyvalue pairs as output The Shuffle phasebegins with completion of the Map phase that collects theintermediate keyvalue pairs from all the Map tasks A Sort

HindawiBioMed Research InternationalVolume 2018 Article ID 7501042 17 pageshttpsdoiorg10115520187501042

2 BioMed Research International

Data Store 1

Data Store n

Input key value pairs Input key value pairs

Shuffle and Sort intermediate values by output key

MAPWORKER

MAP WORKER

helliphellip

REDUCE WORKER

REDUCE WORKER

REDUCE WORKER

Final k1 1

k1 1

Final k2 2

k2 2

Final k3 3

k3 3

(k1 1) (k2 2) (k3 3) (k1 1) (k2 2) (k3 3)

Figure 1 Hadoop MapReduce computation model

operation is performed on the intermediate keyvalue pairsof the Map phase For simplicity the Sort and Shuffle phasesare cumulatively considered in the Shuffle phaseThe Reducephase processes sorted intermediate data based on user de-fined functions The output of the Reduce phase is storedwritten to HDFS

The Hadoop MapReduce platform suffers from a num-ber of drawbacks The preconfigured memory allocator forHadoop jobs leads to issues of buffer concurrency amongstjobs and heavy disk read seeks The memory allocator issuesresult in increasing makespan time and induce high inputoutput (IO) overheads [5] The jobs scheduled on Hadoopcloud environments do not consider parameters such asmemory requirement and multicore environment for linearscalability which seriously affects performance [8] In Ha-doop the Reduce tasks are started after completion of all Maptasks Hadoop assumes homogenous Map execution timesconsidering homogenous distributed data which is not real-istic [9] Assumed homogenous Map execution times andserial execution strategy put forth utilized Map workers (andtheir resources) that have completed their tasks and arewaiting for the other Map workers to complete theirs [10] Incloud environments where organizationsusers are chargedaccording to (storage computation and communication) re-sources utilized these issues burden the costs in addition toaffecting performance [11] Hadoop platforms do not supportflexible pricing [12] Scalability is an issue owing to the clusterbased nature of Hadoop platforms Processing of streamingdata is also an issue with Hadoop [10] To overcome thesedrawbacks researchers have adopted various techniques

In [5] they addressed the issues related to Hadoop mem-ory management by adopting a global memory managementtechnique They proposed a prioritization model of memory

allocation and revocation by adopting a rule based heuristicapproach A multithread execution engine is used to achieveglobal memory management To address the garbage collec-tion issue of a Java virtual machine (JVM) and to improvethe data access rate in Hadoop they adopted a multicachemechanism for sequential and interleaved disk access Theirmodel improves the memory utilization and balances theperformance of IO and CPU In [5] the authors did not takethe network IO performance into consideration

In [8] aGPU basedmodel to address the linear scalabilityissue of Hadoop is presented They addressed the researchchallenges of integrating Hadoop and GPU and how theMapReduce job can be executed using CUDA based GPU InHadoop MapReduce framework the jobs run inside a JVMManaging of jobs creation of jobs and executing of jobs suf-fer from computation overhead and reduce the efficiency ofJust-In-Time (JIT) compilation due to the short-lived natureof jobs in the JVM To overcome this they adopted GPUbased job execution approaches such as JNI JCuda HadoopPipes and Hadoop Streaming They have analyzed and eva-luated detailed comparison of protocol of their pros and cons

To address issues related to sequential execution in [13]a Cloud MapReduce (CMR) framework is discussed Herethey developed a parallelized model by adopting a pipeliningexecution approach to process the streaming and batch dataTheir cloud based MapReduce model supports parallelismbetween Map and Reduce phases and also among individualjobs

The increased demand in data analytics for computingscientificbioinformatics data has resulted in increased sizeof bioinformatics data Computing and storing these hugedata require a huge infrastructure Computing bioinformaticsapplication by adopting Cloud platform is a viable option for

BioMed Research International 3

analyzing the genomic structure and its evolutionary patternof large bioinformatics data [14ndash18] which is generated by theNext Generation Sequencing (NGS) technologies Variouscloud based bioinformatics applications have been developedto compute large bioinformatics data CloudAligner [18]CloudBurst [19] Myrna [20] and Crossbow [21] Cloud tech-nologies allow the user to compute bioinformatics applicationand charges the user based on their usage Reducing the com-putation cost in such environment is an area that needs to beconsidered during designing a bioinformatics computationmodel

Reducing execution times and effective resource utiliza-tion with minimal costs are always a desired feature of cloud-computing frameworks To achieve this goal a ParallelMapReduce (119875119872119877) framework is proposed in this paperThe119875119872119877 adopts a parallel execution strategy similar to the tech-nique presented in [13] In conventional MapReduce sys-tems the Map phase is executed first and then Reduce phaseexecution is considered In the proposed PMR Reduce phaseexecution is initiated in a parallel fashion as soon as twoor more Map worker nodes have completed their tasks Theadoption of such execution strategies enables reduction of un-utilizedworker resources To further reducemakespan paral-lel execution of the Map and Reduce functions is adopted uti-lizing multicore environments available with nodes A make-spanmodel to describe operations of the 119875119872119877 is presented infuture sections Bioinformatics applications are synonymouswith big data Processing of such computationally heavyapplications is considered on cloud platforms as investigatedin [22 23] Performance evaluation of the PMR framework iscarried out using bioinformatics applications Themajor con-tributions can be summarized as follows

(i) Makespanmodeling and design of119875119872119877 cloud frame-work

(ii) Parallel execution strategy of the Map and Reducephase

(iii) Maximizing cloud resource utilization by computingon multicore environments in Map and Reduce

(iv) Performance evaluation on state-of-the-art biomedi-cal applications like BLAST CAP3 and DeepBind

(v) Experiments considering diverse cloud configura-tions and varied application configuration

(vi) Correlation between theoreticalmakespanmodel andexperimental values

Beside the data management and computing issues thereexist numerous security issues and challenges in provisioningsecurity in cloud computing environment and in ensuringethical treatment of biomedical data When MapReduce iscarried out in distributed settings users maintain very littlecontrol over these computations causing several security andprivacy concerns MapReduce activities may be subverted orcompromised by malicious or cheating nodes Such secu-rity issues have been discussed and highlighted by manyresearchers as in [24ndash26]However addressing security issuesis beyond the scope of this paper

The paper organization is as follows In Section 2 therelated works are discussed In Section 3 the proposed 119875119872119877framework is presented The results and the experimentalstudy are presented in the penultimate section The conclud-ing remarks are discussed in the last section

2 Literature Review

D Dahiphale et al [13] presented a cloud based MapReducemodel to overcome the shortcomings of the HadoopMapRe-duce model which are as follows Hadoop processes the Mapand Reduce phases in a sequential manner scalability is notefficient due to cluster based computing mechanism process-ing of stream data is not supported and lastly it does notsupport flexible pricing To overcome the issue of sequentialexecution they proposed a cloud based Parallel MapReducemodel where the tasks are executed by the Amazon EC2instances (virtual machine (worker)) to process stream andbatch data in a parallel manner a pipeliningmodel is adoptedwhich provides flexible pricing by using an Amazon cloudSpot Instance Experiment result shows that the CMRmodelprocesses tasks in a parallel manner improves the through-put and shows a speedup improvement of 30 over theHadoop MapReduce model for larger datasets

X Shi et al [5] presented a framework for memory inten-sive computing to overcome the shortcomings of the HadoopMapReduce model In Hadoop tasks are executed based onthe available CPU cores and memory is allocated based ona preset configuration which lead to memory bottleneck dueto buffer concurrency and heavy disk seeks resulting in IOwait occupancy which further increases the makespan timeTo address this they presented a rule based heuristicmodel toprioritize memory allocation and revocation for global mem-ory management They presented a multithread approachfor which they developed disk access serialization multi-cache technique for efficient garbage collection in JVM Theexperimental study shows that execution of memory inten-sive computation time is improved by 40 over the HadoopMapReduce model

Babak Alipanahi et al [27] presented a model by adopt-ing deep learning techniques for DNA- and RNA-bindingprotein for pattern discovery The specificity of protein isgenerally described using position weight matrices (PWMs)and the learning sequence specificity in the high throughputmodel has the following challenges Firstly there is thevaried nature data from different sources For example chro-matin immunoprecipitation provides varying putativelybound sequence length of ranked list for each sequenceRNAcompete assay and protein binding microarray providea specificity coefficient and HT-SELEX produces a very highsimilarity sequence set Secondly each data provider has itsunique biases artifacts and limitation for which it needsto identify the pertinent specificity Lastly the data are inhuge size which requires a computation model to integrateall data from different sources To overcome these challengesthey presented a model namely DeepBind whose charac-teristics are as follows It is applicable for both sequence andmicroarray data and works well across different technologieswithout correcting for technology-specific biases Sequences

4 BioMed Research International

are processed in a parallel manner by using a graphics pro-cessing unit (GPU) which can train the predicting modelautomatically and canwithstand amodest degree of noisy andincorrectly labeled trained data Experiments are conductedon in vitro data for both training and testing which shows thatthe DeepBind model is a scalable modular pattern discoverytechnique based on deep learning which does not depend onapplication specific heuristics such as ldquoseed findingrdquo

K Mahadik et al [28] presented a parallelized BLASTmodel to overcome issues related to mpiBLAST which areas follows It segments the database and processes each shortquery in parallel but due to rapid growth of NGS it has re-sulted in increased size of sequences (long query sequences)which can bemillions of proteinnucleotide sequences whichlimits the mpiBLAST resulting in scheduling overhead andincreasing the makespan time The mpiBLAST task com-pletion time of short queries is faster as compared to largequeries which create improper load balancing among nodesTo address this they presented a parallel model of BLASTnamely ORION splitting individual queries into overlappingfragments to process large query sequences on the HadoopMapReduce platform Experimental outcomes show thattheirmodel achieves a speedup of 123x overmpiBLASTwith-out compromising on accuracy

J Ekanayake et al [29] presented a cloud based MapRe-duce model namely Microsoft DryadLINQ and Apache Ha-doop for bioinformatics applications and it was comparedwith the existing MPI frameworkThe pairwise Alu sequencealignment and CAP3 [30] application is considered To eva-luate the scheduling performance of these frameworks aninhomogeneous dataset is considered Their outcomes showthat two cloud frameworks have a significant advantage overMPI in terms of fault tolerance parallel execution by adopt-ing the MapReduce framework robustness and flexibilitysince MPI is memory based whereas the DryadLINQ andHadoop model is file oriented based Experimental analysisis conducted for varied sequence sizes and the result showsthat Hadoop performs better than DryadLINQ for inhomo-geneous data for both applications

Y Wu et al [31] presented an outliers based executionstrategy1198731198742 for computation intensive applications in orderto reduce the makespan and overhead of computation manyexisting approaches that adopt a MapReduce framework aresuitable for data intensive application since their schedulerstate is defined by IO status They designed a framework forcomputation intensive tasks by adopting instrumentation todetect task progress and automatic instrument point selectorto reduce overhead and finally for outlierrsquos detection withoutresorting to biased progress calculation K-means is adoptedThe1198731198742 framework is evaluated by using application CAP3and ImageMagick on both local cluster and cloud environ-ment Their threshold based outlier model improves the taskcompletion time by 25 with minimal overhead

3 The Proposed PMR Framework

The119875119872119877 framework incorporates similar functions availablein conventional MapReduce frameworks Accordingly theMap Shuffle (including Sort) and Reduce phases exist in

119875119872119877 For the sake of representation simplicity the ShuffleandReduce phases are cumulatively considered in theReducephase The Map phase takes input data for processing andgenerates a list of key pair values of result 119872(1198961198901199101 V1198861198971) 997888rarr119897(1198961198901199102 V1198861198972) This generated key 1198961198901199102 and list of differentvalues are integrated together and put into a reducer func-tion The reducer function takes intermediate key 1198961198901199102 andprocesses the values and generates a new set of values 119897(V1198861198973)

The 119875119872119877 job execution is performed on multiple virtualmachines forming a computing cluster where one is a masternode and the others are worker nodesslave nodes The mas-ter node distributes andmonitors tasks among worker nodesWorker nodes periodically send their resource utilizationdetails to the master node Master nodes schedule the taskbased on availability of worker resources

To minimize makespan of job execution and maximizeutilization of cloud resource (available with worker nodes)the proposed 119875119872119877 adopts a parallel execution strategy ieReduce phase execution is initiated in a parallel fashion assoon as two or more Map worker nodes have completedtheir tasks The worker nodes are considered to have morethan one computing core the 119875119872119877 framework presentsparallel execution of the Map and Reduce functions adoptedutilizing multicore environments and a makespan model ofthe proposed 119875119872119877 is described and presented in Section 31

The 119875119872119877 function is a combination of the Map task andReduce task The input dataset is split into uniform blocksized data called chunks and is distributed among the com-puting nodes In 119875119872119877 the chunk obtained is further split toparallelize execution of user defined Map and Reduce func-tions The user defined Map function is applied on the inputand intermediate output is generated which is input data forthe Reduce task The Reduce stage is a combination of twophases Shuffle and Reduce Output data which is generatedfrom the Map task is fed as an input in the Shuffle phase thealready completedMap task is shuffled and then sorted in thisphaseNow the sorted data is fed into the user definedReducefunction and the generated output is written back to cloudstorage

A Map function in terms of computation time and inputoutput data dependencies can be represented as a tuple

(997888997888997888rarr119878119894119911119890119894119899M 119878MMdarrM997888rarrMuarr) (1)

where997888997888997888rarr119878119894119911119890119894119899M is the average input data processed by each Mapworker Variables MdarrM997888rarr and Muarr represent the maxi-mum average and minimum computation time of the MapfunctionOutput of theMap function stored in the cloud to beprocessed by Reduce workers is represented as a ratio be-tween output and input data 119878M

Similarly the 119875119872119877 Reduce function is represented as

(119878M 119878119877 119877darr 119877997888rarr 119877uarr) (2)

where 119878M is output data of Map functions stored in the cloud(represented as a ratio) Output of the Reduce function or thetask assigned to 119875119872119877 is represented as 119878119877 (ratio of Reduce

BioMed Research International 5

output to input) Minimum average and maximum compu-tation times of the Reduce function are 119877darr 119877997888rarr and 119877uarr TheReduce stage incorporates Shuffle and Sort operations

Reducing execution time and minimizing cost of cloudusage are always desirable attributes In this paper a make-span model to describe operation of the 119875119872119877 framework ispresented Obtaining actual makespan times is very complexand is always a challenge A number of dependencies existlike hardware parameters network conditions cluster nodeperformance cloud storage parameters data transfer ratesetc in obtaining makespans The makespan model of 119875119872119877described below only considers functional changes incorpo-rated to improve performance in conventional MapReduceframeworks Modeling described below is based on workpresented in [32]

31 PMRMakespan Model

311 Preliminaries and Makespan Bound Establishment Themakespan function is computed as the time required to com-plete a job of input data size and number of resources whichis allocated to 119875119872119877 Let us consider a job 119869 to be executedon the 119875119872119877 cloud platform considering data119863 Let the cloudplatform have 119899 + 1 number of nodesworkers Each workeris said to have 119901 cores that can be utilized for computationOne worker node acts as the master node leaving 119899 numberof workers to perform Map and Reduce computation Job119869 is considered to be distributed and computed using an 119909number of Map and Reduce tasks The data 119863 is also accord-ingly split into 119909 chunks represented as 1198631015840 In conventionalMapReduce platforms 1198631015840 = (119863119899) PMR considers a similarapproach for computing 1198631015840 The time utilized to complete 119909tasks is represented asT1T2 T119909 In the proposed PMRframework the chunks 1198631015840 are further split into11986310158401015840 = (1198631015840119901)for parallel execution In PMR execution of the 119909119905ℎ taskT119909 =Tuarr = maxt1 sdot sdot sdot t119901 where t119901 represents execution of task onthe 119901119905ℎ core considering corresponding data 11986310158401015840

Average (120590) and maximum time (120573) duration taken by 119909tasks to complete job 119869 can be represented as

120590 = sum119909119895=1T119895119909

120573 = max119895

T119895 (3)

Let us consider an optimistic scenario that 119909 tasks areuniformly distributed among 119902 worker nodes (minimumtime taken to process (119909 times 120590) work) Overall the time takento compute these tasks is (119909 times 120590)119902 and it is the lower boundtime

119897119887 = 119909 times 120590119902 (4)

To compute the upper bound time a pessimistic scenariois considered where the longest processing task

larr997888T isin (T1

T2 T119909) with makespan of 120573 is the last processed task

Therefore the time taken before the last tasklarr997888T is upper

bounded as follows

(sum119909119895=1T119895)119902 le (119909 minus 1) times 120590

119902 (5)

Therefore the overall timespan for this longest tasklarr997888T

is upper bounded as ((119909 minus 1) times 120590)119902 + 120573 The probable jobmakespan range due to nondeterminism and scheduling isobtained by the difference lower bound and upper boundThis is a key factor when the time taken of the longest taskis trivial as compared to the overall makespan ie 120573 ≪(119909 times 120590119902)

119906119887 = (119909 minus 1) times 120590119902 + 120573 (6)

312 Makespan of a Job on the PMR Framework Let us con-sider a jobJ which is submitted to the 119875119872119877 framework thejobJ is split into 119883J

Mnumber of Map tasks and 119883J

119877 numberof Reduce tasks Let SJ

Mand S

J

Rrepresent the number of

Map and Reduce workers allocated for theJ119905ℎ jobTo compute the makespan of the Map tasks the lower

and upper bounds are computed Using (3) average (M997888rarr)and maximum (Muarr) makespan of Map tasks of job J arecomputed Using Muarrand M997888rarr computed in (4) the lowerbound of the Map phase ieT119897119887119872 is defined as

T119897119887119872 = 119883J

MtimesM997888rarr

SJM

(7)

Similarly the upper bound T119906119887119872 or the maximum execu-tion time of the Map phase in 119875119872119877 using (6) is defined as

T119906119887119872 = (119883J

Mminus 1) timesM997888rarr

SJM

+Muarr (8)

Considering the lower (T119897119887119872) and upper (T119906119887119872) boundscomputed the makespan of the Map phase in 119875119872119877 is com-puted as

997888rarrT119872 = (T119906119887119872 +T119897119887119872)

2 (9)

The average makespan of each Map worker node is com-puted as

997888rarrT119872997888rarr =

997888rarrT119872

SJM

(10)

Themakespan of the PMRMap phase consisting ofSJM

=119902worker nodes is shown in Figure 2 of the paper Ascertainingbounds ofmakespan ieT119906119887119872 andT119897119887119872 is shown in the figure

The Reduce workers are initiated when at least two Mapworker nodes have finished their computational tasks The

6 BioMed Research International

MAP WORKER 1

Makespan Time (t)

MAP WORKER 2

MAP WORKER 3

MAP WORKER 4

MAP WORKER 5

MAP WORKER 6

minus

minus

minus

MAP WORKER q-2

MAP WORKER q-1

MAP WORKER q

lbM

ubM

(lbM minus ub

M)

Figure 2 Map phase makespan of PMR framework

Reduce phase is initiated at (T119906119887119872 minusT119897119887119872) time instance Inter-mediate data generated by Map worker nodes is processedusing the Shuffle Sort and Reduce functions defined Aver-age execution timeR997888rarr andmaximum execution timeRuarr ofthe Reduce phase considering SJ

Rworkers are derived using

(3) Makespan bounding of the Reduce phase is computed(the lower bound is represented asT119897119887119877 and the upper boundis represented asT119906119887119877 ) as follows

T119897119887119877 = 119883J

R∙R997888rarrS

J

R

(11)

T119906119887119877 = (119883J

Rminus 1) ∙R997888rarr

SJ

R+Ruarr

(12)

The makespan of theJ119905ℎ job on the 119875119872119877 framework is asum of time taken to execute Map tasks and time taken to ex-ecute Reduce tasks Considering the best case scenario (lowerbound) the minimum makespan observed is

T119897119887J = T

119897119887119872 +T

119897119887119877 minus (T119906119887119872 minusT

119897119887119872) (13)

Simplifying (13) we get

T119897119887J = 2T119897119887119872 +T

119897119887119877 minusT

119906119887119872 (14)

Considering the worst computing performance theupper bound or maximum makespan observed is

T119906119887J = T

119906119887119872 +T

119906119887119877 minus (T119906119887119872 minusT

119897119887119872) (15)

T119906119887J = T

119897119887119872 +T

119906119887119877 (16)

Themakespan of jobJ on the119875119872119877 framework is definedas

997888rarrTJ = (T119906119887J +T119897119887J)

2 (17)

Using (14) and (16) makespan J is

997888rarrTJ = ((T119897119887119872 +T119906119887119877 ) + (2T119897119887119872 +T119897119887119877 minusT119906119887119872))

2= (3T119897119887119872 +T119897119887119877 +T119906119887119877 minusT119906119887119872)

2 (18)

313ModelingDataDependency onMakespan According to[30 31] data dependency can bemodeled using linear regres-sion A similar approach is adopted here The average make-span of the 119909119905ℎ Map worker node is defined as

M119909997888rarr = V

1198720 +

SJ

Msum119908=1

(V119872w (1198631015840119901 )) (19)

where V119872lowast represent variables that are application specificie they are dependent on the Map user function

The average makespan of the 119909119905ℎ Reduce worker node isdefined as

R119909997888rarr = V

1198770 +

SJ

Rsum119908=1

(V119877w (1198891015840119901 )) (20)

where V119877lowast represent variables specific to the user definedReduce functions and 1198891015840represents intermediate output dataobtained from the Map phase For parallel execution and toutilize all resources it is further split similar to theMap phase

On similar lines the average and maximum executiontimes of Map and Reduce workers are computed Data de-pendent computations of M997888rarrMuarrR997888rarrRuarr are used in(14) (16) and (18) to compute makespan of the J119905ℎ job onthe 119875119872119877 framework considering data 119863 Additional detailsof data dependency modeling using linear regression arepresented in [33] The proof of the model is also presentedin [33]

BioMed Research International 7

4 Performance Evaluation

Experiments conducted to evaluate the performance of 119875119872119877are presented in this section Performance of 119875119872119877 is com-pared with the state-of-the-art Hadoop framework Hadoopis the most widely usedadopted MapReduce platform forcomputing in cloud environments [34] hence it is consid-ered for comparisons The 119875119872119877 framework is developedusing VC++ C and Nodejs and deployed on the Azurecloud Hadoop 2 ie version 26 is used and deployedon the Azure cloud using HDInsight The 119875119872119877 frameworkis deployed consisting of one master node and 4 workernodes Each worker node is deployed on A3 virtual machineinstances Each A3 VM instance consists of 4 virtual com-puting cores 7 GB of RAM and 120 GB of local hard drivespaceThe Hadoop platform deployed for evaluation consistsof one master and 4 worker nodes in the cluster Uniformconfiguration of 119875119872119877 and Hadoop frameworks on Azurecloud is considered

Biomedical applications characterized by processing ofmassive amounts of genetic data are considered in the ex-periments for performance evaluation A computationallyheavy biomedical application namely BLAST [35] CAP3[30] and state-of-the-art recent DeepBind [27] is adoptedfor evaluation All the genomic sequences considered for theexperimental analyses are obtained from the publicly avail-able NCBI database [36] For comprehensive performanceevaluations the authors have considered various applicationscenarios In experiments conducted using BLAST both theMap and Reduce phases are involved In CAP3 applicationthe Map phase plays a predominant role In DeepBind theReduce phase is critical for analysis

41 BLAST Gene sequence alignment is a fundamentaloperation adopted to identify similarities that exist betweena query protein sequence DNA or RNA and a database ofsequences maintained Sequence alignment is computation-ally heavy and its computation complexity is relative to theproduct of two sequences being currently analyzed Massivevolumes of sequences maintained in the database to besearched induce an additional computation burden BLAST isa widely adopted bioinformatics tool for sequence alignmentwhich performs faster alignments at the expense of accuracy(possibly missing some potential hits) [35]The drawbacks ofBLASTand its improvements are discussed in [28] For evalu-ation here the improved BLAST algorithm of [28] is adoptedTo improve computation time a heuristic strategy is usedcompromising accuracy minimally In the heuristic strategyan initial match is found and is later extended to obtain thecomplete matching sequence

A three-stage approach is adopted in BLAST for sequencealignment Query sequence is represented using q andreference sequence as r Sequences q and r are said to con-sist of 119896minuslength subsequences known as 119896 minus 119898119890119903119904 In theinitial stage also known as the 119896 minus 119898119890119903 match stage BLASTconsiders each of the 119896 minus 119898119890119903119904 of q and r and searches for119896minus119898119890119903119904 that match in both This process is repeated to builda scanner of all 119896minusletter words in query q Then BLASTsearches reference genomer by using the scanner built to find

119896 minus 119898119890119903119904 of r matches with query q and these matches are119904119890119890119889119904 of potential hitsIn the second stage also known as the ungapped align-

ment stage every seed identified previously is 119890119909119905119890119899119889119890119889 inboth directions respectively to include matches and mis-matches A match is found if nucleotides in q and r are thesame A mismatch occurs if varied nucleotides are observedin q and r The mismatch reduces the score and matchesincrease the score of candidate sequence alignment Thepresent score of sequence alignment 119886 and the highest scoreobtained for present seed 119886uarr are retained The second phaseis terminated if 119886uarr minus 119886 is higher than the predefined X-dropthreshold ℎ119909 and returns with the highest alignment scoreof the present seed The alignment is passed to stage three ifthe returned score is higher than the predefined ungappedthreshold 119867119910 The thresholds predefined establish accuracyof alignment scores in BLAST Computational optimizationis achieved by skipping seeds already available in previousalignments The initial two phases of BLAST are executed inthe Map workers of 119875119872119877 and Hadoop

In stage three gapped alignment is performed in theleft and right directions where deletion and insertion areperformed during extension of alignments The same as theprevious stage the highest score of alignment 119886uarr is kept andif the present score 119886 is lower than 119886uarr by more than the X-drop threshold the stage is terminated and the correspondingalignment outcome is obtained Gap alignment operation iscarried out in the Reduce phase of 119875119872119877 and Hadoop frame-work The schematic of BLAST algorithm on 119875119872119877 frame-work is shown in Figure 3

Experiments conducted to evaluate performance of 119875119872119877and Hadoop considered the Drosophila database as a refer-ence database The query genomics of varied sizes consideredis from Homo sapiens chromosomal sequences and genomicscaffolds A total of six different query sequences are con-sidered similar to [28] Configuration of each experiment issummarized in Table 1 All six experiments are conductedusing BLAST algorithm on Hadoop and 119875119872119877 frameworksAll observations retrieved through a set of log files generatedduring the Map and Reduce phases of Hadoop and 119875119872119877are noted and stored for further analysis Using the log filestotal makespan Map worker makespan and Reduce workermakespan of Hadoop and 119875119872119877 is noted for each experimentIt must be noted that the initialization time of the VM clusteris not considered in the computing makespan as it is uniformin 119875119872119877 and Hadoop owing to similar cluster configurations

Individual task execution times of Map worker andReduce worker nodes observed for each BLAST experimentexecuted on Hadoop and 119875119872119877 frameworks are graphicallyshown in Figure 4 Figure 4(a) represents results obtained forHadoop and Figure 4(b) represents results obtained on119875119872119877Execution times of Map workers in 119875119872119877 and Hadoop aredominantly higher than Reduce worker times This is due tothe fact that major computation intensive phases (ie Phase1 and Phase 2) of BLAST sequence alignment application arecarried out in the Map phase Parallel execution of BLASTsequence alignment utilizing all 4 cores available with eachMap worker node adopted in 119875119872119877 results in lower executiontimeswhen compared toHadoopMapworker nodes Average

8 BioMed Research International

Query sequence q(with overlap)

Referencedatabase sequence (R)

Local Memory

hellip

PMR Platform

Local Memory

hellip

PMR Platform

Alignment aggregation

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Blast alignment result

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

PMR-Master Node

Blast Blast

q1R1 q1R2 q1R qR1 qR2 qR

Figure 3 BLAST PMR framework

Table 1 Information of the Genome Sequences used as queries considering equal section lengths from Homo sapiens chromosome 15 as areference

Experiment Id Query genome Query genome size(bp) Database sequence Reference genome size

(bp)1 NT 007914 14866257 Drosophila database 1226539772 AC 000156 19317006 Drosophila database 1226539773 NT 011512 33734175 Drosophila database 1226539774 NT 033899 47073726 Drosophila database 1226539775 NT 008413 43212167 Drosophila database 1226539776 NT 022517 90712458 Drosophila database 122653977

reduction of execution time in Map workers of 119875119872119877 is3419 3415 3478 3529 3576 and 3987 inexperiments conducted when compared to Hadoop Mapworker average execution times As the query size increasesperformance improvement of 119875119872119877 increases Parallel execu-tion strategy of Reduce worker nodes proposed in 119875119872119877 isclearly visible in Figure 4(b) In other words Reduce workersare initiated as soon as two or more Map worker nodes havecompleted their tasks The execution time of Reduce workernodes in 119875119872119877 is marginally higher than those of HadoopWaiting for all Map worker nodes to complete their tasks is aprimary reason for the marginal increase in Reduce workerexecution times in 119872119877 Sequential processing ie Mapworkers first and then Reduce worker execution of workernodes in Hadoop framework is evident from Figure 4(a)

The total makespan of119875119872119877 and Hadoop is dependent ontask execution time of worker nodes during the Map phaseand Reduce phase The total makespan observed in BLASTsequence alignment experiments executed on Hadoop and119875119872119877 frameworks is shown in Figure 5 Superior performancein terms of Reduce makespan times of 119875119872119877 is evident when

compared to HadoopThough a marginal increase in Reduceworker execution time is reported overall execution time ietotal makespan of 119875119872119877 is less when compared to HadoopA reduction of 2923 3061 3278 3317 3333 and3823 is reported for six experiments executed on the119875119872119877 framework when compared to similar experimentsexecuted on Hadoop framework Average reduction of thetotal makespan across all experiments is 3289 provingsuperior performance of 119875119872119877 when compared to Hadoopframework

Theoretical makespan of 119875119872119877 119894119890 J given by (18)is computed and compared against the practical valuesobserved in all the experiments Results obtained are shownin Figure 6 Minor variations are observed between practicaland theoretical makespan computations Overall good cor-relation is reported between practical makespan values andtheoretical makespan values Based on the results presentedit is evident that execution of BLAST sequence alignmentalgorithm on the proposed 119875119872119877 yields superior resultswhen compared to similar experiments conducted on theexisting Hadoop framework Accuracy and correctness of the

BioMed Research International 9

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 3)

0 10 20 30 40 50 60 70R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 1)

0 20 40 60 80 100 120

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 2)

0 50 100 150 200 250

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 4)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 5)

0 50 100 150 200 250 300 350 400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering Hadoop (Experiment 6)

(a) Worker node execution times onHadoop frame-work

0 20 40 60 80 100 120 140

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 3)

0 10 20 30 40 50R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 1)

0 10 20 30 40 50 60 70 80

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 2)

0 20 40 60 80 100 120 140 160

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 4)

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 5)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering PMR (Experiment 6)

(b) Worker node execution times on 119875119872119877 frame-work

Figure 4 BLAST sequence alignment execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 2: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

2 BioMed Research International

Data Store 1

Data Store n

Input key value pairs Input key value pairs

Shuffle and Sort intermediate values by output key

MAPWORKER

MAP WORKER

helliphellip

REDUCE WORKER

REDUCE WORKER

REDUCE WORKER

Final k1 1

k1 1

Final k2 2

k2 2

Final k3 3

k3 3

(k1 1) (k2 2) (k3 3) (k1 1) (k2 2) (k3 3)

Figure 1 Hadoop MapReduce computation model

operation is performed on the intermediate keyvalue pairsof the Map phase For simplicity the Sort and Shuffle phasesare cumulatively considered in the Shuffle phaseThe Reducephase processes sorted intermediate data based on user de-fined functions The output of the Reduce phase is storedwritten to HDFS

The Hadoop MapReduce platform suffers from a num-ber of drawbacks The preconfigured memory allocator forHadoop jobs leads to issues of buffer concurrency amongstjobs and heavy disk read seeks The memory allocator issuesresult in increasing makespan time and induce high inputoutput (IO) overheads [5] The jobs scheduled on Hadoopcloud environments do not consider parameters such asmemory requirement and multicore environment for linearscalability which seriously affects performance [8] In Ha-doop the Reduce tasks are started after completion of all Maptasks Hadoop assumes homogenous Map execution timesconsidering homogenous distributed data which is not real-istic [9] Assumed homogenous Map execution times andserial execution strategy put forth utilized Map workers (andtheir resources) that have completed their tasks and arewaiting for the other Map workers to complete theirs [10] Incloud environments where organizationsusers are chargedaccording to (storage computation and communication) re-sources utilized these issues burden the costs in addition toaffecting performance [11] Hadoop platforms do not supportflexible pricing [12] Scalability is an issue owing to the clusterbased nature of Hadoop platforms Processing of streamingdata is also an issue with Hadoop [10] To overcome thesedrawbacks researchers have adopted various techniques

In [5] they addressed the issues related to Hadoop mem-ory management by adopting a global memory managementtechnique They proposed a prioritization model of memory

allocation and revocation by adopting a rule based heuristicapproach A multithread execution engine is used to achieveglobal memory management To address the garbage collec-tion issue of a Java virtual machine (JVM) and to improvethe data access rate in Hadoop they adopted a multicachemechanism for sequential and interleaved disk access Theirmodel improves the memory utilization and balances theperformance of IO and CPU In [5] the authors did not takethe network IO performance into consideration

In [8] aGPU basedmodel to address the linear scalabilityissue of Hadoop is presented They addressed the researchchallenges of integrating Hadoop and GPU and how theMapReduce job can be executed using CUDA based GPU InHadoop MapReduce framework the jobs run inside a JVMManaging of jobs creation of jobs and executing of jobs suf-fer from computation overhead and reduce the efficiency ofJust-In-Time (JIT) compilation due to the short-lived natureof jobs in the JVM To overcome this they adopted GPUbased job execution approaches such as JNI JCuda HadoopPipes and Hadoop Streaming They have analyzed and eva-luated detailed comparison of protocol of their pros and cons

To address issues related to sequential execution in [13]a Cloud MapReduce (CMR) framework is discussed Herethey developed a parallelized model by adopting a pipeliningexecution approach to process the streaming and batch dataTheir cloud based MapReduce model supports parallelismbetween Map and Reduce phases and also among individualjobs

The increased demand in data analytics for computingscientificbioinformatics data has resulted in increased sizeof bioinformatics data Computing and storing these hugedata require a huge infrastructure Computing bioinformaticsapplication by adopting Cloud platform is a viable option for

BioMed Research International 3

analyzing the genomic structure and its evolutionary patternof large bioinformatics data [14ndash18] which is generated by theNext Generation Sequencing (NGS) technologies Variouscloud based bioinformatics applications have been developedto compute large bioinformatics data CloudAligner [18]CloudBurst [19] Myrna [20] and Crossbow [21] Cloud tech-nologies allow the user to compute bioinformatics applicationand charges the user based on their usage Reducing the com-putation cost in such environment is an area that needs to beconsidered during designing a bioinformatics computationmodel

Reducing execution times and effective resource utiliza-tion with minimal costs are always a desired feature of cloud-computing frameworks To achieve this goal a ParallelMapReduce (119875119872119877) framework is proposed in this paperThe119875119872119877 adopts a parallel execution strategy similar to the tech-nique presented in [13] In conventional MapReduce sys-tems the Map phase is executed first and then Reduce phaseexecution is considered In the proposed PMR Reduce phaseexecution is initiated in a parallel fashion as soon as twoor more Map worker nodes have completed their tasks Theadoption of such execution strategies enables reduction of un-utilizedworker resources To further reducemakespan paral-lel execution of the Map and Reduce functions is adopted uti-lizing multicore environments available with nodes A make-spanmodel to describe operations of the 119875119872119877 is presented infuture sections Bioinformatics applications are synonymouswith big data Processing of such computationally heavyapplications is considered on cloud platforms as investigatedin [22 23] Performance evaluation of the PMR framework iscarried out using bioinformatics applications Themajor con-tributions can be summarized as follows

(i) Makespanmodeling and design of119875119872119877 cloud frame-work

(ii) Parallel execution strategy of the Map and Reducephase

(iii) Maximizing cloud resource utilization by computingon multicore environments in Map and Reduce

(iv) Performance evaluation on state-of-the-art biomedi-cal applications like BLAST CAP3 and DeepBind

(v) Experiments considering diverse cloud configura-tions and varied application configuration

(vi) Correlation between theoreticalmakespanmodel andexperimental values

Beside the data management and computing issues thereexist numerous security issues and challenges in provisioningsecurity in cloud computing environment and in ensuringethical treatment of biomedical data When MapReduce iscarried out in distributed settings users maintain very littlecontrol over these computations causing several security andprivacy concerns MapReduce activities may be subverted orcompromised by malicious or cheating nodes Such secu-rity issues have been discussed and highlighted by manyresearchers as in [24ndash26]However addressing security issuesis beyond the scope of this paper

The paper organization is as follows In Section 2 therelated works are discussed In Section 3 the proposed 119875119872119877framework is presented The results and the experimentalstudy are presented in the penultimate section The conclud-ing remarks are discussed in the last section

2 Literature Review

D Dahiphale et al [13] presented a cloud based MapReducemodel to overcome the shortcomings of the HadoopMapRe-duce model which are as follows Hadoop processes the Mapand Reduce phases in a sequential manner scalability is notefficient due to cluster based computing mechanism process-ing of stream data is not supported and lastly it does notsupport flexible pricing To overcome the issue of sequentialexecution they proposed a cloud based Parallel MapReducemodel where the tasks are executed by the Amazon EC2instances (virtual machine (worker)) to process stream andbatch data in a parallel manner a pipeliningmodel is adoptedwhich provides flexible pricing by using an Amazon cloudSpot Instance Experiment result shows that the CMRmodelprocesses tasks in a parallel manner improves the through-put and shows a speedup improvement of 30 over theHadoop MapReduce model for larger datasets

X Shi et al [5] presented a framework for memory inten-sive computing to overcome the shortcomings of the HadoopMapReduce model In Hadoop tasks are executed based onthe available CPU cores and memory is allocated based ona preset configuration which lead to memory bottleneck dueto buffer concurrency and heavy disk seeks resulting in IOwait occupancy which further increases the makespan timeTo address this they presented a rule based heuristicmodel toprioritize memory allocation and revocation for global mem-ory management They presented a multithread approachfor which they developed disk access serialization multi-cache technique for efficient garbage collection in JVM Theexperimental study shows that execution of memory inten-sive computation time is improved by 40 over the HadoopMapReduce model

Babak Alipanahi et al [27] presented a model by adopt-ing deep learning techniques for DNA- and RNA-bindingprotein for pattern discovery The specificity of protein isgenerally described using position weight matrices (PWMs)and the learning sequence specificity in the high throughputmodel has the following challenges Firstly there is thevaried nature data from different sources For example chro-matin immunoprecipitation provides varying putativelybound sequence length of ranked list for each sequenceRNAcompete assay and protein binding microarray providea specificity coefficient and HT-SELEX produces a very highsimilarity sequence set Secondly each data provider has itsunique biases artifacts and limitation for which it needsto identify the pertinent specificity Lastly the data are inhuge size which requires a computation model to integrateall data from different sources To overcome these challengesthey presented a model namely DeepBind whose charac-teristics are as follows It is applicable for both sequence andmicroarray data and works well across different technologieswithout correcting for technology-specific biases Sequences

4 BioMed Research International

are processed in a parallel manner by using a graphics pro-cessing unit (GPU) which can train the predicting modelautomatically and canwithstand amodest degree of noisy andincorrectly labeled trained data Experiments are conductedon in vitro data for both training and testing which shows thatthe DeepBind model is a scalable modular pattern discoverytechnique based on deep learning which does not depend onapplication specific heuristics such as ldquoseed findingrdquo

K Mahadik et al [28] presented a parallelized BLASTmodel to overcome issues related to mpiBLAST which areas follows It segments the database and processes each shortquery in parallel but due to rapid growth of NGS it has re-sulted in increased size of sequences (long query sequences)which can bemillions of proteinnucleotide sequences whichlimits the mpiBLAST resulting in scheduling overhead andincreasing the makespan time The mpiBLAST task com-pletion time of short queries is faster as compared to largequeries which create improper load balancing among nodesTo address this they presented a parallel model of BLASTnamely ORION splitting individual queries into overlappingfragments to process large query sequences on the HadoopMapReduce platform Experimental outcomes show thattheirmodel achieves a speedup of 123x overmpiBLASTwith-out compromising on accuracy

J Ekanayake et al [29] presented a cloud based MapRe-duce model namely Microsoft DryadLINQ and Apache Ha-doop for bioinformatics applications and it was comparedwith the existing MPI frameworkThe pairwise Alu sequencealignment and CAP3 [30] application is considered To eva-luate the scheduling performance of these frameworks aninhomogeneous dataset is considered Their outcomes showthat two cloud frameworks have a significant advantage overMPI in terms of fault tolerance parallel execution by adopt-ing the MapReduce framework robustness and flexibilitysince MPI is memory based whereas the DryadLINQ andHadoop model is file oriented based Experimental analysisis conducted for varied sequence sizes and the result showsthat Hadoop performs better than DryadLINQ for inhomo-geneous data for both applications

Y Wu et al [31] presented an outliers based executionstrategy1198731198742 for computation intensive applications in orderto reduce the makespan and overhead of computation manyexisting approaches that adopt a MapReduce framework aresuitable for data intensive application since their schedulerstate is defined by IO status They designed a framework forcomputation intensive tasks by adopting instrumentation todetect task progress and automatic instrument point selectorto reduce overhead and finally for outlierrsquos detection withoutresorting to biased progress calculation K-means is adoptedThe1198731198742 framework is evaluated by using application CAP3and ImageMagick on both local cluster and cloud environ-ment Their threshold based outlier model improves the taskcompletion time by 25 with minimal overhead

3 The Proposed PMR Framework

The119875119872119877 framework incorporates similar functions availablein conventional MapReduce frameworks Accordingly theMap Shuffle (including Sort) and Reduce phases exist in

119875119872119877 For the sake of representation simplicity the ShuffleandReduce phases are cumulatively considered in theReducephase The Map phase takes input data for processing andgenerates a list of key pair values of result 119872(1198961198901199101 V1198861198971) 997888rarr119897(1198961198901199102 V1198861198972) This generated key 1198961198901199102 and list of differentvalues are integrated together and put into a reducer func-tion The reducer function takes intermediate key 1198961198901199102 andprocesses the values and generates a new set of values 119897(V1198861198973)

The 119875119872119877 job execution is performed on multiple virtualmachines forming a computing cluster where one is a masternode and the others are worker nodesslave nodes The mas-ter node distributes andmonitors tasks among worker nodesWorker nodes periodically send their resource utilizationdetails to the master node Master nodes schedule the taskbased on availability of worker resources

To minimize makespan of job execution and maximizeutilization of cloud resource (available with worker nodes)the proposed 119875119872119877 adopts a parallel execution strategy ieReduce phase execution is initiated in a parallel fashion assoon as two or more Map worker nodes have completedtheir tasks The worker nodes are considered to have morethan one computing core the 119875119872119877 framework presentsparallel execution of the Map and Reduce functions adoptedutilizing multicore environments and a makespan model ofthe proposed 119875119872119877 is described and presented in Section 31

The 119875119872119877 function is a combination of the Map task andReduce task The input dataset is split into uniform blocksized data called chunks and is distributed among the com-puting nodes In 119875119872119877 the chunk obtained is further split toparallelize execution of user defined Map and Reduce func-tions The user defined Map function is applied on the inputand intermediate output is generated which is input data forthe Reduce task The Reduce stage is a combination of twophases Shuffle and Reduce Output data which is generatedfrom the Map task is fed as an input in the Shuffle phase thealready completedMap task is shuffled and then sorted in thisphaseNow the sorted data is fed into the user definedReducefunction and the generated output is written back to cloudstorage

A Map function in terms of computation time and inputoutput data dependencies can be represented as a tuple

(997888997888997888rarr119878119894119911119890119894119899M 119878MMdarrM997888rarrMuarr) (1)

where997888997888997888rarr119878119894119911119890119894119899M is the average input data processed by each Mapworker Variables MdarrM997888rarr and Muarr represent the maxi-mum average and minimum computation time of the MapfunctionOutput of theMap function stored in the cloud to beprocessed by Reduce workers is represented as a ratio be-tween output and input data 119878M

Similarly the 119875119872119877 Reduce function is represented as

(119878M 119878119877 119877darr 119877997888rarr 119877uarr) (2)

where 119878M is output data of Map functions stored in the cloud(represented as a ratio) Output of the Reduce function or thetask assigned to 119875119872119877 is represented as 119878119877 (ratio of Reduce

BioMed Research International 5

output to input) Minimum average and maximum compu-tation times of the Reduce function are 119877darr 119877997888rarr and 119877uarr TheReduce stage incorporates Shuffle and Sort operations

Reducing execution time and minimizing cost of cloudusage are always desirable attributes In this paper a make-span model to describe operation of the 119875119872119877 framework ispresented Obtaining actual makespan times is very complexand is always a challenge A number of dependencies existlike hardware parameters network conditions cluster nodeperformance cloud storage parameters data transfer ratesetc in obtaining makespans The makespan model of 119875119872119877described below only considers functional changes incorpo-rated to improve performance in conventional MapReduceframeworks Modeling described below is based on workpresented in [32]

31 PMRMakespan Model

311 Preliminaries and Makespan Bound Establishment Themakespan function is computed as the time required to com-plete a job of input data size and number of resources whichis allocated to 119875119872119877 Let us consider a job 119869 to be executedon the 119875119872119877 cloud platform considering data119863 Let the cloudplatform have 119899 + 1 number of nodesworkers Each workeris said to have 119901 cores that can be utilized for computationOne worker node acts as the master node leaving 119899 numberof workers to perform Map and Reduce computation Job119869 is considered to be distributed and computed using an 119909number of Map and Reduce tasks The data 119863 is also accord-ingly split into 119909 chunks represented as 1198631015840 In conventionalMapReduce platforms 1198631015840 = (119863119899) PMR considers a similarapproach for computing 1198631015840 The time utilized to complete 119909tasks is represented asT1T2 T119909 In the proposed PMRframework the chunks 1198631015840 are further split into11986310158401015840 = (1198631015840119901)for parallel execution In PMR execution of the 119909119905ℎ taskT119909 =Tuarr = maxt1 sdot sdot sdot t119901 where t119901 represents execution of task onthe 119901119905ℎ core considering corresponding data 11986310158401015840

Average (120590) and maximum time (120573) duration taken by 119909tasks to complete job 119869 can be represented as

120590 = sum119909119895=1T119895119909

120573 = max119895

T119895 (3)

Let us consider an optimistic scenario that 119909 tasks areuniformly distributed among 119902 worker nodes (minimumtime taken to process (119909 times 120590) work) Overall the time takento compute these tasks is (119909 times 120590)119902 and it is the lower boundtime

119897119887 = 119909 times 120590119902 (4)

To compute the upper bound time a pessimistic scenariois considered where the longest processing task

larr997888T isin (T1

T2 T119909) with makespan of 120573 is the last processed task

Therefore the time taken before the last tasklarr997888T is upper

bounded as follows

(sum119909119895=1T119895)119902 le (119909 minus 1) times 120590

119902 (5)

Therefore the overall timespan for this longest tasklarr997888T

is upper bounded as ((119909 minus 1) times 120590)119902 + 120573 The probable jobmakespan range due to nondeterminism and scheduling isobtained by the difference lower bound and upper boundThis is a key factor when the time taken of the longest taskis trivial as compared to the overall makespan ie 120573 ≪(119909 times 120590119902)

119906119887 = (119909 minus 1) times 120590119902 + 120573 (6)

312 Makespan of a Job on the PMR Framework Let us con-sider a jobJ which is submitted to the 119875119872119877 framework thejobJ is split into 119883J

Mnumber of Map tasks and 119883J

119877 numberof Reduce tasks Let SJ

Mand S

J

Rrepresent the number of

Map and Reduce workers allocated for theJ119905ℎ jobTo compute the makespan of the Map tasks the lower

and upper bounds are computed Using (3) average (M997888rarr)and maximum (Muarr) makespan of Map tasks of job J arecomputed Using Muarrand M997888rarr computed in (4) the lowerbound of the Map phase ieT119897119887119872 is defined as

T119897119887119872 = 119883J

MtimesM997888rarr

SJM

(7)

Similarly the upper bound T119906119887119872 or the maximum execu-tion time of the Map phase in 119875119872119877 using (6) is defined as

T119906119887119872 = (119883J

Mminus 1) timesM997888rarr

SJM

+Muarr (8)

Considering the lower (T119897119887119872) and upper (T119906119887119872) boundscomputed the makespan of the Map phase in 119875119872119877 is com-puted as

997888rarrT119872 = (T119906119887119872 +T119897119887119872)

2 (9)

The average makespan of each Map worker node is com-puted as

997888rarrT119872997888rarr =

997888rarrT119872

SJM

(10)

Themakespan of the PMRMap phase consisting ofSJM

=119902worker nodes is shown in Figure 2 of the paper Ascertainingbounds ofmakespan ieT119906119887119872 andT119897119887119872 is shown in the figure

The Reduce workers are initiated when at least two Mapworker nodes have finished their computational tasks The

6 BioMed Research International

MAP WORKER 1

Makespan Time (t)

MAP WORKER 2

MAP WORKER 3

MAP WORKER 4

MAP WORKER 5

MAP WORKER 6

minus

minus

minus

MAP WORKER q-2

MAP WORKER q-1

MAP WORKER q

lbM

ubM

(lbM minus ub

M)

Figure 2 Map phase makespan of PMR framework

Reduce phase is initiated at (T119906119887119872 minusT119897119887119872) time instance Inter-mediate data generated by Map worker nodes is processedusing the Shuffle Sort and Reduce functions defined Aver-age execution timeR997888rarr andmaximum execution timeRuarr ofthe Reduce phase considering SJ

Rworkers are derived using

(3) Makespan bounding of the Reduce phase is computed(the lower bound is represented asT119897119887119877 and the upper boundis represented asT119906119887119877 ) as follows

T119897119887119877 = 119883J

R∙R997888rarrS

J

R

(11)

T119906119887119877 = (119883J

Rminus 1) ∙R997888rarr

SJ

R+Ruarr

(12)

The makespan of theJ119905ℎ job on the 119875119872119877 framework is asum of time taken to execute Map tasks and time taken to ex-ecute Reduce tasks Considering the best case scenario (lowerbound) the minimum makespan observed is

T119897119887J = T

119897119887119872 +T

119897119887119877 minus (T119906119887119872 minusT

119897119887119872) (13)

Simplifying (13) we get

T119897119887J = 2T119897119887119872 +T

119897119887119877 minusT

119906119887119872 (14)

Considering the worst computing performance theupper bound or maximum makespan observed is

T119906119887J = T

119906119887119872 +T

119906119887119877 minus (T119906119887119872 minusT

119897119887119872) (15)

T119906119887J = T

119897119887119872 +T

119906119887119877 (16)

Themakespan of jobJ on the119875119872119877 framework is definedas

997888rarrTJ = (T119906119887J +T119897119887J)

2 (17)

Using (14) and (16) makespan J is

997888rarrTJ = ((T119897119887119872 +T119906119887119877 ) + (2T119897119887119872 +T119897119887119877 minusT119906119887119872))

2= (3T119897119887119872 +T119897119887119877 +T119906119887119877 minusT119906119887119872)

2 (18)

313ModelingDataDependency onMakespan According to[30 31] data dependency can bemodeled using linear regres-sion A similar approach is adopted here The average make-span of the 119909119905ℎ Map worker node is defined as

M119909997888rarr = V

1198720 +

SJ

Msum119908=1

(V119872w (1198631015840119901 )) (19)

where V119872lowast represent variables that are application specificie they are dependent on the Map user function

The average makespan of the 119909119905ℎ Reduce worker node isdefined as

R119909997888rarr = V

1198770 +

SJ

Rsum119908=1

(V119877w (1198891015840119901 )) (20)

where V119877lowast represent variables specific to the user definedReduce functions and 1198891015840represents intermediate output dataobtained from the Map phase For parallel execution and toutilize all resources it is further split similar to theMap phase

On similar lines the average and maximum executiontimes of Map and Reduce workers are computed Data de-pendent computations of M997888rarrMuarrR997888rarrRuarr are used in(14) (16) and (18) to compute makespan of the J119905ℎ job onthe 119875119872119877 framework considering data 119863 Additional detailsof data dependency modeling using linear regression arepresented in [33] The proof of the model is also presentedin [33]

BioMed Research International 7

4 Performance Evaluation

Experiments conducted to evaluate the performance of 119875119872119877are presented in this section Performance of 119875119872119877 is com-pared with the state-of-the-art Hadoop framework Hadoopis the most widely usedadopted MapReduce platform forcomputing in cloud environments [34] hence it is consid-ered for comparisons The 119875119872119877 framework is developedusing VC++ C and Nodejs and deployed on the Azurecloud Hadoop 2 ie version 26 is used and deployedon the Azure cloud using HDInsight The 119875119872119877 frameworkis deployed consisting of one master node and 4 workernodes Each worker node is deployed on A3 virtual machineinstances Each A3 VM instance consists of 4 virtual com-puting cores 7 GB of RAM and 120 GB of local hard drivespaceThe Hadoop platform deployed for evaluation consistsof one master and 4 worker nodes in the cluster Uniformconfiguration of 119875119872119877 and Hadoop frameworks on Azurecloud is considered

Biomedical applications characterized by processing ofmassive amounts of genetic data are considered in the ex-periments for performance evaluation A computationallyheavy biomedical application namely BLAST [35] CAP3[30] and state-of-the-art recent DeepBind [27] is adoptedfor evaluation All the genomic sequences considered for theexperimental analyses are obtained from the publicly avail-able NCBI database [36] For comprehensive performanceevaluations the authors have considered various applicationscenarios In experiments conducted using BLAST both theMap and Reduce phases are involved In CAP3 applicationthe Map phase plays a predominant role In DeepBind theReduce phase is critical for analysis

41 BLAST Gene sequence alignment is a fundamentaloperation adopted to identify similarities that exist betweena query protein sequence DNA or RNA and a database ofsequences maintained Sequence alignment is computation-ally heavy and its computation complexity is relative to theproduct of two sequences being currently analyzed Massivevolumes of sequences maintained in the database to besearched induce an additional computation burden BLAST isa widely adopted bioinformatics tool for sequence alignmentwhich performs faster alignments at the expense of accuracy(possibly missing some potential hits) [35]The drawbacks ofBLASTand its improvements are discussed in [28] For evalu-ation here the improved BLAST algorithm of [28] is adoptedTo improve computation time a heuristic strategy is usedcompromising accuracy minimally In the heuristic strategyan initial match is found and is later extended to obtain thecomplete matching sequence

A three-stage approach is adopted in BLAST for sequencealignment Query sequence is represented using q andreference sequence as r Sequences q and r are said to con-sist of 119896minuslength subsequences known as 119896 minus 119898119890119903119904 In theinitial stage also known as the 119896 minus 119898119890119903 match stage BLASTconsiders each of the 119896 minus 119898119890119903119904 of q and r and searches for119896minus119898119890119903119904 that match in both This process is repeated to builda scanner of all 119896minusletter words in query q Then BLASTsearches reference genomer by using the scanner built to find

119896 minus 119898119890119903119904 of r matches with query q and these matches are119904119890119890119889119904 of potential hitsIn the second stage also known as the ungapped align-

ment stage every seed identified previously is 119890119909119905119890119899119889119890119889 inboth directions respectively to include matches and mis-matches A match is found if nucleotides in q and r are thesame A mismatch occurs if varied nucleotides are observedin q and r The mismatch reduces the score and matchesincrease the score of candidate sequence alignment Thepresent score of sequence alignment 119886 and the highest scoreobtained for present seed 119886uarr are retained The second phaseis terminated if 119886uarr minus 119886 is higher than the predefined X-dropthreshold ℎ119909 and returns with the highest alignment scoreof the present seed The alignment is passed to stage three ifthe returned score is higher than the predefined ungappedthreshold 119867119910 The thresholds predefined establish accuracyof alignment scores in BLAST Computational optimizationis achieved by skipping seeds already available in previousalignments The initial two phases of BLAST are executed inthe Map workers of 119875119872119877 and Hadoop

In stage three gapped alignment is performed in theleft and right directions where deletion and insertion areperformed during extension of alignments The same as theprevious stage the highest score of alignment 119886uarr is kept andif the present score 119886 is lower than 119886uarr by more than the X-drop threshold the stage is terminated and the correspondingalignment outcome is obtained Gap alignment operation iscarried out in the Reduce phase of 119875119872119877 and Hadoop frame-work The schematic of BLAST algorithm on 119875119872119877 frame-work is shown in Figure 3

Experiments conducted to evaluate performance of 119875119872119877and Hadoop considered the Drosophila database as a refer-ence database The query genomics of varied sizes consideredis from Homo sapiens chromosomal sequences and genomicscaffolds A total of six different query sequences are con-sidered similar to [28] Configuration of each experiment issummarized in Table 1 All six experiments are conductedusing BLAST algorithm on Hadoop and 119875119872119877 frameworksAll observations retrieved through a set of log files generatedduring the Map and Reduce phases of Hadoop and 119875119872119877are noted and stored for further analysis Using the log filestotal makespan Map worker makespan and Reduce workermakespan of Hadoop and 119875119872119877 is noted for each experimentIt must be noted that the initialization time of the VM clusteris not considered in the computing makespan as it is uniformin 119875119872119877 and Hadoop owing to similar cluster configurations

Individual task execution times of Map worker andReduce worker nodes observed for each BLAST experimentexecuted on Hadoop and 119875119872119877 frameworks are graphicallyshown in Figure 4 Figure 4(a) represents results obtained forHadoop and Figure 4(b) represents results obtained on119875119872119877Execution times of Map workers in 119875119872119877 and Hadoop aredominantly higher than Reduce worker times This is due tothe fact that major computation intensive phases (ie Phase1 and Phase 2) of BLAST sequence alignment application arecarried out in the Map phase Parallel execution of BLASTsequence alignment utilizing all 4 cores available with eachMap worker node adopted in 119875119872119877 results in lower executiontimeswhen compared toHadoopMapworker nodes Average

8 BioMed Research International

Query sequence q(with overlap)

Referencedatabase sequence (R)

Local Memory

hellip

PMR Platform

Local Memory

hellip

PMR Platform

Alignment aggregation

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Blast alignment result

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

PMR-Master Node

Blast Blast

q1R1 q1R2 q1R qR1 qR2 qR

Figure 3 BLAST PMR framework

Table 1 Information of the Genome Sequences used as queries considering equal section lengths from Homo sapiens chromosome 15 as areference

Experiment Id Query genome Query genome size(bp) Database sequence Reference genome size

(bp)1 NT 007914 14866257 Drosophila database 1226539772 AC 000156 19317006 Drosophila database 1226539773 NT 011512 33734175 Drosophila database 1226539774 NT 033899 47073726 Drosophila database 1226539775 NT 008413 43212167 Drosophila database 1226539776 NT 022517 90712458 Drosophila database 122653977

reduction of execution time in Map workers of 119875119872119877 is3419 3415 3478 3529 3576 and 3987 inexperiments conducted when compared to Hadoop Mapworker average execution times As the query size increasesperformance improvement of 119875119872119877 increases Parallel execu-tion strategy of Reduce worker nodes proposed in 119875119872119877 isclearly visible in Figure 4(b) In other words Reduce workersare initiated as soon as two or more Map worker nodes havecompleted their tasks The execution time of Reduce workernodes in 119875119872119877 is marginally higher than those of HadoopWaiting for all Map worker nodes to complete their tasks is aprimary reason for the marginal increase in Reduce workerexecution times in 119872119877 Sequential processing ie Mapworkers first and then Reduce worker execution of workernodes in Hadoop framework is evident from Figure 4(a)

The total makespan of119875119872119877 and Hadoop is dependent ontask execution time of worker nodes during the Map phaseand Reduce phase The total makespan observed in BLASTsequence alignment experiments executed on Hadoop and119875119872119877 frameworks is shown in Figure 5 Superior performancein terms of Reduce makespan times of 119875119872119877 is evident when

compared to HadoopThough a marginal increase in Reduceworker execution time is reported overall execution time ietotal makespan of 119875119872119877 is less when compared to HadoopA reduction of 2923 3061 3278 3317 3333 and3823 is reported for six experiments executed on the119875119872119877 framework when compared to similar experimentsexecuted on Hadoop framework Average reduction of thetotal makespan across all experiments is 3289 provingsuperior performance of 119875119872119877 when compared to Hadoopframework

Theoretical makespan of 119875119872119877 119894119890 J given by (18)is computed and compared against the practical valuesobserved in all the experiments Results obtained are shownin Figure 6 Minor variations are observed between practicaland theoretical makespan computations Overall good cor-relation is reported between practical makespan values andtheoretical makespan values Based on the results presentedit is evident that execution of BLAST sequence alignmentalgorithm on the proposed 119875119872119877 yields superior resultswhen compared to similar experiments conducted on theexisting Hadoop framework Accuracy and correctness of the

BioMed Research International 9

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 3)

0 10 20 30 40 50 60 70R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 1)

0 20 40 60 80 100 120

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 2)

0 50 100 150 200 250

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 4)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 5)

0 50 100 150 200 250 300 350 400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering Hadoop (Experiment 6)

(a) Worker node execution times onHadoop frame-work

0 20 40 60 80 100 120 140

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 3)

0 10 20 30 40 50R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 1)

0 10 20 30 40 50 60 70 80

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 2)

0 20 40 60 80 100 120 140 160

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 4)

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 5)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering PMR (Experiment 6)

(b) Worker node execution times on 119875119872119877 frame-work

Figure 4 BLAST sequence alignment execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 3: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

BioMed Research International 3

analyzing the genomic structure and its evolutionary patternof large bioinformatics data [14ndash18] which is generated by theNext Generation Sequencing (NGS) technologies Variouscloud based bioinformatics applications have been developedto compute large bioinformatics data CloudAligner [18]CloudBurst [19] Myrna [20] and Crossbow [21] Cloud tech-nologies allow the user to compute bioinformatics applicationand charges the user based on their usage Reducing the com-putation cost in such environment is an area that needs to beconsidered during designing a bioinformatics computationmodel

Reducing execution times and effective resource utiliza-tion with minimal costs are always a desired feature of cloud-computing frameworks To achieve this goal a ParallelMapReduce (119875119872119877) framework is proposed in this paperThe119875119872119877 adopts a parallel execution strategy similar to the tech-nique presented in [13] In conventional MapReduce sys-tems the Map phase is executed first and then Reduce phaseexecution is considered In the proposed PMR Reduce phaseexecution is initiated in a parallel fashion as soon as twoor more Map worker nodes have completed their tasks Theadoption of such execution strategies enables reduction of un-utilizedworker resources To further reducemakespan paral-lel execution of the Map and Reduce functions is adopted uti-lizing multicore environments available with nodes A make-spanmodel to describe operations of the 119875119872119877 is presented infuture sections Bioinformatics applications are synonymouswith big data Processing of such computationally heavyapplications is considered on cloud platforms as investigatedin [22 23] Performance evaluation of the PMR framework iscarried out using bioinformatics applications Themajor con-tributions can be summarized as follows

(i) Makespanmodeling and design of119875119872119877 cloud frame-work

(ii) Parallel execution strategy of the Map and Reducephase

(iii) Maximizing cloud resource utilization by computingon multicore environments in Map and Reduce

(iv) Performance evaluation on state-of-the-art biomedi-cal applications like BLAST CAP3 and DeepBind

(v) Experiments considering diverse cloud configura-tions and varied application configuration

(vi) Correlation between theoreticalmakespanmodel andexperimental values

Beside the data management and computing issues thereexist numerous security issues and challenges in provisioningsecurity in cloud computing environment and in ensuringethical treatment of biomedical data When MapReduce iscarried out in distributed settings users maintain very littlecontrol over these computations causing several security andprivacy concerns MapReduce activities may be subverted orcompromised by malicious or cheating nodes Such secu-rity issues have been discussed and highlighted by manyresearchers as in [24ndash26]However addressing security issuesis beyond the scope of this paper

The paper organization is as follows In Section 2 therelated works are discussed In Section 3 the proposed 119875119872119877framework is presented The results and the experimentalstudy are presented in the penultimate section The conclud-ing remarks are discussed in the last section

2 Literature Review

D Dahiphale et al [13] presented a cloud based MapReducemodel to overcome the shortcomings of the HadoopMapRe-duce model which are as follows Hadoop processes the Mapand Reduce phases in a sequential manner scalability is notefficient due to cluster based computing mechanism process-ing of stream data is not supported and lastly it does notsupport flexible pricing To overcome the issue of sequentialexecution they proposed a cloud based Parallel MapReducemodel where the tasks are executed by the Amazon EC2instances (virtual machine (worker)) to process stream andbatch data in a parallel manner a pipeliningmodel is adoptedwhich provides flexible pricing by using an Amazon cloudSpot Instance Experiment result shows that the CMRmodelprocesses tasks in a parallel manner improves the through-put and shows a speedup improvement of 30 over theHadoop MapReduce model for larger datasets

X Shi et al [5] presented a framework for memory inten-sive computing to overcome the shortcomings of the HadoopMapReduce model In Hadoop tasks are executed based onthe available CPU cores and memory is allocated based ona preset configuration which lead to memory bottleneck dueto buffer concurrency and heavy disk seeks resulting in IOwait occupancy which further increases the makespan timeTo address this they presented a rule based heuristicmodel toprioritize memory allocation and revocation for global mem-ory management They presented a multithread approachfor which they developed disk access serialization multi-cache technique for efficient garbage collection in JVM Theexperimental study shows that execution of memory inten-sive computation time is improved by 40 over the HadoopMapReduce model

Babak Alipanahi et al [27] presented a model by adopt-ing deep learning techniques for DNA- and RNA-bindingprotein for pattern discovery The specificity of protein isgenerally described using position weight matrices (PWMs)and the learning sequence specificity in the high throughputmodel has the following challenges Firstly there is thevaried nature data from different sources For example chro-matin immunoprecipitation provides varying putativelybound sequence length of ranked list for each sequenceRNAcompete assay and protein binding microarray providea specificity coefficient and HT-SELEX produces a very highsimilarity sequence set Secondly each data provider has itsunique biases artifacts and limitation for which it needsto identify the pertinent specificity Lastly the data are inhuge size which requires a computation model to integrateall data from different sources To overcome these challengesthey presented a model namely DeepBind whose charac-teristics are as follows It is applicable for both sequence andmicroarray data and works well across different technologieswithout correcting for technology-specific biases Sequences

4 BioMed Research International

are processed in a parallel manner by using a graphics pro-cessing unit (GPU) which can train the predicting modelautomatically and canwithstand amodest degree of noisy andincorrectly labeled trained data Experiments are conductedon in vitro data for both training and testing which shows thatthe DeepBind model is a scalable modular pattern discoverytechnique based on deep learning which does not depend onapplication specific heuristics such as ldquoseed findingrdquo

K Mahadik et al [28] presented a parallelized BLASTmodel to overcome issues related to mpiBLAST which areas follows It segments the database and processes each shortquery in parallel but due to rapid growth of NGS it has re-sulted in increased size of sequences (long query sequences)which can bemillions of proteinnucleotide sequences whichlimits the mpiBLAST resulting in scheduling overhead andincreasing the makespan time The mpiBLAST task com-pletion time of short queries is faster as compared to largequeries which create improper load balancing among nodesTo address this they presented a parallel model of BLASTnamely ORION splitting individual queries into overlappingfragments to process large query sequences on the HadoopMapReduce platform Experimental outcomes show thattheirmodel achieves a speedup of 123x overmpiBLASTwith-out compromising on accuracy

J Ekanayake et al [29] presented a cloud based MapRe-duce model namely Microsoft DryadLINQ and Apache Ha-doop for bioinformatics applications and it was comparedwith the existing MPI frameworkThe pairwise Alu sequencealignment and CAP3 [30] application is considered To eva-luate the scheduling performance of these frameworks aninhomogeneous dataset is considered Their outcomes showthat two cloud frameworks have a significant advantage overMPI in terms of fault tolerance parallel execution by adopt-ing the MapReduce framework robustness and flexibilitysince MPI is memory based whereas the DryadLINQ andHadoop model is file oriented based Experimental analysisis conducted for varied sequence sizes and the result showsthat Hadoop performs better than DryadLINQ for inhomo-geneous data for both applications

Y Wu et al [31] presented an outliers based executionstrategy1198731198742 for computation intensive applications in orderto reduce the makespan and overhead of computation manyexisting approaches that adopt a MapReduce framework aresuitable for data intensive application since their schedulerstate is defined by IO status They designed a framework forcomputation intensive tasks by adopting instrumentation todetect task progress and automatic instrument point selectorto reduce overhead and finally for outlierrsquos detection withoutresorting to biased progress calculation K-means is adoptedThe1198731198742 framework is evaluated by using application CAP3and ImageMagick on both local cluster and cloud environ-ment Their threshold based outlier model improves the taskcompletion time by 25 with minimal overhead

3 The Proposed PMR Framework

The119875119872119877 framework incorporates similar functions availablein conventional MapReduce frameworks Accordingly theMap Shuffle (including Sort) and Reduce phases exist in

119875119872119877 For the sake of representation simplicity the ShuffleandReduce phases are cumulatively considered in theReducephase The Map phase takes input data for processing andgenerates a list of key pair values of result 119872(1198961198901199101 V1198861198971) 997888rarr119897(1198961198901199102 V1198861198972) This generated key 1198961198901199102 and list of differentvalues are integrated together and put into a reducer func-tion The reducer function takes intermediate key 1198961198901199102 andprocesses the values and generates a new set of values 119897(V1198861198973)

The 119875119872119877 job execution is performed on multiple virtualmachines forming a computing cluster where one is a masternode and the others are worker nodesslave nodes The mas-ter node distributes andmonitors tasks among worker nodesWorker nodes periodically send their resource utilizationdetails to the master node Master nodes schedule the taskbased on availability of worker resources

To minimize makespan of job execution and maximizeutilization of cloud resource (available with worker nodes)the proposed 119875119872119877 adopts a parallel execution strategy ieReduce phase execution is initiated in a parallel fashion assoon as two or more Map worker nodes have completedtheir tasks The worker nodes are considered to have morethan one computing core the 119875119872119877 framework presentsparallel execution of the Map and Reduce functions adoptedutilizing multicore environments and a makespan model ofthe proposed 119875119872119877 is described and presented in Section 31

The 119875119872119877 function is a combination of the Map task andReduce task The input dataset is split into uniform blocksized data called chunks and is distributed among the com-puting nodes In 119875119872119877 the chunk obtained is further split toparallelize execution of user defined Map and Reduce func-tions The user defined Map function is applied on the inputand intermediate output is generated which is input data forthe Reduce task The Reduce stage is a combination of twophases Shuffle and Reduce Output data which is generatedfrom the Map task is fed as an input in the Shuffle phase thealready completedMap task is shuffled and then sorted in thisphaseNow the sorted data is fed into the user definedReducefunction and the generated output is written back to cloudstorage

A Map function in terms of computation time and inputoutput data dependencies can be represented as a tuple

(997888997888997888rarr119878119894119911119890119894119899M 119878MMdarrM997888rarrMuarr) (1)

where997888997888997888rarr119878119894119911119890119894119899M is the average input data processed by each Mapworker Variables MdarrM997888rarr and Muarr represent the maxi-mum average and minimum computation time of the MapfunctionOutput of theMap function stored in the cloud to beprocessed by Reduce workers is represented as a ratio be-tween output and input data 119878M

Similarly the 119875119872119877 Reduce function is represented as

(119878M 119878119877 119877darr 119877997888rarr 119877uarr) (2)

where 119878M is output data of Map functions stored in the cloud(represented as a ratio) Output of the Reduce function or thetask assigned to 119875119872119877 is represented as 119878119877 (ratio of Reduce

BioMed Research International 5

output to input) Minimum average and maximum compu-tation times of the Reduce function are 119877darr 119877997888rarr and 119877uarr TheReduce stage incorporates Shuffle and Sort operations

Reducing execution time and minimizing cost of cloudusage are always desirable attributes In this paper a make-span model to describe operation of the 119875119872119877 framework ispresented Obtaining actual makespan times is very complexand is always a challenge A number of dependencies existlike hardware parameters network conditions cluster nodeperformance cloud storage parameters data transfer ratesetc in obtaining makespans The makespan model of 119875119872119877described below only considers functional changes incorpo-rated to improve performance in conventional MapReduceframeworks Modeling described below is based on workpresented in [32]

31 PMRMakespan Model

311 Preliminaries and Makespan Bound Establishment Themakespan function is computed as the time required to com-plete a job of input data size and number of resources whichis allocated to 119875119872119877 Let us consider a job 119869 to be executedon the 119875119872119877 cloud platform considering data119863 Let the cloudplatform have 119899 + 1 number of nodesworkers Each workeris said to have 119901 cores that can be utilized for computationOne worker node acts as the master node leaving 119899 numberof workers to perform Map and Reduce computation Job119869 is considered to be distributed and computed using an 119909number of Map and Reduce tasks The data 119863 is also accord-ingly split into 119909 chunks represented as 1198631015840 In conventionalMapReduce platforms 1198631015840 = (119863119899) PMR considers a similarapproach for computing 1198631015840 The time utilized to complete 119909tasks is represented asT1T2 T119909 In the proposed PMRframework the chunks 1198631015840 are further split into11986310158401015840 = (1198631015840119901)for parallel execution In PMR execution of the 119909119905ℎ taskT119909 =Tuarr = maxt1 sdot sdot sdot t119901 where t119901 represents execution of task onthe 119901119905ℎ core considering corresponding data 11986310158401015840

Average (120590) and maximum time (120573) duration taken by 119909tasks to complete job 119869 can be represented as

120590 = sum119909119895=1T119895119909

120573 = max119895

T119895 (3)

Let us consider an optimistic scenario that 119909 tasks areuniformly distributed among 119902 worker nodes (minimumtime taken to process (119909 times 120590) work) Overall the time takento compute these tasks is (119909 times 120590)119902 and it is the lower boundtime

119897119887 = 119909 times 120590119902 (4)

To compute the upper bound time a pessimistic scenariois considered where the longest processing task

larr997888T isin (T1

T2 T119909) with makespan of 120573 is the last processed task

Therefore the time taken before the last tasklarr997888T is upper

bounded as follows

(sum119909119895=1T119895)119902 le (119909 minus 1) times 120590

119902 (5)

Therefore the overall timespan for this longest tasklarr997888T

is upper bounded as ((119909 minus 1) times 120590)119902 + 120573 The probable jobmakespan range due to nondeterminism and scheduling isobtained by the difference lower bound and upper boundThis is a key factor when the time taken of the longest taskis trivial as compared to the overall makespan ie 120573 ≪(119909 times 120590119902)

119906119887 = (119909 minus 1) times 120590119902 + 120573 (6)

312 Makespan of a Job on the PMR Framework Let us con-sider a jobJ which is submitted to the 119875119872119877 framework thejobJ is split into 119883J

Mnumber of Map tasks and 119883J

119877 numberof Reduce tasks Let SJ

Mand S

J

Rrepresent the number of

Map and Reduce workers allocated for theJ119905ℎ jobTo compute the makespan of the Map tasks the lower

and upper bounds are computed Using (3) average (M997888rarr)and maximum (Muarr) makespan of Map tasks of job J arecomputed Using Muarrand M997888rarr computed in (4) the lowerbound of the Map phase ieT119897119887119872 is defined as

T119897119887119872 = 119883J

MtimesM997888rarr

SJM

(7)

Similarly the upper bound T119906119887119872 or the maximum execu-tion time of the Map phase in 119875119872119877 using (6) is defined as

T119906119887119872 = (119883J

Mminus 1) timesM997888rarr

SJM

+Muarr (8)

Considering the lower (T119897119887119872) and upper (T119906119887119872) boundscomputed the makespan of the Map phase in 119875119872119877 is com-puted as

997888rarrT119872 = (T119906119887119872 +T119897119887119872)

2 (9)

The average makespan of each Map worker node is com-puted as

997888rarrT119872997888rarr =

997888rarrT119872

SJM

(10)

Themakespan of the PMRMap phase consisting ofSJM

=119902worker nodes is shown in Figure 2 of the paper Ascertainingbounds ofmakespan ieT119906119887119872 andT119897119887119872 is shown in the figure

The Reduce workers are initiated when at least two Mapworker nodes have finished their computational tasks The

6 BioMed Research International

MAP WORKER 1

Makespan Time (t)

MAP WORKER 2

MAP WORKER 3

MAP WORKER 4

MAP WORKER 5

MAP WORKER 6

minus

minus

minus

MAP WORKER q-2

MAP WORKER q-1

MAP WORKER q

lbM

ubM

(lbM minus ub

M)

Figure 2 Map phase makespan of PMR framework

Reduce phase is initiated at (T119906119887119872 minusT119897119887119872) time instance Inter-mediate data generated by Map worker nodes is processedusing the Shuffle Sort and Reduce functions defined Aver-age execution timeR997888rarr andmaximum execution timeRuarr ofthe Reduce phase considering SJ

Rworkers are derived using

(3) Makespan bounding of the Reduce phase is computed(the lower bound is represented asT119897119887119877 and the upper boundis represented asT119906119887119877 ) as follows

T119897119887119877 = 119883J

R∙R997888rarrS

J

R

(11)

T119906119887119877 = (119883J

Rminus 1) ∙R997888rarr

SJ

R+Ruarr

(12)

The makespan of theJ119905ℎ job on the 119875119872119877 framework is asum of time taken to execute Map tasks and time taken to ex-ecute Reduce tasks Considering the best case scenario (lowerbound) the minimum makespan observed is

T119897119887J = T

119897119887119872 +T

119897119887119877 minus (T119906119887119872 minusT

119897119887119872) (13)

Simplifying (13) we get

T119897119887J = 2T119897119887119872 +T

119897119887119877 minusT

119906119887119872 (14)

Considering the worst computing performance theupper bound or maximum makespan observed is

T119906119887J = T

119906119887119872 +T

119906119887119877 minus (T119906119887119872 minusT

119897119887119872) (15)

T119906119887J = T

119897119887119872 +T

119906119887119877 (16)

Themakespan of jobJ on the119875119872119877 framework is definedas

997888rarrTJ = (T119906119887J +T119897119887J)

2 (17)

Using (14) and (16) makespan J is

997888rarrTJ = ((T119897119887119872 +T119906119887119877 ) + (2T119897119887119872 +T119897119887119877 minusT119906119887119872))

2= (3T119897119887119872 +T119897119887119877 +T119906119887119877 minusT119906119887119872)

2 (18)

313ModelingDataDependency onMakespan According to[30 31] data dependency can bemodeled using linear regres-sion A similar approach is adopted here The average make-span of the 119909119905ℎ Map worker node is defined as

M119909997888rarr = V

1198720 +

SJ

Msum119908=1

(V119872w (1198631015840119901 )) (19)

where V119872lowast represent variables that are application specificie they are dependent on the Map user function

The average makespan of the 119909119905ℎ Reduce worker node isdefined as

R119909997888rarr = V

1198770 +

SJ

Rsum119908=1

(V119877w (1198891015840119901 )) (20)

where V119877lowast represent variables specific to the user definedReduce functions and 1198891015840represents intermediate output dataobtained from the Map phase For parallel execution and toutilize all resources it is further split similar to theMap phase

On similar lines the average and maximum executiontimes of Map and Reduce workers are computed Data de-pendent computations of M997888rarrMuarrR997888rarrRuarr are used in(14) (16) and (18) to compute makespan of the J119905ℎ job onthe 119875119872119877 framework considering data 119863 Additional detailsof data dependency modeling using linear regression arepresented in [33] The proof of the model is also presentedin [33]

BioMed Research International 7

4 Performance Evaluation

Experiments conducted to evaluate the performance of 119875119872119877are presented in this section Performance of 119875119872119877 is com-pared with the state-of-the-art Hadoop framework Hadoopis the most widely usedadopted MapReduce platform forcomputing in cloud environments [34] hence it is consid-ered for comparisons The 119875119872119877 framework is developedusing VC++ C and Nodejs and deployed on the Azurecloud Hadoop 2 ie version 26 is used and deployedon the Azure cloud using HDInsight The 119875119872119877 frameworkis deployed consisting of one master node and 4 workernodes Each worker node is deployed on A3 virtual machineinstances Each A3 VM instance consists of 4 virtual com-puting cores 7 GB of RAM and 120 GB of local hard drivespaceThe Hadoop platform deployed for evaluation consistsof one master and 4 worker nodes in the cluster Uniformconfiguration of 119875119872119877 and Hadoop frameworks on Azurecloud is considered

Biomedical applications characterized by processing ofmassive amounts of genetic data are considered in the ex-periments for performance evaluation A computationallyheavy biomedical application namely BLAST [35] CAP3[30] and state-of-the-art recent DeepBind [27] is adoptedfor evaluation All the genomic sequences considered for theexperimental analyses are obtained from the publicly avail-able NCBI database [36] For comprehensive performanceevaluations the authors have considered various applicationscenarios In experiments conducted using BLAST both theMap and Reduce phases are involved In CAP3 applicationthe Map phase plays a predominant role In DeepBind theReduce phase is critical for analysis

41 BLAST Gene sequence alignment is a fundamentaloperation adopted to identify similarities that exist betweena query protein sequence DNA or RNA and a database ofsequences maintained Sequence alignment is computation-ally heavy and its computation complexity is relative to theproduct of two sequences being currently analyzed Massivevolumes of sequences maintained in the database to besearched induce an additional computation burden BLAST isa widely adopted bioinformatics tool for sequence alignmentwhich performs faster alignments at the expense of accuracy(possibly missing some potential hits) [35]The drawbacks ofBLASTand its improvements are discussed in [28] For evalu-ation here the improved BLAST algorithm of [28] is adoptedTo improve computation time a heuristic strategy is usedcompromising accuracy minimally In the heuristic strategyan initial match is found and is later extended to obtain thecomplete matching sequence

A three-stage approach is adopted in BLAST for sequencealignment Query sequence is represented using q andreference sequence as r Sequences q and r are said to con-sist of 119896minuslength subsequences known as 119896 minus 119898119890119903119904 In theinitial stage also known as the 119896 minus 119898119890119903 match stage BLASTconsiders each of the 119896 minus 119898119890119903119904 of q and r and searches for119896minus119898119890119903119904 that match in both This process is repeated to builda scanner of all 119896minusletter words in query q Then BLASTsearches reference genomer by using the scanner built to find

119896 minus 119898119890119903119904 of r matches with query q and these matches are119904119890119890119889119904 of potential hitsIn the second stage also known as the ungapped align-

ment stage every seed identified previously is 119890119909119905119890119899119889119890119889 inboth directions respectively to include matches and mis-matches A match is found if nucleotides in q and r are thesame A mismatch occurs if varied nucleotides are observedin q and r The mismatch reduces the score and matchesincrease the score of candidate sequence alignment Thepresent score of sequence alignment 119886 and the highest scoreobtained for present seed 119886uarr are retained The second phaseis terminated if 119886uarr minus 119886 is higher than the predefined X-dropthreshold ℎ119909 and returns with the highest alignment scoreof the present seed The alignment is passed to stage three ifthe returned score is higher than the predefined ungappedthreshold 119867119910 The thresholds predefined establish accuracyof alignment scores in BLAST Computational optimizationis achieved by skipping seeds already available in previousalignments The initial two phases of BLAST are executed inthe Map workers of 119875119872119877 and Hadoop

In stage three gapped alignment is performed in theleft and right directions where deletion and insertion areperformed during extension of alignments The same as theprevious stage the highest score of alignment 119886uarr is kept andif the present score 119886 is lower than 119886uarr by more than the X-drop threshold the stage is terminated and the correspondingalignment outcome is obtained Gap alignment operation iscarried out in the Reduce phase of 119875119872119877 and Hadoop frame-work The schematic of BLAST algorithm on 119875119872119877 frame-work is shown in Figure 3

Experiments conducted to evaluate performance of 119875119872119877and Hadoop considered the Drosophila database as a refer-ence database The query genomics of varied sizes consideredis from Homo sapiens chromosomal sequences and genomicscaffolds A total of six different query sequences are con-sidered similar to [28] Configuration of each experiment issummarized in Table 1 All six experiments are conductedusing BLAST algorithm on Hadoop and 119875119872119877 frameworksAll observations retrieved through a set of log files generatedduring the Map and Reduce phases of Hadoop and 119875119872119877are noted and stored for further analysis Using the log filestotal makespan Map worker makespan and Reduce workermakespan of Hadoop and 119875119872119877 is noted for each experimentIt must be noted that the initialization time of the VM clusteris not considered in the computing makespan as it is uniformin 119875119872119877 and Hadoop owing to similar cluster configurations

Individual task execution times of Map worker andReduce worker nodes observed for each BLAST experimentexecuted on Hadoop and 119875119872119877 frameworks are graphicallyshown in Figure 4 Figure 4(a) represents results obtained forHadoop and Figure 4(b) represents results obtained on119875119872119877Execution times of Map workers in 119875119872119877 and Hadoop aredominantly higher than Reduce worker times This is due tothe fact that major computation intensive phases (ie Phase1 and Phase 2) of BLAST sequence alignment application arecarried out in the Map phase Parallel execution of BLASTsequence alignment utilizing all 4 cores available with eachMap worker node adopted in 119875119872119877 results in lower executiontimeswhen compared toHadoopMapworker nodes Average

8 BioMed Research International

Query sequence q(with overlap)

Referencedatabase sequence (R)

Local Memory

hellip

PMR Platform

Local Memory

hellip

PMR Platform

Alignment aggregation

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Blast alignment result

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

PMR-Master Node

Blast Blast

q1R1 q1R2 q1R qR1 qR2 qR

Figure 3 BLAST PMR framework

Table 1 Information of the Genome Sequences used as queries considering equal section lengths from Homo sapiens chromosome 15 as areference

Experiment Id Query genome Query genome size(bp) Database sequence Reference genome size

(bp)1 NT 007914 14866257 Drosophila database 1226539772 AC 000156 19317006 Drosophila database 1226539773 NT 011512 33734175 Drosophila database 1226539774 NT 033899 47073726 Drosophila database 1226539775 NT 008413 43212167 Drosophila database 1226539776 NT 022517 90712458 Drosophila database 122653977

reduction of execution time in Map workers of 119875119872119877 is3419 3415 3478 3529 3576 and 3987 inexperiments conducted when compared to Hadoop Mapworker average execution times As the query size increasesperformance improvement of 119875119872119877 increases Parallel execu-tion strategy of Reduce worker nodes proposed in 119875119872119877 isclearly visible in Figure 4(b) In other words Reduce workersare initiated as soon as two or more Map worker nodes havecompleted their tasks The execution time of Reduce workernodes in 119875119872119877 is marginally higher than those of HadoopWaiting for all Map worker nodes to complete their tasks is aprimary reason for the marginal increase in Reduce workerexecution times in 119872119877 Sequential processing ie Mapworkers first and then Reduce worker execution of workernodes in Hadoop framework is evident from Figure 4(a)

The total makespan of119875119872119877 and Hadoop is dependent ontask execution time of worker nodes during the Map phaseand Reduce phase The total makespan observed in BLASTsequence alignment experiments executed on Hadoop and119875119872119877 frameworks is shown in Figure 5 Superior performancein terms of Reduce makespan times of 119875119872119877 is evident when

compared to HadoopThough a marginal increase in Reduceworker execution time is reported overall execution time ietotal makespan of 119875119872119877 is less when compared to HadoopA reduction of 2923 3061 3278 3317 3333 and3823 is reported for six experiments executed on the119875119872119877 framework when compared to similar experimentsexecuted on Hadoop framework Average reduction of thetotal makespan across all experiments is 3289 provingsuperior performance of 119875119872119877 when compared to Hadoopframework

Theoretical makespan of 119875119872119877 119894119890 J given by (18)is computed and compared against the practical valuesobserved in all the experiments Results obtained are shownin Figure 6 Minor variations are observed between practicaland theoretical makespan computations Overall good cor-relation is reported between practical makespan values andtheoretical makespan values Based on the results presentedit is evident that execution of BLAST sequence alignmentalgorithm on the proposed 119875119872119877 yields superior resultswhen compared to similar experiments conducted on theexisting Hadoop framework Accuracy and correctness of the

BioMed Research International 9

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 3)

0 10 20 30 40 50 60 70R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 1)

0 20 40 60 80 100 120

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 2)

0 50 100 150 200 250

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 4)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 5)

0 50 100 150 200 250 300 350 400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering Hadoop (Experiment 6)

(a) Worker node execution times onHadoop frame-work

0 20 40 60 80 100 120 140

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 3)

0 10 20 30 40 50R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 1)

0 10 20 30 40 50 60 70 80

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 2)

0 20 40 60 80 100 120 140 160

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 4)

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 5)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering PMR (Experiment 6)

(b) Worker node execution times on 119875119872119877 frame-work

Figure 4 BLAST sequence alignment execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 4: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

4 BioMed Research International

are processed in a parallel manner by using a graphics pro-cessing unit (GPU) which can train the predicting modelautomatically and canwithstand amodest degree of noisy andincorrectly labeled trained data Experiments are conductedon in vitro data for both training and testing which shows thatthe DeepBind model is a scalable modular pattern discoverytechnique based on deep learning which does not depend onapplication specific heuristics such as ldquoseed findingrdquo

K Mahadik et al [28] presented a parallelized BLASTmodel to overcome issues related to mpiBLAST which areas follows It segments the database and processes each shortquery in parallel but due to rapid growth of NGS it has re-sulted in increased size of sequences (long query sequences)which can bemillions of proteinnucleotide sequences whichlimits the mpiBLAST resulting in scheduling overhead andincreasing the makespan time The mpiBLAST task com-pletion time of short queries is faster as compared to largequeries which create improper load balancing among nodesTo address this they presented a parallel model of BLASTnamely ORION splitting individual queries into overlappingfragments to process large query sequences on the HadoopMapReduce platform Experimental outcomes show thattheirmodel achieves a speedup of 123x overmpiBLASTwith-out compromising on accuracy

J Ekanayake et al [29] presented a cloud based MapRe-duce model namely Microsoft DryadLINQ and Apache Ha-doop for bioinformatics applications and it was comparedwith the existing MPI frameworkThe pairwise Alu sequencealignment and CAP3 [30] application is considered To eva-luate the scheduling performance of these frameworks aninhomogeneous dataset is considered Their outcomes showthat two cloud frameworks have a significant advantage overMPI in terms of fault tolerance parallel execution by adopt-ing the MapReduce framework robustness and flexibilitysince MPI is memory based whereas the DryadLINQ andHadoop model is file oriented based Experimental analysisis conducted for varied sequence sizes and the result showsthat Hadoop performs better than DryadLINQ for inhomo-geneous data for both applications

Y Wu et al [31] presented an outliers based executionstrategy1198731198742 for computation intensive applications in orderto reduce the makespan and overhead of computation manyexisting approaches that adopt a MapReduce framework aresuitable for data intensive application since their schedulerstate is defined by IO status They designed a framework forcomputation intensive tasks by adopting instrumentation todetect task progress and automatic instrument point selectorto reduce overhead and finally for outlierrsquos detection withoutresorting to biased progress calculation K-means is adoptedThe1198731198742 framework is evaluated by using application CAP3and ImageMagick on both local cluster and cloud environ-ment Their threshold based outlier model improves the taskcompletion time by 25 with minimal overhead

3 The Proposed PMR Framework

The119875119872119877 framework incorporates similar functions availablein conventional MapReduce frameworks Accordingly theMap Shuffle (including Sort) and Reduce phases exist in

119875119872119877 For the sake of representation simplicity the ShuffleandReduce phases are cumulatively considered in theReducephase The Map phase takes input data for processing andgenerates a list of key pair values of result 119872(1198961198901199101 V1198861198971) 997888rarr119897(1198961198901199102 V1198861198972) This generated key 1198961198901199102 and list of differentvalues are integrated together and put into a reducer func-tion The reducer function takes intermediate key 1198961198901199102 andprocesses the values and generates a new set of values 119897(V1198861198973)

The 119875119872119877 job execution is performed on multiple virtualmachines forming a computing cluster where one is a masternode and the others are worker nodesslave nodes The mas-ter node distributes andmonitors tasks among worker nodesWorker nodes periodically send their resource utilizationdetails to the master node Master nodes schedule the taskbased on availability of worker resources

To minimize makespan of job execution and maximizeutilization of cloud resource (available with worker nodes)the proposed 119875119872119877 adopts a parallel execution strategy ieReduce phase execution is initiated in a parallel fashion assoon as two or more Map worker nodes have completedtheir tasks The worker nodes are considered to have morethan one computing core the 119875119872119877 framework presentsparallel execution of the Map and Reduce functions adoptedutilizing multicore environments and a makespan model ofthe proposed 119875119872119877 is described and presented in Section 31

The 119875119872119877 function is a combination of the Map task andReduce task The input dataset is split into uniform blocksized data called chunks and is distributed among the com-puting nodes In 119875119872119877 the chunk obtained is further split toparallelize execution of user defined Map and Reduce func-tions The user defined Map function is applied on the inputand intermediate output is generated which is input data forthe Reduce task The Reduce stage is a combination of twophases Shuffle and Reduce Output data which is generatedfrom the Map task is fed as an input in the Shuffle phase thealready completedMap task is shuffled and then sorted in thisphaseNow the sorted data is fed into the user definedReducefunction and the generated output is written back to cloudstorage

A Map function in terms of computation time and inputoutput data dependencies can be represented as a tuple

(997888997888997888rarr119878119894119911119890119894119899M 119878MMdarrM997888rarrMuarr) (1)

where997888997888997888rarr119878119894119911119890119894119899M is the average input data processed by each Mapworker Variables MdarrM997888rarr and Muarr represent the maxi-mum average and minimum computation time of the MapfunctionOutput of theMap function stored in the cloud to beprocessed by Reduce workers is represented as a ratio be-tween output and input data 119878M

Similarly the 119875119872119877 Reduce function is represented as

(119878M 119878119877 119877darr 119877997888rarr 119877uarr) (2)

where 119878M is output data of Map functions stored in the cloud(represented as a ratio) Output of the Reduce function or thetask assigned to 119875119872119877 is represented as 119878119877 (ratio of Reduce

BioMed Research International 5

output to input) Minimum average and maximum compu-tation times of the Reduce function are 119877darr 119877997888rarr and 119877uarr TheReduce stage incorporates Shuffle and Sort operations

Reducing execution time and minimizing cost of cloudusage are always desirable attributes In this paper a make-span model to describe operation of the 119875119872119877 framework ispresented Obtaining actual makespan times is very complexand is always a challenge A number of dependencies existlike hardware parameters network conditions cluster nodeperformance cloud storage parameters data transfer ratesetc in obtaining makespans The makespan model of 119875119872119877described below only considers functional changes incorpo-rated to improve performance in conventional MapReduceframeworks Modeling described below is based on workpresented in [32]

31 PMRMakespan Model

311 Preliminaries and Makespan Bound Establishment Themakespan function is computed as the time required to com-plete a job of input data size and number of resources whichis allocated to 119875119872119877 Let us consider a job 119869 to be executedon the 119875119872119877 cloud platform considering data119863 Let the cloudplatform have 119899 + 1 number of nodesworkers Each workeris said to have 119901 cores that can be utilized for computationOne worker node acts as the master node leaving 119899 numberof workers to perform Map and Reduce computation Job119869 is considered to be distributed and computed using an 119909number of Map and Reduce tasks The data 119863 is also accord-ingly split into 119909 chunks represented as 1198631015840 In conventionalMapReduce platforms 1198631015840 = (119863119899) PMR considers a similarapproach for computing 1198631015840 The time utilized to complete 119909tasks is represented asT1T2 T119909 In the proposed PMRframework the chunks 1198631015840 are further split into11986310158401015840 = (1198631015840119901)for parallel execution In PMR execution of the 119909119905ℎ taskT119909 =Tuarr = maxt1 sdot sdot sdot t119901 where t119901 represents execution of task onthe 119901119905ℎ core considering corresponding data 11986310158401015840

Average (120590) and maximum time (120573) duration taken by 119909tasks to complete job 119869 can be represented as

120590 = sum119909119895=1T119895119909

120573 = max119895

T119895 (3)

Let us consider an optimistic scenario that 119909 tasks areuniformly distributed among 119902 worker nodes (minimumtime taken to process (119909 times 120590) work) Overall the time takento compute these tasks is (119909 times 120590)119902 and it is the lower boundtime

119897119887 = 119909 times 120590119902 (4)

To compute the upper bound time a pessimistic scenariois considered where the longest processing task

larr997888T isin (T1

T2 T119909) with makespan of 120573 is the last processed task

Therefore the time taken before the last tasklarr997888T is upper

bounded as follows

(sum119909119895=1T119895)119902 le (119909 minus 1) times 120590

119902 (5)

Therefore the overall timespan for this longest tasklarr997888T

is upper bounded as ((119909 minus 1) times 120590)119902 + 120573 The probable jobmakespan range due to nondeterminism and scheduling isobtained by the difference lower bound and upper boundThis is a key factor when the time taken of the longest taskis trivial as compared to the overall makespan ie 120573 ≪(119909 times 120590119902)

119906119887 = (119909 minus 1) times 120590119902 + 120573 (6)

312 Makespan of a Job on the PMR Framework Let us con-sider a jobJ which is submitted to the 119875119872119877 framework thejobJ is split into 119883J

Mnumber of Map tasks and 119883J

119877 numberof Reduce tasks Let SJ

Mand S

J

Rrepresent the number of

Map and Reduce workers allocated for theJ119905ℎ jobTo compute the makespan of the Map tasks the lower

and upper bounds are computed Using (3) average (M997888rarr)and maximum (Muarr) makespan of Map tasks of job J arecomputed Using Muarrand M997888rarr computed in (4) the lowerbound of the Map phase ieT119897119887119872 is defined as

T119897119887119872 = 119883J

MtimesM997888rarr

SJM

(7)

Similarly the upper bound T119906119887119872 or the maximum execu-tion time of the Map phase in 119875119872119877 using (6) is defined as

T119906119887119872 = (119883J

Mminus 1) timesM997888rarr

SJM

+Muarr (8)

Considering the lower (T119897119887119872) and upper (T119906119887119872) boundscomputed the makespan of the Map phase in 119875119872119877 is com-puted as

997888rarrT119872 = (T119906119887119872 +T119897119887119872)

2 (9)

The average makespan of each Map worker node is com-puted as

997888rarrT119872997888rarr =

997888rarrT119872

SJM

(10)

Themakespan of the PMRMap phase consisting ofSJM

=119902worker nodes is shown in Figure 2 of the paper Ascertainingbounds ofmakespan ieT119906119887119872 andT119897119887119872 is shown in the figure

The Reduce workers are initiated when at least two Mapworker nodes have finished their computational tasks The

6 BioMed Research International

MAP WORKER 1

Makespan Time (t)

MAP WORKER 2

MAP WORKER 3

MAP WORKER 4

MAP WORKER 5

MAP WORKER 6

minus

minus

minus

MAP WORKER q-2

MAP WORKER q-1

MAP WORKER q

lbM

ubM

(lbM minus ub

M)

Figure 2 Map phase makespan of PMR framework

Reduce phase is initiated at (T119906119887119872 minusT119897119887119872) time instance Inter-mediate data generated by Map worker nodes is processedusing the Shuffle Sort and Reduce functions defined Aver-age execution timeR997888rarr andmaximum execution timeRuarr ofthe Reduce phase considering SJ

Rworkers are derived using

(3) Makespan bounding of the Reduce phase is computed(the lower bound is represented asT119897119887119877 and the upper boundis represented asT119906119887119877 ) as follows

T119897119887119877 = 119883J

R∙R997888rarrS

J

R

(11)

T119906119887119877 = (119883J

Rminus 1) ∙R997888rarr

SJ

R+Ruarr

(12)

The makespan of theJ119905ℎ job on the 119875119872119877 framework is asum of time taken to execute Map tasks and time taken to ex-ecute Reduce tasks Considering the best case scenario (lowerbound) the minimum makespan observed is

T119897119887J = T

119897119887119872 +T

119897119887119877 minus (T119906119887119872 minusT

119897119887119872) (13)

Simplifying (13) we get

T119897119887J = 2T119897119887119872 +T

119897119887119877 minusT

119906119887119872 (14)

Considering the worst computing performance theupper bound or maximum makespan observed is

T119906119887J = T

119906119887119872 +T

119906119887119877 minus (T119906119887119872 minusT

119897119887119872) (15)

T119906119887J = T

119897119887119872 +T

119906119887119877 (16)

Themakespan of jobJ on the119875119872119877 framework is definedas

997888rarrTJ = (T119906119887J +T119897119887J)

2 (17)

Using (14) and (16) makespan J is

997888rarrTJ = ((T119897119887119872 +T119906119887119877 ) + (2T119897119887119872 +T119897119887119877 minusT119906119887119872))

2= (3T119897119887119872 +T119897119887119877 +T119906119887119877 minusT119906119887119872)

2 (18)

313ModelingDataDependency onMakespan According to[30 31] data dependency can bemodeled using linear regres-sion A similar approach is adopted here The average make-span of the 119909119905ℎ Map worker node is defined as

M119909997888rarr = V

1198720 +

SJ

Msum119908=1

(V119872w (1198631015840119901 )) (19)

where V119872lowast represent variables that are application specificie they are dependent on the Map user function

The average makespan of the 119909119905ℎ Reduce worker node isdefined as

R119909997888rarr = V

1198770 +

SJ

Rsum119908=1

(V119877w (1198891015840119901 )) (20)

where V119877lowast represent variables specific to the user definedReduce functions and 1198891015840represents intermediate output dataobtained from the Map phase For parallel execution and toutilize all resources it is further split similar to theMap phase

On similar lines the average and maximum executiontimes of Map and Reduce workers are computed Data de-pendent computations of M997888rarrMuarrR997888rarrRuarr are used in(14) (16) and (18) to compute makespan of the J119905ℎ job onthe 119875119872119877 framework considering data 119863 Additional detailsof data dependency modeling using linear regression arepresented in [33] The proof of the model is also presentedin [33]

BioMed Research International 7

4 Performance Evaluation

Experiments conducted to evaluate the performance of 119875119872119877are presented in this section Performance of 119875119872119877 is com-pared with the state-of-the-art Hadoop framework Hadoopis the most widely usedadopted MapReduce platform forcomputing in cloud environments [34] hence it is consid-ered for comparisons The 119875119872119877 framework is developedusing VC++ C and Nodejs and deployed on the Azurecloud Hadoop 2 ie version 26 is used and deployedon the Azure cloud using HDInsight The 119875119872119877 frameworkis deployed consisting of one master node and 4 workernodes Each worker node is deployed on A3 virtual machineinstances Each A3 VM instance consists of 4 virtual com-puting cores 7 GB of RAM and 120 GB of local hard drivespaceThe Hadoop platform deployed for evaluation consistsof one master and 4 worker nodes in the cluster Uniformconfiguration of 119875119872119877 and Hadoop frameworks on Azurecloud is considered

Biomedical applications characterized by processing ofmassive amounts of genetic data are considered in the ex-periments for performance evaluation A computationallyheavy biomedical application namely BLAST [35] CAP3[30] and state-of-the-art recent DeepBind [27] is adoptedfor evaluation All the genomic sequences considered for theexperimental analyses are obtained from the publicly avail-able NCBI database [36] For comprehensive performanceevaluations the authors have considered various applicationscenarios In experiments conducted using BLAST both theMap and Reduce phases are involved In CAP3 applicationthe Map phase plays a predominant role In DeepBind theReduce phase is critical for analysis

41 BLAST Gene sequence alignment is a fundamentaloperation adopted to identify similarities that exist betweena query protein sequence DNA or RNA and a database ofsequences maintained Sequence alignment is computation-ally heavy and its computation complexity is relative to theproduct of two sequences being currently analyzed Massivevolumes of sequences maintained in the database to besearched induce an additional computation burden BLAST isa widely adopted bioinformatics tool for sequence alignmentwhich performs faster alignments at the expense of accuracy(possibly missing some potential hits) [35]The drawbacks ofBLASTand its improvements are discussed in [28] For evalu-ation here the improved BLAST algorithm of [28] is adoptedTo improve computation time a heuristic strategy is usedcompromising accuracy minimally In the heuristic strategyan initial match is found and is later extended to obtain thecomplete matching sequence

A three-stage approach is adopted in BLAST for sequencealignment Query sequence is represented using q andreference sequence as r Sequences q and r are said to con-sist of 119896minuslength subsequences known as 119896 minus 119898119890119903119904 In theinitial stage also known as the 119896 minus 119898119890119903 match stage BLASTconsiders each of the 119896 minus 119898119890119903119904 of q and r and searches for119896minus119898119890119903119904 that match in both This process is repeated to builda scanner of all 119896minusletter words in query q Then BLASTsearches reference genomer by using the scanner built to find

119896 minus 119898119890119903119904 of r matches with query q and these matches are119904119890119890119889119904 of potential hitsIn the second stage also known as the ungapped align-

ment stage every seed identified previously is 119890119909119905119890119899119889119890119889 inboth directions respectively to include matches and mis-matches A match is found if nucleotides in q and r are thesame A mismatch occurs if varied nucleotides are observedin q and r The mismatch reduces the score and matchesincrease the score of candidate sequence alignment Thepresent score of sequence alignment 119886 and the highest scoreobtained for present seed 119886uarr are retained The second phaseis terminated if 119886uarr minus 119886 is higher than the predefined X-dropthreshold ℎ119909 and returns with the highest alignment scoreof the present seed The alignment is passed to stage three ifthe returned score is higher than the predefined ungappedthreshold 119867119910 The thresholds predefined establish accuracyof alignment scores in BLAST Computational optimizationis achieved by skipping seeds already available in previousalignments The initial two phases of BLAST are executed inthe Map workers of 119875119872119877 and Hadoop

In stage three gapped alignment is performed in theleft and right directions where deletion and insertion areperformed during extension of alignments The same as theprevious stage the highest score of alignment 119886uarr is kept andif the present score 119886 is lower than 119886uarr by more than the X-drop threshold the stage is terminated and the correspondingalignment outcome is obtained Gap alignment operation iscarried out in the Reduce phase of 119875119872119877 and Hadoop frame-work The schematic of BLAST algorithm on 119875119872119877 frame-work is shown in Figure 3

Experiments conducted to evaluate performance of 119875119872119877and Hadoop considered the Drosophila database as a refer-ence database The query genomics of varied sizes consideredis from Homo sapiens chromosomal sequences and genomicscaffolds A total of six different query sequences are con-sidered similar to [28] Configuration of each experiment issummarized in Table 1 All six experiments are conductedusing BLAST algorithm on Hadoop and 119875119872119877 frameworksAll observations retrieved through a set of log files generatedduring the Map and Reduce phases of Hadoop and 119875119872119877are noted and stored for further analysis Using the log filestotal makespan Map worker makespan and Reduce workermakespan of Hadoop and 119875119872119877 is noted for each experimentIt must be noted that the initialization time of the VM clusteris not considered in the computing makespan as it is uniformin 119875119872119877 and Hadoop owing to similar cluster configurations

Individual task execution times of Map worker andReduce worker nodes observed for each BLAST experimentexecuted on Hadoop and 119875119872119877 frameworks are graphicallyshown in Figure 4 Figure 4(a) represents results obtained forHadoop and Figure 4(b) represents results obtained on119875119872119877Execution times of Map workers in 119875119872119877 and Hadoop aredominantly higher than Reduce worker times This is due tothe fact that major computation intensive phases (ie Phase1 and Phase 2) of BLAST sequence alignment application arecarried out in the Map phase Parallel execution of BLASTsequence alignment utilizing all 4 cores available with eachMap worker node adopted in 119875119872119877 results in lower executiontimeswhen compared toHadoopMapworker nodes Average

8 BioMed Research International

Query sequence q(with overlap)

Referencedatabase sequence (R)

Local Memory

hellip

PMR Platform

Local Memory

hellip

PMR Platform

Alignment aggregation

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Blast alignment result

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

PMR-Master Node

Blast Blast

q1R1 q1R2 q1R qR1 qR2 qR

Figure 3 BLAST PMR framework

Table 1 Information of the Genome Sequences used as queries considering equal section lengths from Homo sapiens chromosome 15 as areference

Experiment Id Query genome Query genome size(bp) Database sequence Reference genome size

(bp)1 NT 007914 14866257 Drosophila database 1226539772 AC 000156 19317006 Drosophila database 1226539773 NT 011512 33734175 Drosophila database 1226539774 NT 033899 47073726 Drosophila database 1226539775 NT 008413 43212167 Drosophila database 1226539776 NT 022517 90712458 Drosophila database 122653977

reduction of execution time in Map workers of 119875119872119877 is3419 3415 3478 3529 3576 and 3987 inexperiments conducted when compared to Hadoop Mapworker average execution times As the query size increasesperformance improvement of 119875119872119877 increases Parallel execu-tion strategy of Reduce worker nodes proposed in 119875119872119877 isclearly visible in Figure 4(b) In other words Reduce workersare initiated as soon as two or more Map worker nodes havecompleted their tasks The execution time of Reduce workernodes in 119875119872119877 is marginally higher than those of HadoopWaiting for all Map worker nodes to complete their tasks is aprimary reason for the marginal increase in Reduce workerexecution times in 119872119877 Sequential processing ie Mapworkers first and then Reduce worker execution of workernodes in Hadoop framework is evident from Figure 4(a)

The total makespan of119875119872119877 and Hadoop is dependent ontask execution time of worker nodes during the Map phaseand Reduce phase The total makespan observed in BLASTsequence alignment experiments executed on Hadoop and119875119872119877 frameworks is shown in Figure 5 Superior performancein terms of Reduce makespan times of 119875119872119877 is evident when

compared to HadoopThough a marginal increase in Reduceworker execution time is reported overall execution time ietotal makespan of 119875119872119877 is less when compared to HadoopA reduction of 2923 3061 3278 3317 3333 and3823 is reported for six experiments executed on the119875119872119877 framework when compared to similar experimentsexecuted on Hadoop framework Average reduction of thetotal makespan across all experiments is 3289 provingsuperior performance of 119875119872119877 when compared to Hadoopframework

Theoretical makespan of 119875119872119877 119894119890 J given by (18)is computed and compared against the practical valuesobserved in all the experiments Results obtained are shownin Figure 6 Minor variations are observed between practicaland theoretical makespan computations Overall good cor-relation is reported between practical makespan values andtheoretical makespan values Based on the results presentedit is evident that execution of BLAST sequence alignmentalgorithm on the proposed 119875119872119877 yields superior resultswhen compared to similar experiments conducted on theexisting Hadoop framework Accuracy and correctness of the

BioMed Research International 9

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 3)

0 10 20 30 40 50 60 70R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 1)

0 20 40 60 80 100 120

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 2)

0 50 100 150 200 250

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 4)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 5)

0 50 100 150 200 250 300 350 400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering Hadoop (Experiment 6)

(a) Worker node execution times onHadoop frame-work

0 20 40 60 80 100 120 140

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 3)

0 10 20 30 40 50R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 1)

0 10 20 30 40 50 60 70 80

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 2)

0 20 40 60 80 100 120 140 160

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 4)

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 5)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering PMR (Experiment 6)

(b) Worker node execution times on 119875119872119877 frame-work

Figure 4 BLAST sequence alignment execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 5: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

BioMed Research International 5

output to input) Minimum average and maximum compu-tation times of the Reduce function are 119877darr 119877997888rarr and 119877uarr TheReduce stage incorporates Shuffle and Sort operations

Reducing execution time and minimizing cost of cloudusage are always desirable attributes In this paper a make-span model to describe operation of the 119875119872119877 framework ispresented Obtaining actual makespan times is very complexand is always a challenge A number of dependencies existlike hardware parameters network conditions cluster nodeperformance cloud storage parameters data transfer ratesetc in obtaining makespans The makespan model of 119875119872119877described below only considers functional changes incorpo-rated to improve performance in conventional MapReduceframeworks Modeling described below is based on workpresented in [32]

31 PMRMakespan Model

311 Preliminaries and Makespan Bound Establishment Themakespan function is computed as the time required to com-plete a job of input data size and number of resources whichis allocated to 119875119872119877 Let us consider a job 119869 to be executedon the 119875119872119877 cloud platform considering data119863 Let the cloudplatform have 119899 + 1 number of nodesworkers Each workeris said to have 119901 cores that can be utilized for computationOne worker node acts as the master node leaving 119899 numberof workers to perform Map and Reduce computation Job119869 is considered to be distributed and computed using an 119909number of Map and Reduce tasks The data 119863 is also accord-ingly split into 119909 chunks represented as 1198631015840 In conventionalMapReduce platforms 1198631015840 = (119863119899) PMR considers a similarapproach for computing 1198631015840 The time utilized to complete 119909tasks is represented asT1T2 T119909 In the proposed PMRframework the chunks 1198631015840 are further split into11986310158401015840 = (1198631015840119901)for parallel execution In PMR execution of the 119909119905ℎ taskT119909 =Tuarr = maxt1 sdot sdot sdot t119901 where t119901 represents execution of task onthe 119901119905ℎ core considering corresponding data 11986310158401015840

Average (120590) and maximum time (120573) duration taken by 119909tasks to complete job 119869 can be represented as

120590 = sum119909119895=1T119895119909

120573 = max119895

T119895 (3)

Let us consider an optimistic scenario that 119909 tasks areuniformly distributed among 119902 worker nodes (minimumtime taken to process (119909 times 120590) work) Overall the time takento compute these tasks is (119909 times 120590)119902 and it is the lower boundtime

119897119887 = 119909 times 120590119902 (4)

To compute the upper bound time a pessimistic scenariois considered where the longest processing task

larr997888T isin (T1

T2 T119909) with makespan of 120573 is the last processed task

Therefore the time taken before the last tasklarr997888T is upper

bounded as follows

(sum119909119895=1T119895)119902 le (119909 minus 1) times 120590

119902 (5)

Therefore the overall timespan for this longest tasklarr997888T

is upper bounded as ((119909 minus 1) times 120590)119902 + 120573 The probable jobmakespan range due to nondeterminism and scheduling isobtained by the difference lower bound and upper boundThis is a key factor when the time taken of the longest taskis trivial as compared to the overall makespan ie 120573 ≪(119909 times 120590119902)

119906119887 = (119909 minus 1) times 120590119902 + 120573 (6)

312 Makespan of a Job on the PMR Framework Let us con-sider a jobJ which is submitted to the 119875119872119877 framework thejobJ is split into 119883J

Mnumber of Map tasks and 119883J

119877 numberof Reduce tasks Let SJ

Mand S

J

Rrepresent the number of

Map and Reduce workers allocated for theJ119905ℎ jobTo compute the makespan of the Map tasks the lower

and upper bounds are computed Using (3) average (M997888rarr)and maximum (Muarr) makespan of Map tasks of job J arecomputed Using Muarrand M997888rarr computed in (4) the lowerbound of the Map phase ieT119897119887119872 is defined as

T119897119887119872 = 119883J

MtimesM997888rarr

SJM

(7)

Similarly the upper bound T119906119887119872 or the maximum execu-tion time of the Map phase in 119875119872119877 using (6) is defined as

T119906119887119872 = (119883J

Mminus 1) timesM997888rarr

SJM

+Muarr (8)

Considering the lower (T119897119887119872) and upper (T119906119887119872) boundscomputed the makespan of the Map phase in 119875119872119877 is com-puted as

997888rarrT119872 = (T119906119887119872 +T119897119887119872)

2 (9)

The average makespan of each Map worker node is com-puted as

997888rarrT119872997888rarr =

997888rarrT119872

SJM

(10)

Themakespan of the PMRMap phase consisting ofSJM

=119902worker nodes is shown in Figure 2 of the paper Ascertainingbounds ofmakespan ieT119906119887119872 andT119897119887119872 is shown in the figure

The Reduce workers are initiated when at least two Mapworker nodes have finished their computational tasks The

6 BioMed Research International

MAP WORKER 1

Makespan Time (t)

MAP WORKER 2

MAP WORKER 3

MAP WORKER 4

MAP WORKER 5

MAP WORKER 6

minus

minus

minus

MAP WORKER q-2

MAP WORKER q-1

MAP WORKER q

lbM

ubM

(lbM minus ub

M)

Figure 2 Map phase makespan of PMR framework

Reduce phase is initiated at (T119906119887119872 minusT119897119887119872) time instance Inter-mediate data generated by Map worker nodes is processedusing the Shuffle Sort and Reduce functions defined Aver-age execution timeR997888rarr andmaximum execution timeRuarr ofthe Reduce phase considering SJ

Rworkers are derived using

(3) Makespan bounding of the Reduce phase is computed(the lower bound is represented asT119897119887119877 and the upper boundis represented asT119906119887119877 ) as follows

T119897119887119877 = 119883J

R∙R997888rarrS

J

R

(11)

T119906119887119877 = (119883J

Rminus 1) ∙R997888rarr

SJ

R+Ruarr

(12)

The makespan of theJ119905ℎ job on the 119875119872119877 framework is asum of time taken to execute Map tasks and time taken to ex-ecute Reduce tasks Considering the best case scenario (lowerbound) the minimum makespan observed is

T119897119887J = T

119897119887119872 +T

119897119887119877 minus (T119906119887119872 minusT

119897119887119872) (13)

Simplifying (13) we get

T119897119887J = 2T119897119887119872 +T

119897119887119877 minusT

119906119887119872 (14)

Considering the worst computing performance theupper bound or maximum makespan observed is

T119906119887J = T

119906119887119872 +T

119906119887119877 minus (T119906119887119872 minusT

119897119887119872) (15)

T119906119887J = T

119897119887119872 +T

119906119887119877 (16)

Themakespan of jobJ on the119875119872119877 framework is definedas

997888rarrTJ = (T119906119887J +T119897119887J)

2 (17)

Using (14) and (16) makespan J is

997888rarrTJ = ((T119897119887119872 +T119906119887119877 ) + (2T119897119887119872 +T119897119887119877 minusT119906119887119872))

2= (3T119897119887119872 +T119897119887119877 +T119906119887119877 minusT119906119887119872)

2 (18)

313ModelingDataDependency onMakespan According to[30 31] data dependency can bemodeled using linear regres-sion A similar approach is adopted here The average make-span of the 119909119905ℎ Map worker node is defined as

M119909997888rarr = V

1198720 +

SJ

Msum119908=1

(V119872w (1198631015840119901 )) (19)

where V119872lowast represent variables that are application specificie they are dependent on the Map user function

The average makespan of the 119909119905ℎ Reduce worker node isdefined as

R119909997888rarr = V

1198770 +

SJ

Rsum119908=1

(V119877w (1198891015840119901 )) (20)

where V119877lowast represent variables specific to the user definedReduce functions and 1198891015840represents intermediate output dataobtained from the Map phase For parallel execution and toutilize all resources it is further split similar to theMap phase

On similar lines the average and maximum executiontimes of Map and Reduce workers are computed Data de-pendent computations of M997888rarrMuarrR997888rarrRuarr are used in(14) (16) and (18) to compute makespan of the J119905ℎ job onthe 119875119872119877 framework considering data 119863 Additional detailsof data dependency modeling using linear regression arepresented in [33] The proof of the model is also presentedin [33]

BioMed Research International 7

4 Performance Evaluation

Experiments conducted to evaluate the performance of 119875119872119877are presented in this section Performance of 119875119872119877 is com-pared with the state-of-the-art Hadoop framework Hadoopis the most widely usedadopted MapReduce platform forcomputing in cloud environments [34] hence it is consid-ered for comparisons The 119875119872119877 framework is developedusing VC++ C and Nodejs and deployed on the Azurecloud Hadoop 2 ie version 26 is used and deployedon the Azure cloud using HDInsight The 119875119872119877 frameworkis deployed consisting of one master node and 4 workernodes Each worker node is deployed on A3 virtual machineinstances Each A3 VM instance consists of 4 virtual com-puting cores 7 GB of RAM and 120 GB of local hard drivespaceThe Hadoop platform deployed for evaluation consistsof one master and 4 worker nodes in the cluster Uniformconfiguration of 119875119872119877 and Hadoop frameworks on Azurecloud is considered

Biomedical applications characterized by processing ofmassive amounts of genetic data are considered in the ex-periments for performance evaluation A computationallyheavy biomedical application namely BLAST [35] CAP3[30] and state-of-the-art recent DeepBind [27] is adoptedfor evaluation All the genomic sequences considered for theexperimental analyses are obtained from the publicly avail-able NCBI database [36] For comprehensive performanceevaluations the authors have considered various applicationscenarios In experiments conducted using BLAST both theMap and Reduce phases are involved In CAP3 applicationthe Map phase plays a predominant role In DeepBind theReduce phase is critical for analysis

41 BLAST Gene sequence alignment is a fundamentaloperation adopted to identify similarities that exist betweena query protein sequence DNA or RNA and a database ofsequences maintained Sequence alignment is computation-ally heavy and its computation complexity is relative to theproduct of two sequences being currently analyzed Massivevolumes of sequences maintained in the database to besearched induce an additional computation burden BLAST isa widely adopted bioinformatics tool for sequence alignmentwhich performs faster alignments at the expense of accuracy(possibly missing some potential hits) [35]The drawbacks ofBLASTand its improvements are discussed in [28] For evalu-ation here the improved BLAST algorithm of [28] is adoptedTo improve computation time a heuristic strategy is usedcompromising accuracy minimally In the heuristic strategyan initial match is found and is later extended to obtain thecomplete matching sequence

A three-stage approach is adopted in BLAST for sequencealignment Query sequence is represented using q andreference sequence as r Sequences q and r are said to con-sist of 119896minuslength subsequences known as 119896 minus 119898119890119903119904 In theinitial stage also known as the 119896 minus 119898119890119903 match stage BLASTconsiders each of the 119896 minus 119898119890119903119904 of q and r and searches for119896minus119898119890119903119904 that match in both This process is repeated to builda scanner of all 119896minusletter words in query q Then BLASTsearches reference genomer by using the scanner built to find

119896 minus 119898119890119903119904 of r matches with query q and these matches are119904119890119890119889119904 of potential hitsIn the second stage also known as the ungapped align-

ment stage every seed identified previously is 119890119909119905119890119899119889119890119889 inboth directions respectively to include matches and mis-matches A match is found if nucleotides in q and r are thesame A mismatch occurs if varied nucleotides are observedin q and r The mismatch reduces the score and matchesincrease the score of candidate sequence alignment Thepresent score of sequence alignment 119886 and the highest scoreobtained for present seed 119886uarr are retained The second phaseis terminated if 119886uarr minus 119886 is higher than the predefined X-dropthreshold ℎ119909 and returns with the highest alignment scoreof the present seed The alignment is passed to stage three ifthe returned score is higher than the predefined ungappedthreshold 119867119910 The thresholds predefined establish accuracyof alignment scores in BLAST Computational optimizationis achieved by skipping seeds already available in previousalignments The initial two phases of BLAST are executed inthe Map workers of 119875119872119877 and Hadoop

In stage three gapped alignment is performed in theleft and right directions where deletion and insertion areperformed during extension of alignments The same as theprevious stage the highest score of alignment 119886uarr is kept andif the present score 119886 is lower than 119886uarr by more than the X-drop threshold the stage is terminated and the correspondingalignment outcome is obtained Gap alignment operation iscarried out in the Reduce phase of 119875119872119877 and Hadoop frame-work The schematic of BLAST algorithm on 119875119872119877 frame-work is shown in Figure 3

Experiments conducted to evaluate performance of 119875119872119877and Hadoop considered the Drosophila database as a refer-ence database The query genomics of varied sizes consideredis from Homo sapiens chromosomal sequences and genomicscaffolds A total of six different query sequences are con-sidered similar to [28] Configuration of each experiment issummarized in Table 1 All six experiments are conductedusing BLAST algorithm on Hadoop and 119875119872119877 frameworksAll observations retrieved through a set of log files generatedduring the Map and Reduce phases of Hadoop and 119875119872119877are noted and stored for further analysis Using the log filestotal makespan Map worker makespan and Reduce workermakespan of Hadoop and 119875119872119877 is noted for each experimentIt must be noted that the initialization time of the VM clusteris not considered in the computing makespan as it is uniformin 119875119872119877 and Hadoop owing to similar cluster configurations

Individual task execution times of Map worker andReduce worker nodes observed for each BLAST experimentexecuted on Hadoop and 119875119872119877 frameworks are graphicallyshown in Figure 4 Figure 4(a) represents results obtained forHadoop and Figure 4(b) represents results obtained on119875119872119877Execution times of Map workers in 119875119872119877 and Hadoop aredominantly higher than Reduce worker times This is due tothe fact that major computation intensive phases (ie Phase1 and Phase 2) of BLAST sequence alignment application arecarried out in the Map phase Parallel execution of BLASTsequence alignment utilizing all 4 cores available with eachMap worker node adopted in 119875119872119877 results in lower executiontimeswhen compared toHadoopMapworker nodes Average

8 BioMed Research International

Query sequence q(with overlap)

Referencedatabase sequence (R)

Local Memory

hellip

PMR Platform

Local Memory

hellip

PMR Platform

Alignment aggregation

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Blast alignment result

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

PMR-Master Node

Blast Blast

q1R1 q1R2 q1R qR1 qR2 qR

Figure 3 BLAST PMR framework

Table 1 Information of the Genome Sequences used as queries considering equal section lengths from Homo sapiens chromosome 15 as areference

Experiment Id Query genome Query genome size(bp) Database sequence Reference genome size

(bp)1 NT 007914 14866257 Drosophila database 1226539772 AC 000156 19317006 Drosophila database 1226539773 NT 011512 33734175 Drosophila database 1226539774 NT 033899 47073726 Drosophila database 1226539775 NT 008413 43212167 Drosophila database 1226539776 NT 022517 90712458 Drosophila database 122653977

reduction of execution time in Map workers of 119875119872119877 is3419 3415 3478 3529 3576 and 3987 inexperiments conducted when compared to Hadoop Mapworker average execution times As the query size increasesperformance improvement of 119875119872119877 increases Parallel execu-tion strategy of Reduce worker nodes proposed in 119875119872119877 isclearly visible in Figure 4(b) In other words Reduce workersare initiated as soon as two or more Map worker nodes havecompleted their tasks The execution time of Reduce workernodes in 119875119872119877 is marginally higher than those of HadoopWaiting for all Map worker nodes to complete their tasks is aprimary reason for the marginal increase in Reduce workerexecution times in 119872119877 Sequential processing ie Mapworkers first and then Reduce worker execution of workernodes in Hadoop framework is evident from Figure 4(a)

The total makespan of119875119872119877 and Hadoop is dependent ontask execution time of worker nodes during the Map phaseand Reduce phase The total makespan observed in BLASTsequence alignment experiments executed on Hadoop and119875119872119877 frameworks is shown in Figure 5 Superior performancein terms of Reduce makespan times of 119875119872119877 is evident when

compared to HadoopThough a marginal increase in Reduceworker execution time is reported overall execution time ietotal makespan of 119875119872119877 is less when compared to HadoopA reduction of 2923 3061 3278 3317 3333 and3823 is reported for six experiments executed on the119875119872119877 framework when compared to similar experimentsexecuted on Hadoop framework Average reduction of thetotal makespan across all experiments is 3289 provingsuperior performance of 119875119872119877 when compared to Hadoopframework

Theoretical makespan of 119875119872119877 119894119890 J given by (18)is computed and compared against the practical valuesobserved in all the experiments Results obtained are shownin Figure 6 Minor variations are observed between practicaland theoretical makespan computations Overall good cor-relation is reported between practical makespan values andtheoretical makespan values Based on the results presentedit is evident that execution of BLAST sequence alignmentalgorithm on the proposed 119875119872119877 yields superior resultswhen compared to similar experiments conducted on theexisting Hadoop framework Accuracy and correctness of the

BioMed Research International 9

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 3)

0 10 20 30 40 50 60 70R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 1)

0 20 40 60 80 100 120

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 2)

0 50 100 150 200 250

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 4)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 5)

0 50 100 150 200 250 300 350 400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering Hadoop (Experiment 6)

(a) Worker node execution times onHadoop frame-work

0 20 40 60 80 100 120 140

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 3)

0 10 20 30 40 50R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 1)

0 10 20 30 40 50 60 70 80

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 2)

0 20 40 60 80 100 120 140 160

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 4)

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 5)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering PMR (Experiment 6)

(b) Worker node execution times on 119875119872119877 frame-work

Figure 4 BLAST sequence alignment execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 6: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

6 BioMed Research International

MAP WORKER 1

Makespan Time (t)

MAP WORKER 2

MAP WORKER 3

MAP WORKER 4

MAP WORKER 5

MAP WORKER 6

minus

minus

minus

MAP WORKER q-2

MAP WORKER q-1

MAP WORKER q

lbM

ubM

(lbM minus ub

M)

Figure 2 Map phase makespan of PMR framework

Reduce phase is initiated at (T119906119887119872 minusT119897119887119872) time instance Inter-mediate data generated by Map worker nodes is processedusing the Shuffle Sort and Reduce functions defined Aver-age execution timeR997888rarr andmaximum execution timeRuarr ofthe Reduce phase considering SJ

Rworkers are derived using

(3) Makespan bounding of the Reduce phase is computed(the lower bound is represented asT119897119887119877 and the upper boundis represented asT119906119887119877 ) as follows

T119897119887119877 = 119883J

R∙R997888rarrS

J

R

(11)

T119906119887119877 = (119883J

Rminus 1) ∙R997888rarr

SJ

R+Ruarr

(12)

The makespan of theJ119905ℎ job on the 119875119872119877 framework is asum of time taken to execute Map tasks and time taken to ex-ecute Reduce tasks Considering the best case scenario (lowerbound) the minimum makespan observed is

T119897119887J = T

119897119887119872 +T

119897119887119877 minus (T119906119887119872 minusT

119897119887119872) (13)

Simplifying (13) we get

T119897119887J = 2T119897119887119872 +T

119897119887119877 minusT

119906119887119872 (14)

Considering the worst computing performance theupper bound or maximum makespan observed is

T119906119887J = T

119906119887119872 +T

119906119887119877 minus (T119906119887119872 minusT

119897119887119872) (15)

T119906119887J = T

119897119887119872 +T

119906119887119877 (16)

Themakespan of jobJ on the119875119872119877 framework is definedas

997888rarrTJ = (T119906119887J +T119897119887J)

2 (17)

Using (14) and (16) makespan J is

997888rarrTJ = ((T119897119887119872 +T119906119887119877 ) + (2T119897119887119872 +T119897119887119877 minusT119906119887119872))

2= (3T119897119887119872 +T119897119887119877 +T119906119887119877 minusT119906119887119872)

2 (18)

313ModelingDataDependency onMakespan According to[30 31] data dependency can bemodeled using linear regres-sion A similar approach is adopted here The average make-span of the 119909119905ℎ Map worker node is defined as

M119909997888rarr = V

1198720 +

SJ

Msum119908=1

(V119872w (1198631015840119901 )) (19)

where V119872lowast represent variables that are application specificie they are dependent on the Map user function

The average makespan of the 119909119905ℎ Reduce worker node isdefined as

R119909997888rarr = V

1198770 +

SJ

Rsum119908=1

(V119877w (1198891015840119901 )) (20)

where V119877lowast represent variables specific to the user definedReduce functions and 1198891015840represents intermediate output dataobtained from the Map phase For parallel execution and toutilize all resources it is further split similar to theMap phase

On similar lines the average and maximum executiontimes of Map and Reduce workers are computed Data de-pendent computations of M997888rarrMuarrR997888rarrRuarr are used in(14) (16) and (18) to compute makespan of the J119905ℎ job onthe 119875119872119877 framework considering data 119863 Additional detailsof data dependency modeling using linear regression arepresented in [33] The proof of the model is also presentedin [33]

BioMed Research International 7

4 Performance Evaluation

Experiments conducted to evaluate the performance of 119875119872119877are presented in this section Performance of 119875119872119877 is com-pared with the state-of-the-art Hadoop framework Hadoopis the most widely usedadopted MapReduce platform forcomputing in cloud environments [34] hence it is consid-ered for comparisons The 119875119872119877 framework is developedusing VC++ C and Nodejs and deployed on the Azurecloud Hadoop 2 ie version 26 is used and deployedon the Azure cloud using HDInsight The 119875119872119877 frameworkis deployed consisting of one master node and 4 workernodes Each worker node is deployed on A3 virtual machineinstances Each A3 VM instance consists of 4 virtual com-puting cores 7 GB of RAM and 120 GB of local hard drivespaceThe Hadoop platform deployed for evaluation consistsof one master and 4 worker nodes in the cluster Uniformconfiguration of 119875119872119877 and Hadoop frameworks on Azurecloud is considered

Biomedical applications characterized by processing ofmassive amounts of genetic data are considered in the ex-periments for performance evaluation A computationallyheavy biomedical application namely BLAST [35] CAP3[30] and state-of-the-art recent DeepBind [27] is adoptedfor evaluation All the genomic sequences considered for theexperimental analyses are obtained from the publicly avail-able NCBI database [36] For comprehensive performanceevaluations the authors have considered various applicationscenarios In experiments conducted using BLAST both theMap and Reduce phases are involved In CAP3 applicationthe Map phase plays a predominant role In DeepBind theReduce phase is critical for analysis

41 BLAST Gene sequence alignment is a fundamentaloperation adopted to identify similarities that exist betweena query protein sequence DNA or RNA and a database ofsequences maintained Sequence alignment is computation-ally heavy and its computation complexity is relative to theproduct of two sequences being currently analyzed Massivevolumes of sequences maintained in the database to besearched induce an additional computation burden BLAST isa widely adopted bioinformatics tool for sequence alignmentwhich performs faster alignments at the expense of accuracy(possibly missing some potential hits) [35]The drawbacks ofBLASTand its improvements are discussed in [28] For evalu-ation here the improved BLAST algorithm of [28] is adoptedTo improve computation time a heuristic strategy is usedcompromising accuracy minimally In the heuristic strategyan initial match is found and is later extended to obtain thecomplete matching sequence

A three-stage approach is adopted in BLAST for sequencealignment Query sequence is represented using q andreference sequence as r Sequences q and r are said to con-sist of 119896minuslength subsequences known as 119896 minus 119898119890119903119904 In theinitial stage also known as the 119896 minus 119898119890119903 match stage BLASTconsiders each of the 119896 minus 119898119890119903119904 of q and r and searches for119896minus119898119890119903119904 that match in both This process is repeated to builda scanner of all 119896minusletter words in query q Then BLASTsearches reference genomer by using the scanner built to find

119896 minus 119898119890119903119904 of r matches with query q and these matches are119904119890119890119889119904 of potential hitsIn the second stage also known as the ungapped align-

ment stage every seed identified previously is 119890119909119905119890119899119889119890119889 inboth directions respectively to include matches and mis-matches A match is found if nucleotides in q and r are thesame A mismatch occurs if varied nucleotides are observedin q and r The mismatch reduces the score and matchesincrease the score of candidate sequence alignment Thepresent score of sequence alignment 119886 and the highest scoreobtained for present seed 119886uarr are retained The second phaseis terminated if 119886uarr minus 119886 is higher than the predefined X-dropthreshold ℎ119909 and returns with the highest alignment scoreof the present seed The alignment is passed to stage three ifthe returned score is higher than the predefined ungappedthreshold 119867119910 The thresholds predefined establish accuracyof alignment scores in BLAST Computational optimizationis achieved by skipping seeds already available in previousalignments The initial two phases of BLAST are executed inthe Map workers of 119875119872119877 and Hadoop

In stage three gapped alignment is performed in theleft and right directions where deletion and insertion areperformed during extension of alignments The same as theprevious stage the highest score of alignment 119886uarr is kept andif the present score 119886 is lower than 119886uarr by more than the X-drop threshold the stage is terminated and the correspondingalignment outcome is obtained Gap alignment operation iscarried out in the Reduce phase of 119875119872119877 and Hadoop frame-work The schematic of BLAST algorithm on 119875119872119877 frame-work is shown in Figure 3

Experiments conducted to evaluate performance of 119875119872119877and Hadoop considered the Drosophila database as a refer-ence database The query genomics of varied sizes consideredis from Homo sapiens chromosomal sequences and genomicscaffolds A total of six different query sequences are con-sidered similar to [28] Configuration of each experiment issummarized in Table 1 All six experiments are conductedusing BLAST algorithm on Hadoop and 119875119872119877 frameworksAll observations retrieved through a set of log files generatedduring the Map and Reduce phases of Hadoop and 119875119872119877are noted and stored for further analysis Using the log filestotal makespan Map worker makespan and Reduce workermakespan of Hadoop and 119875119872119877 is noted for each experimentIt must be noted that the initialization time of the VM clusteris not considered in the computing makespan as it is uniformin 119875119872119877 and Hadoop owing to similar cluster configurations

Individual task execution times of Map worker andReduce worker nodes observed for each BLAST experimentexecuted on Hadoop and 119875119872119877 frameworks are graphicallyshown in Figure 4 Figure 4(a) represents results obtained forHadoop and Figure 4(b) represents results obtained on119875119872119877Execution times of Map workers in 119875119872119877 and Hadoop aredominantly higher than Reduce worker times This is due tothe fact that major computation intensive phases (ie Phase1 and Phase 2) of BLAST sequence alignment application arecarried out in the Map phase Parallel execution of BLASTsequence alignment utilizing all 4 cores available with eachMap worker node adopted in 119875119872119877 results in lower executiontimeswhen compared toHadoopMapworker nodes Average

8 BioMed Research International

Query sequence q(with overlap)

Referencedatabase sequence (R)

Local Memory

hellip

PMR Platform

Local Memory

hellip

PMR Platform

Alignment aggregation

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Blast alignment result

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

PMR-Master Node

Blast Blast

q1R1 q1R2 q1R qR1 qR2 qR

Figure 3 BLAST PMR framework

Table 1 Information of the Genome Sequences used as queries considering equal section lengths from Homo sapiens chromosome 15 as areference

Experiment Id Query genome Query genome size(bp) Database sequence Reference genome size

(bp)1 NT 007914 14866257 Drosophila database 1226539772 AC 000156 19317006 Drosophila database 1226539773 NT 011512 33734175 Drosophila database 1226539774 NT 033899 47073726 Drosophila database 1226539775 NT 008413 43212167 Drosophila database 1226539776 NT 022517 90712458 Drosophila database 122653977

reduction of execution time in Map workers of 119875119872119877 is3419 3415 3478 3529 3576 and 3987 inexperiments conducted when compared to Hadoop Mapworker average execution times As the query size increasesperformance improvement of 119875119872119877 increases Parallel execu-tion strategy of Reduce worker nodes proposed in 119875119872119877 isclearly visible in Figure 4(b) In other words Reduce workersare initiated as soon as two or more Map worker nodes havecompleted their tasks The execution time of Reduce workernodes in 119875119872119877 is marginally higher than those of HadoopWaiting for all Map worker nodes to complete their tasks is aprimary reason for the marginal increase in Reduce workerexecution times in 119872119877 Sequential processing ie Mapworkers first and then Reduce worker execution of workernodes in Hadoop framework is evident from Figure 4(a)

The total makespan of119875119872119877 and Hadoop is dependent ontask execution time of worker nodes during the Map phaseand Reduce phase The total makespan observed in BLASTsequence alignment experiments executed on Hadoop and119875119872119877 frameworks is shown in Figure 5 Superior performancein terms of Reduce makespan times of 119875119872119877 is evident when

compared to HadoopThough a marginal increase in Reduceworker execution time is reported overall execution time ietotal makespan of 119875119872119877 is less when compared to HadoopA reduction of 2923 3061 3278 3317 3333 and3823 is reported for six experiments executed on the119875119872119877 framework when compared to similar experimentsexecuted on Hadoop framework Average reduction of thetotal makespan across all experiments is 3289 provingsuperior performance of 119875119872119877 when compared to Hadoopframework

Theoretical makespan of 119875119872119877 119894119890 J given by (18)is computed and compared against the practical valuesobserved in all the experiments Results obtained are shownin Figure 6 Minor variations are observed between practicaland theoretical makespan computations Overall good cor-relation is reported between practical makespan values andtheoretical makespan values Based on the results presentedit is evident that execution of BLAST sequence alignmentalgorithm on the proposed 119875119872119877 yields superior resultswhen compared to similar experiments conducted on theexisting Hadoop framework Accuracy and correctness of the

BioMed Research International 9

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 3)

0 10 20 30 40 50 60 70R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 1)

0 20 40 60 80 100 120

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 2)

0 50 100 150 200 250

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 4)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 5)

0 50 100 150 200 250 300 350 400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering Hadoop (Experiment 6)

(a) Worker node execution times onHadoop frame-work

0 20 40 60 80 100 120 140

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 3)

0 10 20 30 40 50R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 1)

0 10 20 30 40 50 60 70 80

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 2)

0 20 40 60 80 100 120 140 160

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 4)

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 5)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering PMR (Experiment 6)

(b) Worker node execution times on 119875119872119877 frame-work

Figure 4 BLAST sequence alignment execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 7: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

BioMed Research International 7

4 Performance Evaluation

Experiments conducted to evaluate the performance of 119875119872119877are presented in this section Performance of 119875119872119877 is com-pared with the state-of-the-art Hadoop framework Hadoopis the most widely usedadopted MapReduce platform forcomputing in cloud environments [34] hence it is consid-ered for comparisons The 119875119872119877 framework is developedusing VC++ C and Nodejs and deployed on the Azurecloud Hadoop 2 ie version 26 is used and deployedon the Azure cloud using HDInsight The 119875119872119877 frameworkis deployed consisting of one master node and 4 workernodes Each worker node is deployed on A3 virtual machineinstances Each A3 VM instance consists of 4 virtual com-puting cores 7 GB of RAM and 120 GB of local hard drivespaceThe Hadoop platform deployed for evaluation consistsof one master and 4 worker nodes in the cluster Uniformconfiguration of 119875119872119877 and Hadoop frameworks on Azurecloud is considered

Biomedical applications characterized by processing ofmassive amounts of genetic data are considered in the ex-periments for performance evaluation A computationallyheavy biomedical application namely BLAST [35] CAP3[30] and state-of-the-art recent DeepBind [27] is adoptedfor evaluation All the genomic sequences considered for theexperimental analyses are obtained from the publicly avail-able NCBI database [36] For comprehensive performanceevaluations the authors have considered various applicationscenarios In experiments conducted using BLAST both theMap and Reduce phases are involved In CAP3 applicationthe Map phase plays a predominant role In DeepBind theReduce phase is critical for analysis

41 BLAST Gene sequence alignment is a fundamentaloperation adopted to identify similarities that exist betweena query protein sequence DNA or RNA and a database ofsequences maintained Sequence alignment is computation-ally heavy and its computation complexity is relative to theproduct of two sequences being currently analyzed Massivevolumes of sequences maintained in the database to besearched induce an additional computation burden BLAST isa widely adopted bioinformatics tool for sequence alignmentwhich performs faster alignments at the expense of accuracy(possibly missing some potential hits) [35]The drawbacks ofBLASTand its improvements are discussed in [28] For evalu-ation here the improved BLAST algorithm of [28] is adoptedTo improve computation time a heuristic strategy is usedcompromising accuracy minimally In the heuristic strategyan initial match is found and is later extended to obtain thecomplete matching sequence

A three-stage approach is adopted in BLAST for sequencealignment Query sequence is represented using q andreference sequence as r Sequences q and r are said to con-sist of 119896minuslength subsequences known as 119896 minus 119898119890119903119904 In theinitial stage also known as the 119896 minus 119898119890119903 match stage BLASTconsiders each of the 119896 minus 119898119890119903119904 of q and r and searches for119896minus119898119890119903119904 that match in both This process is repeated to builda scanner of all 119896minusletter words in query q Then BLASTsearches reference genomer by using the scanner built to find

119896 minus 119898119890119903119904 of r matches with query q and these matches are119904119890119890119889119904 of potential hitsIn the second stage also known as the ungapped align-

ment stage every seed identified previously is 119890119909119905119890119899119889119890119889 inboth directions respectively to include matches and mis-matches A match is found if nucleotides in q and r are thesame A mismatch occurs if varied nucleotides are observedin q and r The mismatch reduces the score and matchesincrease the score of candidate sequence alignment Thepresent score of sequence alignment 119886 and the highest scoreobtained for present seed 119886uarr are retained The second phaseis terminated if 119886uarr minus 119886 is higher than the predefined X-dropthreshold ℎ119909 and returns with the highest alignment scoreof the present seed The alignment is passed to stage three ifthe returned score is higher than the predefined ungappedthreshold 119867119910 The thresholds predefined establish accuracyof alignment scores in BLAST Computational optimizationis achieved by skipping seeds already available in previousalignments The initial two phases of BLAST are executed inthe Map workers of 119875119872119877 and Hadoop

In stage three gapped alignment is performed in theleft and right directions where deletion and insertion areperformed during extension of alignments The same as theprevious stage the highest score of alignment 119886uarr is kept andif the present score 119886 is lower than 119886uarr by more than the X-drop threshold the stage is terminated and the correspondingalignment outcome is obtained Gap alignment operation iscarried out in the Reduce phase of 119875119872119877 and Hadoop frame-work The schematic of BLAST algorithm on 119875119872119877 frame-work is shown in Figure 3

Experiments conducted to evaluate performance of 119875119872119877and Hadoop considered the Drosophila database as a refer-ence database The query genomics of varied sizes consideredis from Homo sapiens chromosomal sequences and genomicscaffolds A total of six different query sequences are con-sidered similar to [28] Configuration of each experiment issummarized in Table 1 All six experiments are conductedusing BLAST algorithm on Hadoop and 119875119872119877 frameworksAll observations retrieved through a set of log files generatedduring the Map and Reduce phases of Hadoop and 119875119872119877are noted and stored for further analysis Using the log filestotal makespan Map worker makespan and Reduce workermakespan of Hadoop and 119875119872119877 is noted for each experimentIt must be noted that the initialization time of the VM clusteris not considered in the computing makespan as it is uniformin 119875119872119877 and Hadoop owing to similar cluster configurations

Individual task execution times of Map worker andReduce worker nodes observed for each BLAST experimentexecuted on Hadoop and 119875119872119877 frameworks are graphicallyshown in Figure 4 Figure 4(a) represents results obtained forHadoop and Figure 4(b) represents results obtained on119875119872119877Execution times of Map workers in 119875119872119877 and Hadoop aredominantly higher than Reduce worker times This is due tothe fact that major computation intensive phases (ie Phase1 and Phase 2) of BLAST sequence alignment application arecarried out in the Map phase Parallel execution of BLASTsequence alignment utilizing all 4 cores available with eachMap worker node adopted in 119875119872119877 results in lower executiontimeswhen compared toHadoopMapworker nodes Average

8 BioMed Research International

Query sequence q(with overlap)

Referencedatabase sequence (R)

Local Memory

hellip

PMR Platform

Local Memory

hellip

PMR Platform

Alignment aggregation

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Blast alignment result

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

PMR-Master Node

Blast Blast

q1R1 q1R2 q1R qR1 qR2 qR

Figure 3 BLAST PMR framework

Table 1 Information of the Genome Sequences used as queries considering equal section lengths from Homo sapiens chromosome 15 as areference

Experiment Id Query genome Query genome size(bp) Database sequence Reference genome size

(bp)1 NT 007914 14866257 Drosophila database 1226539772 AC 000156 19317006 Drosophila database 1226539773 NT 011512 33734175 Drosophila database 1226539774 NT 033899 47073726 Drosophila database 1226539775 NT 008413 43212167 Drosophila database 1226539776 NT 022517 90712458 Drosophila database 122653977

reduction of execution time in Map workers of 119875119872119877 is3419 3415 3478 3529 3576 and 3987 inexperiments conducted when compared to Hadoop Mapworker average execution times As the query size increasesperformance improvement of 119875119872119877 increases Parallel execu-tion strategy of Reduce worker nodes proposed in 119875119872119877 isclearly visible in Figure 4(b) In other words Reduce workersare initiated as soon as two or more Map worker nodes havecompleted their tasks The execution time of Reduce workernodes in 119875119872119877 is marginally higher than those of HadoopWaiting for all Map worker nodes to complete their tasks is aprimary reason for the marginal increase in Reduce workerexecution times in 119872119877 Sequential processing ie Mapworkers first and then Reduce worker execution of workernodes in Hadoop framework is evident from Figure 4(a)

The total makespan of119875119872119877 and Hadoop is dependent ontask execution time of worker nodes during the Map phaseand Reduce phase The total makespan observed in BLASTsequence alignment experiments executed on Hadoop and119875119872119877 frameworks is shown in Figure 5 Superior performancein terms of Reduce makespan times of 119875119872119877 is evident when

compared to HadoopThough a marginal increase in Reduceworker execution time is reported overall execution time ietotal makespan of 119875119872119877 is less when compared to HadoopA reduction of 2923 3061 3278 3317 3333 and3823 is reported for six experiments executed on the119875119872119877 framework when compared to similar experimentsexecuted on Hadoop framework Average reduction of thetotal makespan across all experiments is 3289 provingsuperior performance of 119875119872119877 when compared to Hadoopframework

Theoretical makespan of 119875119872119877 119894119890 J given by (18)is computed and compared against the practical valuesobserved in all the experiments Results obtained are shownin Figure 6 Minor variations are observed between practicaland theoretical makespan computations Overall good cor-relation is reported between practical makespan values andtheoretical makespan values Based on the results presentedit is evident that execution of BLAST sequence alignmentalgorithm on the proposed 119875119872119877 yields superior resultswhen compared to similar experiments conducted on theexisting Hadoop framework Accuracy and correctness of the

BioMed Research International 9

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 3)

0 10 20 30 40 50 60 70R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 1)

0 20 40 60 80 100 120

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 2)

0 50 100 150 200 250

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 4)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 5)

0 50 100 150 200 250 300 350 400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering Hadoop (Experiment 6)

(a) Worker node execution times onHadoop frame-work

0 20 40 60 80 100 120 140

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 3)

0 10 20 30 40 50R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 1)

0 10 20 30 40 50 60 70 80

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 2)

0 20 40 60 80 100 120 140 160

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 4)

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 5)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering PMR (Experiment 6)

(b) Worker node execution times on 119875119872119877 frame-work

Figure 4 BLAST sequence alignment execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 8: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

8 BioMed Research International

Query sequence q(with overlap)

Referencedatabase sequence (R)

Local Memory

hellip

PMR Platform

Local Memory

hellip

PMR Platform

Alignment aggregation

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Blast alignment result

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

PMR-Master Node

Blast Blast

q1R1 q1R2 q1R qR1 qR2 qR

Figure 3 BLAST PMR framework

Table 1 Information of the Genome Sequences used as queries considering equal section lengths from Homo sapiens chromosome 15 as areference

Experiment Id Query genome Query genome size(bp) Database sequence Reference genome size

(bp)1 NT 007914 14866257 Drosophila database 1226539772 AC 000156 19317006 Drosophila database 1226539773 NT 011512 33734175 Drosophila database 1226539774 NT 033899 47073726 Drosophila database 1226539775 NT 008413 43212167 Drosophila database 1226539776 NT 022517 90712458 Drosophila database 122653977

reduction of execution time in Map workers of 119875119872119877 is3419 3415 3478 3529 3576 and 3987 inexperiments conducted when compared to Hadoop Mapworker average execution times As the query size increasesperformance improvement of 119875119872119877 increases Parallel execu-tion strategy of Reduce worker nodes proposed in 119875119872119877 isclearly visible in Figure 4(b) In other words Reduce workersare initiated as soon as two or more Map worker nodes havecompleted their tasks The execution time of Reduce workernodes in 119875119872119877 is marginally higher than those of HadoopWaiting for all Map worker nodes to complete their tasks is aprimary reason for the marginal increase in Reduce workerexecution times in 119872119877 Sequential processing ie Mapworkers first and then Reduce worker execution of workernodes in Hadoop framework is evident from Figure 4(a)

The total makespan of119875119872119877 and Hadoop is dependent ontask execution time of worker nodes during the Map phaseand Reduce phase The total makespan observed in BLASTsequence alignment experiments executed on Hadoop and119875119872119877 frameworks is shown in Figure 5 Superior performancein terms of Reduce makespan times of 119875119872119877 is evident when

compared to HadoopThough a marginal increase in Reduceworker execution time is reported overall execution time ietotal makespan of 119875119872119877 is less when compared to HadoopA reduction of 2923 3061 3278 3317 3333 and3823 is reported for six experiments executed on the119875119872119877 framework when compared to similar experimentsexecuted on Hadoop framework Average reduction of thetotal makespan across all experiments is 3289 provingsuperior performance of 119875119872119877 when compared to Hadoopframework

Theoretical makespan of 119875119872119877 119894119890 J given by (18)is computed and compared against the practical valuesobserved in all the experiments Results obtained are shownin Figure 6 Minor variations are observed between practicaland theoretical makespan computations Overall good cor-relation is reported between practical makespan values andtheoretical makespan values Based on the results presentedit is evident that execution of BLAST sequence alignmentalgorithm on the proposed 119875119872119877 yields superior resultswhen compared to similar experiments conducted on theexisting Hadoop framework Accuracy and correctness of the

BioMed Research International 9

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 3)

0 10 20 30 40 50 60 70R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 1)

0 20 40 60 80 100 120

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 2)

0 50 100 150 200 250

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 4)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 5)

0 50 100 150 200 250 300 350 400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering Hadoop (Experiment 6)

(a) Worker node execution times onHadoop frame-work

0 20 40 60 80 100 120 140

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 3)

0 10 20 30 40 50R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 1)

0 10 20 30 40 50 60 70 80

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 2)

0 20 40 60 80 100 120 140 160

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 4)

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 5)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering PMR (Experiment 6)

(b) Worker node execution times on 119875119872119877 frame-work

Figure 4 BLAST sequence alignment execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 9: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

BioMed Research International 9

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 3)

0 10 20 30 40 50 60 70R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 1)

0 20 40 60 80 100 120

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 2)

0 50 100 150 200 250

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 4)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - Hadoop (Experiment 5)

0 50 100 150 200 250 300 350 400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering Hadoop (Experiment 6)

(a) Worker node execution times onHadoop frame-work

0 20 40 60 80 100 120 140

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 3)

0 10 20 30 40 50R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 1)

0 10 20 30 40 50 60 70 80

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution -PMR (Experiment 2)

0 20 40 60 80 100 120 140 160

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 4)

0 50 100 150 200

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution - PMR (Experiment 5)

0 50 100 150 200 250 300

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

BLAST Execution Considering PMR (Experiment 6)

(b) Worker node execution times on 119875119872119877 frame-work

Figure 4 BLAST sequence alignment execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 10: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

10 BioMed Research International

1 2 3 4 5 6Experiment Number

HadoopPMR

050

100150200250300350400450

Exec

utio

n Ti

me

(s)

Figure 5 BLAST sequence alignment total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

theoretical makespan model of 119875119872119877 presented are provedthrough correlation measures

42 CAP3 DNA sequence assembly tools are used in bioin-formatics for gene discovery and understanding genomesof existingnew organisms CAP3 is one such popular toolused to assemble DNA sequences DNA assembly is achievedby performing merging and aligning operations on smallersequence fragments to build complete genome sequencesCAP3 eliminates poor sections observed within DNA frag-ments computes overlaps amongst DNA fragments is capa-ble of identifying false overlaps eliminating false overlapsidentified accumulates fragments of multiple or one overlap-ping DNA segment to produce contigs and performs mul-tiple sequence alignments to produce consensus sequencesCAP3 reads multiple gene sequences from an input FASTAfile and generates output consensus sequences written tomultiple files and also to standard outputs

The CAP3 gene sequence assembly working principleconsists of the following key stages Firstly the poor regionsof 31015840 (three-prime) and 51015840 (five-prime) of each read areidentified and eliminated False overlaps are identified andeliminated Secondly to form contigs reads are combinedbased on overlapping scores in descending order Furtherto incorporate modifications to the contigs constructedforward-reverse constraints are adopted Lastly numerous se-quence alignments of reads are constructed per contig result-ing in consensus sequences characterized by a quality valuefor each base Quality values of consensus sequences are usedin construction of numerous sequence alignment operationsand also in computation of overlaps Operational steps ofCAP3 assembly model are shown in Figure 7 A detailed ex-planation of the CAP3 gene sequence assembly is providedin [30]

In the experiments conducted CAP3 gene sequenceassembly is directly adopted in the Map phase of 119875119872119877 andHadoop In the Reduce phase result aggregation is consid-ered Performance evaluation of CAP3 execution on 119875119872119877and Hadoop frameworks Homo sapiens chromosome 15 isconsidered as a reference Genome sequences of various sizesare considered as queries and submitted to Azure cloud plat-form in the experiments Query sequences for experiments

are considered in accordance to [30] CAP3 experimentsconducted with query genomic sequences (BAC datasets) aresummarized in Table 2 All four experiments are conductedusing CAP3 algorithm on the Hadoop and119875119872119877 frameworksObservations are retrieved through a set of log files generatedduring Map and Reduce phase execution on Hadoop and119875119872119877 Using the log files total makespan Map worker makes-pan and Reduce worker makespan of Hadoop and 119875119872119877 arenoted for each experiment Itmust be noted that the initializa-tion time of the VM cluster is not considered in the comput-ing makespan as it is uniform in 119875119872119877 and Hadoop owing tosimilar cluster configurations

Task execution times of Map and Reduce worker nodesobserved for CAP3 experiments conducted on Hadoop and119875119872119877 frameworks are shown in Figure 8 Figure 8(a) repre-sents results obtained for Hadoop and Figure 8(b) representsresults obtained on 119872119877 Execution times of Map workernodes are far greater than execution times of Reduce workernodes as CAP3 algorithm execution is carried out in the Mapphase and result accumulation is considered in the Reducephase Parallel execution strategy (utilizing 4 computingcores available with each Map worker) of CAP3 algorithmon 119875119872119877 enables lower execution times when compared toHadoop Map worker nodes Average reduction of executiontime in Map workers of 119875119872119877 reported is 1933 20071509 and 1821 in CAP3 experiments conducted whencompared toHadoopMapworker average execution times In119875119872119877 theReduceworkers are initiated as soon as two ormoreMapworker nodes have completed their tasks which is visiblefrom Figure 8(b) Sequential processing strategy (ie Mapworkers first and then Reduce workers execution) of workernodes in the Hadoop framework is evident from Figure 8(a)Execution time of Reduceworker nodes in119875119872119877 ismarginallyhigher by about 1542 than those of HadoopWaiting for allMapworker nodes to complete their tasks is a primary reasonfor the marginal increase in Reduce worker execution timesin119872119877

The total makespan observed in CAP3 experiments exe-cuted on the Hadoop and 119875119872119877 frameworks is presented inFigure 9 Superior performance in terms of Reduce mak-espan times of 119875119872119877 is evident when compared to HadoopThough amarginal increase in Reduce worker execution time

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 11: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

BioMed Research International 11

Table 2 Information of the Genome Sequences used in CAP3 experiments

Experiment Number Dataset GenBankaccession number

Number of reads(bp)

Average Length ofreads (bp)

Length ofprovided

sequences (bp)1 203 AC004669 1812 598 897792 216 AC004638 2353 614 1246453 322F16 AF111103 4297 1011 1591794 526N18 AF123462 3221 965 180182

1 2 3 4 5 6

Experiment NumberPMRPMR -Theory

0

50

100

150

200

250

300Ex

ecut

ion

Tim

e (s

)

Figure 6 Correlation between theoretical and practical makespan times for BLAST sequence alignment execution on PMR framework

Removal of poor regions of

each read

Computation of overlaps

among reads

Wrongly identified overlap removal

Contig construction

Generation of consensus sequences and construction of multiple sequence

alignments

Figure 7 Steps for CAP3 sequencing assembly

is reported overall execution time ie total makespan of119875119872119877 is less when compared to Hadoop A reductionof 1897 20 1503 and 1801 is reported for thefour experiments executed on the 119875119872119877 framework whencompared to similar experiments executed on the Hadoopframework Average reduction of the total makespan acrossall experiments is 18 proving superior performance of 119875119872119877when compared to the Hadoop framework Makespan timefor experiment 2 is greater than other experiments as thenumber of differences considered in CAP3 is 17 larger thanvalues considered in other experiments Similar nature of ex-ecution times is reported in [29] validating CAP3 executionexperiments presented here

Theoretical makespan of 119875119872119877 for all four CAP3 exper-iments is computed using (18) Comparison between the-oretical and experimental makespan values is presented inFigure 10 Minor differences are reported between practicaland theoretical makespan computations proving correctnessof 119875119872119877makespan modeling presented

The results presented in this section prove that CAP3sequence assembly execution on the 119875119872119877 cloud frameworkdeveloped exhibits superior performance when compared tosimilar CAP3 experiments executed on the existing Hadoopcloud framework

43 DeepBind Analysis to Identify Binding Sites In recenttimes deep learning techniques have been extensively usedfor various applications Deep learning techniques areadopted predominantly when large amounts of data are tobe processed or analyzed To meet large computing needs ofdeep learning techniques GPU are used Motivated by thisthe authors of the paper consider very recent state-of-the-art ldquoDeepBindrdquo biomedical application execution on a cloudplatform To the best of our knowledge no such attempt toconsider cloud platforms for DeepBind execution has beenreported

Alternative splicing transcription and gene regulationsbiomedical operations are dependent on DNA- and RNA-binding proteins DNA- andRNA-binding proteins describedusing sequence specificities are critical in identifying diseasesand deriving models of regulatory processes that occur inbiological systems Position weight matrices are used incharacterizing specificities of a protein Binding sites ongenomic sequences are identified by scanning positionweightmatrices over the considered genomic sequences DeepBindis used to predict sequence specificities DeepBind adoptsdeep convolutional neural networks to achieve accurateprediction Comprehensive details and sequence specificity

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 12: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

12 BioMed Research International

0 500 1000 1500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 1)

0 1000 2000 3000 4000 5000 6000 7000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 2)

0 1000 2000 3000 4000 5000 6000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 3)

0 500 1000 1500 2000 2500 3000 3500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - Hadoop (Experiment 4)

(a) Worker node execution times on Hadoop framework

0 200 400 600 800 1000 1200 1400

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 1)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s) Ta

sk E

xecu

tion

CAP Execution - PMR (Experiment 2)

0 1000 2000 3000 4000 5000

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution - PMR (Experiment 3)

0 500 1000 1500 2000 2500

R W 1R W 2R W 3R W 4

M W 1M W 2M W 3M W 4

Makespan Time (s)

Task

Exe

cutio

n

CAP Execution Considering PMR (Experiment 4)

(b) Worker node execution times on 119875119872119877 framework

Figure 8 CAP3 sequence assembly execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 4 nodes (b) On119875119872119877 cluster of 4 nodes

prediction accuracy of the DeepBind application are availablein [27]

DeepBind is developed using a two-phase approach atraining phase and testing phase Training phase executionis carried out using Map workers in the Hadoop and 119875119872119877frameworks The trained weights are stored in the cloudmemory for further processing The testing phase of Deep-Bind is carried out at the Reduce stage in the Hadoop and

119875119872119877 frameworks Execution strategy of DeepBind algorithmon the 119875119872119877 framework is shown in Figure 11 DeepBindapplication is developed using the code provided in [27] Forperformance evaluation on Hadoop and 119875119872119877 only testingphase is discussed (ie Reduce only mode) A custom cloudcluster of one master node and six worker nodes is deployedfor DeepBind performance evaluation A similar cloud clus-ter for the Hadoop framework is considered The experiment

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 13: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

BioMed Research International 13

Table 3 Information of the disease-causing genomic variants used in the experiment

Experiment 1Case study Genome variant Experiment details

1 SP1 A disrupted SP1 binding site in the LDL-R promoter that leads to familialhypercholesterolemia

2 TCF7L2 A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site3 GATA1 A gained GATA1 binding site that disrupts the original globin cluster promoters

4 GATA4 A lost GATA4 binding site in the BCL-2 promoter potentially playing a role inovarian granulosa cell tumors

5 RFX3 Loss of two potential RFX3 binding sites leads to abnormal cortical development

6 GABPA Gained GABP-120572 binding sites in the TERT promoter which are linked to severaltypes of aggressive cancer

1 2 3 4Experiment Number

HadoopPMR

010002000300040005000600070008000

Exec

utio

n Ti

me (

s)

Figure 9 CAP3 sequence assembly total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1 2 3 4Experiment Number

PMRPMR -Theory

0

1000

2000

3000

4000

5000

6000

7000

Exec

utio

n Ti

me (

s)

Figure 10 Correlation between theoretical and practical makespan times for CAP3 sequence assembly execution on PMR framework

conducted to evaluate the performance of DeepBind on theHadoop and 119875119872119877 frameworks considers a set of six disease-causing genomic variants obtained from [27] The disease-causing genomic variants to be analyzed using DeepBind aresummarized in Table 3 DeepBind analysis is executed onthe Hadoop and 119875119872119877 frameworks deployed on a customcloud cluster Log data generated is stored and used in furtheranalysis

The results obtained to demonstrate the performanceof six worker cluster nodes of Hadoop and 119875119872119877 duringMap and Reduce phase execution are shown in Figure 12Performance is presented in terms of task execution times

observed per worker node Considering Hadoop workernodes execution times of each node during the Map andReduce phase is shown in Figure 12(a) The executiontime observed for each 119875119872119877 worker node during the Mapand Reduce phases is shown in Figure 12(b) In the Mapphase execution the genomic variants to be analyzed areobtained from the cloud storage and are accumulated basedon their identities defined [27] The query sequences ofdisease-causing genomic variants to be analyzed are split forparallelization In theReduce phase the split query sequencesare analyzed and results obtained are accumulated andstored in the cloud storage Map workers in 119875119872119877 exhibit

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 14: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

14 BioMed Research International

Query sequence q Training sequence

Local Memory

PMR Platform

Local Memory

PMR Platform

Trained data (weights)

Cloud storage

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker 1

Shuffle Sort Reduce

Virtual Machine ndash Reduce Worker w

Virtual Machine ndash Map Worker wVirtual Machine ndash Map Worker 1

Computed Score

Temporary Cloud storage

Cloud storage

PMR Platform PMR Platform

DeepBind DeepBindPM

R-M

aste

r Nod

eTraining data1

Training datan

Compute Binding Score1Compute Binding Scoren

Figure 11 DeepBind PMR framework

0 5 10 15 20 25 30R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - Hadoop (Experiment 1)

(a) Worker node execution times on Hadoop framework

0 5 10 15 20R W 1R W 2R W 3R W 4R W 5R W 6

M W 1M W 2M W 3M W 4M W 5M W 6

Makespan Time (s)

Task

Exe

cutio

n

DeepBind Execution - PMR (Experiment 1)

(b) Worker node execution times on 119875119872119877 framework

Figure 12 DeepBind execution makespan of the Map and Reduce worker nodes (a) On Hadoop cluster of 6 nodes (b) On 119875119872119877 cluster of6 nodes

better performance and an average execution time reductionof 4857 is reported when compared to Hadoop Mapworker nodes Execution time of the six Reduce workernodes in Hadoop and 119875119872119877 is greater than Map workersas DeepBind analysis and identification of potential bindingsites is carried out during this phase Parallel executionstrategy of Reduce worker nodes is clear from Figure 12(b)The Reduce phase in 119875119872119877 commences after 5 seconds onceMap worker node 1 (MW1) and Map worker node 5 (MP5)have completed their task InHadoop that adopts a sequentialapproach the Reduce phase is initiated after all worker nodeshave completed their tasks Parallel execution of DeepBindanalysis utilizing all 4 computing cores available with Reduceworker nodes and parallel initiation of the Reduce phasein 119875119872119877 enable average Reduce execution time of 2222when compared to Hadoop Reduce worker nodes The totalmakespan observed for DeepBind experiment execution on

the Hadoop and 119875119872119877 cloud computing platforms is shownin Figure 13 Total makespan reduction of 3462 is achievedusing the 119875119872119877 framework when compared to the Hadoopframework Analysis results similar to [27] are reported forDeepBind analysis on the Hadoop and 119875119872119877 frameworksThe theoretical makespan computed using (18) for 119875119872119877 iscomparedwith the practical value observed in the experimentand the results obtained are shown in Figure 14 Aminor vari-ation between theoretical and practical values is observedThe variation observed is predominantly due to applicationdependent multiple cloud memory access operations Basedon results obtained for DeepBind analysis it is evident thatperformance on the 119875119872119877 framework is far superior to itsexecution on the existing Hadoop framework

On the basis of biomedical applications considered forperformance evaluation and results obtained it is evident thatthe proposed119875119872119877 framework exhibits superior performance

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 15: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

BioMed Research International 15

1Experiment Number

HadoopPMR

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 13 DeepBind analysis total makespan time observed for experiments conducted on 119875119872119877 and Hadoop frameworks

1Experiment Number

PMRPMR -Theory

0

5

10

15

20

25

30

Exec

utio

n Ti

me (

s)

Figure 14 Correlation between theoretical and practical makespan times for DeepBind analysis execution on PMR framework

when compared to its existing Hadoop counterpart InBLAST the Map and Reduce phases are utilized In CAP3application the Map phase plays a predominant role InDeepBind application analysis is carried out in the Reducephase The proposed 119875119872119877 cloud computing framework isrobust and is capable of the dynamic biomedical applicationscenarios presented deployment of 119875119872119877 on public andcustom cloud platforms In addition 119875119872119877 exhibits low exe-cution times and enables effective cloud resource utilizationLow execution times enable cost reduction always a desiredfeature

5 Conclusion and Future Work

The significance of cloud computing platforms is discussedThe commonly adopted Hadoop MapReduce frameworkworking with its drawbacks is presented To lower executiontimes and enable effective utilization of cloud resources thispaper proposes a 119875119872119877 cloud computing platform A parallelexecution strategy of the Map and Reduce phases is consid-ered in the 119875119872119877 framework TheMap and Reduce functionsof 119875119872119877 are designed to utilize multicore environments

available with worker nodesThe paper presents the proposed119875119872119877 framework architecture along with makespan model-ing Performance of the 119875119872119877 cloud computing frameworkis compared with the Hadoop framework For performanceevaluation computationally heavy biomedical applicationslike BLAST CAP3 and DeepBind are considered Averageoverall makespan times reduction of 3892 1800 and3462 is achieved using the 119875119872119877 framework when com-pared to the Hadoop framework for BLAST CAP3 andDeepBind applications The experiments presented prove therobustness of the 119875119872119877 platform its capability to handlediverse applications and ease of deployment on public andprivate cloud platformsThe results presented through the ex-periments conducted prove the superior performance of119875119872119877 against the Hadoop framework Good matching is re-ported between the theoretical makespan of the 119875119872119877 pre-sented and experimental values observed In addition adopt-ing the 119875119872119877 cloud computing framework also enables costreduction and efficient utilization of cloud resources

Performance study considering cloud cluster with manynodes additional applications and security provisioning to

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 16: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

16 BioMed Research International

cloud computing framework is considered as the future workof this paper

Data Availability

Thedata is available at the National Center for BiotechnologyInformation (2015) [Online] Available httpwwwncbinlmnihgov

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this paper

Acknowledgments

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2015R1D1A1A01061328)

References

[1] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 137ndash150 2004

[2] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark Cluster Computing with Working Setsrdquo inProceedings of the in Proceedings of the 2nd USENIX Conferenceon Hot topics in Cloud Computing Boston MA June 2010

[3] G Malewicz M H Austern A J C Bik et al ldquoPregel a sys-tem for large-scale graph processingrdquo inProceedings of the Inter-national Conference on Management of Data (SIGMOD rsquo10) pp135ndash146 June 2010

[4] S Ghemawat H Gobioff and S Leung ldquoThe Google file sys-temrdquo in Proceedings of the the nineteenth ACM symposium p29 Bolton Landing NY USA October 2003

[5] X Shi M Chen L He et al ldquoMammoth Gearing HadoopTowards Memory-Intensive MapReduce Applicationsrdquo IEEETransactions on Parallel and Distributed Systems vol 26 no 8pp 2300ndash2315 2015

[6] L Person ldquoWorld Hadoop Market - Opportunities and Fore-castsrdquo Allied Market Research p 108 2020

[7] SNS Research The Big Data Market 2014ndash2020 OpportunitiesChallenges Strategies Industry Verticals and Forecasts SNSResearch 2014

[8] J Zhu J Li E Hardesty H Jiang and K-C Li ldquoGPU-in-Ha-doop Enabling MapReduce across distributed heterogeneousplatformsrdquo in Proceedings of the 2014 13th IEEEACIS Interna-tional Conference on Computer and Information Science ICIS2014 - Proceedings pp 321ndash326 China June 2014

[9] M Zaharia A Konwinski A D Joseph R H Katz I Stoicaand ldquo ldquoImproving Mapreduce Performance in Heteroge-neous Environments rdquordquo in Proceedings of the Proc EighthUSENIX Conf pp 29ndash42 2008

[10] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[11] D P Wall P Kudtarkar V A Fusaro R Pivovarov P Patil andP J Tonellato ldquoCloud computing for comparative genomicsrdquoBMC Bioinformatics vol 11 no 1 article 259 2010

[12] A Rosenthal P Mork M H Li J Stanford D Koester andP Reynolds ldquoCloud computing a new business paradigm forbiomedical information sharingrdquo Journal of Biomedical Infor-matics vol 43 no 2 pp 342ndash353 2010

[13] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[14] E Deelman G SinghM Livny B Berriman and J Good ldquoThecost of doing science on the cloud the montage examplerdquo inProceedings of the ACMIEEE Conference on SupercomputingIEEE Press November 2008

[15] N Chohan C Castillo M Spreitzer M Steinder A TantawiandCKrintz ldquoSee spot run using spot instances formapreduceworkflows inrdquo in Proceedings of the Proc 2010 USENIX Con-ference on Hot Topics in Cloud Computing p 7 2010

[16] R SThakur R Bandopadhyay B Chaudhary and S ChatterjeeldquoNow and next-generation sequencing techniques Future ofsequence analysis using cloud computingrdquo Frontiers in Geneticsvol 3 2012

[17] J Chen F QianW Yan and B Shen ldquoTranslational biomedicalinformatics in the cloud present and futurerdquo BioMed ResearchInternational vol 2013 Article ID 658925 8 pages 2013

[18] TNguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[19] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquoBioinformatics vol 25 no 11 pp 1363ndash1369 2009

[20] B Langmead K D Hansen and J T Leek ldquoCloud-scale RNA-sequencing differential expression analysis with Myrnardquo Ge-nome Biology vol 11 no 8 p R83 2010

[21] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 no 11 article R134 2009

[22] A A Al-Absi and D Kang ldquoA Novel Parallel ComputationModel with Efficient Local Memory Management for Data-Intensive Applicationsrdquo in Proceedings of the 2015 IEEE 8thInternational Conference on Cloud Computing (CLOUD) pp958ndash963 New York City NY USA June 2015

[23] A Al-Absi and Dae-K Kang I ldquoLong Read Alignment withParallelMapReduceCloud Platformrdquo BioMed Research Interna-tional vol 2015 Article ID 807407 13 pages 2015

[24] E Yoon and A Squicciarini ldquoToward Detecting CompromisedMapReduce Workers through Log Analysisrdquo in Proceedings ofthe 14th IEEEACM International Symposium on Cluster Cloudand Grid Computing pp 41ndash50 Chicago IL USA 2014

[25] D Dang Y Liu X Zhang and S Huang ldquoA CrowdsourcingWorker Quality Evaluation Algorithm on MapReduce for BigData Applicationsrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 27 no 7 pp 1879ndash1888 2016

[26] SDTetaliM Lesani RMajumdar andTMillstein ldquoMrCryptStatic analysis for secure cloud computationsrdquo in Proceedingsof the 2013 28th ACM SIGPLAN Conference on Object-OrientedProgramming Systems Languages and Applications OOPSLA2013 pp 271ndash286 USA October 2013

[27] B Alipanahi A Delong M T Weirauch and B J Frey ldquoPre-dicting the sequence specificities of DNA- and RNA-bindingproteins by deep learningrdquo Nature Biotechnology vol 33 no 8pp 831ndash838 2015

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 17: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

BioMed Research International 17

[28] K Mahadik S Chaterji B Zhou M Kulkarni and S BagchildquoOrion Scaling Genomic Sequence Matching with Fine-GrainedParallelizationrdquo inProceedings of the International Con-ference for High Performance Computing Networking Storageand Analysis SC 2014 pp 449ndash460 USA November 2014

[29] J Ekanayake T Gunarathne and J Qiu ldquoCloud technologiesfor bioinformatics applicationsrdquo IEEE Transactions on Paralleland Distributed Systems vol 22 no 6 pp 998ndash1011 2011

[30] X Huang and A Madan ldquoCAP3 a DNA sequence assemblyprogramrdquo Genome Research vol 9 no 9 pp 868ndash877 1999

[31] Y Wu W Guo J Ren X Zhao and W Zheng ldquo1198731198742 speedingup parallel processing of massive compute-intensive tasksrdquoInstitute of Electrical and Electronics Engineers Transactions onComputers vol 63 no 10 pp 2487ndash2499 2014

[32] Z Zhang L Cherkasova and B T Loo ldquoOptimizing cost andperformance trade-offs for MapReduce job processing in thecloudrdquo in Proceedings of the IEEEIFIP Network Operations andManagement Symposium Management in a Software DefinedWorld NOMS 2014 Poland May 2014

[33] K Chen J Powers S Guo and F Tian ldquoCRESP Towards opti-mal resource provisioning for MapReduce computing in publiccloudsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 no 6 pp 1403ndash1412 2014

[34] T White Hadoop The Definitive Guide OReilly Media TheDefinitive Guide OrsquoReilly Media Hadoop 2009

[35] S F AltschulW Gish WMiller EWMyers andD J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[36] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom

Page 18: Parallel MapReduce: Maximizing Cloud Resource Utilization ...downloads.hindawi.com/journals/bmri/2018/7501042.pdf · MapReduce platform. Experimental outcomes show that theirmodelachievesaspeedupof.xovermpiBLASTwith-outcompromisingonaccuracy

Hindawiwwwhindawicom

International Journal of

Volume 2018

Zoology

Hindawiwwwhindawicom Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Neuroscience Journal

Hindawiwwwhindawicom Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Biochemistry Research International

ArchaeaHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Genetics Research International

Hindawiwwwhindawicom Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Enzyme Research

Hindawiwwwhindawicom Volume 2018

International Journal of

MicrobiologyHindawiwwwhindawicom

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwwwhindawicom