11
PIC: Partitioned Iterative Convergence for Clusters Reza Farivar , Anand Raghunathan , Srimat Chakradhar , Harshit Kharbanda , Roy H. Campbell University of Illinois, Purdue University, NEC Labs America , University of Illinois Email: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract—Iterative-convergence algorithms are frequently used in a variety of domains to build models from large data sets. Clus- ter implementations of these algorithms are commonly realized using parallel programming models such as MapReduce. How- ever, these implementations suffer from significant performance bottlenecks, especially due to large volumes of network traffic resulting from intermediate data and model updates during the iterations. To address these challenges, we propose partitioned iterative convergence (PIC), a new approach to programming and execut- ing iterative convergence algorithms on frameworks like MapRe- duce. In PIC, we execute the iterative-convergence computation in two phases - the best-effort phase, which quickly produces a good initial model and the top-off phase, which further refines this model to produce the final solution. The best-effort phase iteratively performs the following steps: (a) partition the input data and the model to create several smaller, model-building sub-problems, (b) independently solve these sub-problems using iterative convergence computations, and (c) merge solutions of the sub-problems to create the next version of the model. This partitioned, loosely coupled execution of the computation produces a model of good quality, while drastically reducing network traffic due to intermediate data and model updates. The top-off phase further refines this model by employing the original iterative-convergence computation on the entire (un- partitioned) problem until convergence. However, the number of iterations executed in the top-off phase is quite small, resulting in a significant overall improvement in performance. We have implemented a library for PIC on top of the Hadoop MapReduce framework, and evaluated it using five popular iterative-convergence algorithms (PageRank, K-Means clustering, neural network training, linear equation solver and image smoothing). Our evaluations on clusters ranging from 6 nodes to 256 nodes demonstrate a 2.5X-4X speedup compared to conventional implementations using Hadoop. I. I NTRODUCTION Algorithms from a wide range of emerging domains, includ- ing data analytics, web search, social networks, and recogni- tion, mining and synthesis (RMS) [1], build models from a large corpus of unstructured input data. To do so, they often use Iterative-Convergence (IC) algorithms, which iteratively execute a computation that refines the model until a specified convergence criterion is satisfied. Due to the scale of data sets that they process, IC al- gorithms are frequently realized on clusters using high-level programming models like MapReduce [2]. For example, the Apache Mahout [3] project provides implementations of a wide range of IC algorithms using the Hadoop framework [4]. In these conventional implementations of IC algorithms on MapReduce, the data-parallel computations performed in each iteration are expressed as one or more MapReduce jobs. This approach offers a quick path to implementation and leverages the strengths of MapReduce such as ease of programming, load balancing, and fault tolerance. However, recent research has demonstrated that cluster implementations of iterative- convergence algorithms based on MapReduce suffer from performance degradation due to repeated reads of input data and repeated creation and termination of MapReduce jobs in each iteration [5], [6], [7]. These issues have been addressed by modifying the MapReduce framework to provide mechanisms for caching invariant data [5], [6], [7], long-running tasks [5], and by making the run-time scheduler loop aware [7], resulting in performance improvements. In this paper, we identify two key bottlenecks (not addressed by previous work), that limit the performance of cluster implementations of IC algorithms with MapReduce. The first bottleneck is the large volume of intermediate data (also referred to as the shuffle data) between map and reduce tasks in each iteration. Note that, due to the all-to-all nature of the shuffle traffic, it stresses the cluster bisection, a resource that is both scarce and difficult to scale [2], [8]. Second, in applications where the model itself is large, there is a large amount of network traffic due to model updates in each iteration. These bottlenecks cannot be alleviated by previous work [5], [6], [7], since the shuffle data and model are not constant and therefore cannot be cached. We propose a new approach called partitioned iterative convergence (PIC) to express iterative-convergence algorithms for parallel execution, and describe a programming framework for PIC that greatly improves performance while preserving the other benefits of MapReduce. We exploit a key property of most IC algorithms - that they start with an arbitrary initial model (often chosen randomly) and produce an acceptable model at convergence. However, the time to convergence depends on the specific choice of the initial model. The key insight in PIC is to view iterative-convergence as a two-phase process. In the first phase, we use a transformed version of the original computation to quickly produce a good initial model. Then, in the second phase, we use this model as the starting point and further refine the model by using only a few iterations of the original computation. In the first phase of PIC, which we refer to as the best- effort phase, we partition the problem (input data and model) into a number of smaller model-building sub-problems, by us- ing a programmer-specified partition function. Although there are typically dependencies across sub-problems, we ignore these dependencies and perform independent iterative- convergence computations, in parallel, and without any syn- chronization or communication between them (we refer to these as “local iterations”). The models computed by the sub- problems are combined using a programmer-specified merge function into a single model. The above process is repeated with the new, single model as the starting point (we refer to this process as “best-effort iterations”), until the single model satisfies a best-effort convergence criterion. In the second phase of PIC, which we refer to as the top-off phase, we refine the model computed in the best-effort phase, by using iterations of the original computation (i.e., without partitioning or ignoring dependencies) until convergence. Re-structuring iterative-convergence computations as es- poused by PIC addresses the performance bottlenecks in MapReduce due to intermediate and model updates, since (i) there is no traffic across partitions while the local iterations are executed in the best-effort phase, (ii) while cluster-wide com- 2012 IEEE International Conference on Cluster Computing 978-0-7695-4807-4/12 $26.00 © 2012 IEEE DOI 10.1109/CLUSTER.2012.84 391

[IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

  • Upload
    roy-h

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

PIC: Partitioned Iterative Convergence for ClustersReza Farivar†, Anand Raghunathan‡, Srimat Chakradhar∗, Harshit Kharbanda†, Roy H. Campbell†

†University of Illinois,‡ Purdue University, ∗NEC Labs America, University of Illinois

Email: [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract—Iterative-convergence algorithms are frequently usedin a variety of domains to build models from large data sets. Clus-ter implementations of these algorithms are commonly realizedusing parallel programming models such as MapReduce. How-ever, these implementations suffer from significant performancebottlenecks, especially due to large volumes of network trafficresulting from intermediate data and model updates during theiterations.

To address these challenges, we propose partitioned iterativeconvergence (PIC), a new approach to programming and execut-ing iterative convergence algorithms on frameworks like MapRe-duce. In PIC, we execute the iterative-convergence computationin two phases - the best-effort phase, which quickly produces agood initial model and the top-off phase, which further refinesthis model to produce the final solution. The best-effort phaseiteratively performs the following steps: (a) partition the inputdata and the model to create several smaller, model-buildingsub-problems, (b) independently solve these sub-problems usingiterative convergence computations, and (c) merge solutions ofthe sub-problems to create the next version of the model.This partitioned, loosely coupled execution of the computationproduces a model of good quality, while drastically reducingnetwork traffic due to intermediate data and model updates.The top-off phase further refines this model by employing theoriginal iterative-convergence computation on the entire (un-partitioned) problem until convergence. However, the number ofiterations executed in the top-off phase is quite small, resultingin a significant overall improvement in performance.

We have implemented a library for PIC on top of theHadoop MapReduce framework, and evaluated it using fivepopular iterative-convergence algorithms (PageRank, K-Meansclustering, neural network training, linear equation solver andimage smoothing). Our evaluations on clusters ranging from 6nodes to 256 nodes demonstrate a 2.5X-4X speedup compared toconventional implementations using Hadoop.

I. INTRODUCTION

Algorithms from a wide range of emerging domains, includ-ing data analytics, web search, social networks, and recogni-tion, mining and synthesis (RMS) [1], build models from alarge corpus of unstructured input data. To do so, they oftenuse Iterative-Convergence (IC) algorithms, which iterativelyexecute a computation that refines the model until a specifiedconvergence criterion is satisfied.

Due to the scale of data sets that they process, IC al-gorithms are frequently realized on clusters using high-levelprogramming models like MapReduce [2]. For example, theApache Mahout [3] project provides implementations of awide range of IC algorithms using the Hadoop framework [4].In these conventional implementations of IC algorithms onMapReduce, the data-parallel computations performed in eachiteration are expressed as one or more MapReduce jobs. Thisapproach offers a quick path to implementation and leveragesthe strengths of MapReduce such as ease of programming,load balancing, and fault tolerance. However, recent researchhas demonstrated that cluster implementations of iterative-convergence algorithms based on MapReduce suffer fromperformance degradation due to repeated reads of input dataand repeated creation and termination of MapReduce jobs ineach iteration [5], [6], [7]. These issues have been addressed by

modifying the MapReduce framework to provide mechanismsfor caching invariant data [5], [6], [7], long-running tasks [5],and by making the run-time scheduler loop aware [7], resultingin performance improvements.

In this paper, we identify two key bottlenecks (not addressedby previous work), that limit the performance of clusterimplementations of IC algorithms with MapReduce. The firstbottleneck is the large volume of intermediate data (alsoreferred to as the shuffle data) between map and reduce tasksin each iteration. Note that, due to the all-to-all nature ofthe shuffle traffic, it stresses the cluster bisection, a resourcethat is both scarce and difficult to scale [2], [8]. Second,in applications where the model itself is large, there is alarge amount of network traffic due to model updates in eachiteration. These bottlenecks cannot be alleviated by previouswork [5], [6], [7], since the shuffle data and model are notconstant and therefore cannot be cached.

We propose a new approach called partitioned iterativeconvergence (PIC) to express iterative-convergence algorithmsfor parallel execution, and describe a programming frameworkfor PIC that greatly improves performance while preserving theother benefits of MapReduce. We exploit a key property ofmost IC algorithms - that they start with an arbitrary initialmodel (often chosen randomly) and produce an acceptablemodel at convergence. However, the time to convergencedepends on the specific choice of the initial model. The keyinsight in PIC is to view iterative-convergence as a two-phaseprocess. In the first phase, we use a transformed version ofthe original computation to quickly produce a good initialmodel. Then, in the second phase, we use this model as thestarting point and further refine the model by using only a fewiterations of the original computation.

In the first phase of PIC, which we refer to as the best-effort phase, we partition the problem (input data and model)into a number of smaller model-building sub-problems, by us-ing a programmer-specified partition function. Althoughthere are typically dependencies across sub-problems, weignore these dependencies and perform independent iterative-convergence computations, in parallel, and without any syn-chronization or communication between them (we refer tothese as “local iterations”). The models computed by the sub-problems are combined using a programmer-specified mergefunction into a single model. The above process is repeatedwith the new, single model as the starting point (we refer tothis process as “best-effort iterations”), until the single modelsatisfies a best-effort convergence criterion. In the secondphase of PIC, which we refer to as the top-off phase, werefine the model computed in the best-effort phase, by usingiterations of the original computation (i.e., without partitioningor ignoring dependencies) until convergence.

Re-structuring iterative-convergence computations as es-poused by PIC addresses the performance bottlenecks inMapReduce due to intermediate and model updates, since (i)there is no traffic across partitions while the local iterations areexecuted in the best-effort phase, (ii) while cluster-wide com-

2012 IEEE International Conference on Cluster Computing

978-0-7695-4807-4/12 $26.00 © 2012 IEEE

DOI 10.1109/CLUSTER.2012.84

391

Page 2: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

�������������� ������������ ������������������������������������������� ��������������

���� �������������� ���������������������� ���������� ������������� ������� ��������������������������������� ���������� ��������� ���������������� ��������������������������������������������������� ������������������� �������������

(a) (b)

Figure 1. Conventional implementation of an iterative convergence algorithm using MapReduce: (a) generic template, and (b) K-means clustering example

munication is required once during each best-effort iterationin the best-effort phase, the number of best-effort iterationsis typically quite small since most of the model refinementis accomplished by the local iterations, and (iii) the numberof iterations required in the top-off phase is quite small inpractice. Therefore, as shown in our results, PIC implementa-tions achieve significant performance improvements over thestate-of-the-art.

Our proposal requires no modifications to the underlyingMapReduce runtime framework (we build on top of theMapReduce framework). Therefore, previous optimizations ofMapReduce frameworks for iterative algorithms [5], [6], [7]can be fully leveraged. In addition, the effort required tomigrate conventional implementations into the PIC frameworkis small since the original implementation is fully re-used tosolve the sub-problems in the best-effort phase, as well as torealize the top-off phase.

To evaluate the benefits of our proposal, we have devel-oped a library for PIC on top of the Hadoop MapReduceframework [4]. We have implemented five popular iterative-convergence algorithms (PageRank, K-Means clustering, neu-ral network training, linear equation solver and image smooth-ing) using PIC. We compare the performance of PIC imple-mentations to conventional MapReduce implementations thatare already optimized to eliminate the overheads of repeatedjob creation and reads of input data [5], [6], [7]. Our resultsdemonstrate that PIC achieves speedups of 2.5X-4X acrossclusters of 6-256 nodes.

The rest of the paper is organized as follows. Section IImotivates our work by demonstrating the performance issuesaffecting conventional MapReduce based implementations ofiterative-convergence algorithms. Section III describes the PICframework. Section IV provides illustrative examples of howapplications are are specified using PIC. Section V evaluatesthe proposed approach by comparing PIC implementationswith the baseline in terms of performance and scalability.Section VI discusses the effectiveness of the best-effort phase,which is key to the performance improvements of PIC. Sec-tion VIII discusses related efforts and places our contributionsin their context, and Section IX concludes the paper.

II. MOTIVATION

Figure 1(a) shows a generic template of a MapReduceimplementation of an iterative convergence algorithm. Theiterations of the do-until loop generate successive versionsof the model. In each iteration, the input data (d) and thecurrent model (m) are processed in a data parallel manner bya MapReduce job to update the model (in general, multiplechained MapReduce jobs may be used). The map functions

produce intermediate data (key-value pairs). The reduce func-tions process the intermediate data to generate the next updateof the model. Although the input data does not change fromone iteration to the next, the model data is updated after everyiteration. A convergence criterion is evaluated on the modelto decide when to terminate the iterations. Figure 1(b) showshow the popular K-means clustering algorithm [9] is realizedby using the template of Figure 1(a). The clustering algorithmpartitions a set of points into clusters where each cluster isrepresented by a centroid, which is the average of points inthe cluster. The map computation associates each point withthe centroid that it is closest to, while the reduce computationre-computes the values of the centroids. The computation isrepeated until the centroids do not change by more than athreshold from the previous iteration, or if no more than acertain fraction of a points change clusters from the previousiteration.

We have observed that MapReduce based implementationsof iterative-convergence algorithms suffer from performancedegradation due to the following factors.

• MapReduce intermediate data. The intermediate key-value pairs have to be communicated across the cluster in-terconnect because of the all-to-all nature of this commu-nication. Despite well-known optimizations (use of com-biners and overlapping shuffle with the Map phase [2]), alarge volume of intermediate data is quite common, withadverse impact on application performance.

• Model updates. Since we update the model in eachiteration, it is necessary to synchronize at the end ofeach iteration and communicate data across the clusterinterconnect to update the model. Note that the modelis stored in the cluster file system with replicas (forfault tolerance), hence the performance impact of frequentmodel updates is significant, and severe for applicationswhere the model is large.

Note that the above factors are over and beyond the issuesaddressed in previous work [5], [6], [7], such as repeatedinitialization of the MapReduce runtime and repeated readsof constant input data. Addressing these factors will thereforerequire new techniques that go beyond these prior proposals.

The key insight motivating the proposed work is that ICalgorithms produce acceptable results while starting from anarbitrary model. However, the number of iterations necessaryfor convergence is strongly dependent on the initial startingpoint. Therefore, we view iterative-convergence computationas a two-phase process that consists of (i) coming up with agood starting point (initial model), and (ii) refining the initialmodel to generate the final solution. Of course, determining a

392

Page 3: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

��

������

������

������

������

������

�����

�����

��� ���

������������ ����� �������� �����

��

���

����

����

����

����

��� ���

������������ ����� �������� �����

�������������� �!"�#�

$�%���&

��!��

�%'�#�

Figure 2. Run time and Shuffle traffic for K-means clustering (100 millionpoints into 100 clusters, 64-nodes cluster).

good initial model, in general, can be as difficult as findingthe solution in the first place. Moreover, we do not wantto burden the programmer with the task of coming up witha new algorithm to generate an initial model. In this work,we demonstrate that breaking up the original problem intosub-problems, and executing iterative convergence on the sub-problems in a loosely coupled manner can generate a verygood initial model with drastic improvements in efficiency.The second phase starts off with the model produced bythe first phase and runs the unmodified iterative convergencecomputation. However, this phase now requires far feweriterations than a conventional implementation.

Figure 2 quantitatively demonstrates this insight for K-means clustering of 100 million points into 100 clusters.The graph on the left compares the execution time of theconventional MapReduce that follows Figure 1(b), with theimplementation using the proposed PIC framework (describedfurther in following sections). The execution time for the PICcase is broken down into the time consumed in each of thephases. The graph on the right shows a similar comparison forthe total volume of intermediate data and model updates. Thegraphs show that (i) the best-effort phase executes in aroundone-fifth the time as the conventional implementation, primar-ily due to a drastic reduction in cluster interconnect traffic dueto shuffle and model updates, and (ii) the top-off phase requiresaround one-sixth the number of iterations of the conventionalimplementation, resulting in proportionally lower executiontime and traffic. Overall, the PIC implementation achievesaround 3X speedup over the conventional implementation.

In the next section, we present a new programming frame-work called partitioned iterative convergence (PIC) that em-bodies the two-phase approach described above. We note thatthe PIC framework is built on top of MapReduce. Therefore,we leverage the inherent advantages of the MapReduce pro-gramming model, as well as all prior improvements [5], [6],[7] to the MapReduce run-times.

III. PARTITIONED ITERATIVE CONVERGENCE

In this section, we describe the proposed partitioned iterativeconvergence (PIC) approach, and discuss the programmereffort involved in expressing iterative-convergence applicationsusing PIC.

A. OverviewFigure 3 presents a generic template for an algorithm

expressed using the proposed partitioned iterative conver-

��� � ��������������������������������� ��� ������ ���������������������������������� ����������� ���������������������������������������������������������� ���������������������!���"������� ��������������������������� ����������������������� �

Figure 3. Partitioned iterative-convergence: A programming template

gence (PIC) approach. Given an input data set (d), and aninitial model (m), the best-effort phase partitions the inputdata and the model to create several, smaller model-buildingsub-problems. Dependencies across the sub-problems are ini-tially ignored, and conventional iterative convergence methodsare used to solve the sub-problems. The solutions to thesub-problems (i.e., updated models) are then merged usingproblem-specific merge functions to create a single, mergedmodel. If the merged model does not meet a convergencecriterion, we continue the best-effort phase. Otherwise, we endthe best-effort phase and enter the top-off phase. This secondphase further refines the model produced by the best-effortphase by using conventional implementations of iterative con-vergence (i.e., without partitioning or ignoring dependencies).We end the top-off phase when the model satisfies the specifiedconvergence criterion.

Figure 4 summarizes the PIC programming model APIthat application developers can use to realize their iterativeconvergence applications. Note that, except for three functions(partition, merge, and BE_converged), all the otherfunctions in the API are standard, in the sense that they arenecessary to realize any iterative convergence application on aMapReduce framework. The map, reduce and convergedfunctions of a conventional implementation (Figure 1(a)) cor-respond to similarly named functions in Figure 4.

B. Best-effort phase

The best-effort phase is key to the performance improve-ments achieved by PIC, and is governed by three functions(partition, merge, and BE_converged). The applica-tion developer can either implement these functions, or use thedefault implementations in the PIC framework.

The partition function is useful to control the numberand size of the sub-problems, as well as the degree of paral-lelism. For example, we can create more sub-problems than thenumber of nodes available in a cluster. Or, we can size a sub-problem so that a group of tightly-coupled nodes (e.g., a rack)can execute the sub-problem. However, more sub-problems ofsmaller size can increase the number of best-effort iterationsthat the best-effort phase may require to converge. For a givennumber of sub-problems, the partitioning function should tryto reduce the dependencies between the sub-problems so thatthe number of best-effort iterations, as well as the number ofiterations required in the top-off phase, are minimized.

The specific choice of the partition function isapplication-dependent, much like the contents of map andreduce in the MapReduce framework. In some problems (forexample, the PageRank case study in Section IV-B), we

393

Page 4: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

����������������� ����� �����(��(�)����������)*+��((���),�)�������� *�*+������-(�*���(�),�(- ��((��)��*��)�����, ��)�,���)(��((�����

��)���*-�����-�)�,��),���)��������, �()�*)����

���,�()*-),*����.-)��)���*����.-)�,����*���

���������������������*�(-)(��*�������������� ��

����)����(���*�),*��(���������/��*����.-)���)�������,����*��(),*����/��*��)���,-).-)��,��(��(�),� ��(),*����

����/��)��*�)���-(�*��(�*�0-�()����)�������� *�*+�),�.�*)�)�,��)�����.-)���)��(�)���)��*/�(��-(�*(����.�*)�)�,��

)������(�)���(�1�(���(���������*)�)�,��*��((���

������ ��������,�),*�� ���1!�,��,�����

��������������������������������,�),*�� ���1!���/�,����� ��.*,1���(�)���-(�*�/�)��)���,��������/� �()����,*)��)�*�)�,���,��(����(����)���-(�*2(�.*,�*�������,1�*������

����������������� ���� ������+��� ��(�)���,��,�.-)�)�,�(���

����.��� � �+���1�1�-����,�),*�� ���1!��,�����

��������������������������������������,�),*�� ���1!�,-).-)����������*�(����)�(�),����,,."(���.��������-**��)��,����(��(,�.*,1�����),�)�����.����-�)�,���,*���(��,��

.*,�*��������#�(����������.����)�)�,�(�)+.��+�������3)*��.*,�*�������),�.�*�,*��)��(��-�)�,���)+���

�� *��-��� � �+���)�*�),*��1!�1�-�(���,�),*�� ���1!�,-).-)��� �������*�(����)�(�),����,,."(�*��-�����

��,�1�*������,�),*�� ��1!�,��,�����

��������������������,�),*�� ���1!���/�,��������*,1���(�)���-(�*�/�)��)���,��������/�,���)�*�)�,���,��(��,*���,�1�����)�,�1�*�����)�*�(�,�����(�,���

�� �������� � �+���)�*�),*��1!�1�-�(���,�),*�� ���1!�,-).-)��� ������(�)���,**�(.,�����������)(�,������(- ��,�����*,-.(�)����),��)��*�����.�((�(�)����),�)���-(�*�),�

��*�������(�*��-�(�)���.*,�*����*(�)�( �,������)��+����,**�(.,�����������)(���������*��)�(- �.*, ��(���

����������������� ����������� ��

���(��(���)�������,*�(��.��.�*)�)�,�������,*�)��(������$� *�*+���-��(�(,�����.����)�)�,��,��)��(��((���,*�

�3��.��(��.���,�-,�.�*)�)�,���������*���,��.�*)�)�,������%)�*��)�1�+�)���-(�*����-(���,*��,�.�3�

.�*)�)�,��*(�,-)(��������������()*-)������,)�),��))��.)�.�*)�)�,�������

�� ������������� � �+���1�1�-��������*)�)�,�(��� ��(�,-��*�)-*��)���.�*)�)�,���-� �*��,*�,�����.-)�.,��)��

Figure 4. The Partitioned Iterative-Convergence user-facing API. Most of the API is anyway necessary for a MapReduce implementation of IC. Only threeextra functions, shown in italics, are needed for the best-effort phase. Also, our library is based on templates, and any Hadoop-provided data structure object(e.g. Text, IntWritable, LongWritable, etc.) can be used within PIC for key/value pairs.

partition both the input data and the model. In other cases(for example, the K-Means case study in Section IV-A), it ismore appropriate to partition the input data, but create multiplecopies of the model. The complexity of the partitionfunction may range from simple techniques like randomlybreaking up the input data and/or model (in which case theprogrammer can simply use the default partitioner classesprovided by PIC), to sophisticated partitioning schemes suchas min-cut graph partitioning.

The merge function depends on the strategy used bythe partition function. For example, if the partitionfunction divides the model into disjoint parts that are updatedby the different sub-problems, then the merge function maysimply piece them back together. On the other hand, if copiesof the model are created by the partition function, thenthey may be “averaged” or aggregated to construct the mergedmodel. Similar to the partition function, the programmercan either specify an application-specific merge function, oruse one of the defaults provided by PIC. For models thatcan be represented as vectors, the default merge functions canconcatenate the vectors from sub-problems into a single vector,sum the vectors, or average the respective entries in the vectors.

Finally, PIC uses the BE_converged function to deter-mine if the best-effort phase can be terminated. In principle,the application developer can use the same criterion that theyspecified using the converged function of the conventionalimplementation, or they can specify a much looser criterion toquickly terminate the best-effort phase.

In the next section, we provide examples of how these threefunctions are specified for real applications.

Figure 5 compares the execution flow of the best-effortphase of PIC with the conventional implementation of Fig-ure 1(a). By processing sub-problems while ignoring their de-pendencies, communication traffic on the cluster interconnectdue to shuffle data and model updates is drastically reduced.As we show in Section V, this leads to drastically improvedperformance in the best-effort phase, while providing a verygood starting model for the top-off phase. From a different

������������������������

�����������������������

������

�����������

������

�����������

���������

������

����� �������

�� !��"����#�"�������������

��

����������

��

����������

��

����������

����

� �$���� ������

����������

��

�� !��"����

#�"���

Figure 5. Comparison of conventional implementation of iterative conver-gence and the best-effort phase of PIC

perspective, conventional MapReduce implementations canonly exploit parallelism within each iteration. The best-effortphase of PIC introduces an additional degree of parallelism -sub-problems that can be solved independently - beyond theopportunity of exploiting parallelism within each iteration ofa sub-problem. By increasing the amount of parallelism, thebest-effort phase can scale more easily than the conventionalimplementation.

Finally, an important special case of the best-effort phaseof PIC is worth noting. If the number of partitions is one,the merge function becomes the identity function (i.e., the

394

Page 5: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

merge function returns the only model it receives), and theBE_converged function terminates the best-effort processafter only one iteration, the best-effort phase of PIC degener-ates to the conventional implementation of Figure 1(a). Thisis important because it implies that PIC requires no additionalprogrammer effort to realize the top-off phase.

C. Programming with PICIn MapReduce, the map function specifies the computation

to be performed on one element of the input data, leavingthe runtime framework to partition the input data set andinvoke the map function on each data element. However, thesesemantics are insufficient for iterative convergence methodsbecause they also have a need to:

• Pass the model, as well as the input data to map functions.• Keep two copies of the model - current and previous - to

evaluate the convergence criterion.• Replace the model from a previous iteration with the

model computed in the current iteration.

In addition, the best-effort phase of PIC also has a need to:

• Generate models for each sub-problem, starting from asingle, unified model at the beginning of a best-effortiteration.

• Collect models from sub-problems to implement themerge function.

While the above operations on the model can be imple-mented by application code that uses a conventional MapRe-duce framework like Hadoop, this is a significant burden onthe application developer.

The best-effort phase can further introduce a burden onthe programmer. Many implementations of partition andmerge functions require that the model be divided intoelements that are uniquely identifiable and operable. Giventwo models that need to be merged, we may have to firstestablish the correspondence of elements in the two models.For example, consider the case of K-means. Assume that wehave two sets of centroids (models) that were computed bytwo different sub-problems. If the merge function requiresthat corresponding centroids be averaged, then we have to keeptrack of which centroid in one model (centroid set) correspondsto which centroid in the other.

PIC only requires that the model be expressed in the formof key/value pairs to facilitate splitting a model into ele-ments and identification of correspondence between elements.Representing the model as key/value pairs also allows themerge function itself to execute in a distributed fashion asa MapReduce job.

IV. CASE STUDIES

To test the proposed framework, we have developed fiveiterative-convergence algorithms using PIC: K-means cluster-ing, PageRank computation, neural network training usingback propagation, image smoothing and linear equation solver.In this section we briefly describe the map and reduce compu-tations and the application-specific functions (partition(),merge(), and converged()) that we used for the PICimplementations of the first two examples.

A. K-Means ClusteringK-means is an iterative convergence algorithms designed to

create a representative model (k centroids) from a data set (abody of “points” in a cartesian space of n dimensions).

���������������������������������� ��������������������������� ���������������������������� ���������������������������!���"���������������������������������������������������������������������������������������������� ������������������������ ����� ������������������������������������������� ���������������������������������������������!�"���� ��������� ������������������ ������� ����������������������������� �

Figure 6. PIC implementation of K-means (for IC implementation, seeFigure 1(b)

Figure 6 shows the PIC implementation of K-means. Wefully re-use the IC implementation shown in Figure 1(b).

We used a simple random partition function for K-means. In a best-effort iteration of the PIC implementation,each sub-problem performs as many local iterations as nec-essary to obtain a converged partial model. The convergencecriterion we used is the same as the criterion used in theIC implementation. Every iteration of the IC implementationstarts with a model (i.e., set of proposed values for the Kcentroids). At the end of the iteration, we have a refined model(i.e., new set of values for the K centroids). If the changein the value of all the K centroids is within a pre-specifiedthreshold, we conclude that the model has converged. We usedthe same convergence criterion for every sub-problem in PIC,and for detecting best-effort convergence in PIC. Each sub-problem computes a partial model (set of K centroid values).So, for each centroid, we have a value from every sub-problem.Our merge function identifies corresponding centroid valuesfrom each partition and averages them to compute the centroidvalues for the unified model.

B. Page Rank

The PageRank algorithm is used to obtain a best-effortordering of a set of web pages. The input to the algorithmis a web graph. There is one vertex in the graph for eachweb page (or URL). A directed edge from a vertex to anothervertex implies that the source web page has a hyperlink tothe destination web page. The PageRank value for a vertexis a function of the number of web pages that refer to thevertex, either directly or through other web pages. The solution(or model) created by the PageRank algorithm is a set ofPageRanks for all vertices in the web graph. The PageRankalgorithm also assigns a score to every edge in the graph. Thisinformation is not reported as a solution, but edge scores areused to compute the vertex PageRanks. Therefore, in the PICimplementation, we consider the set of edge scores as part ofthe model in our implementation.

As shown in Figure 7, every iteration in the PageRankalgorithm consists of two phases: aggregation and propagation(this reflects the implementation in the Nutch [10] opensource search engine). In the aggregation phase, we updatethe PageRank of every vertex by aggregating the scores of itsincoming edges. The PageRank of vertex i is computed from

395

Page 6: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

��������#�$%�&�$�'$�(��%(���� ���������������$�'$�(��%(�������%)��)%��$���������'���$���#�*�(��)$�#�$%���(������(�)$�(��)$�����(������)�(%�%��������������'*%��%����������)���������)��������+������)�+��������� ����������������������)$��������������$)��#�$%�&�*�%)�#�$%�&�#������������������%��������#���������������� ���������������)$��������%)%����(*��)�������)���������������������������%)%� �

� ��,����� ����������������������)$��������������$)��#�$%�&�*�%)�#�$%�&�#�������������������%��������*������������ ���������������)$���������)*�% �*���$�)��������)$�%�������������������������)$����������������������������������)*�% ����������������,�����,�*�%����� ���������� �,�

Figure 7. IC implementation of PageRank

������������&���������� ����������������������������� �������� ��������������� ���� ���� �������������������������������������������������������� �����������������������������-������������������������-������� �������������� ����������� �������������������� �������� �������������.���� �� �������������������������������������� ��� ��������-����� �����������������������! ��� ��"������������������������������������������� ���� ����� ����������������/��������� ������������������#�$���� ������������������ �� ��� ������������������������������������������� ������� ������������% ������/�������� ��������������������������������! ��� ��"&'�������������/�

Figure 8. PIC implementation of PageRank

the scores of its incoming edges, as follows:

PageRanki = (1− c) + c ∗∑

j

edgeji

In the above formula, c is a pre-specified constant (dampingfactor that is typically 0.85), and edgeji is a directed edge fromvertex j to the vertex i. In the propagation phase, we updatethe score of every edge. The score of edgeji is the ratio ofthe PageRank of vertex j to the number of outgoing edges ofvertex j.

Figure 8 shows the PIC implementation of PageRank. Wepartition the web graph into sub-graphs, by splitting thevertices into disjoint groups. Vertices and the edges that arefully contained in a group (i.e., the two vertices of the edge arein the same vertex group) form a sub-graph that is assigned toa partition. The convergence criterion for a sub-problem is thesame as the criterion used in the conventional IC implemen-tation. The PageRank implementation in Nutch automatically

terminates after a pre-specified set of iterations, independentof the quality of the solution. For the PIC implementation, wealso terminate the local and best-effort iterations after a pre-setiteration limit.

During the local iterations, every sub-problem updates thePageRank or edge scores of vertices and edges included in itspartition. At the end of the local iterations, the various sub-problems have computed PageRanks for all vertices. However,edge scores have been computed only for edges that arefully inside a partition. In particular, no edge scores havebeen computed for edges that are between partitions. Themerge function first computes the scores for all outgoingedges from a partition (i.e., source vertex of the edge is inthe partition but the destination vertex is in another partition).Then, the merge function also updates the PageRanks ofthe destination vertices of all outgoing edges. This is theonly mechanism we have used to factor in the dependenciesbetween the sub-problems. Like the K-means case, one candevelop more complex mechanisms to consider the effect ofother sub-problems.

V. EXPERIMENTAL RESULTS

In this section, we report the results of executing five appli-cations using PIC on three clusters of different sizes - a smallresearch cluster, a medium sized production cluster and a largevirtual cluster hosted on Amazon’s Elastic MapReduce service.Section V.A describes the experimental setup in detail. Thespeedup of PIC implementations compared to the baseline ICimplementations (which already employ known optimizationssuch as combiners, and elimination of repeated initializationoverheads and input reads) are reported in Section V.B andV.C. In Section V.D we present insights into the speedupsobtained by PIC. We show that the amount of communicationtraffic on the cluster interconnect due to MapReduce interme-diate data and model updates are significantly reduced overthe baseline implementations.

A. Experimental SetupWe have implemented the PIC library on top of the Apache

Hadoop framework [4]. We ported conventional IC implemen-tations of five applications (K-means clustering, PageRank,neural network training, linear equation solver, and imagesmoothing) into PIC. We compare IC (our baseline implemen-tation, which was developed using Hadoop directly withoutPIC) and the PIC implementations. We ran experiments onthree different clusters, which we refer to as small, mediumand large, to demonstrate the use of the PIC library, and thebenefits that accrue by using PIC at different cluster scales.The small cluster is a 6-node research testbed that uses GigabitEthernet as the cluster interconnect. Each node has two quad-core E5520 Xeon processors running at 2.27GHz (8 physicalcores and hyper-threading support), with 48 GB of RAM. FromHadoop’s point of view there are a total of 24 map and 24reduce task slots on this cluster. The medium testbed is a 64-node shared production cluster. Each node has two quad-coreE5430 Xeon processors running at 2.66GHz, with 16 GB ofRAM. The medium cluster occupies 6 racks, and the nodes areconnected to each other by using a Gigabit Ethernet switch.We used 330 map and 110 reduce task slots. Finally, the largetestbed consists of 256 Amazon Elastic MapReduce extra largeinstances. Each instance has 15 GB of memory and 8 EC2Compute Units (4 virtual cores with 2 EC2 Compute Unitseach).

396

Page 7: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

��

����

��

����

��

����

��

����

��

����

��

��

�����

�����

�����

�����

�����

����

�&��%�� ��%���������%��

��� ���

��������������

����%�%��

����

'��!��&����

#�

��&

��!�� �%'�#�

������%�� ��� ����'��

Figure 9. Performance of PIC and baseline IC on a small (6 node) cluster

Although recent proposals [5], [6], [7] have managed to re-duce the overheads for repeatedly launching Map and Reducetasks in each iteration, these optimizations are not yet availablein Hadoop. Therefore, we took the following approach in orderto remove the effect of these optimizations before evaluatingPIC. We recorded the number of iterations that the baselinealgorithm executed. We then ran a program that looped forthe same number of iterations, and in each iteration createda MapReduce job that reads the input data but does notperform any processing. The execution time of this job wassubtracted from the baseline to account for the performanceimprovements from elimination of repeated job initializationand repeated input reads.

B. SpeedupsFigure 9 shows the speedup experienced in the small cluster

for three different applications: K-means, PageRank and linearequation solver. The K-means experiment was designed tocluster 5 million points into 100 clusters. The IC implementa-tion of the PageRank algorithm was taken from the Nutch opensource search engine (version 1.1). The Nutch implementationconsiders the results of the PageRank acceptable after 10iterations. As the input web graph, we use the wikipedia.orgwebsite that contains 1.8 million documents. To implementthe PageRank algorithm using PIC, our partitioning functionrandomly divides the web graph into 18 partitions, each havingabout 100,000 vertices. The cross-edges between partitions arealso grouped into 182 = 324 sets. For the linear equationsolver, we used an example of a linear system of 100 variableswith a weakly diagonal dominant matrix. In each case, theproblem size was chosen to ensure that the baseline executiontook about 1 hour on the cluster (for practical reasons). Sucha large time window was chosen to minimize the impact ofHadoop job starting and finishing overhead (which is in theorder of seconds). The results presented in Figure 9 show thatPIC results in 2.5X-4X performance improvement over thebaseline IC implementations.

Figure 10 shows the speedups for a medium sized cluster(64 nodes) for K-means, neural network training, and imagesmoothing. This time, a larger data set was used for the K-

��

����

��

����

��

����

��

����

��

����

��

����

�����

�����

�����

�����

�����

�����

�����

�����

������� ������� �� ����

��������

��������

�����

���������

����� ��!�������

#�

���

��!�������#�

��������� ���� ����� ��

Figure 10. Performance of PIC vs. Baseline IC on a medium (64 node)cluster

means experiment, namely 10 million points distributed in a3-dimensional space. The neural network training applicationused a dataset of about 210,000 optical character recognition(OCR) training vectors. Finally a large 40 megapixel imagewas used as the dataset for the image smoother. Once again,the problem sizes were chosen such that the baseline executiontook about 1 hour. We see that PIC still manages to outperformthe baseline IC implementation by a factor of between 2.5Xto 4X.

The dataset sizes of the K-means and Neural Networktraining were increased when moving from the small clusterinto the medium sized cluster. The main reason for this wasto both ensure that there is enough work to utilize the wholecluster fully. These results demonstrate weak scalability of thePIC library.

C. Speedup: Strong scaling on large clustersTo measure the impact of PIC on strong scalability, we

performed experiments on Amazon Elastic MapReduce using256 extra large instances (256 nodes, 8 cores each). In theseexperiments, the dataset size was fixed, while we scaled thenumber of nodes from 64 to 128 to 192 and finally to 256nodes. Figure 11 shows the results of our experiments forthe image smoothing application, shows that, for up to 256nodes, the speedups of PIC over the baseline implementationare maintained. Furthermore, we can conclude that the PIClibrary does not have any negative impact on the scalability ofHadoop.

D. Analysis of speedupsThe speedups obtained by PIC in practice are due to two key

factors: (i) in the best-effort phase the number of best-effortiterations is very small (relative to the number of iterationsexecuted by a conventional implementation). This is criticalsince cluster-wide communication is incurred once per best-effort iteration. (ii) The number of iterations executed in thetop-off phase is also small. In this section, we present data toquantitatively demonstrate these factors.

For K-means, Table I shows the number of best-effortiterations and the number of local iterations required in each

397

Page 8: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

Table IITERATIONS REQUIRED FOR IC AND BEST-EFFORT PHASE OF PIC (K-MEANS)

DataSet Size 0.5M 5M 50M 500MNumber of IC Iterations 32 32 31 31Number of Best-effort Iterations (PIC) 5 4 3 3(Max) number of Local Iterations (PIC) 34 3 3 2 2 34 3 2 2 33 2 2 33 2 2

��

�� �

!�

!� �

"�

"� �

#�

#� �

$#� �!%� �&!� ! $�

������������

��������

Figure 11. Strong scalability of the PIC speedup over IC baseline forthe image smoothing application. The horizontal axis shows the number ofcomputing nodes (each node has 8 processing cores) and the vertical axisshows the speedup of PIC vs baseline IC.

best-effort iteration. Note that, except for the first best-effortiteration, only 2-3 local iterations are necessary in any best-effort iteration. Similar results were observed for all otherapplications.

Table IIBREAKDOWN OF DATA READ OR GENERATED DURING K-MEANS

CLUSTERING OF 500 MILLION POINTS

1 Baseline It. (IC) Total Baseline (IC) Total PICIntermediate data 9.21 GB 285.68 GB 80.9 KBModel updates 30 KB 959.03 KB 92.23 KB

Table II shows the volume of intermediate data (mapperoutput) and model updates in the K-means application whileclustering 500 million data points using the IC and PICschemes on the small cluster (6 nodes, 8 processors each).The first column shows the breakdown for a single iterationof the baseline IC implementation. The second column showsthe cumulative results for all the iterations required by theIC implementation. The third column corresponds to the PICimplementation. We can clearly see that the PIC implementa-tion drastically reduces intermediate data and model updates.This disparity is in spite of the fact that all our baselineimplementations utilize combiner optimizations.

In summary, our results indicate that reduction in commu-nication traffic is a primary contributing factor to the speedupsachieved by PIC. Since communication only becomes a moresevere bottleneck for larger clusters, we believe that thebenefits of PIC will be sustained if not enhanced with anincrease in cluster size.

VI. EFFECTIVENESS OF PIC’S BEST-EFFORT PHASE

The performance improvements obtained by PIC are inlarge part due to the effectiveness of the best-effort phase in

computing a high-quality model, and the fact that it does soin a much shorter time than the original IC computation itself.For all our applications, we observed that the results producedby the best-effort phase are very close in quality to the finalsolution (in some cases, even identical), necessitating very fewiterations in the top-off phase.

In this section we empirically evaluate the quality of themodel generated by PIC’s best-effort phase by comparing it tothe results of the baseline IC implementation. We also brieflymention some analytical insights to explain our results. Asa special case, we show that when the computations in aniteration are linear and the problem has the “nearly uncoupled”property, the best-effort phase of PIC can be analyticallyshown to converge to the same solution as the baseline ICimplementation.

A. Error vs. time

To illustrate the effectiveness of the best-effort phase, weevaluated the error of the models computed in each best-effortiteration, and plotted error vs. time at which each best-effortiteration completes. We compare these trajectories to the onesobtained from conventional IC implementations.

We note that the definition of quality or error is application-specific, therefore we consider each application separately.

For neural network training, the error was evaluated duringtraining by applying the model to a validation data set andmeasuring the fraction of data points that are mis-classified(model error). Figure 12(a) shows the trajectory of the modelerror vs. time for the neural network training algorithm. Thefigure shows that the PIC implementation manages to reacha model error that is virtually identical to what the baselineimplementation eventually achieves, but in less than a quarterof the time.

For K-means clustering, we computed the final solution(centroids) produced by a sequential implementation and usedthe distance to this reference solution as the error metric.Figure 12(b) plots the centroid displacement from iteration toiteration in the best-effort phase of PIC as well as the baselineimplementation. Again, we see that the centroids convergemuch faster in the best-effort phase of PIC.

We also used the Jagota index [11], a popular metric toevaluate the quality of clustering algorithms, to compare themodel computed by the best-effort phase of PIC with the modelcomputed by the original IC implementation. The Jagota indexmeasures the tightness or homogeneity of points within theclusters and is defined as follows:

Q =k∑

i=1

1

|Ci|∑

x∈Ci

d(x, μi)

where d(x, μi) is the distance between a point x and thecentroid μi that it belongs to, and |Ci| is the number of pointsin the i-th cluster. The results for two data sets, shown inTable III, suggest that the best-effort phase of PIC is ableto produce a solution that is within 3% of the quality of

398

Page 9: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

� ����

� ���

� ��

���

����

����

����

����

���

����

�����

�����

�����

���

����

����

����

����

���

����

�����

�����

�����

����

�����

�����

�����

�����

����

��������� ����� ������������ ����

�����!������#�

��� ���

��

����

����

����

����

����

���

���

����

����

��

�� ���� ���� ��� ���� ����� ����� ����� ���� �����

��� ���

����!������#�

������

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 200 400 600 800 1000 1200

Err

or

Time(seconds)

PIC Vs IC for Neural Network training using Backpropagation

ICPIC

!�#� !�#� !�#�

Figure 12. Accuracy of results vs. time for (a) neural network training, (b) K-means clustering and (c) solving a system of linear equations.

the baseline IC implementation, enabling the top-off phase toterminate in a small number of iterations.

Table IIIQUALITY OF BEST-EFFORT PHASE OF PIC IN TERMS OF JAGOTA INDEX

(K-MEANS)

Dataset 1 Dataset 2IC K-means 2.109 2.146

PIC BE Phase K-means 2.112 2.205Difference(%) 0.14% 2.75%

For the system of linear equations, there exists a uniquegolden solution. We use the distance to this solution as theerror metric, and plot error vs. time for the best-effort phaseof PIC and the conventional IC implementation in Figure 12(c).Again, we see that the best-effort phase of PIC producescomparable quality to the conventional IC implementation inone-third the time.

Similar accuracy results were obtained for the other casestudies, validating our hypothesis that the best-effort phase ofPIC is effective in generating a high-quality model in a muchshorter time than conventional IC implementations.

B. Analysis as a preconditioner

We provide some analytical insights to explain the effective-ness of the best-effort phase of PIC. In other words, how canwe know that an algorithm implemented in PIC’s best-effortphase will not diverge or create drastically different resultsthan the baseline.

In [12], we provide a mathematical framework, by drawingan analogy between the best-effort phase of PIC for the case ofalgorithms that perform linear computations in each iteration(e.g., PageRank, linear equation solver, image smoothing) andadditive Schwarz preconditioners from the field of “DomainDecomposition”. Without repeating the analysis in this paper,we summarize the key insights.

We believe that the effectiveness of the best-effort phaseof PIC is explained by the nature of applications that PICtargets, namely applications that are “nearly uncoupled”. Thedependency patterns of such applications can be approximatedusing nearly block diagonal matrices, as depicted in figure 13.

Additive Schwarz preconditioners deal with similar prob-lems. Note that in differential equations (where domain de-composition techniques are traditionally used), the stencilbased computation ensures the desired dependency patterns.However, stencil based algorithms are not the only algorithmswith the desired dependency patterns.

�����

�������

����

�������

��

����

�������

����

�������

��

��

��

� �� � ��

��

��

��� � � ���

�� � �������

� �

���

� �� � ��

���� �� �

Figure 13. The ideal dependency matrix of an application that PIC cansuccessfully target. The dependency between different partitions, shown byεij , i �= j, should be minimal (symbolized by 0) for PIC to be effective.

As an example, consider the PageRank algorithm. In aparallel implementation, each node requires information fromthe adjacent nodes and edges for the computation. If the webwas a complete graph with n.n−1

2 edges it would not be a goodmatch for PIC implementation. Fortunately the web graph istypically local, and by properly partitioning it (for exampleusing the METIS package), the connectivity matrix of thegraph becomes nearly uncoupled.

Similar arguments apply to the K-means clustering. Theimpact of far-away points on a centroid is much smaller thanthe impact of close points to that centroid. As such, a roughfirst-pass partitioning can ensure that each sub-problem mostlyrelies on the points inside that partition.

The image smoothing algorithm is stencil based and clearlythe dependencies are local. In the case of the linear systemof equations, a ”weak diagonal dominant” matrix propertyguarantees the “nearly uncoupled” property. In fact, the weakdiagonal dominance property is powerful enough to ensureeven asynchronous convergence [13].

[12] shows that convergence rate of the best-effort phase canbe found from the convergence rate of the baseline algorithmwith a scaling factor, as follows:

(ω βα )

k−1k

βα is the ratio of the maximum length of input partitioned

vectors to the length of the unpartitioned vector. ω is a measure

399

Page 10: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

of the converging power of the iterative function, and can bederived from the “local stability” condition [12], and k is thenumber of local iterations.

It is evident that more partitions translate to a slowerconvergence rate in the best-effort phase, but as we haveseen earlier, the increased locality in the problems allowsmuch faster local iterations by reducing network traffic, andperforming computations locally.

VII. DISCUSSIONS

There are a few points that merit further discussion. Thefirst is the impact of PIC on Hadoop’s fault tolerance. SincePIC is implemented as a library on top of Hadoop, best-effortiterations can utilize the same fault-tolerance mechanisms thatHadoop provides. Therefore if a node running a best-effortphase fails, Hadoop will automatically restart it.

We would also like to mention that PIC is currently imple-mented on Hadoop version 0.20.203. We have considered thenew version of Hadoop (Yarn, 0.23) and believe that its designarchitecture (resource manager, node managers and containers)is a good fit for PIC, and PIC can be easily ported to it. Weleave this as future work.

VIII. RELATED WORK

High-level programming frameworks [2], [14], [4] haveemerged as the preferred programming model for develop-ing and deploying applications on shared-nothing clustersof unreliable machines. Iterative-convergence algorithms arecommonly implemented using these frameworks; yet, noneof these frameworks provide explicit support or optimizationsfor iterative-convergence algorithms. For example, the ApacheMahout [3] project builds machine learning libraries on top ofHadoop. Most machine learning algorithms are iterative, andMahout uses a driver program to launch new MapReduce jobsin each iteration, resulting in the inefficiencies described inearlier sections.

Recently, several noteworthy attempts have been madeto augment MapReduce frameworks for iterative computa-tions [5], [6], [7]. Twister [5] is a stream-based frameworkthat extends the basic MapReduce framework by explicitlyavoiding repeated mapper data loading from disks. Twisteralso uses mappers and reducers that are long running withdistributed caches. The SPARK [6] framework targets dataintensive applications that reuse a working set of data acrossmultiple parallel operations. They introduce the concept of re-silient distributed datasets, which are a read-only collection ofobjects partitioned across a set of machines that can be rebuiltif a particular partition is lost. HaLoop [7] extends Hadoop byintroducing iterations into the programming model and severaloptimizations that include loop aware task scheduling, loop-invariant data caching and strategies for efficient fixed pointverification. A recent effort reported in [15] demonstrates howasynchronous algorithms [13] can be realized using MapRe-duce with improved efficiency. A common attribute of the best-effort phase of PIC and [15] is that both do not preserve numer-ical equivalence with a sequential implementation. However,unlike asynchronous algorithms, where the communication be-tween parallel computations does not occur at pre-determinedsynchronization points and can lead to results depending onexecution timing, PIC is fully synchronous and deterministic.PIC exposes nested parallelism (across partitions and withina partition), which can also be achieved using frameworkssuch as NESL [16]. However, PIC goes well beyond nested

parallelism frameworks by focusing on iterative-convergencealgorithms and by leveraging the inherent forgiving nature ofapplications in the best-effort phase.

Similar problems have been studied in the Operations Re-search community [17], [18] for “nearly uncoupled” problems,also known as “nearly completely decomposable” problems.[18] shows that a aggregation/disaggregation technique (in-cluding techniques such as “Koury, Mcallister, stewart”, “Taka-hashi” and “Vantilborgh”) can compute Markov chain steadystate solutions through a process similar to PIC’s best-effortphase. [17] extends this notion to the general class of problems,namely the fixed point of non-expansive mappings (a sub-classof contraction mappings).

Meng et al. present a best-effort parallel execution frame-work for Recognition and Mining applications on multi-coreplatforms, based on the properties of iterative-convergencealgorithms [19], [20]. They propose a variety of “best-effort”computing strategies. A number of compile-time optimizationsare proposed in [21], including some that do not preserve theaccuracy of the final results. Similar approaches have beenapplied in the context of stencil computations [22], [23]. Thesetechniques significantly differ from our proposal, and are infact of limited use for clusters.

None of the prior systems explicitly address the signif-icant challenges of (a) reducing disk and network trafficdue to MapReduce intermediate data and model updates or(b) introducing parallelism across partitions. Augmenting thebasic MapReduce framework to support iterations is useful,but cannot overcome these challenges. By leveraging theunique forgiving nature of applications that employ iterative-convergence, we are able to achieve significant performanceand scalability advantage that is beyond most existing pro-posals. Moreover, PIC does not require any changes to theMapReduce framework, and it is possible to reuse an existingMapReduce based implementation when developing a PICimplementation, reducing programmer effort.

IX. CONCLUSIONS

We proposed partitioned iterative convergence (PIC) asan approach to realize iterative algorithms on clusters. Weobserved that conventional implementations of iterative algo-rithms using MapReduce are quite inefficient as a result ofseveral factors. Complementary to prior work, we focus on ad-dressing the challenges of high network traffic due to frequentmodel updates, and lack of parallelism across iterations. Theproposed partitioned iterative convergence scheme exploitsthe forgiving nature of iterative-convergence algorithms. Werealized our framework by adapting the open-source Hadoopcluster programming framework, and demonstrated its utilityand performance benefits using five iterative convergence ap-plications. PIC implementations are significantly faster (upto4X) when compared with conventional MapReduce-based im-plementations on clusters ranging from a research testbed of 6nodes to an Amazon Elastic MapReduce instance of 256 nodes.Our results suggest that PIC has great potential in enablingefficient implementations of iterative convergence algorithmson clusters.

REFERENCES

[1] Y.-K. Chen, J. Chhugani, P. Dubey, C. Hughes, D. Kim, S. Ku-mar, V. Lee, A. Nguyen, and M. Smelyanskiy, “Convergence ofrecognition, mining, and synthesis workloads and its implica-tions,” Proceedings of the IEEE, vol. 96, no. 5, pp. 790 –807,may 2008.

400

Page 11: [IEEE 2012 IEEE International Conference on Cluster Computing (CLUSTER) - Beijing, China (2012.09.24-2012.09.28)] 2012 IEEE International Conference on Cluster Computing - PIC: Partitioned

[2] J. Dean and S. Ghemawat, “Mapreduce: simplified data pro-cessing on large clusters,” Commun. ACM, vol. 51, no. 1, pp.107–113, 2008.

[3] “Apache mahout project,” available athttp://mahout.apache.org/.

[4] “The open source apache hadoop project,” available athttp://hadoop.apache.org/.

[5] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae,J. Qiu, and G. Fox, “Twister: a runtime for iterative mapreduce,”in Proceedings of the 19th ACM International Symposiumon High Performance Distributed Computing, ser. HPDC ’10.New York, NY, USA: ACM, 2010, pp. 810–818. [Online].Available: http://doi.acm.org/10.1145/1851476.1851593

[6] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, andI. Stoica, “Spark: cluster computing with working sets,” inProceedings of the 2nd USENIX conference on Hot topicsin cloud computing, ser. HotCloud’10. Berkeley, CA, USA:USENIX Association, 2010, pp. 10–10. [Online]. Available:http://dl.acm.org/citation.cfm?id=1863103.1863113

[7] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, “Haloop:efficient iterative data processing on large clusters,” Proc. VLDBEndow., vol. 3, no. 1-2, pp. 285–296, Sep. 2010. [Online].Available: http://dl.acm.org/citation.cfm?id=1920841.1920881

[8] A. Vahdat, M. Al-Fares, N. Farrington, R. N. Mysore, G. Porter,and S. Radhakrishnan, “Scale-out networking in the data center,”IEEE Micro, vol. 30, no. 4, pp. 29–41, Jul. 2010.

[9] J. MacQueen, Some methods for classification and analysis ofmultivariate observations. University of California Press, 1967,vol. 1, no. 233, pp. 281–297.

[10] “Apache nutch open source search engine,” available athttp://nutch.apache.org/.

[11] A. Jagota, “Novelty detection on a very large number of mem-ories stored in a hopfield-style network,” in Neural Networks,1991., IJCNN-91-Seattle International Joint Conference on,vol. ii, jul 1991, p. 905 vol.2.

[12] F. et. al., “A theorethic treatment of the partitioned iterativeconvergence methods,” Illinois Digital Environment for Accessto learning and Scholarship, June 2012. [Online]. Available:https://www.ideals.illinois.edu/handle/2142/31323

[13] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributedcomputation: numerical methods. Upper Saddle River, NJ,USA: Prentice-Hall, Inc., 1989.

[14] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:distributed data-parallel programs from sequential buildingblocks,” in Proceedings of the 2nd ACM SIGOPS/EuroSysEuropean Conference on Computer Systems 2007, ser. EuroSys’07. New York, NY, USA: ACM, 2007, pp. 59–72. [Online].Available: http://doi.acm.org/10.1145/1272996.1273005

[15] K. Kambatla, N. Rapolu, S. Jagannathan, and A. Grama,“Asynchronous algorithms in mapreduce,” in Proceedingsof the 2010 IEEE International Conference on ClusterComputing, ser. CLUSTER ’10. Washington, DC, USA: IEEEComputer Society, 2010, pp. 245–254. [Online]. Available:http://dx.doi.org/10.1109/CLUSTER.2010.30

[16] G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, andM. Zagha, “Implementation of a portable nested data-parallellanguage,” Pittsburgh, PA, USA, Tech. Rep., 1993.

[17] T. W., “Iterative methods for approximation of fixed points andtheir applications,” Journal of the Operations Research Societyof Japan, vol. 43, no. 1, pp. 87–108, 2000.

[18] W.-L. Cao and W. J. Stewart, “Iterative aggregation/disaggrega-tion techniques for nearly uncoupled markov chains,” J. ACM,vol. 32, no. 3, pp. 702–719, Jul. 1985. [Online]. Available:http://doi.acm.org/10.1145/3828.214137

[19] J. Meng, S. Chakradhar, and A. Raghunathan, “Best-effortparallel execution framework for recognition and mining ap-plications,” in Proceedings of the 2009 IEEE InternationalSymposium on Parallel&Distributed Processing, ser. IPDPS ’09.Washington, DC, USA: IEEE Computer Society, 2009, pp. 1–12.

[20] ——, “Exploiting the forgiving nature of applications for scal-able parallel execution,” in Proceedings of the 2010 IEEEInternational Symposium on Parallel&Distributed Processing,ser. IPDPS ’10, 2010.

[21] J. Liu, N. Ravi, S. Chakradhar, and M. Kandemir, “Panacea:towards holistic optimization of mapreduce applications,” inProceedings of the Tenth International Symposium on CodeGeneration and Optimization, ser. CGO ’12. New York,NY, USA: ACM, 2012, pp. 33–43. [Online]. Available:http://doi.acm.org/10.1145/2259016.2259022

[22] D. Chazan and W. Miranker, “Chaotic relaxation,” Linear Alge-bra and its Applications, vol. 2, no. 2, pp. 199 – 222, 1969.

[23] S. Venkatasubramanian, R. W. Vuduc, and n. none,“Tuned and wildly asynchronous stencil kernels for hybridcpu/gpu systems,” in Proceedings of the 23rd internationalconference on Supercomputing, ser. ICS ’09. New York,NY, USA: ACM, 2009, pp. 244–255. [Online]. Available:http://doi.acm.org/10.1145/1542275.1542312

401