Towards Energy Efficient MapReduce - EECS at UC Berkeley€¦ · Towards Energy Efficient MapReduce Yanpei Chen Laura Keys Randy H. Katz Electrical Engineering and Computer Sciences

Towards Energy Efficient MapReduce

Yanpei ChenLaura KeysRandy H. Katz

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2009-109

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-109.html

August 5, 2009

Copyright 2009, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Towards Energy Efficient MapReduce

Yanpei Chen, Laura Keys, Randy H. KatzRAD Lab, EECS Dept. UC Berkeley

{ychen2, laurak, randy}@eecs.berkeley.edu

ABSTRACTEnergy considerations are important for Internet datacen-ters operators, and MapReduce is a common Internet data-center application. In this work, we use the energy efficiencyof MapReduce as a new perspective for increasing Internetdatacenter productivity. We offer a framework to analyzesoftware energy efficiency in general, and MapReduce energyefficiency in particular. We characterize the performance ofthe Hadoop implementation of MapReduce under differentworkloads. We also introduce quantitative models to guideoperators and developers in improving the performance ofMapReduce/Hadoop. A major, but somewhat unsurprisingfinding is that for workloads where the work rate is pro-portional to the amount of resources used, improving theperformance as measured by traditional metrics such as jobduration is equivalent to improving the performance as mea-sured by lower energy consumed.

Categories and Subject DescriptorsC.2.4 [Computer-Communication Networks]: DistributedSystems—Distributed applications

General TermsDesign, Performance, Economics

KeywordsMapReduce, Hadoop, Energy, Data centers

1. INTRODUCTIONThe energy consumption of Internet datacenters is an im-portant aspect of datacenter efficiency. Research has shownthat over the lifetime of IT equipment, the operating energyand cooling cost is now comparable to the initial equipmentacquisition cost [13]. Also, most operators build datacen-ters to be as large as they can be economically supportedby the power grid. Thus, equipment that is not power orenergy efficient will limit the amount of IT equipment that

can go into a particular physical datacenter site. As com-putational needs increase, this means more frequent, largescale investments in new datacenter sites.

At the same time, the total power consumption of datacen-ters in the United States has doubled from 2000 to 2005 rep-resenting 1.5% of the national power consumption in 2005,larger than that of color televisions [19]. During the sameperiod and thereafter, the national power production levelwitnessed only marginal increases [1, 8], a sharp contrastwith the 15% yearly increase in datacenter power consump-tion [19]. Thus, datacenter operators, with their long historyof innovation, have a technology responsibility to lead thenation in using new ideas to reduce our carbon footprint.

Datacenter operators have already taken on this responsibil-ity, and have developed industry power efficiency standardssuch as the power utilization efficiency (PUE) metric [4].The PUE metric is the ratio of total facility power and to-tal IT equipment power. In 2004, typical PUE values are2 and above [21], suggesting that most of the datacenterpower is wasted in physical, non-IT systems such as coolingand power distribution. More recently, PUE values have im-proved significantly, and are fast approaching the ideal valueof 1 [18, 22]. This means that the vast majority of datacen-ter power consumption is now in the IT equipment. Conse-quently, the greatest opportunities for further improvementnow lie in optimizing IT.

We can improve the energy efficiency of IT equipment througheither hardware or software. We focus on software becausewe believe it presents a lower barrier to innovation and adop-tion. Of all the software paradigms, we analyze MapReduce[15] because it powers the core business of many key Inter-net enterprises. Of all the MapReduce implementations, wechoose Hadoop [6] because it is open source, and thus offersa low barrier for innovation. We suspect there will be manyopportunities to save energy, because existing Hadoop ver-sions have not been designed with energy efficiency in mind.

To the best of our knowledge, our work is the first detailedstudy of the energy efficiency of datacenter applications likeMapReduce. We seek to make several contributions. Wepresent an analytical framework for software energy effi-ciency, including efficiency of systems such as MapReduce.We provide experimental data to characterize the energyconsumption for the Hadoop implementation of MapReduce.We also introduce quantitative models that can help Hadoop

operators and developers in identifying opportunities to im-prove energy efficiency.

This report is organized as follows. Section 2 gives a briefoverview of MapReduce, and quickly review related work inInternet datacenter energy efficiency. Section 3 presents oursoftware energy efficiency framework, and discusses the com-plex inter-relationship between metrics, system parameters,workloads, and energy measurement. Section 4 describes ourexperimental setup and our energy measurement methodol-ogy. Section 5 provides descriptive data to characterize theenergy consumption of several MapReduce workloads. Sec-tion 6 outlines quantitative models that allow Hadoop oper-ators to tune their clusters, and Hadoop developers to iden-tify potential improvements for the Hadoop system. Section7 summarizes our recommendations, and suggests directionsfor future work.

2. BACKGROUND2.1 MapReduce OverviewMapReduce was initially developed at Google as a platformto parallel processing of large datasets [15]. Today, MapRe-duce powers Google’s flagship web search service, as well asclusters at companies like Yahoo!, Facebook, and IBM [7].MapReduce has several attractive qualities. Programs writ-ten using the MapReduce model are automatically executedin a parallelized fashion on the cluster. Also, MapReducecan be run on clusters of cheap commodity machines, anincreasingly attractive alternative to expensive, specialistclusters. Furthermore, MapReduce is highly scalable, al-lowing petabytes of data to be processed on thousands andeven millions of machines. Most importantly for our work,MapReduce is a data intensive benchmark that generates amixture of CPU, disk, and network activity that is repre-sentative of many real-life datacenter workloads.

At its core, MapReduce has two user-defined functions. TheMap function takes in a key-value pair, and generates a setof intermediate key-value pairs. The Reduce function takesin all intermediate pairs associated with a particular key,and emits a final set of key-value pairs. Both the inputpairs to Map and the output pairs of Reduce are placed inan underlying distributed file system (DFS). The run-timesystem takes care of retrieving from and outputting to theDFS, partitioning the data, scheduling parallel execution,coordinating network communication, and handling machinefailures.

A MapReduce task is executed in several stages. There isa master daemon responsible for coordinating a job on acluster of workers. The input data resides in the DFS. Themaster divides the input file into many splits, each to beread and processed by a Map worker. The intermediate key-value pairs are periodically written to the local disk at theMap workers, usually separate machines from the master,and the locations of the pairs are sent to the master. Themaster forwards these locations to the Reduce workers, whoread the intermediate pairs from Map workers using remoteprocedure call (RPC). After a Reduce worker has read allthe intermediate pairs, it sorts the data by the intermediatekey, applies the Reduce function, and finally appends theoutput pairs to a final output file for the Reduce partition.The output file also resides in the DFS. If any of the Map

or Reduce tasks lags far behind, backup tasks are executedspeculatively, and the task is consider complete when anyof the backup Map or Reduce workers finishes. To limitnetwork traffic, users may additionally specify a Combinefunction that merges multiple intermediate key-value pairsbefore sending them off to the Reduce worker.

For our work, we select the Hadoop implementation of MapRe-duce [6]. This is one of the most popular MapReduce im-plementations. It is an open source Java application. Theuser-supplied Map and Reduce functions are both written inJava; the cluster configurations are set using XML. Exten-sive documentation exists online. The Hadoop DFS (HDFS)implements many features of the Google DFS [17], on whichthe original Google MapReduce runs.

2.2 Related WorkThere has been a significant amount of prior work in en-ergy efficient computing systems. First motivated by mo-bile applications, the interest in energy efficiency has beensustained by the proliferation of laptops, as well as the re-cent focus on Internet datacenters. It would be impossibleto comprehensively review the existing literature, so we willhighlight below several topics most closely related to ourwork.

Frequency scaling has long been implemented as a well knownpower saving technique in CPU architecture. However, withCPU frequencies scaled down, any CPU-bound workloadswould take proportionally longer to run. Since total energyuse is the product of power and time, a lower frequency ac-tually impose a latency cost at no benefit in terms of totalenergy use. Nevertheless, frequency scaling is still useful ifthe workload is not CPU-bound, or if power is a system de-sign limitation, as it is often the case in the CPU designspace. The higher the CPU power, the greater its coolingneeds. The now well-known tradeoff between power and la-tency in frequency scaling highlights for us the importanceto look at both energy, and power, i.e., the rate of energyconsumption.

In a different line of work, energy conscious communicationsfor mobile architectures developed the energy per useful bitmetric [11]. This metric highlighted the need for metricsthat evaluate systems with regard to the energy used foreach unit of useful work. In a similar fashion, we can alsomeasure finishing time per unit of useful work. It is difficult,if not impossible, to come up with a general definition for”useful work”. But for comparable workloads, the unit ofuseful work is usually easy to identify. For example, in HDFSread or write jobs, the unit of useful work would be the bytesread or written, and the corresponding workload-consciousenergy efficiency metric would be energy per bytes read orenergy per bytes written.

More recently, people have started developing benchmarksto measure datacenter energy efficiency. The SPEC Powerbenchmark [10] has been widely accepted by server vendors.As already mentioned, the PUE metric [4] is already a de-facto industry standard among datacenter operators. TheEPA has also developed a new set of Energy Star criteria forenterprise servers [2]. A set of Energy Star criteria for enter-prise storage is under development as well [3]. The natural

progression of these workload-specific energy metrics wouldbe some sort of unifying datacenter productivity measure.Our work offers insights that help guide incremental stepstowards that goal.

Also, people have recently started speaking about energyproportional computing [12] as a worthy system design goal.The main argument is that current systems consume nearlyas much power idle as when they are actively doing work.Given that most machines are expected to see a significantamount of idleness during their lifetime, we should designcomputing systems such that they consume power propor-tional to the amount to which they are utilized. Ideally, anenergy proportional system consumes zero power when idle,full power when under 100% utilization, and the power in-creases linearly between the two end points. The energy pro-portionality concept inspired the quantitative analysis lead-ing to our major result that energy efficiency and traditionaltime efficiency are fundamentally equivalent.

Although we are the first to look at the energy efficiency ofMapReduce, we are certainly not the first to use MapReducein work on systems energy consumption. MapReduce hasalready become a standard workload in evaluating the powerconsumption of datacenters [16, 20]. We fully believe thatMapReduce should be a part of any evaluation of datacenterefficiency. We differ from this line of work by focusing onimproving the software energy efficiency of MapReduce inparticular, instead of using MapReduce as a workload for awider energy efficiency investigation.

Lastly, in a closely related work, JouleSort investigates theenergy efficiency of sorting applications [23]. Sorting is akey part of the MapReduce system. In most implementa-tions, including Hadoop, sorting occurs implicitly with ev-ery MapReduce job. The JouleSort performance metric ofjoules per sorted record echoes the energy per useful bitconcept. The JouleSort experimental methodology also in-formed ours. Their key finding is that sorting is most energyefficient when it is CPU bound instead of IO bound. Theinsight in this area would help us improve the sorting part ofMapReduce. Our investigations are a superset of the focusin JouleSort. JouleSort is one of several workloads that werun on MapReduce.

3. ANALYTICAL FRAMEWORKWe believe software energy efficiency studies should be guidedby four considerations - the desired performance metrics, theavailable system parameters, the appropriate workload, andthe method of energy measurement.

3.1 MetricsThe choice of performance metrics is critical to any discus-sions about datacenter energy efficiency. First, we mustmake a distinction between energy and power. Power is therate of energy consumption, i.e. energy = power × time.Datacenter operators are charged for energy consumption,usually in kilo-Watt-hours (kWHr). However, each datacen-ter site is usually constructed for a certain power rating, inkilo-Watts (kW) or even mega-Watts (MW). For a constantpower, jobs that run longer consume more energy. For jobsthat consume a fixed amount of energy, a faster finishingtime would come at the cost of higher power. Both met-

rics are important in a discussion about datacenter energyefficiency.

Another crucial metric is energy per unit of useful work.This metric is important because it gives us a way of talk-ing about energy consumption independent of workload size.The difficulty in using this metric, however, is identifying theappropriate unit of useful work. Doing so is difficult acrossdifferent types of workloads. However, for workloads of thesame type, the appropriate unit of useful work is often ob-vious, e.g. joules per record sorted for sort, joules per bytewritten/read for write/read jobs, and the like.

A similar metric is work rate per Watt. While energy perunit of useful work measures the energy efficiency indepedentof workload size, work rate per Watt can measure the powerefficiency independent of workload size.

Also, improved energy efficiency should not come at the costof decreased performance in the traditional sense. Thus,metrics like job finishing time and service level agreements(SLA) remain relevant. We believe that energy efficiencyshould be an essential part of system performance, and en-ergy efficiency metrics should complement traditional per-formance metrics, instead of completely replacing them. Thefocus should not be about whether job finishing time or en-ergy consumption is more important. Rather, we should beasking questions like whether we can use additional energyor power to buy improved finishing time or better SLA, orwhether we can conserve energy or power without harmingfinishing time or SLA. As we will discover later, for mostsystems, decreasing energy consumption is equivalent to de-creasing the finishing time.

To summarize, we believe that a bundle of metrics should beused, because we are not yet at the stage where we can devisea single metric to unify the different system design priorities.We begin by analyzing metrics that are the easiest to collect- finishing time, energy, power, and energy per unit of work.

3.2 ParametersIn talking about the energy efficiency of software systems likeMapReduce, it is also important to identify the parametersof the systems that we can change. We believe there areseveral different types of parameters that we can optimize.

The easiest parameters to change are static software con-figuration parameters. For MapReduce, such parametersinclude the number of workers and HDFS data nodes, aswell as the many Hadoop-specific parameters found in thehadoop-default.xml. We believe the most interesting pa-rameters are IO parameters including HDFS block size, andHDFS replication factor. They directly impact the perfor-mance of the IO path. Since MapReduce is a data-centricparadigm, a sub-optimal IO path is likely to impact all per-formance metrics for all workloads.

We must add a caveat that a key function of HDFS is to en-sure fault tolerance. Changing the HDFS replication factorwould change the resistance to faults, and decreasing HDFSreplication factor to 1 implies that the failure of any nodeswould result in data loss. Thus, an interesting question ishow much energy we can save if we are willing to compro-

mise on data persistence. We provide data to quantify thistrade-off.

Another set of parameters are dynamic software executionparameters. For MapReduce, such parameters include thedifferent number of map and reduce jobs, different schedul-ing algorithms, different HDFS block placement schemes,different speculative execution models. Aside from the num-ber of map and reduce jobs, these parameters are usually di-rectly built into the software architecture. Optimizing themoften require more extensive code modifications. We do notimplement or evaluate any software execution optimizations.However, our findings suggest several opportunities wheresuch optimizations have good potential.

Last but not least, software requires hardware to run, thushardware parameters are relevant. One can object thathardware is difficult to change, but we have seen that data-center operators can and do customize their server hardware[24]. One can further object that hardware configurationsare not a part of the software optimization problem, but wecan respond that different types of software may have differ-ent CPU/IO tradeoffs, or are biased towards different typesof IOs. Thus, understanding the software-hardware inter-action is vital to understanding software energy efficiency.We believe the most important parameters are the choiceof CPU and the speed of various IO paths, including disk,memory, network, and the size of memory. We are yet togather data for different hardware.

3.3 WorkloadThere are two properties of a good workload for measuringdatacenter software energy efficiency. First, the workloadshould be representative of actual production workloads.Second, the workload should exercise all parts of the sys-tem. In particular, the workload should exercise all majorparts of the IO path, all common aspects of various schedul-ing/placement algorithms, and all common combinations ofIO/CPU biases. Given these criteria, It is unlikely thata single type of workload can meet all the criteria, anda bundle of workloads would be more realistic. MapRe-duce is a good starting point because it is representativeof large, data-intensive computations. As outlined in Sec-tion 2.1, MapReduce powers production clusters at manyleading datacenter operators. It also exercises all parts ofthe computation system by generating a mixture of CPU,disk, and network loads.

The most realistic insights would come from measuring pro-duction workloads on production clusters. Such measure-ments would reflect the effects real-life utilization cycles, un-predictable failures, background workloads, adminstrationinterruptions, and the like. However, to understand the ef-fects of various system design parameters, we believe weshould first perform measurements in a controlled and rel-atively simple environment. In particular, the backgroundCPU, disk, network loads, and the level of machine virtual-ization should all be controlled. The simplest environmentwould exercise the workload with no background load andno virtualization. In such an environment, the effect of soft-ware system design parameters would not be distorted bycomplexities in the experimental setup.

3.4 Energy MeasurementIn measuring the software energy consumption, an impor-tant task is to define the system boundary. There are var-ious granularities at which we can draw boundaries for thesystem. We believe the most useful and the most conve-nient boundary is at the wall power socket. This way, thepower-energy measurements would capture the holistic sys-tem performance, including any components that may beidle and drawing wasted power. In particular, we can mea-sure the baseline energy consumption by leaving the systemon but idle.

All MapReduce master and workers should be measured.Ideally, the network switch should be measured as well.However, monitoring the network switch on a shared clus-ter exposes our readings to interference from other networkintensive jobs on the cluster. Therefore, we suggest that un-less the cluster can be isolated, we should not monitor thenetwork switch. This is purely prioritizing the quality ofdata over the comprehensiveness of data.

As we will discuss in more detail later, including the masterin power energy measurements distorts data trends in smallclusters. We have only a small cluster of one master and upto twelve workers. Thus, for most of our experiments, wewill in fact measure the energy consumed by workers only.Again, this is done to prioritize data quality over compre-hensiveness.

We do not believe that power and energy measurementsshould include physical systems such as air conditioning,lighting, power distribution, and the like. The contributionsfrom those systems are important, but it is much harderto draw a causal relationship or even a correlation betweenthe power and energy consumed in physical systems andthe choice of software design parameters. Also, the infras-tructure to facilitate the accurate measurement of physicalsystems is often not in place. Furthermore, with PUE valuesof around 1.2 [18, 22], the value of improving software en-ergy efficiency is much greater than the value of improvingphysical systems.

Last but not least, power measurements should record realpower and not apparent power. In AC systems, apparentpower includes the energy flow caused by capacitive and in-ductive circuit elements. Such cyclic power flows return tothe power source at the end of each cycle, resulting in netzero energy transfer. Power flows of this type are not avail-able for doing any work, and thus should be excluded frompower-energy measurements. Real power measures only thenon-zero net power transfer that is available for doing usefulwork.

4. METHODOLOGYGuided by the experimental philosophy in Section 3, we de-signed our experiments as below.

Our cluster includes up to 13 machines, each a Sun Mi-crosystems X2200 M22 machine, with two dual-core AMDOpteron Processor 2214 at 2.2GHz, 3.80GB RAM, 250GBSATA Drive, running Linux 2.6.26-1-xen-amd64. Machinesin the cluster are connected to each other via 1Gbps Ether-net through a single switch.

We use Hadoop distribution 0.18.2 running on Java version6.0, update 11. Hadoop 0.18.2 was the latest distributionwhen we began our experiments. To ensure data compara-bility, we decided against switching to newer distributions.For our Nutch experiments, we used Hadoop distribution0.19.1, which comes prepackaged with Nutch. Unless oth-erwise noted, we used default, out of the box configurationparameters for all systems.

We run Hadoop in a controlled environment, motivated bythe reasons mentioned in Section 3.3. The machines we useare a part of a shared cluster, so we could not guaranteethat there were no background tasks on the machines weused, nor that there were no background network traffic.However, we perform our experiments during periods of lowor no background load on the cluster, and closely monitoredthe CPU, disk, and network load during our experiments.When we detect any activity not due to Hadoop, we stopdata collection and repeat the measurement at a later time.We run Hadoop with no virtualization.

Our Hadoop jobs includes sort, an IO-bound task that exer-cises all IO paths, and comparable with the existing Joule-Sort benchmark. We used the standard terasort format of100 bytes records with 10 bytes keys. Also, we developedour own artificial workloads to try to isolate different partsof the IO system. We also ran Hadoop jobs that performeda configurable amount of HDFS read, HDFS write, and datashuffle. In addition, we ran Nutch, a web crawler that is rep-resentative of common workloads at leading Internet searchengine companies [9].

For energy measurement, we used a Brand Electronics Model21-1850/CI power meter. It has 1W power resolution withlogging capabilities, and we set it to sample at 1Hz. It isattached to the wall socket of one of the machines in ourcluster. We do not yet have enough power meters to moni-tor all machines in the cluster, nor can we monitor the net-work switch. We are making the methodological simplifica-tion that because the machines in the cluster are homoge-nous, we expect the power consumption of each machineto be homogenous also. We take multiple measurements.Thus, given the randomized nature of Hadoop schedulingand HDFS block placement algorithms, the variation in theperformance of the machines should be captured in the ex-perimental confidence intervals.

We did not measure detailed CPU, memory, disk, and net-work utilization during for workloads. Collecting such dataand trying to correlate them with software parameters is thesubject of future work.

Unless otherwise noted, all data points we present come fromfive repeated readings, and the error bars represent 95% con-fidence intervals.

5. RESULTSIn this section, we characterize Hadoop energy efficiency per-formance from several perspectives. We begin with mea-suring the scalability behavior of sort and the Nutch webcrawler, as well as artifical workloads constructed to stressdifferent data paths. We then examine the effect of changingstatic Hadoop configuration parameters, including HDFS

replication and block size. We also investigate whether theinput size of jobs would impact performance.

5.1 Scalable BenchmarkOur first set of experiments involves changing the numberof workers only. We ran sort and Nutch. These experimentsinvestigate whether adding more workers would improve theperformance of each workload. Each worker runs MapRe-duce tasks and acts as a HDFS data node. We have a dedi-cated machine that is the master.

5.1.1 SortFigure 1 shows the performance of sort. Input data size isfixed at 10GB. There are several interesting features. Jobduration or finishing time decreases with an increased num-ber of workers. Total energy decreases with the number ofworkers, and as expected, total power increases linearly withthe number of workers.

Figure 2 shows the power per node for both workers andmaster. The power remains near constant as we increasethe number of workers in the cluster. Also, we added a datapoint for zero worker - measuring the average idle power ofthe machine. We see from the graphs that the average powerwhen running a job is only marginally higher than the idlepower. This echoes the finding in [12] that most systemstoday are not energy proportional.

Our results for sort invites the question whether one can re-duce energy consumption simply by adding more workers.We believe that this is not the case. Our results are some-what distorted by the fact that we have included the energyconsumption of the master in our measurements. We seefrom Figure 2 that power consumed by the master is rela-tively fixed. So as we increase the number of workers, therelative power consumption of the master is ”amortized”overthe workers. To illustrate this point, we re-display in Fig-ure 3 the data from Figure 1, with the contribution of themaster subtracted away.

The job duration graph remains unchanged. However, wesee that the total energy is much more of a flat line ratherthan a pronounced downward trend. We can reason that theone worker case should consume less energy because it doesno network data transfer. The total power is still a linearline, shifted downwards vertically from the line in Figure 1by roughly 200W, i.e. the power consumption of a worker.

When the energy consumption workload is relatively con-stant for a constant workload sized, we can say that theworkload has been designed to scale to more workers. Forsort, this is the case.

In terms of energy per work unit, Hadoop sort with defaultconfiguration parameters can achieve around 100 sorted recordsper joule. For comparison, the current JouleSort benchmarkwinner can do 11k sorted records per joule. Default Hadoopis two orders of magnitude behind.

5.1.2 NutchNutch is a web crawler and indexer. At its core, it performsa loop with several iterations. It begins by injecting URLs

Figure 1: Sort performance

Figure 2: Sort performance - power per machine

Figure 3: Sort performance - workers only

into a crawl database, in order to bootstrap it. Then, itgenerates a set of URLs to fetch, fetch the content at theURLs, parse the content, update the crawl database withnew URLs, invert links from the content on the URLs, andfinally indexes the URL content before beginning anotheriteration.

For our experiments, we bootstrap Nutch to crawl and indexURLs anchored at www.berkeley.edu, going up to a depthof 7, and following the top 2000 links on each page. Our jobduration, energy, and power measurements are in Figure 4.

These results are very surprising to us. They mean thatNutch, as we have run it, is not scalable to a large numberof workers. The job still finishes, but using more workersgives no benefit in either job duration or energy or power.So we might as well run Nutch on a single node always.

It would be easy to dismiss our results either as a mis-configuration on our part, or poor design on Nutch’s part.As we will show later, the results are likely because our crawland index dataset is not large enough to take advantage ofthe full potential for parallelism offer by Nutch. For oper-ators running workloads similar to Nutch but with signifi-cantly larger crawl and index dataset size, it would be veryvaluable to repeat our experiments and investigate whetherour observations are distorted by small dataset, or they re-veal design limitations in Nutch.

5.2 Isolated IO TasksWe are unsatisfied with the somewhat “black-box” perspec-tive that we initially took while looking at the sort andNutch results. We would like to develop a general way inwhich we can isolate each part of a MapReduce/Hadoopjob. The compute part of each MapReduce/Hadoop job isvery job-specific and hard to generalize. However, the IOoperations are common across all MapReduce/Hadoop jobs.Therefore, we devised three workloads to stress each majorpart of the MapReduce/Hadoop IO path:

HDFS read: Stressing the HDFS disk reads necessary be-fore a Map task can begin.

Shuffle: Stressing the network, memory, and if necessary,the disk used for shuffle data between Map tasks and Reducetasks.

HDFS write: Stressing the HDFS disk writes necessaryafter a Reduce task completes.

Each of these workloads stress one part of the IO path, anddoes nothing else. They are all modified from the sort andrandomwriter examples that come pre-packaged with recentHadoop distributions. The measurements for the workloadsare in Figure 5. To ensure comparability with our earliersort results, we set the jobs to read, shuffle, or write 10GBof data using terasort record format.

The most surprising result is that increasing the numberof workers would not decrease the job duration of HDFSwrite, even though HDFS read and shuffle both run fasterwith more workers. HDFS write exhibits great variationin job duration, mirroed in the great variation in the energy

consumed for HDFS write. HDFS write also consumes moreenergy at a large number of workers. For HDFS read andshuffle, the energy is relatively constant.

To more convincingly display the relative scalability behav-ior of the different isolation workloads, we constructed Fig-ure 6, showing the duration and energy contributions of eachof HDFS read, write, and shuffle as a fraction of the sum, e.g.for the fraction that HDFS read contributes to the duration,

Fraction(hdfsRead) =thdfsRead

thdfsRead + tshuffle + thdfsWrite

where tTASK is the duration of TASK.

In Figure 6, the job duration and energy fractions track eachother. If all three jobs scale in a similar fashion, the relativefractions should be the fixed as we increase the number ofsenders. We see that indeed HDFS write contributes a largerand larger fraction of the duration and energy relative tothe total sum. In other words, HDFS write does not scaleto more workers at the same rate as HDFS read and shuffle.

We demonstrate in Section 5.3 that the poor scalability andgreat variation in the HDFS write performance is removedwhen we reduce HDFS replication level.

5.3 HDFS ReplicationWe want to investigate whether decreasing HDFS replicationbrings about a clear benefit. in terms of time efficiency andenergy efficiency. We repeat our sort and isolation jobs usingthe same input size and configuration parameters, changingonly the HDFS replication level. We run the experimentsfor 12 workers and a fixed data size of 10GB. Our resultsare in Figure 7.

The total power remains identical across different HDFSreplication levels for all jobs, a result that we expected.Thus, any energy savings would come entirely from decreasesin job duration. Indeed, there is a strong correlation in theshape of the duration and energy graphs. In Section 6.4, wehave a more extensive discussion regarding the tight corre-lation between energy and job duration.

The measurements show that decreasing HDFS replicationlevel brings negligible benefit for HDFS read and shuffle, buta clear benefit for HDFS write both in terms of the averageperformance and the variability in performance. We can ex-plain this observation by the way HDFS replicate blocks.When a block is written to HDFS, the machine doing thelocal write begins writing to disk immediately, and tells a dif-ferent HDFS data node to begin writing the second replica.The second machine then begins writing, and tells a thirdmachine to begin writing the third replica. The chain con-tinues until HDFS has reproduced the required number ofreplicas. When each of the replication data nodes finisheswriting, it notifies the data note that initially requested thereplicas. The HDFS write call would return only when allreplica data nodes reported completion. When we decreasethe replication level, the chain is “shorter”, and both the av-erage and variation in finishing time would decrease. Thus,

Figure 4: Nutch performance - workers only

Figure 5: Isolation jobs performance - workers only

Figure 6: Isolation jobs performance as relative fractions of sum of the jobs

we believe the HDFS replication mechanism is responsiblefor the poor scalability for HDFS write, as shown by thedata in Section 5.2. Later, we will present in Section 6.3 amore detailed quantitative discussion about how to identifythe optimal HDFS replication level.

The data point for 4-fold HDFS replication is off the trendfor HDFS write and HDFS read. HDFS replication levelabove 3 is usually not used, because it imposes additionalstorage space overhead for negligible additional benefit instorage failure tolerance. We are yet to pin point why 4-foldHDFS replication exhibit behavior different from the otherreplication levels.

In Figures 8 and 9, we reproduce the duration and energyfraction graph using data for HDFS replication of 2 and 1,respectively. Recall that in Figure 6, HDFS write is thedominant fraction at a large number of workers. In Figures8 and 9, shuffle is the dominant fraction and the scaling bot-tleneck. This means that if a HDFS replication lower than3 is acceptable, then the greatest opportunity for furtherimprovement would be in the shuffle part of the IO path.

5.4 HDFS Block SizeOne other HDFS parameter that can be easily changed isHDFS block size. Different HDFS block sizes will also havean impact on the performance of various workloads. Werepeated our experiments for 12 workers, 10GB input datasize, default 3-fold HDFS replication, and varied the HDFSblock size from 16MB to 1GB. Our results are in Figure 10.

The measurements indicate that large HDFS block size bringsa clear benefit for HDFS read and write workloads, but neg-ligible benefit for shuffle. Thus, for jobs that have a largeamount of HDFS input/output data, such as sort, the HDFSblock size should increase appropriately with the workloadsize.

5.5 Input SizeOur observations with regard to Nutch in Section 5.1.2 leadus to suspect that the size of a workload may impact energyefficiency. The inescapable overhead in MapReduce/Hadoopwould be amortized across large jobs, and prevent small jobsfrom scaling. To investigate the trade off associated withworkload size, we repeat the sort workload for input datasizes of 20GB, 5GB, 1GB, 500MB, and 100MB. To ensurecomparability with our earlier sort results for 10GB, we runHadoop with default parameters, in particular with a HDFS

replication level of three replicas. The results are in Figure11. We omit the power results because we expect the powerto be independent of workload size, and the experimentalmeasurements confirm this expectation.

The data is somewhat hard to read, because the input sizevaries across two orders of magnitude, and the duration andenergy is expected to also vary across the same scale. Tomake the results more readable, we normalize the durationand energy, such that the duration and energy for each datapoint is expressed as a fraction of the maximum durationand energy for that particular data input size. This gives usa series of lines for each data input size, with the maximumof the line bounded by 1, and allowing us to more easilyidentify any trends in the data. The normalized graphs arein Figure 12.

In the normalized energy curves, we can see the effect ofinherent MapReduce/Hadoop overhead. The lines for 20GBand 10GB are generally constant - the increase in paral-lelism overrides the effect of inherent MapReduce/Hadoopoverhead. The line for 5GB is constant up to 4 workers, thenincreasing thereafter - increase in parallelism dominates un-til the work done on each worker becomes comparable andthen negligible relative to the overhead. After this pointthe energy consumed increases. The 1GB and 500MB linesdisplay similar behavior. For 100MB, the overhead domi-nates, i.e., the job takes nearly the same amount of time tofinish regardless of the number of workers, and the energyconsumed increases steadily.

Figure 13 shows sorted records per joule for the differentinput sizes. There are several noteworthy points. For smallinput sizes, we can increase the records sorted per joule byreducing the number of workers. This suggests that the over-head of coordinating workers and distributing data to theworkers is a bottleneck in the system. For larger input sizes,the records sorted per joule is constant relative to the num-ber of workers. This suggests that the overhead associatedwith many workers is balanced out by increased parallelismfor large input sizes. However, the records sorted per joulefor large input sizes is still less than that for small inputsize and few workers. This suggests that the data localityassociated with fewer workers is more important than theincreased parallelism associated with many workers. Lastly,the curve for 100MB is significantly lower than the rest,suggesting workloads that involve many small jobs wouldsignificantly benefit from decreased overhead.

Figure 7: Sort and isolation jobs performance for different HDFS replication - workers only

Figure 8: Isolation jobs as relative fractions of sum of the jobs - HDFS replication of 2

Figure 9: Isolation jobs as relative fractions of sum of the jobs - HDFS replication of 1

Figure 13: Records sorted per joule for different in-put data size - workers only

5.6 Slow NodesWe made an accidental discovery when monitoring HDFSblock placement. One node in our cluster persistently getsa smaller number of assigned blocks. Since HDFS blockplacement is supposed to be random, it would mean that thenode reports less frequently that it is ready to receive newblocks. To quantify this difference in block placement, werun the HDFS write job for 10GB, 10 workers, 20 repeatedmeasurements. The average number of blocks written toeach node is in the left graph in Figure 14, with the slow nodehighlighted. We normalize the block numbers such that theamount of variation does not depend on the data input size.We see that node r32 is much slower, and the high-end of its95% confidence interval does not overlap with the low-endof the 95% confidence interval for any other nodes.

We remove node r32 from the cluster, and repeat the mea-surements. The block allocations are now more even, asevidenced from the right graph in Figure 14. Table 1 showsthe job duration, energy, and power comparison before andafter the slow node is removed. All metrics show improve-ments when the slow node is removed. Thus, it is importantto identify slow nodes in the cluster and deal with them ap-propriately. We can get a performance improvement simplyby removing slow nodes.

6. QUANTITATIVE MODELSIn this section, we provide quantitative models to answerseveral questions critical to the efficient operation of a Hadoopcluster. We show that we can predict the scalability trendsin the IO performance of a job from the amount of its inputread, output write, and intermediate shuffle data. We alsodemonstrate that we can provision cluster size in a scientificmanner based on expected workload characterstics. We fur-ther provide quantitative guidance on optimizing the level ofHDFS replication without compromising reliability. Lastly,we explore the relationship between time efficiency and en-ergy efficiency, and argue that under simplified conditions,the two are equivalent.

6.1 Predicting Energy ConsumptionFrom our results, we noticed that a good way to predictthe performance of a sort job is to add the performance ofthe HDFS read, shuffle, and HDFS write jobs of the same

Figure 10: Sort and isolation jobs performance for different HDFS block size - workers only

Table 1: Performance improvement of removing a slow nodeExperiment Duration (s) 95% confi-

dence intervalEnergy (J) 95% confi-

dence intervalRecords / J Power (W) 95% confi-

dence intervalWith r32 388 73.7 827039 157170 129.8 2134 1.8Without r32 301 81.2 648813 174540 165.4 1941 5.6

Figure 11: Sort performance with different data input size - workers only

Figure 12: Normalized sort performance with different data input size - workers only

Figure 14: HDFS block placement for HDFS write with slow node (left) and with slow node removed (right)

workload and cluster size. That is,

tsort ≈ thdfsRead + tshuffle + thdfsWrite

Esort ≈ EhdfsRead + Eshuffle + EhdfsWrite

Psort ≈ average[PhdfsRead, Pshuffle, PhdfsWrite]

where t, E, P are respectively time, energy, and power foreach part of the datapath, and the average for power is aweighted average with the weights determined by the rela-tive durations associated with each part of the data path.We applied this model to data for HDFS replication levelsof 3, 2, and 1, and the results are in Figures 15, 16, 17.

This simple model is not exact, and in some cases there aresignificant differences between the predicted and measuredperformance. In particular, the model assumes no overlap inthe execution of different IO tasks. Despite this, the modelis effective at predicting trends. The reason is that sortdoes no computation at all, and all the work done in a sortjob is exactly the sum of HDFS read, shuffle, and HDFSwrite. Over-predictions exist because IO tasks are overlap-ping, unaccounted in the model. Under-predictions exist be-cause there are coordination overhead in MapReduce, alsounaccounted in the model. The overall trend, however, isreflected in the predictions.

For MapReduce/Hadoop jobs in general, if we know theamount of input, shuffle, and output data, we should beable to use this method to predict the duration, energy,and power trends associated with the IO parts of a MapRe-duce/Hadoop job, as below.

tIO ≈ thdfsRead + tshuffle + thdfsWrite

EIO ≈ EhdfsRead + Eshuffle + EhdfsWrite

PIO ≈ average[PhdfsRead, Pshuffle, PhdfsWrite]

For MapReduce tasks that have straightforward map andreduce functions where the amount of computation is pro-portional to the amount of data, the sums and average abovewould include an additional term representing the non-IOcompute part of MapReduce jobs.

6.2 Provisioning MachinesA question that frequently faces Hadoop operators is howmany machines to allocate to a cluster. Alternatively, for theHadoop tasks scheduler, the question is how many machinesto allocate to a particular job. Our measurements suggesteda way to answer this question. We illustrate the method witha simplified example.

Suppose we have a cluster dedicated to running sort. Thesort jobs have relatively constant sizes, and the jobs arriveat random times with a known average interval. Such jobsize and arrival characteristics are representative of analyticworkloads that process data of regular sizes at regular in-tervals. The question to answer is how many machines toassign to the MapReduce/Hadoop cluster.

Job duration would be an irrelevant metric, unless there areconstraints on workload response times. Power consumptionwould also be secondary, unless there are power constraintsbased on cooling limits and power provision ceilings. Theprimary metric in our example should be energy per job.We do not turn off the system when it is idle, because the

jobs arrive at random times, even though the average rate ofarrival is known. Thus, the energy consumed per job shouldinclude the energy consumed while the system is idle. Wedefine the following variables and functions:

N = number of workers

D(N) = job duration, a function of N

Pa(N) = power when active, a function of N

Pi = power when idle, should be near constant

T = average job arrival interval

E(N) = energy consumed per job, a function of N

Using these functions and variables, we can express the ex-pected energy per job as

E(N) = D(N)Pa(N) + (T − D(N))Pi

This is a straight forward linear sum multiplying job dura-tion by active power and idle duration by idle power. Weoptimize E(N) for the range of N such that D(N) ≤ T .

To give a numerical example using actual data, suppose thesort jobs are 10GB terasort format, and T is 1000 seconds.Taking the data points in Figure 1, we optimize for thechoice of N = 8 or N = 12. The data for 8 and 12 workersare

D(8) = 661 s ± 15 s (95% confidence)

D(12) = 436 s ± 25 s 95% confidence)

Pa(8) = 1739 W ± 5 W 95% confidence)

Pa(12) = 2611 W ± 18 W (95% confidence)

Pi = 203 W ± 1 W (95% confidence, per worker)

Substituting, we get

E(8) = (661 s)(1739 W) + (1000 s − 661 s)(203 W)

= 1218 kJ per job ± 33 kJ

= 0.338 kWHr per job ± 0.009 kWHr

E(12) = (436 s)(2611 W) + (1000 s − 436 s)(203 W)

= 1252 kJ per job ± 76 kJ

= 0.348 kWHr per job ± 0.021 kWHr

Since E(8) < E(12), we should provision 8 machines for thisparticular job stream, even though it takes longer to finishany single job.

We can quantify the potential savings of such optimizations.Suppose that we lack the information necessary to do theabove calculation. Thus, we provision 8 machines half thetime and 12 machines half the time. This would mean anaverage 0.05 kWHr saved for each job. For a job arrivalrate of T = 1000s per job, this translates to 1577 kWHrsaved per year. For an average industrial electricity retailprice of 7.01 centers per kWHr in 2008 [1], this means ap-proximately annual savings of $110 per machine using well-informed, straight-forward provisioning decisions.

The analysis process we illustrate here is extensible to work-loads with different types of jobs, different size of jobs, anddifferent arrival model for jobs. If the level of energy and fi-nancial savings here are representative, then for datacentersof order 100,000 servers, the aggregated savings are on theorder of $11 million yearly.

Figure 15: Predicting sort performance - HDFS replication 3



6.3 Replication and ReliabilityIn light of the results in Section 5.3, we will provide somemethods to systematically optimize the HDFS replicationlevel. For production Hadoop deployments, optimization ofthis type should be motivated by historical machine failurecharacteristics. In the absence of such published data, wewill focus on trends and trade-offs. To ensure the generalityof our analysis, we will consider two scenarios.

First, we will look at Hadoop deployments on top of a reli-able, archival storage system. HDFS serves as the workspaceof computation pipelines, and the archival storage systemserves as the permanent repository for input and outputdata. Such deployments would be required if the Hadoopclusters are transient, but the data needs to be always avail-able. These deployments would also make sense for runningHadoop on bulk archival data, for which HDFS replicationmay be impossible due to storage capacity. Our lab uses acouple of Hadoop deployments of this type. On our localcluster, the archival storage is a networked drive backed byZFS. Typically we copy input data to HDFS at the start ofa computation process, and copy output data to ZFS at theend. We also have a deployment on Amazon’s EC2 platform,where the archival storage is S3, and the Hadoop cluster runson EC2 instances. We can either take S3 data and put it inHDFS, or use S3 directly in place of HDFS. In both deploy-ments, we need the separate storage because the Hadoopcluster is transient. The optimization focus for these de-ployments is performance, since we can always recover frommachine failures by re-copying the data from the separatereliable storage and repeating the computation.

Second, we will look at Hadoop deployments that have HDFSas its only storage system. These deployments typically rep-resent persistent Hadoop clusters, and any machine failuresthat gets past the HDFS replication mechanisms will resultin data loss. The optimization focus for these deploymentsis preventing data loss without excessively harming perfor-mance.

Our analysis indicate that for Hadoop deployments on top ofa reliable storage system, we should always run at low HDFSreplication, even when the cost of staging data in and outof HDFS is considered. However, for Hadoop deploymentsthat have HDFS as its only storage system, we should notlower the HDFS replication level because doing so wouldsignificantly compromise data durability.

6.3.1 HDFS with separate reliable storageWe focus on a computation process with three stages - copyto HDFS from the reliable storage at the beginning, somecomputation tasks on Hadoop, and write to archival storageat the end. For simplicity, when machines in our Hadoopcluster fail and results in data loss in HDFS, we recover byrestarting the computation process with input copied fromthe separate reliable storage. This recovery model would notapply if we use Hadoop to generate non-reproducible datasuch as logs, user interaction streams, etc.

In this scenario, we alway want to run at lower HDFS repli-cation, provided that we can use this recovery process toregenerate the lost data. The result is intuitive because theseparate reliable storage makes it redundant for HDFS to

achieve data durability through replication.

To derive this insight analytically, we introduce the followingvariables:

T3fold = normal process duration with 3-fold replication

Tlower = normal process duration with lowered replication

T′

3fold = recovery time with 3-fold replication

= 2T3fold max for the recovery process described

T′

lower = recovery time with lowered replication

= 2Tlower max for the recovery process described

E3fold = normal energy consumed with 3-fold replication

Elower = normal energy consumed with lowered replication

E′

3fold = recovery energy with 3-fold replication

= 2E3fold max for the recovery process described

E′

lower = recovery energy with lowered replication

= 2Elower max for the recovery process described

f3fold = failure probability with 3-fold replication

flower = failure probability with lowered replication

The upper bounds for the recovery duration and energy istwice the normal duration and energy because for the sce-nario we described, recovery from failure is analogous toexecuting the job twice. Using these variables, the expectedprocess durations are

Texpected 3fold = (1 − f3fold)T3fold + f3foldT′

3fold

= (1 − f3fold)T3fold + 2f3foldT3fold

= (1 + f3fold)T3fold

Texpected lower = (1 − flower)Tlower + flowerT′

lower

= (1 − flower)Tlower + 2flowerTlower

= (1 + flower)Tlower

Likewise, the expected energy consumption is

Eexpected 3fold = (1 − f3fold)E3fold + f3foldE′

3fold

= (1 − f3fold)E3fold + 2f3foldE3fold

= (1 + f3fold)E3fold

Eexpected lower = (1 − flower)Elower + flowerE′

lower

= (1 − flower)Elower + 2flowerElower

= (1 + flower)Elower

The apparent performance-reliability tradeoff is that f3fold <

flower, but T3fold > Tlower, and E3fold > Elower. For mostsystems, failure probabilities are low. This means if we ap-proximate

(1 + f3fold) ≈ 1

(1 + flower) ≈ 1

then

Texpected 3fold ≈ T3fold

Texpected lower ≈ Tlower

Eexpected 3fold ≈ E3fold

Eexpected lower ≈ Elower

In other words, if failure probabilities are small, T3fold >

Tlower and E3fold > Elower, then we should always run atlowered HDFS replication.

The cost of staging data in and out of HDFS has been im-plicitly accounted in the analysis above. We can rewrite theprocess duration and energy consumed as

T3fold = T3fold, Hadoop + T3fold, data staging

Tlower = Tlower, Hadoop + Tlower, data staging

E3fold = E3fold, Hadoop + E3fold, data staging

Elower = Elower, Hadoop + Elower, data staging

Thus, we can expand the final approximations as

Texpected 3fold ≈ T3fold, Hadoop + T3fold, data staging

Texpected lower ≈ Tlower, Hadoop + Tlower, data staging

Eexpected 3fold ≈ E3fold, Hadoop + E3fold, data staging

Eexpected lower ≈ Elower, Hadoop + Elower, data staging

Data staging involves two parts - reading from and writingto the separate storage system, which incurs a fixed cost re-gardless of HDFS replication level, and writing to and read-ing from HDFS, which incurs a lower or equal cost for lowerHDFS replication levels, as indicated in Section 5.3. Hadoopjobs also involve two parts - computation of the map and re-duce functions, which incurs a fixed cost regardless of HDFSreplication level, and IO, which incurs a lower or equal costfor lower HDFS replication levels, also as indicated in Sec-tion 5.3. Thus we expect T3fold > Tlower and E3fold > Elower,and our earlier conclusion already accounts for the cost ofstaging data in and out of HDFS.

To quantify the size of potential savings, let us consider aworst case example with large cluster size, high failure prob-ability, 1-fold replication, and marginal increase in perfor-mance from reduced replication level. We can express flower

as

flower = 1 − (1 − p)N

where p is the probability of any single machine failing dur-ing a job, and N is the size of the cluster. This expressionassumes that machine failures are indepdent. If they arehighly dependent, then it does not make sense to have highreplication - the first failure will trigger failures of replicas.Let us assume N = 1000 for a large cluster, p = 0.0001, i.e.on average a machine fails every 10,000 jobs. For simplicitylet us also assume Tlower = 0.85T3fold and Elower = 0.85E3fold,i.e. 15% improvement in duration and energy from loweredHDFS replication. Substituting, we get

flower = 1 − (1 − 0.0001)1000 = 0.095 = 9.5%

i.e. we expect to repeat every 10th job because some ma-chine has failed. Substituting again, we get

Texpected lower = (1 + flower)Tlower

= (1 + 0.095)0.85T3fold

= 0.93T3fold

Eexpected lower = (1 + flower)Elower

= (1 + 0.095)0.85E3fold

= 0.93E3fold

Thus, even in this artificial example when we assume failurecharacteristics to heavily favor increased HDFS replication,we still get a 7% benefit from operating at a lower repli-cation. Note that there is a 7% benefit in increased valuefrom reduced job duration, and a 7% benefit in decreasedcost from reduced energy consumption. For a order 100kWdatacenter and a retail price of electricity of 7.01 cents perkWHr in 2008 [1], 7% savings are on the order of $5 millionyearly.

Our data from Section 5.3 shows that operating at 1-foldHDFS replication leads to Tlower ≈ 0.13T3fold and Elower ≈

0.13E3fold for no penalty in HDFS read and shuffle. So weexpect the benefits in real deployments to be much largerthan 7%.

Also, real machines are likely to have mean time to failurelonger than 10,000 jobs. When cluster size increases, theapproximations involving f3fold and flower become less exact,but not to an extent that would affect our conclusions. Forexample, if p = 0.00001, i.e. on average a machine failsevery 100,000 jobs, then for N = 1000 representing a clusterof 1000 machines, we get

flower = 1 − (1 − 0.00001)1000 = 0.001 = 0.1%

For N = 10, 000 representing a cluster ten times as large,we get

flower = 1 − (1 − 0.00001)10000 = 0.01 = 1%

The approximations

Texpected lower ≈ Tlower

Eexpected lower ≈ Elower

are less exact but still valid.

In short, when failure probabilities are low, HDFS is de-ployed in conjunction with a separate reliable storage sys-tem, then it always makes sense to operate at 1-fold HDFSreplication, and recover from HDFS data losses by repeatingthe computation.

6.3.2 HDFS as primary storageFor Hadoop deployments that have HDFS as its only stor-age system, any machine failures that gets past the HDFSreplication mechanisms will result in data loss. Thus, datadurability has priority over performance, and any improve-ments in performance should not come at the cost of increaserisk of data loss. It turns out that when failure probabilitiesare constant, we need to have orders of magnitude decreasesin cluster size such that 2-fold replication has the same faultresilience as 3-fold replication. For most deployments, mak-ing this change would not be worth the effort.

To illustrate, we use the variables from Section 6.3.1. Again,we assume that machine failures are independent. Other-wise, the benefit of replication is reduced, since one failurewould be correlated with another.

We can express f3fold as

f3fold = 1 − (1 − p)N

−

(

N

1

)

p(1 − p)N−1

−

(

N

2

)

p2(1 − p)N−2

= 1 − (1 − p)N

− Np(1 − p)N−1

−N(N − 1)

2p2(1 − p)N−2

where(

N

n

)

are binomial coefficients. Each term subtractedcorrespond to the probability of no losses, 1 loss, and 2losses. Similarly, we can express f2fold as

f2fold = 1 − (1 − p)N−

(

N

1

)

p(1 − p)N−1

= 1 − (1 − p)N− Np(1 − p)N−1

In these expressions, the terms (1 − p)N−n are the domi-nant terms while p is fixed, unless we change to differenthardware. Thus, we can make f2fold ≈ f3fold if we decreaseN , the size of the cluster. This makes intuitive sense be-cause the smaller the cluster, the smaller the probability ofsimultaneous failures.

It turns out that with p fixed, the probability of a single lossis O(N) larger than the probability of no losses, and theprobability of a double loss is O(N) larger than the proba-bility of a single loss. So for large clusters, we need a hugedecrease in N for 2-fold replication to have the same failureresilience as 3-fold replication.

To use a numerical example, let us assume p = 1×10−6, i.e.one in a million chance of failure on a particular job, andN = 1000 representing a cluster with 1000 nodes. We findN ′ < N for which f2fold ≈ f3fold. For these values, f3fold ≈

1.66×10−10 numerically. Equating this to f2fold and solvingfor N ′, we find that at N ′ = 19, f2fold ≈ 1.71×10−10. Thus,when failure probabilities are constant, we need decreasecluster size from 1000 to 19 such that 2-fold replication hasthe same fault resilience as 3-fold replication.

A 1000 node cluster and a 19 node cluster represent twocompletely different operation modes. To achieve the samecomputational throughput, we would have to divide the 1000node cluster into some 50 sub-clusters, and have each ofthem operate jobs with no data dependencies with jobs inother sub-clusters. This is often difficult to achieve. Evenif it is achieved, the job duration would be orders of mag-nitude longer, and the coordination overhead would likelymitigate the marginal performance gains of using a lowerHDFS replication.

In short, when HDFS is the primary storage system, anddata durability is important, the focus of improvement shouldbe on doing 3-fold replication more efficiently instead of us-ing a lower HDFS replication.

Figure 18: Power and work rate characteristics forenergy proportional systems

6.4 Time Efficiency and Energy EfficiencyIn various forms, we have alluded to the strong interplaybetween job duration, power, and energy. Since work rateis correlated with power, job duration is the inverse of workrate, and energy is the product of power and duration, onewonders whether reducing job duration is equivalent to re-ducing the energy consumed. We answer this question usingsome simple quantitative models.

We introduce the following variables and functions:

r =resources used in a system, ranging from 0 to 1

R(r) =work rate of the system, ranging from 0 to Rmax

P (r) =power of the system, ranging from 0 to Pmax

rl, ru =lower and upper bounds of the operating region

W =workload size

E(r) =energy consumed at r

A fully idle system has r = 0 while a fully utilized systemhas r = 1. Note that usually rl > 0, and ru < 1. Thisreflects the fact that the system would always have somebook-keeping tasks to do, and it is hard to drive the systemto the full utilization of 1. In the operating region for ourexperiments, we observed Pmax to be approximately 250W.

For ideal, energy proportional systems, R(r) and P (r) areas below and in Figure 18.

R(r) = rRmax

P (r) = rPmax

So, for fixed workload sizes, the energy consumed is

E(r) = P (r)W

R(r)= rPmax

W

rRmax

=W

Rmax

Pmax

That is, the energy consumed is independent of the level ofutilization, and we can decrease job duration without ad-ditional energy costs. Thus, we should run our system atutilization level ru, the upper bound of the operating re-gion.

Realistic systems are not power proportional. They are moreappropriately described by

R(r) = rRmax

P (r) = Pidle + r(Pmax − Pidle)

where Pidle is the idle power of the system, observed tobe approximately 200W in our experiments. In comparison,Pmax is approximately 250W. Graphically speaking, the be-havior is as in Figure 19.

Figure 19: Power and work rate characteristics forsystems that are not energy proprotional

For such a system, the energy consumed at r is

E(r) = P (r)W

R(r)=

W (Pidle + r(Pmax − Pidle))

rRmax

=W

Rmax

Pidle

r+

W

Rmax

(Pmax − Pidle)

This is a decreasing function in r. So again, we should op-erate at ru, the upper bound of the operating region.

The key result from our analysis is that if work rate is pro-portional to r, we should always run our workloads as fastas possible. We believe this analytical result to be highly

significant, because it means that efforts to reduce energyconsumption are equivalent to efforts to optimize for tradi-tional performance metrics such as job duration. The priorwork in system optimization does not need to be re-inventedin light of energy efficiency. Our results are applicable to allreal life systems that match the quantitative models we havedescribed.

An interesting side result is that if P (r) contains a piece-wiselinear region from r1 to r2, with slope greater than Pmax,then it is always more energy efficient to operate at r1, thelowest possible utilization. This is the direct opposite of ourresult earlier. We can intuitively understand why operat-ing at r1 makes sense here. As the slope of P (r) approachinfinity, i.e. as P (r) approach a vertical line between r1

and r2, a marginal drop in the utilization from r2 to r1 candecrease power consumption significantly while keeping thejob duration near constant. In other words, we reduce en-ergy consumption for free. To the best of our knowledge, noreal life systems exhibit this kind of power-utilization char-acteristic. But if one day such systems exist, then defaultoperation should be at as low utilization as possible, subjectto job duration constraints.

One can object that power consumption is not linear. How-ever, recent measurements in [14] showed that our modelsare quite sound. The authors used a multi-variable linearmodel to predict system power with root mean square errorof approximately 2W for a 200W system. The model in-cluded the power consumption of CPU, memory, disk, andnetwork. The data also showed that for CPUs, if we taker to mean the number of instructions per second, then thesystem power consumption is exactly as shown in Figure 19- linear power increase with large idle power offset. Usinga less detailed analysis, the authors also arrived at our con-clusion, that if work rate is proportional to resources used,we should always run our workloads as fast as possible.

Our own measurements from the previous sections both sup-port the analysis here and reveal its limitations. For aMapReduce cluster, we can interpret r to mean the fractionof workers used. When the number of workers is constant,our model suggests that whatever techniques we use to de-crease the job duration would also bring about decreasedenergy consumption. We find data supporting this tightconnection between job duration and energy consumed inFigure 7 for different HDFS replication levels and Figure 10for different HDFS block size.

In contrast, when the number of workers varies, r varies.Our model suggests that we should always use the maxinumnumber of workers. However, our discussions for sort (Figure3), nutch (Figure 4), slow nodes (Table 1), and intermittenjob arrival rates (Section 6.2) contradict this result. The rea-son is that our model assumes that work rate is proportionalto r. In the counter-examples we identified, the work rate isnot proportional to the number of workers. This behavioris due to sub-optimal scaling in sort, a scaling bottleneckin nutch, a performance penalty for slow nodes, and a workrate capped by the job arrival rate.

If the work rate behavior is quantified, we still have a validframework of looking at energy efficiency as the ratio ofpower and work rate, even if we have non-linear P (r) andR(r) functions. The behavior of E(r) would be more com-plicated, and we have to optimize across the entire operatingregion from rl to ru.

We encounter further complications when we change thescope of the system under analysis. The meaning of r, Pmax,and Pidle depends on whether we are talking about a singlemachine, a cluster, or an entire datacenter. The insightswe get from the model would also be different. For ex-ample, when we are talking about a single machine, Pmax

would be the machine’s power rating. Our model here high-lights the importance of low-power hardware - when Pmax

is decreased, the entire curve for P (r) shifts vertically lower.When we are talking about a cluster, r would be the num-ber of machines. Our model highlights the importance ofsystem scalability and intelligent provisioning according toload - without them, the graph for R(r) would not be pro-portional. When we are talking about a datacenter, Pidle

would be the total facility power excluding the IT equip-ment. Figure 19 highlights the importance of a low PUE- the closer the PUE is to 1, the lower the Pidle and thevertical intercept for P (r). The bottomline of our analysishere is that achieving traditional system design goals wouldbenefit both time efficiency and energy efficiency.

7. RECOMMENDATIONSOur findings are summarized into the following recommen-dations for MapReduce/Hadoop operators:

• Measure the duration, energy, and power of commonMapReduce/Hadoop tasks to find out if they are scal-able to more workers.

• Ensure that small sized jobs are run on an appropri-ately small number of workers.

• Ensure that HDFS block sizes increases with increaseddata size.

• Consider using the decision process in Section 6.2 toprovision the number of workers.

• If Hadoop is deployed concurrently with a reliable stor-age system separate from HDFS, then we should al-ways operate with HDFS replication of 1, even whenwe incorporate the cost of data staging in and out ofHDFS, per the analysis in Section 6.3.

• Measure the power-utilization and work rate-utilizationcharacteristics for machines in the cluster, and use theframework in Section 6.4 to optimize the level of uti-lization.

• Continue to focus on all techniques for performanceimprovements, since energy efficiency is equivalent totime efficiency when work rate is proportional to theresources used.

For MapReduce/Hadoop designers, we make the followingsuggestions:

• Improve the HDFS replication process. The current”chain”of replication play a significant part in the highvariability in jobs involving large amounts of output. Alower HDFS replication level is not possible when datadurability is important, per the analysis in Section 6.3.

• Decrease the job coordination overhead. As indicatedin Section 5.5, overhead prevents small jobs from scal-ing to more workers.

• Improve data locality for jobs. As indicated in Section5.5, increased data locality is as important as increasedparallelism by adding more workers.

• Investigate ways to assign less work to slow nodes, orto remove them from the cluster when they imposemore penalty than the speedup they contribute.

For researchers working on MapReduce/Hadoop energy ef-ficiency, we propose the following work items:

• Repeat our experiments on clusters running differentmachines, in particular machines that are designed tobe energy efficient, e.g. machines with ATOM proces-sors.

• Repeat our experiments on other MapReduce/Hadoopworkloads, such as gridmix [5].

• Investigate the effect of changing other Hadoop pa-rameters, in particular IO parameters like HDFS blocksize, number of parallel reduce copies, sort memorylimit, and file buffer size.

• Compare the usefulness of MapReduce as a benchmarkvis-a-vis established benchmarks such as SPECPower.

8. CLOSING REMARKSWe began our work with the goal to add energy as anotherperformance metric to consider in the design space for com-puter systems and Internet datacenters. We concluded ourwork with the thought that energy efficiency and traditionaltime efficiency are fundamentally different performance met-rics for the same thing. Whatever that thing is, we do notyet have a well articulated language to talk about it. Webelieve the natural continuation of our work is the develop-ment of MapReduce/Hadoop benchmarks, perhaps across abundle of performance metrics at first. In the long term,it would be a tremendous contribution to develop a way totalk about computer systems or Internet datacenters pro-ductivity that unifies the seemingly disparate concerns cur-rently measured using different metrics. Taking a high levelview, the early single-machine SPEC benchmarks are thefirst step towards this goal, the PUE metric is the secondstep specific for Internet datacenters. We believe our workis making incremental progress towards understanding com-puter systems productivity in general. We encourage otherresearchers to continue in the same direction, whether theycome from energy perspectives, or perspectives focused moreon traditional performance.

9. ACKNOWLEDGMENTSThanks to the rest of the RAD Lab team for their frequentand insightful suggestions. Special thanks to our IT staffwho provided tremendous help with cluster operation andpower meter management.

10. REFERENCES[1] Annual energy review 2008. Energy Information

Administration, U.S. Department of Energy, June2009.

[2] Energy star enterprise server specification. UnitedStates Environmental Protection Agency,http://www.energystar.gov/index.cfm?c=new specs.enterprise servers, April 2009.

[3] Energy star enterprise storage specification. UnitedStates Environmental Protection Agency,http://www.energystar.gov/index.cfm?c=new specs.enterprise storage, under development, April 2009.

[4] Green Grid data center efficiency metrics: PUE andDCIE. White Paper, The Green Grid, December 2008.

[5] Gridmix. HADOOP-HOME/src/benchmarks/gridmixin all recent Hadoop distributions.

[6] Hadoop. http://hadoop.apache.org.

[7] Hadoop Power-By Page.http://wiki.apache.org/hadoop/PoweredBy.

[8] International energy annual 2006. Energy InformationAdministration, U.S. Department of Energy, June2008.

[9] Nutch. http://lucene.apache.org/nutch/.

[10] SPECpower ssj2008. Standard PerformanceEvaluation Corporation,http://www.spec.org/power ssj2008/, April 2009.

[11] J. Ammer and J. Rabacy. The energy-per-useful-bitmetric for evaluating and optimizing sensor networkphysical layers. In Sensor and Ad Hoc

Communications and Networks, 2006. SECON ’06.

2006 3rd Annual IEEE Communications Society on,volume 2, pages 695–700, Sept. 2006.

[12] L. A. Barroso and U. Holzle. The case forenergy-proportional computing. Computer,40(12):33–37, 2007.

[13] C. Belady. In the data center, power and cooling costsmore than the IT equipment it supports. Electronics

Cooling Magazine, 13(1):24–27, February 2007.

[14] S. Dawson-Haggerty, A. Krioukov, and D. Culler.Power optimization - a reality check. Submitted toHotPower 2009, Workshop on Power AwareComputing and Systems.

[15] J. Dean and S. Ghemawat. MapReduce: simplifieddata processing on large clusters. Communications of

the ACM, 51(1):107–113, January 2008.

[16] X. Fan, W.-D. Weber, and L. A. Barroso. Powerprovisioning for a warehouse-sized computer. In ISCA

’07: Proceedings of the 34th annual international

symposium on Computer architecture, pages 13–23,New York, NY, USA, 2007. ACM.

[17] S. Ghemawat, H. Gobioff, and S.-T. Leung. TheGoogle file system. SIGOPS Oper. Syst. Rev.,37(5):29–43, 2003.

[18] Google. Data center efficiency measurements. TheGoogle Blog, 2009.

[19] J. Koomey. Worldwide electricity used in data centers.Environmental Research Letters, 3(3), September2008.

[20] K. Lim, P. Ranganathan, J. Chang, C. Patel,T. Mudge, and S. Reinhardt. Understanding anddesigning new server architectures for emergingwarehouse-computing environments. In ISCA ’08:

Proceedings of the 35th International Symposium on

Computer Architecture, pages 315–326, Washington,DC, USA, 2008. IEEE Computer Society.

[21] M. Manos and C. Belady. What is your PUE strategy.The Power of Software blog, July 2008.

[22] R. Miller. Microsoft: PUE of 1.22 for data centercontainers. Data Center Knowledge, October 2008.

[23] S. Rivoire, M. A. Shah, P. Ranganathan, andC. Kozyrakis. Joulesort: a balanced energy-efficiencybenchmark. In SIGMOD ’07: Proceedings of the 2007

ACM SIGMOD international conference on

Management of data, pages 365–376, New York, NY,USA, 2007. ACM.

[24] S. Shankland. Google uncloaks once secret server. cnet

news, April 2009.

Documents

Towards Energy Efficient MapReduce - EECS at UC Berkeley€¦ · Towards Energy Efficient MapReduce Yanpei Chen Laura Keys Randy H. Katz Electrical Engineering and Computer Sciences