16
D. J. Bradley R. E. Harper S. W. Hunter Workload- based power management for parallel computer systems This paper describes and evaluates predictive power management algorithms that we have developed to minimize energy consumption and unmet demand in parallel computer systems. The algorithms are evaluated using workload data obtained from production servers from several applications, showing that energy savings of 20% or more can readily be achieved, with a small degree of unmet demand and acceptable reliability, availability, and serviceability (RAS) impact. The implementation of these algorithms in IBM system management software and the possibilities for future work are discussed. 1. Introduction Concern over the power consumption of computer systems is no longer limited to long-lived embedded applications such as spacecraft and aerospace systems, but is spreading to the general-purpose computer market. Power consumption and dissipation constraints are jeopardizing the ability of the IT industry to support the business demands of present-day workloads, especially in the areas of electronic commerce and Web hosting. Given existing physical infrastructure, it can be difficult to get power into and heat out of such large systems, while power shortages such as those recently experienced in California impose further, unpredictable constraints. This is in part because server complex size is growing dramatically in response to skyrocketing electronic commerce and Web hosting computational needs, and in part because processor power consumption is also growing, to well over 100 watts per processor for some present-day processors. Modern-day server complexes for electronic commerce and Web hosting constitute literally thousands of servers operated in parallel, consuming thousands of square feet of computer room space and many kilowatts of power. In some cases there is simply no additional power and cooling capacity in a given physical environment to support increases in capacity. Lower-power processors are now becoming available from the industry, but it is safe to say that a new market- acceptable price–power–performance equilibrium has yet to be demonstrated in the server space, and in fact the performance characteristics of lower-power processors may limit their ultimate penetration into this space. Moreover, processor power consumption, while significant, does not account for all the power consumed by a modern server. Memory controllers, I/O adapters, disk drives, memory, and other devices consume a non-negligible fraction of server power cannot be disregarded. Finally, incorporation of advanced power management functionality into server hardware architectures, while proceeding because of market demand, must always be balanced against the reality of minimizing base server cost in an extremely price-competitive market. Fortunately, electronic commerce and Web-serving workloads have certain characteristics that make them amenable to predictive system management techniques that can effectively reduce power consumption for a broad class of server architectures. 1 They usually support periodic or otherwise variable workloads, with the peak workload being substantially higher than the minimum or 1 Other appropriately implemented and managed workloads may also have many of these characteristics. Copyright 2003 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/03/$5.00 © 2003 IBM IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003 D. J. BRADLEY ET AL. 703

Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

D. J. BradleyR. E. HarperS. W. Hunter

Workload-based powermanagementfor parallelcomputersystemsThis paper describes and evaluates predictive powermanagement algorithms that we have developed to minimizeenergy consumption and unmet demand in parallel computersystems. The algorithms are evaluated using workload dataobtained from production servers from several applications,showing that energy savings of 20% or more can readilybe achieved, with a small degree of unmet demand andacceptable reliability, availability, and serviceability (RAS)impact. The implementation of these algorithms in IBM systemmanagement software and the possibilities for future work arediscussed.

1. IntroductionConcern over the power consumption of computer systemsis no longer limited to long-lived embedded applicationssuch as spacecraft and aerospace systems, but is spreadingto the general-purpose computer market. Powerconsumption and dissipation constraints are jeopardizingthe ability of the IT industry to support the businessdemands of present-day workloads, especially in the areasof electronic commerce and Web hosting. Given existingphysical infrastructure, it can be difficult to get power intoand heat out of such large systems, while power shortagessuch as those recently experienced in California imposefurther, unpredictable constraints.

This is in part because server complex size is growingdramatically in response to skyrocketing electroniccommerce and Web hosting computational needs, andin part because processor power consumption is alsogrowing, to well over 100 watts per processor for somepresent-day processors. Modern-day server complexes forelectronic commerce and Web hosting constitute literallythousands of servers operated in parallel, consumingthousands of square feet of computer room space andmany kilowatts of power. In some cases there is simply noadditional power and cooling capacity in a given physicalenvironment to support increases in capacity.

Lower-power processors are now becoming availablefrom the industry, but it is safe to say that a new market-acceptable price–power–performance equilibrium has yetto be demonstrated in the server space, and in fact theperformance characteristics of lower-power processors maylimit their ultimate penetration into this space. Moreover,processor power consumption, while significant, does notaccount for all the power consumed by a modern server.Memory controllers, I/O adapters, disk drives, memory,and other devices consume a non-negligible fractionof server power cannot be disregarded. Finally,incorporation of advanced power managementfunctionality into server hardware architectures, whileproceeding because of market demand, must always bebalanced against the reality of minimizing base servercost in an extremely price-competitive market.

Fortunately, electronic commerce and Web-servingworkloads have certain characteristics that make themamenable to predictive system management techniquesthat can effectively reduce power consumption for abroad class of server architectures.1 They usually supportperiodic or otherwise variable workloads, with the peakworkload being substantially higher than the minimum or

1 Other appropriately implemented and managed workloads may also have many ofthese characteristics.

�Copyright 2003 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) eachreproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of thispaper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of

this paper must be obtained from the Editor.

0018-8646/03/$5.00 © 2003 IBM

IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003 D. J. BRADLEY ET AL.

703

Page 2: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

average workload. The dynamic range of the workload isoften a factor of 10; that is, the peak workload can beten times the minimum workload. Furthermore, suchstampedes can cause the transition from low to highworkload (and vice versa) to be abrupt. These workloadattributes imply that keeping all of the servers powered onall of the time might be a waste of power, yet the possibilityof failing to meet an unanticipated surge of demandmust be minimized.

These important workloads have other promisingcharacteristics from a power management perspective.They are highly parallel and relatively easy to balance. Atypical Web-serving system has a large number of Webservers fronted by a load-balancing “IP sprayer,” whichprovides a single Internet protocol (IP) address to theoutside world and dispatches requests from the outsideworld to the many Web servers in the complex to balancethe load among them [Figure 1(a)].

Simplistically, the IP (address) sprayer sends a givenrequest to the server having the lowest utilization, and, inturn, the servers keep the IP sprayer updated with theirutilization, response time, or other indication. Workload ismerely routed around failed servers, and the users havingtransactions or sessions on a failed server can click againto have their request go to another server. This works

quite well because a given request is locale-transparent:Assuming that all servers have access to the same back-end source of Web pages or database, as is common inpractice, that request can be dispatched to any Web serverin the complex. Finally, the requests are short-livedenough that if a given server is “condemned” and newworkload is withheld from it, its utilization quickly falls,whereas if a new server is brought on-line, new workloadcan readily be dispatched to it and its utilization quicklyrises. These workload attributes imply that it should berelatively straightforward to power individual systems offand on with minimal disruption to the overall application.

On the basis of these needs and observations, wehave developed and evaluated an algorithm (calledCoolRunnings) that exploits this environment to managepower on the basis of measured and predicted workload,such that both unmet demand and power consumption areminimized. It does this by

1. Measuring and characterizing the workload on all ofthe servers in a defined group.

2. Determining whether any servers need to be poweredon or off in the near future by assessing the currentcapacity relative to the predicted capacity needs.

3. Manipulating the existing system and workloadmanagement functions to remove load from serversto be turned off.

4. Physically turning on or off (or switching into and outof standby or hibernation) designated servers usingexisting system management interfaces.

Thus, there is a measurement component, analgorithmic component, a workload managementcomponent, and a physical server power managementcomponent, as shown in Figure 1(b).

The CoolRunnings algorithm is not intended to replaceemerging power management techniques internal to thearchitecture of a server, such as processor clock andvoltage modulation, nor workload-balancing techniquessuch as those provided by existing workload and systemmanagement infrastructures. It is intended to supplementthem with computing facility-scale power managementtechniques that are capable of supporting a wide range ofsystems. CoolRunnings also differs from recently deployedIBM functions and products, notably Capacity Manager[1] and Software Rejuvenation [1]. Capacity Manager isdesigned to measure long-term (e.g., weeks and months)trends in resource utilization, and to instruct the customerto procure additional appropriate resources before systemperformance suffers. Software Rejuvenation is designedto detect long-term (weeks) resource exhaustion due tobugs such as memory leaks. All or part of the system isrestarted at a convenient time (automatically or by thecustomer) before exhaustion of the resources causes an

Figure 1

IP address sprayer: (a) Schema; (b) power management.

Workload

execution

Workload

measurement

Workload

execution

Workload

measurement

Workload

Workload Utilization

Web server Web server

IP sprayer

Workload

execution

Power

control

Power

control

Workload

measurement

Power

control

Workload

execution

Workload

measurement

Workload

Workload Utilization

Server Server

Power

management

algorithm

Workload

controlWorkload

balancer

(a)

(b)

D. J. BRADLEY ET AL. IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003

704

Page 3: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

unplanned outage. CoolRunnings shares some technologiesfrom these products, notably the workload characterizationand saturation definitions of Capacity Management, andsome prediction concepts from Software Rejuvenation,but it is primarily intended to provide short-term (e.g.,minutes to hours in advance) projections of workload,and power systems off or on appropriately to meet theobjective function. Thus, CoolRunnings is interestedonly in projecting workload far enough forward in timeto allow adequate resources to be powered on andreadied for work when the workload requires them.

This paper describes some predictive power managementalgorithms we have developed that attempt to minimizeenergy and unmet demand. These algorithms are evaluatedusing historical workload data obtained from productionservers from several different application domains, andwe show that energy savings of 20% or more can bereadily achieved with a very small amount of unmetdemand. We then describe how this technology can beimplemented in IBM system management software, andpoint to interesting possibilities for future work.

2. Related workThere has been much activity in the study of powerconservation in computing devices. This can beaccomplished at the single system image level [2– 4], aswell as at the multisystem level [2, 5– 8]. The applicationof power management to Web applications [9] is especiallyinteresting because there is often a widely varyingworkload [10], such that dynamic decisions in support ofpower management policies can be made through the useof feedback mechanisms [11]. This paper focuses on thealgorithms for making those decisions on the basis ofcurrent workload conditions in a multiserver infrastructure.This technique could be especially useful when applied toservers with a blade form factor [12] in a common chassis.

3. Power management algorithms

ObjectiveThe objective of our power management algorithm isto meet the offered workload at all times, subject tominimizing power consumption and not unduly increasingserver failure rate or client disconnection rate due toexcessive power cycling.

Specifically, the figures of merit that the algorithmattempts to minimize are the following:

1. The energy consumption of the server aggregate,normalized to the energy consumption when all serversare powered on. This is computed as the energyconsumption per server (including all of its ancillarycomponents such as disks, fans, and additional logiccomponents times the number of servers that are

powered on. We do not differentiate between theenergy consumption of a completely utilized powered-on server and an unutilized powered-on server.

2. The unmet demand relative to integrated demand overa history window, suitably normalized to obtain anumber between 0 and 1. As mentioned below, thiscan be extended to incorporate response timecharacteristics of met demand.

3. The number of power-on cycles incurred by eachsystem, normalized to power-on/off cycles per day. Thismust be minimized in order to reduce reliabilitydegradation due to power cycles.

Workload characterizationA server has several resources whose consumption rendersit unable to accommodate additional load. On the basis ofexperience with server workloads, we use the followingmetrics for utilization:

● CPU utilization.● Physical memory utilization.● Local area network adapter bandwidth utilization.● Disk bandwidth utilization.

These parameters are readily measured2 and are the sameas those used to characterize server workload in the IBMCapacity Management product.

The system workload traces that we had availablefor this study always contained CPU utilization, andsometimes contained proxy information for these otherparameters. We chose to use CPU utilization alone totest our algorithm, recognizing that other resourcesmay ultimately constrain performance. Incorporatingadditional performance parameters into our algorithmis straightforward. If the utilization of any resource isgreater than a threshold, additional capacity should beadded; in our case, if any servers are powered off, enoughshould be powered on to reduce all utilizations on allservers to below that threshold. If all utilizations arelower than this threshold (correcting for hysteresis), andassuming that there is adequate capacity in the resultingserver group to absorb the load of at least one serverwithout any resource on any server being over-utilized,at least one server can be powered off.

For our work, we defined the saturation thresholdfor any resource to be 70%, as in the IBM CapacityManagement product. This figure reflects the realitiesof the performance characteristics of industry-standardservers, which suffer rapidly degrading performance as anyresource utilization approaches this value. Also, below this

2 In the Microsoft** Windows** operating system, the parameters can be derivedfrom the performance counters. In Linux**, they can be derived from data residingin the /proc directory structure.

IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003 D. J. BRADLEY ET AL.

705

Page 4: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

utilization, nonlinearities due to queuing are not so largethat approximating response time from utilization is toomisleading. Thresholds can be increased or decreasedon the basis of the load curve of a given system andapplication or the degree of conservatism of a systemadministrator, and different resources can be givendifferent thresholds.

If enough is known about the system, the measuredperformance parameters can be combined into a fairlysimple closed-form queuing model, and the projectedresponse time can be incorporated into the objectivefunction. In this case, the algorithm would attempt tominimize the unmet demand, plus the response timeassociated with the met demand. In some architectures,the response times can be measured directly, as in [5].

Gain-based algorithmThe gain-based version of the CoolRunnings algorithmattempts to estimate a capacity envelope at a time in thenear future, and power servers on or off to maintain thatcapacity envelope at a desired amount (but not too far)

above the projected workload. The entire capacityenvelope must always be above the current utilization;otherwise, unmet demand will result. The projection timeis equal to the time required to power up a server andready it for work.

The lower limit of the capacity envelope, that is, theminimum amount of capacity deemed necessary giventhe current workload, is projected by taking the currentworkload and adding to it an “uplift” that is based on themaximum sample-to-sample deviation that was observedover a given workload history. The upper limit of thecapacity envelope, that is, the maximum amount ofcapacity deemed necessary given the current workload,is projected by taking the current workload and addingto it an “excess” that is similarly based on the maximumsample-to-sample deviation that was observed over a givenworkload history.

The uplift is equal to the “uplift gain” times thismaximum deviation value, and the excess is equal to the“excess gain” times this maximum deviation value. If thecurrent capacity is between the uplift and the excess, noaction is taken. If the current capacity is less than theuplift, one or more servers are scheduled to be poweredon; if the current capacity is greater than the excess, oneor more servers are scheduled to be powered off. Notethat if the workload is roughly constant and the uplift isequal to the excess, the algorithm will alternately powerservers on and off at each sample point. The basics of themethod are shown in Figure 2(a) for a sample window sizeof six points.

For example [Figure 2(b)], if the current capacity is1100, the current workload is 1000, the history window is20, the uplift gain is 20%, and the excess gain is 100%,the algorithm runs as follows:

1. Scan the last 20 workload samples and calculate themaximum point-to-point deviation. Suppose that thisvalue is 200.

2. Calculate the projected capacity envelope. The upliftis 1000 � 20% � 200 � 1040, and the excess is1000 � 100% � 200 � 1200.

3. The capacity is 1100, which is greater than the upliftyet less than the excess. Thus, no action has to betaken, as illustrated in Figure 2(b). If the capacity wereless than 1040, one or more servers would have to bepowered on to maintain it in the desired capacityrange, and if the capacity were greater than 1200,one or more servers would have to be powered off.

The history window size, uplift gain, and excess gain arefundamental to the performance of the algorithm, andmust be judiciously chosen. We show the effects of theseparameters on a variety of workloads below.

Figure 2

(a) Short-term gain algorithm; (b) algorithm demonstration.

Excess � DV � excess gain

(Excess gain > uplift gain)

Uplift � DV � uplift gain

DV � maximum

sample-to-sample

variation in

sample window

Power-off server(s) if capacity

is in this range

Power-on server(s) if capacity

is in this range

Power-on server(s) if capacity

is in this range

Power-off server(s) if capacity

is in this range

No action if capacity

is in this range

No action if capacity is in this range

Sample window �6 points

Current

workload sample

Time

(a)

Time

(b)

Current

capacity

� 1100

Workload

history Current

workload

1000

1200

1040

D. J. BRADLEY ET AL. IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003

706

Page 5: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

Algorithm based on seasonal characterizationThe short-term algorithm does not account for suddenspikes in workload, because these are not presaged byvariations in the recent sample window. Figure 3(a)illustrates the unfortunate situation in which a short-termalgorithm, applied to the numerical example above, failsto predict a sudden spike in workload, thus resulting inunmet demand.

Fortunately, many workload spikes are repetitious[Figure 3(b)], based on weekly or daily activity suchas nightly backups, weekly logons on Monday, marketopenings and closings, etc. For epochs that are not dailyor weekly, it is a straightforward matter to perform acalculation such as an autocorrelation to determine theworkload periodicity, and define the epochs appropriately.For our initial exploratory purposes, we assume (andour data reinforces) that weekly and daily periodspredominate.

To accommodate seasonal phenomena, we havedeveloped another algorithm that is based on collectingworkload data over a prior epoch in time, characterizingthe workload of future epochs based on prior epochs, andsetting up a short-term power-on/off schedule based onthat characterization. This approach has the benefit thatit can speculatively power-on systems before periodicsurges in workload. Such a quasi-static epochal capacityschedule can be overridden by exigencies of the moment,by augmenting it with a gain-based policy. For example,if in the next time increment the schedule states thatcertain capacity is needed, but a gain-based policy asdescribed above indicates that more capacity than thatis needed, the capacity indicated by the gain-basedpolicy can be powered on.

Details of hybrid gain-based/seasonal algorithmThe details of an implementation of this algorithm aredescribed below. As before, the algorithm consists of acomponent to monitor and characterize workload and acomponent to adjust capacity.

The utilization-monitoring component measures thedifference in utilization from one point to the next, withthe intent of detecting and recording for future referencea workload spike that may not have been accommodatedby the short-term algorithm. It does this by detectingwhether the difference in utilization is greater than a pre-programmed amount, and setting flags accordingly forfuture reference.

For example, if the most recent sample is greater thanthe previous sample by a given amount (called thethreshold up parameter), the utilization-monitoringcomponent can set a flag for that particular point in time,indicating that in one epoch minus one sample interval,the additional capacity should be added. The amount ofcapacity scheduled to be added in one epoch minus one

sample interval depends on the difference in the mostrecent and the next most recent samples. If the mostrecent sample is less than the previous sample by agiven amount (called the threshold down parameter), theutilization-monitoring component can set a flag for thatparticular time, indicating that in one epoch minus onesample interval from the current time, capacity should beremoved. The utilization-monitoring component performsthis characterization for every single sample and stores theresults for future reference.

The capacity-adjustment component adjusts capacityfor the next sample point on the basis of utilization fromprior epochs. At each sample point, it examines the flagsfor the time point that is one epoch in the past. If theflags indicate that capacity must be added or removed,the capacity-adjustment component does so (Figure 4).

A workload usually exhibits multiple concurrent epochsof differing periodicities. For example, a workload mayexhibit a daily, weekly, and monthly repetitiveness that canbe detected and exploited. Thus, the capacity adjustmentcomponent must look one day, one week, and possibly onemonth into the past to make the capacity-adjustmentdecision.

Because of sample time quantization, the monitoringsystem may estimate the occurrence of a spikeinaccurately. Thus, when calculating the flags for a givenpoint in time, it is useful for the algorithm to examine not

Figure 3

Workload variations: (a) Unpredicted workload spike; (b) epochal

spike prediction.

Current

capacity

� 1100

Workload

history

Time

(a)

(b)

Excess � 1200

Uplift � 1040

Now �1

epoch

NowTime

Workload spike

Unmet demand

Last workload

sample � 1000,

no planned increase

in capacity

Workload spike

one epoch ago

Upcoming workload

spike in this epoch

IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003 D. J. BRADLEY ET AL.

707

Page 6: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

only the sample immediately following the point oneepoch back in time, but several samples after that point(lookahead horizon).

The resulting hybrid algorithm has the capabilityof reacting to random workloads using the gain-basedalgorithm, and the capability of accommodating periodicworkloads. The algorithm uses the gain-based algorithmas described above to decide whether recent variations inutilization justify adding or removing servers, and uses theepochal algorithm to make the determination on the basisof long-term (i.e., one or more epochs) variations inutilization.

Algorithm pseudocodeThe following pseudocode outlines the algorithm:

Workload characterization step:

Measure Current Utilization

If current utilization � last utilization

� Threshold Up

Record Adjustment Up � current

utilization � last utilization for

later use

If last utilization � current utilization

� Threshold Down

Record Adjustment Down � current

utilization � last utilization for

later use

Capacity adjustment step:

Measure Current Utilization

Short-Term Phase

Over last N points, calculate DV �

max(abs(point-to-point variation))

Calculate Uplift � DV � Uplift Gain

Calculate Excess � DV � Excess Gain

Long-Term Phase

For each epoch

Retrieve Adjustment Up for sample one

epoch back in time, minus one

sample interval

Retrieve Adjustment Down for sample

one epoch back in time, minus one

sample interval

If any epoch has Adjustment Up,

Adjustment � max(Adjustment Up

over all epochs)

Else Adjustment � max(Adjustment Down

over all epochs)

Compute thresholds step:

If Adjustment �� 0 (i.e., Adjustment

Up)

minCapacity � CurrentUtilization �

max(Adjustment, Uplift)

maxCapacity � Current Utilization �

max(Adjustment, Excess)

Else (i.e., Adjustment Down)

minCapacity � Current Utilization �

max(0, Uplift � Adjustment)

maxCapacity � Current Utilization �

max(0, Excess � Adjustment)

Power on/off processors step:

Determine Capacity � Total Available

processing power � Desired

Utilization

If Capacity � minCapacity

Power On processors until Capacity �

minCapacity

Else if Capacity � maxCapacity

Power Off processors until Capacity �

maxCapacity, while ensuring that

Capacity � minCapacity

(i.e., after powering off

processors, Capacity may be �

maxCapacity, but will definitely

be � minCapacity)

Self-tuning implementationThe algorithm has the following tunable parameters:

● Uplift gain.● Excess gain.● Desired utilization.● Window size.● Threshold up.● Threshold down.● Lookahead horizon (not evaluated in this work).

Uplift gain, excess gain, threshold up, thresholddown, desired utilization, and window size comprise a

Figure 4

Predicted epochal spike.

Now �1

epoch

NowTime

Precursor

workload spike

Capacity

according

to long-term

algorithm

Capacity

according

to short-term

algorithm

Anticipated

workload spike

Threshold up

Threshold

down

Planned capacity

for future epoch

D. J. BRADLEY ET AL. IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003

708

Page 7: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

multidimensional search space which contains an optimumfigure of merit that is dependent on the workloadcharacteristics as well as the relative weighting of energyconsumption and unmet demand. In general, finding theoptimum values of these figures of merit within this searchspace is tedious and ad hoc at best, and certainly notpractical or optimal for all workloads and systemadministration policies encountered in the field.Therefore, we developed a self-tuning gain-basedalgorithm that calculates energy consumption and unmetdemand on the basis of a workload sample for a large setof values of the input parameters of the algorithm. Then,the algorithm searches through this set of input values tofind the settings that optimize the figures of merit forthe given workload. Any search algorithm can be used;typically, because the state space is small, an exhaustiveenumeration could even be used. The self-tuning approachhas the advantage that it can dynamically adapt not onlyto any workload that is encountered in the field, but tochanges that occur to the workload over time on any givensystem.

4. Empirical workload measurements andalgorithm performance

Simulation methodologyTo evaluate candidate power management algorithmsunder controlled experimental conditions, we constructeda workload generator and algorithm simulator that cangenerate a variety of synthetic workloads, ranging fromrandom to periodic, with varying amounts of noise.Various figures of merit such as energy consumption andunmet demand can be calculated and displayed. Thealgorithm simulator can also read files containingempirical workload trajectories, as discussed below. Thissimulator has been valuable in rapidly working outalgorithm details under closely and easily controlledlaboratory conditions.

The simulator makes certain assumptions about thedynamics of the workload. The algorithm works by takinga sample of utilization, looking at recent and epochalworkload history, and computing whether one or moreservers should be powered on or off. For energycalculation purposes, the simulator assumes that when wecommand a server to be powered on, it consumes powerimmediately but delivers no capacity until the next sample.This has the effect of overestimating the unmet demand,since it is likely that demand will be served in a shortertime than the sampling interval. Upon shutdown, thesimulator assumes that once a server has been commandedto shut down, it requires an entire sample interval to shutdown, during which it consumes power but provides nocapacity. This simplification probably overemphasizes the

energy consumption, since it is likely that a servercan actually shut down in less time than one sampleinterval.

Startup and shutdown durations are also a function ofworkload inertia, which influences the time required forthe workload from a given server to decay and for fullload to be applied to a newly powered-on server. Evenif a system can be put into and taken out of standby orhibernation in a few seconds (for servers, this remains tobe seen), if the workload takes minutes to decay or rampup, the effective startup or shutdown time will beconstrained by the time constant of this workload.Furthermore, we do not discriminate between the powerconsumption of a fully utilized and an unutilized server,and the power consumption of the system is calculated asthe power consumption of a single whole server times thenumber of powered-on servers.

For convenience, we set the power-on/off latency to beequal to the sample period of the raw data, which is 15minutes for the IBM Domino* server [13], two minutesfor the Divahouse server,3 and five minutes for the IBMserver containing data from the 2000 Olympic Gamesheld in Sidney, Australia4 (these workloads are describedbelow). We have not explored the effect of the frequencyof power-on/off opportunities on algorithm performance,except to note that as the algorithm tries to minimizepower-on/off cycles, it will not be able to take advantageof frequent opportunities to cycle power, so we wouldexpect the effect of power-on/off latency to be weak.

Further assumptions are made about the persistenceof unmet demand. We assume that unmet demand is lostforever, whereas in actual usage it is possible that unmetdemand could actually be deferred until the requisitecapacity is brought on line. We do not model suchdemand deferral, which once again probably causes usto overstate unmet demand.

To develop the results presented in this paper, thesimulator was set up to automatically evaluate thealgorithm using a large combination of input parameters,and compute the energy as a function of unmet demandfor all of them. In all subsequent evaluations, no self-tuning of parameters was performed in order to allow usto readily observe the effects of varying each parameter.

Lotus Domino database server workloadWe obtained workload data as a function of time for anumber of Lotus Domino servers in use at IBM facilitiesin Poughkeepsie, New York. These servers support businessapplications of the IBM Corporation such as databasesand email. Although the workload is not necessarilyfungible, in other respects as an aggregate it seems to

3 Divahouse is the name of the University of North Carolina (UNC) server.4 The site is no longer available.

IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003 D. J. BRADLEY ET AL.

709

Page 8: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

exhibit realistic workload behavior such as periodicity,high dynamic range, and sudden workload transients.

A typical segment of this data is shown in Figure 5(a).The data workload shows the total CPU utilization of theten servers, sampled every 15 minutes.5 The time span ismidnight Tuesday to midnight the subsequent Mondayduring January 2000. The periodic nature, high dynamicrange, and sharp transient behavior of the workload areapparent, as is ample opportunity for power management.

The autocorrelation for this workload is shown inFigure 5(b). The “lag” is expressed in samples, so one daycorresponds to 95 samples and one week corresponds to

665 samples. A strong daily autocorrelation is easilyvisible, as is a slightly elevated autocorrelation at oneweek.

Figure 6 shows the unmet demand versus energytradeoff for this workload for a large set of inputparameters. For Figure 6 [as well as Figures 7, 9(b),and 11(a), shown later], the ordinate shows energyconsumption relative to the energy consumption if allservers were powered on all the time (with higher energyconsumption at the top of the axis), and the abscissashows unmet demand relative to the total demand appliedto the system over the run duration (with higher unmetdemand at the right of the axis). Each circle on the chartshows the energy and unmet demand for a given setting of5 The maximum CPU utilization for the ten servers is 1000 “workload units.”

Figure 5

Lotus Domino server workloads: (a) CPU utilization for ten processors; (b) autocorrelation of workload.

900

800

700

600

500

400

300

200

100

0

Work

load

Tu We Th Fr Sa Su Mo

00:0

0:0

0

08:0

0:0

0

16:0

0:0

0

00:0

0:0

0

00:8

0:0

0

16:0

0:0

0

00:0

0:0

0

08:0

0:0

0

16:0

0:0

0

00:0

0:0

0

00:8

0:0

0

16:0

0:0

0

00:0

0:0

0

08:0

0:0

0

16:0

0:0

0

00:0

0:0

0

00:8

0:0

0

16:0

0:0

0

00:0

0:0

0

08:0

0:0

0

16:0

0:0

0

Time – Tuesday through Monday

(a)

Lag (samples)

(b)

Auto

corr

ela

tion f

uncti

on

1.0

0.8

0.6

0.4

0.2

0.0

�0.2

0 200 400 600 800 1000

One dayOne week

Figure 6

Energy consumption vs. unmet demand for Domino workload.

0.778

0.758

0.737

0.717

0.696

0.676

0.655

0.635

0.614

0.594

0.573

0.553

0.532

0.512

0.492

Energ

y c

onsu

mpti

on (%

)

Unmet demand

0.0

01

0.0

01

0.0

02

0.0

03

0.0

03

0.0

04

0.0

04

0.0

05

0.0

06

0.0

06

0.0

07

0.0

08

0.0

08

0.0

09

0.0

10

0.0

10

0.0

11

0.0

12

D. J. BRADLEY ET AL. IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003

710

Page 9: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

the six parameters. The energy savings range from 26% to52%, while the unmet demand ranges from 0.1% to 1.2%.

The color of the circle indicates the number of power-on cycles. A blue circle indicates that the number ofpower-on cycles for the entire system for the entireevaluation interval (eight weeks) was less than one perprocessor per day; a green circle indicates one to twopower-on cycles per processor per day; yellow indicatestwo to three power-on cycles, and red indicates more thanthree power-on cycles. For reference, a scenario in whicheach server is power-cycled once per day is not consideredto unduly reduce server hardware reliability.

The twin-hyperbola curve has a “wishbone” shape, withenergy vs. unmet demand exhibiting a hyperbolic tradeoffalong two distinct arcs. To understand which parameterscause the figure of merit to traverse a particular arc of thecurve, which ones cause it to cross from one arc to the other,and, ultimately, what constitute relatively good settings,a number of vector field plots were produced that showthe effect of changing a single parameter at a time bythe same increment.

The first plot [Figure 7(a)] shows that the effect ofincreasing uplift gain is to reduce unmet demand andincrease energy within a hyperbolic arc, but withoutcrossing to the other arc. An increase in uplift gain causesthe algorithm to be more likely to turn on processors onthe basis of a given degree of variation in the short-termwindow. Thus, energy consumption is higher, but thelikelihood of unmet demand is lower.

The next chart [Figure 7(b)] shows the effect ofincreasing excess gain. As excess gain increases, the systemis less likely to power-off processors on the basis of ashort-term projected capacity surfeit, so once again energyconsumption increases while the likelihood of unmetdemand decreases, and a change in excess gain does notcause the point to change arcs.

The next chart [Figure 7(c)] shows the effect ofthreshold up on energy consumption and unmet demand.Threshold up determines how sensitive the algorithm is tosudden increases in demand that occurred one epoch ago(for this workload, either one day or one week ago). If aworkload increment in the previous epoch is greater thanthe threshold up, the algorithm will immediately schedulea commensurate increase in capacity at the current time.Thus, a low value of threshold up causes the algorithm toclosely track the epochal variation, and a high value ofthreshold up causes the algorithm to ignore any epochalvariation.

This explains the characteristic “wishbone” shape of theenergy– unmet demand curve. The left arc of the curvecorresponds to low values of threshold up: Because thealgorithm is sensitive to the existing epochal variations inthe workload, and proactively schedules capacity increasesbased on those variations, the unmet demand is lower

than in the right fork in the curve. In the right arc of thecurve, where threshold up is higher, the algorithm doesnot take epochal variations into account, is surprised bythe unpredicted workload transients, and incurs largeramounts of unmet demand.

Thus, the threshold up vector field chart shows that, forworkloads that exhibit periodic workload increments, theepochal algorithm can be extremely effective in reducing

Figure 7

Energy consumption vs. unmet demand for (a) uplift gain vector

field; (b) excess gain vector field; and (c) threshold up vector

field. Vectors are in the direction of the increasing parameter.

0.7780.7580.7370.7170.6960.6760.6550.6350.6140.5940.5730.5530.5320.5120.492

Energ

y c

onsu

mpti

on (%

)

0.7780.7580.7370.7170.6960.6760.6550.6350.6140.5940.5730.5530.5320.5120.492

Energ

y c

onsu

mpti

on (%

)

Unmet demand(a)

Unmet demand(b)

0.7780.7580.7370.7170.6960.6760.6550.6350.6140.5940.5730.5530.5320.5120.492

Energ

y c

onsu

mpti

on (%

)

Unmet demand(c)

Effect of increasing

threshold up

Effect of increasing

excess gain

Effect of increasing

uplift gain

0.0

01

0.0

01

0.0

02

0.0

03

0.0

03

0.0

04

0.0

04

0.0

05

0.0

06

0.0

06

0.0

07

0.0

08

0.0

08

0.0

09

0.0

10

0.0

11

0.0

12

0.0

01

0.0

01

0.0

02

0.0

03

0.0

03

0.0

04

0.0

04

0.0

05

0.0

06

0.0

06

0.0

07

0.0

08

0.0

08

0.0

09

0.0

10

0.0

11

0.0

12

0.0

01

0.0

01

0.0

02

0.0

03

0.0

03

0.0

04

0.0

04

0.0

05

0.0

06

0.0

06

0.0

07

0.0

08

0.0

08

0.0

09

0.0

10

0.0

11

0.0

12

IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003 D. J. BRADLEY ET AL.

711

Page 10: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

unmet demand. We also note that if the workload had noperiodicity, the wishbone curve would collapse into asingle hyperbola, and threshold up would have little ifany effect. We actually see this for other, less periodicworkloads.

The effects of changing the other parameters of thealgorithm are similar to the effect of changing upliftgain and excess gain, and are not reproduced here. Tosummarize,

● Increases in threshold down cause energy to rise andunmet demand to fall along a given arc of the wishbone.

● Increases in desired utilization cause energy to fall andunmet demand to rise along a given arc of the wishbone.

● Increases in window size cause energy to rise and unmetdemand to fall along a given arc of the wishbone.

Figure 8 shows energy consumption as a function ofthe number of power-on cycles per day per server for theDomino workload. The first observation is that for thisworkload, a large number of parameter settings result insignificant energy savings, with well under one power-on cycle per server day, and quite reasonable unmetdemand. For example, the point marked by the arrowshows one such setting, with a 45% energy savings,a 0.5% unmet demand, and 0.935 power-on cycles perserver day.

UNC ibiblio.org Web server workloadThe second workload that we used to evaluate thealgorithm was obtained from the main ibiblio.org filetransfer protocol and Web server called “Divahouse.”This server, located at the University of North Carolinaat Chapel Hill (UNC), receives several million hits per

day. The workload, which consists of samples taken everytwo minutes, spans 27 days in January 2002.

The autocorrelation is shown in Figure 9(a). Becausethe samples are at two-minute intervals, a workload with astrong daily periodicity would have a peak at a lag of 720samples, and a one-week periodicity would have a peak ata lag of 5040 samples. These periodicities are evident fromthe autocorrelation, but they are not as pronounced as inthe Domino workload; therefore, we would not expect thelong-term epochal prediction technique to be as effectiveas in that case.

Figure 9(b) shows the energy versus unmet demandtradeoff chart for Divahouse. A 20% energy savings isaccompanied by 0.5% unmet demand, which on thesurface appears to be very reasonable algorithmicperformance. Note the absence of the wishbone shape,implying that the algorithm is not effectively exploitingwhat little epochal behavior is exhibited by the workload.

However, for this workload, it proved more difficult toobtain large energy savings with a small number of power-on cycles. As the indicated point in Figure 10 exemplifies,to achieve energy savings of the order of 20% requires3.29 power-on cycles per server day, although unmetdemand remains at a comfortable 0.5%. We feel that thisis partly because of the high-frequency components of theworkload, and partly because the algorithm has (and takes)an opportunity to turn a server off or on every two minutes(compared to every fifteen minutes for the Dominoworkload), resulting in a higher frequency of power cycling.(The different colors for the points in this energy vs. power-on cycles chart reflect differing values of normalized unmetdemand. A blue point signifies �1% unmet demand, greenis �2%, yellow �3%, and red �3% unmet demand.)

Figure 8

Energy consumption vs. power-on cycles per server day for Domino workload.

0.778

0.758

0.737

0.717

0.696

0.676

0.655

0.635

0.614

0.594

0.573

0.553

0.532

0.512

0.492

Energ

y c

onsu

mpti

on (%

)

Power cycles1 2

Acceptable

setting

D. J. BRADLEY ET AL. IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003

712

Page 11: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

Sydney 2000 Summer Olympic Games Web siteworkloadThe final workload we evaluated was obtained from 12Web servers located in Sydney, Australia (Figure 11). Thisdata spans 27 calendar days representing activity beforeand during the 2000 Summer Olympic Games, with fiveminutes between samples. The energy savings in thishighly overprovisioned application range from 54% to80%. Unmet demand [Figure 11(a)] was typically �1%,

and many settings achieved �1 power-on cycle per serverday [Figure 11(b)].

Summary of empirical resultsTable 1 summarizes the performance of the algorithm forthe workloads we considered. We also compare the energyconsumption obtained via our algorithm to the energyconsumption for a “perfect” algorithm, in which a) onlysufficient servers are powered on to service the workload

Figure 9

(a) Autocorrelation for Divahouse workload; (b) energy consumption vs. unmet demand for ibiblio.org Divahouse Web server.

0.9830.9630.9420.9220.9010.8810.8600.8400.8190.7990.7780.7580.7370.7170.6960.6760.6550.6350.6140.5940.5730.5530.532

Energ

y c

onsu

mpti

on (%

)

Unmet demand

(b)

0.0

03

0.0

05

0.0

08

0.0

10

0.0

13

0.0

15

0.0

18

0.0

20

0.0

23

0.0

26

0.0

28

0.0

31

0.0

33

0.0

36

0.0

38

0.0

41

0.0

44

0.0

46

0.0

49

0.0

51

0.0

54

0.0

56

Lag (samples)

(a)

Auto

corr

ela

tion f

uncti

on

1.0

0.8

0.6

0.4

0.2

0.0

0 1000 2000 3000 4000 5000 6000

One day One week

IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003 D. J. BRADLEY ET AL.

713

Page 12: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

at any given time, b) servers power up or downinstantaneously and can accept workload immediatelyupon power-up, and c) the number of power-on/offcycles is unconstrained. We think that the poor energyperformance on the “Divahouse” workload, compared withthe perfect algorithm, is largely due to its high-frequencycomponents, which are difficult to accommodate with apower-on/off cycle minimization constraint.

5. Practical considerationsIn some architectures, power-based workload managementcan be accomplished by invoking existing workload-balancing functionality and informing it that a given serverhas to have its workload removed or that a new server isavailable for use. This is likely to be the case in the IBMDirector/Network Dispatcher scenario outlined below.Power control can be accomplished by communicating the

power control demands to a service processor resident inthe server whose power is to be controlled. This is usuallyaccomplished over either an in-band or out-of-bandmanagement network, using established applicationprogrammable interfaces (APIs). Depending on itsfunctional capabilities, a server can be either powered offor put into a suspended/hibernation state. Note that forRAS purposes, hibernation is equivalent to a power-offcycle and thus should not be invoked with total abandon,but it could result in much shorter power-off/on cycles.

Standalone implementationIn one implementation, all of the functions of workload-aware power management could reside in a general-purposedistributed system environment having the workloadexecution, workload measurement, workload balancing,power management algorithm, and power control means.

Figure 10

Energy consumption vs. power-on cycles per server day for Divahouse Web-serving workload.

0.983

0.942

0.901

0.860

0.819

0.778

0.737

0.696

0.655

0.614

0.573

0.532

Energ

y c

onsu

mpti

on (%

)

0 5 10 15 20 25 30 35 40 45 50 55 60 65

Power-on cycles

Acceptable

setting

Table 1 Algorithm performance summary.

Workload Energyconsumption,

compared with allservers powered

on, “perfect”algorithm (%)

Energyconsumption,

compared with allservers powered

on, CoolRunningsalgorithm (%)

Unmet demand,compared withtotal demand

(%)

Power-on cyclesper server day

Comments

Lotus Domino database 37 55 0.5 �1 Strongly seasonal enterprisedatabase. High dynamic range.

“Divahouse” Web server 32 80 0.5 3.29 Not overprovisioned. Weakseasonality. Low dynamic range.

Sydney 2000 SummerOlympics Web server

14 20 �1 �1 Highly overprovisioned. Lowseasonality. High dynamic range.

D. J. BRADLEY ET AL. IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003

714

Page 13: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

When no distinguished server is designated as beingresponsible for the collection of utilization data andthe execution of the power management algorithm,a leader election algorithm can be run to determinewhich server of the currently extant group should runthis algorithm and make the power control decisions.Subsequently, the leader can also invoke the workloadmanagement and power control functions to implementthe decision.

Implementation integrated into IBM Director andNetwork DispatcherFigure 12 illustrates how power management functionalitycould operate in an environment containing IBM Director[1] and Network Dispatcher, a component of IBMWebsphere Edge Server [14].

There are several possible sources of utilizationinformation for the algorithm. The figure shows it beingprovided by a utilization calculation thread that is shown

Figure 11

Olympic Games (Sydney, Australia, 2000): (a) Energy vs. unmet demand; (b) energy vs. power-on cycles.

Energ

y c

onsu

mpti

on (%

)

0.0

01

0.0

01

0.0

02

0.0

03

0.0

03

0.0

04

0.0

04

0.0

05

0.0

06

0.0

06

0.0

07

0.0

08

0.0

08

0.0

09

0.0

10

0.0

10

0.0

11

0.0

12

0.0

12

0.0

13

0.0

13

0.0

14

0.0

15

0.0

15

0.0

16

0.0

17

0.0

17

Unmet demand

(a)

0.451

0.430

0.410

0.389

0.369

0.348

0.328

0.307

0.287

0.266

0.246

0.225

0.205

0.184

0.164

Energ

y c

onsu

mpti

on (%

)

Power-on cycles

(b)

0.4610.4510.4400.4300.4200.4100.3990.3890.3790.3690.3580.3480.3380.3280.3170.3070.2970.2870.2760.2660.2560.2460.2360.2250.2150.2050.1950.184

1 2 3 4 5

IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003 D. J. BRADLEY ET AL.

715

Page 14: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

as part of IBM Director, perhaps using the CapacityManager database, but it could alternatively be obtainedfrom Network Dispatcher as part of workload status, inwhich case actual response time measurements would beavailable; or, rather than having the utilization calculationthread obtain the data from the existing Directormonitoring applications, it could obtain utilization datafrom the Network Dispatcher advisor.

6. Conclusions and recommendations forfuture workThis paper describes a real-time power managementalgorithm that is applicable to a parallel complex ofservers having a parallelizable, migratable workload. Thealgorithm is designed to minimize power utilization, unmetdemand, and server power cycles. It accomplishes this byusing short-term and long-term workload history and asimple workload prediction method to anticipate short-term fluctuations in workload. The algorithm then issuespower cycling commands as appropriate to reduce powerwhile meeting this predicted workload, with some upsidereserve. Our experiments with this algorithm onemail/database and Web-serving workloads show thatenergy savings of 20% or more can readily be achieved,with typically less than 1% of unmet or deferred demandand an average of one power cycle per server day. If theworkload has a high dynamic range, or if the system issubstantially overprovisioned, much higher energy savingsare obtained.

The work we have described has led to several ideasfor future work items. We have assumed daily andweekly epochs, which is a limiting assumption. It

should be straightforward to improve the epochalalgorithm to detect epochs automatically, perhapsusing frequency analysis or autocorrelation of theworkload. Another opportunity for improving algorithmbehavior might be to look at the autocorrelationof upsets or spikes of missed demand, perhaps asa function of the lookahead window, and dynamicallyadjust parameters accordingly. It should also be possibleto reduce the impact of sampling granularity on missedworkload transients by enhancing the epochal algorithmto detect smeared or jittery epochal behavior, perhapsby further developing the lookahead concept discussedin the paper.

The dynamical model of the current algorithm assumesthat it takes one sampling interval to power up and applyworkload, or power down a server. In reality, this intervalcan be either significantly greater or significantly less thanone sample interval. For example, a server may be able tobe brought into and out of standby very quickly (instanton/off), or it may take a long time to adjust the workloadfor a given application (high workload inertia). Theperformance of the algorithm can be improved byincorporating these different dynamical constraints on ascenario-by-scenario basis.

We currently use easily measured indicators of serverworkload (CPU utilization, etc.) that are independentof any application. This of course has the advantage ofgenerality, but perhaps better algorithm performancecould be obtained for a particular application ifapplication-specific indicators of utilization were used,such as transaction response time. However, because high-level indicators of performance may not offer diagnosticinsight into the root causes of poor performance, itseems reasonable to formulate diagnostic correlationsbetween basic utilization parameters that are easilymeasured on the server and the application-level figuresof merit.

There are several factors to consider when decidingwhich machines in a complex to power on or off.Ultimately, a decision criterion must be formulated on thebasis of a combination of factors such as trying to equalizepower-on cycles across all machines, preferentially poweringoff machines in hot spots, preferentially powering offmachines that are unhealthy or require rejuvenation,machine characteristics, and other aspects we havenot yet considered. For example, if different machineshave different power dissipations or different capacities,we need to adapt the algorithm to power on the machinesthat maximize capacity subject to minimizing power;this appears to be a linear programming problem. Ourcurrent position is that we should keep the more powerfulmachines powered on all the time, and preferentiallycycle the lower-throughput machines for vernier capacitycontrol, but analysis remains to be performed.

Figure 12

Integration of intelligent power management into IBM Director.

Web

server

Service

processor

Advisor

Power

control

thread

Power

control

Power

control

Utilization

calculation

thread

Existing

monitoring

applications

Workload

Workload Utilization Utilization

Network

Dispatcher

Power

management

algorithm

Workload

control

Workload

status

IBM Director server

IBM Director agentManaged server (one of many)

D. J. BRADLEY ET AL. IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003

716

Page 15: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

In reality, server group membership can change overtime for a variety of reasons, such as failures, recoveries,new servers being added to a complex, etc. The algorithmcurrently assumes a static server group membership, butcan easily be extended to accommodate dynamic groups.

AcknowledgmentsWe thank Paul Jones and John Reuning of the Universityof North Carolina at Chapel Hill for the Divahouseworkload data, and Herb Lee of IBM Yorktown Researchfor the Domino workload data. Luis Lastras, JimChallenger, and Paul Dantzig of IBM Yorktown Researchprovided access to the Sydney Olympics data, for which weare grateful. The comments of one of the reviewers wereespecially helpful in improving this paper.

*Trademark or registered trademark of International BusinessMachines Corporation.

**Trademark or registered trademark of MicrosoftCorporation or Linus Torvalds.

References1. See http://www-1.ibm.com/servers/eserver/xseries/

systems_management/xseries_sm.html.2. T. Mudge, “Power: A First Class Design Constraint for

Future Architectures,” presented at the High PerformanceComputer Conference, Bangalore, India, December 2000;IEEE Computer 34, No. 4, 52–58 (April 2001).

3. A. Vahdat, A. Lebeck, and C. Ellis, “Every Joule IsPrecious: The Case for Revisiting Operating SystemDesign for Energy Efficiency,” Proceedings of the 9th ACMSIGOPS European Workshop, September 2000, pp. 31–36.

4. C. S. Ellis, “The Case for Higher Level PowerManagement,” Proceedings of the 7th IEEE Workshop onHot Topics in Operating Systems (HotOS-VII), 1999, pp.162–167.

5. P. Bohrer, E. Elnozahy, T. Keller, M. Kistler, C. Lefurgy,C. McDowell, and R. Rajamony, “The Case for PowerManagement in Web Servers,” Power-Aware Computing,Robert Graybill and Rami Melhem, Eds., Kluwer/PlenumSeries in Computer Science, January 2002, Ch. 1, pp. 1–31.

6. J. Chase, D. Anderson, P. Thakar, A. Vahdat, and R.Doyle, “Managing Energy and Server Resources inHosting Centers,” Proceedings of the 18th Symposium onOperating Systems Principles (SOSP’01), October 2001,pp. 103–116.

7. J. Chase and R. Doyle, “Balance of Power: EnergyManagement for Server Clusters,” Proceedings of the8th Workshop on Hot Topics in Operating Systems(HotOS-VIII), May 2001, pp. 163–165.

8. E. Pinheiro, R. Bianchini, E. V. Carrera, and T. Heath,“Load Balancing and Unbalancing for Power andPerformance in Cluster-Based Systems,” Proceedings of theWorkshop on Compilers and Operating Systems for LowPower, September 2001; Technical Report DCS-TR-440,Department of Computer Science, Rutgers University,New Brunswick, NJ, May 2001.

9. International Business Machines Corporation, WebsphereSoftware Platform; see http://www.ibm.com/.

10. A. Iyengar, J. Challenger, D. Dias, and P. Dantzig, “High-Performance Web Site Design Techniques,” IEEE InternetComputing 4, No. 2, 17–26 (March 2000).

11. J. Aman, C. K. Eilert, D. Emmes, P. Yocom, and D.Dillenberger, “Adaptive Algorithms for Managing a

Distributed Data Processing Workload,” IBM Syst. J. 36,No. 2, 242–283 (1997).

12. International Business Machines Corporation,BladeCenter; see http://www.ibm.com/.

13. See http://www.ibm.com/products/; http://www.lotus.com/products/.

14. See http://www-3.ibm.com/software/webservers/edgeserver/index.html.

Received November 18, 2002; accepted for publicationJuly 15, 2003

IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003 D. J. BRADLEY ET AL.

717

Page 16: Workload- based power for parallel computer systems...workload (and vice versa) to be abrupt. These workload attributes imply that keeping all of the servers powered on all of the

David J. Bradley IBM Systems Group, Research TrianglePark, P.O. Box 12195, Durham, North Carolina 27709([email protected]). Dr. Bradley is a Senior TechnicalStaff Member in the xSeries Architecture and TechnologyDepartment. He received a B.E.E. degree in 1971 from theUniversity of Dayton, and an M.S. degree and a Ph.D. degree,both in electrical engineering, from Purdue University in 1972and 1975, respectively. After joining IBM in 1975, he wasone of the “original 12” engineers who developed the IBMPersonal Computer in Boca Raton, Florida, where heinvented “Ctl-Alt-Del.” Since then he has developed a varietyof Personal Computer systems and founded and managed thePC Architecture Department. Dr. Bradley has most recentlyappeared as the answer on Final Jeopardy during CollegeWeek.

Richard E. Harper IBM Research Division, Thomas J.Watson Research Center, P.O. Box 218, Yorktown Heights, NewYork 10598 ([email protected]). Dr. Harper is a ResearchStaff Member in the Server Technologies Department at theThomas J. Watson Research Center. He received a B.S.degree in physics in 1976 and an M.S. degree in aeronauticalengineering in 1977 from Mississippi State University, and aPh.D. degree in computer systems architecture from theMassachusetts Institute of Technology in 1987. Between 1987and 1995 he developed parallel flight and mission-criticalcomputing system architectures at the Charles Stark DraperLaboratory in Cambridge, Massachusetts. Between 1995 andthe IBM Thomas J. Watson Research Center in 1998, heworked on system analysis and architecture at StratusComputer in Marlboro, Massachusetts, a manufacturerof commercial fault-tolerant computers. At IBM, he hasworked on performance analysis and architecture of high-endNUMA/SMP architectures, and proactive problem predictionand avoidance technologies. Dr. Harper is an author orcoauthor of numerous patents and technical papers.

Steven W. Hunter IBM Systems Group, Research TrianglePark, P.O. Box 12195, Durham, North Carolina 27709([email protected]). Dr. Hunter is a Senior TechnicalStaff Member in the xSeries Architecture and TechnologyDepartment. He received a B.S. degree in electricalengineering from Auburn University in 1984, an M.S. degreein electrical engineering from North Carolina State Universityin 1988, and a Ph.D. degree in electrical engineering fromDuke University in 1997. Since 1997 Dr. Hunter has been partof the xSeries organization, where he has most recently beena lead architect and designer for the BladeCenter product.He is licensed as a Professional Engineer, and is a SeniorMember of the IEEE and an Adjunct Professor at NorthCarolina State University. He holds more than a dozenpatents spanning hardware and software. Dr. Hunter haspublished numerous papers and has presented at a varietyof conferences and symposiums.

D. J. BRADLEY ET AL. IBM J. RES. & DEV. VOL. 47 NO. 5/6 SEPTEMBER/NOVEMBER 2003

718