15
Cluster Comput (2013) 16:511–525 DOI 10.1007/s10586-012-0219-6 Evaluating application performance and energy consumption on hybrid CPU + GPU architecture Edson Luiz Padoin · Laércio Lima Pilla · Francieli Zanon Boito · Rodrigo Virote Kassick · Pedro Velho · Philippe O.A. Navaux Received: 4 January 2012 / Accepted: 14 June 2012 / Published online: 30 June 2012 © Springer Science+Business Media, LLC 2012 Abstract The High Performance Computing (HPC) com- munity aimed for many years to increase performance re- gardless of energy consumption. Until the end of the decade, a next generation of HPC systems is expected to reach sus- tained performances of the order of exaflops. This requires many times more performance compared to the fastest su- percomputers of today. Achieving this goal is unthinkable with current technology due to strict constraints on sup- plied power. Therefore, finding ways to improve energy effi- ciency become a main challenge on state-of-the-art research. The present paper investigates energy efficiency on hetero- geneous CPU + GPU architectures using a scientific appli- cation from the agroforestry domain as a case-study. Differ- ently from other works, our work evaluates how the work- load of the application may affect energy efficiency on hy- E.L. Padoin ( ) · L.L. Pilla · F.Z. Boito · R.V. Kassick · P. Velho · P.O.A. Navaux Institute of Informatics, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil e-mail: [email protected] L.L. Pilla e-mail: [email protected] F.Z. Boito e-mail: [email protected] R.V. Kassick e-mail: [email protected] P. Velho e-mail: [email protected] P.O.A. Navaux e-mail: [email protected] E.L. Padoin Department of Exact Sciences and Engineering, Regional University of Northwest of Rio Grande do Sul (UNIJUI), Ijuí, RS, Brazil brid architectures. Results point out that the power supplier constraints depend also on the workload. Keywords Energy efficiency · Heterogeneous architectures · CPU + GPU · Energy consumption 1 Introduction A wide range of applications on areas such as physics [32], weather forecast [23], oil exploitation [28], and industry [12] require high processing power. To respond to this demand, High Performance Computing (HPC) systems gather the processing power of several computational resources. Re- cent changes in the way the microprocessors industry in- creases the performance of its products influenced state-of- the-art HPC systems. Today’s processors use many cores with reduced clock frequencies to increase performance. As a result, current HPC environments have many cores per processors. Graphics Processing Units (GPUs) followed a similar tendency, having several processing elements (PEs) inside a single silicon die. Reported results indicate that the perfor- mance of commonly found computational intensive kernels at least doubles when running on GPUs [20]. As a conse- quence, the combined use of CPUs and GPUs in HPC sys- tems has become a popular choice among top ranked and yet to come platforms [8]. Performance in the HPC community is often translated and measured in Flops (Floating Point Operations per Sec- ond). With the increasing demand of processing power, HPC architectures tend to grow in performance. Recently both e- science and industry claim to be exascale ready (1 EFlops = 10 18 Flops), while most powerful architectures are still on petascale (1 PFlops = 10 15 Flops).

Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Embed Size (px)

Citation preview

Page 1: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Comput (2013) 16:511–525DOI 10.1007/s10586-012-0219-6

Evaluating application performance and energy consumptionon hybrid CPU + GPU architecture

Edson Luiz Padoin · Laércio Lima Pilla ·Francieli Zanon Boito · Rodrigo Virote Kassick ·Pedro Velho · Philippe O.A. Navaux

Received: 4 January 2012 / Accepted: 14 June 2012 / Published online: 30 June 2012© Springer Science+Business Media, LLC 2012

Abstract The High Performance Computing (HPC) com-munity aimed for many years to increase performance re-gardless of energy consumption. Until the end of the decade,a next generation of HPC systems is expected to reach sus-tained performances of the order of exaflops. This requiresmany times more performance compared to the fastest su-percomputers of today. Achieving this goal is unthinkablewith current technology due to strict constraints on sup-plied power. Therefore, finding ways to improve energy effi-ciency become a main challenge on state-of-the-art research.The present paper investigates energy efficiency on hetero-geneous CPU + GPU architectures using a scientific appli-cation from the agroforestry domain as a case-study. Differ-ently from other works, our work evaluates how the work-load of the application may affect energy efficiency on hy-

E.L. Padoin (�) · L.L. Pilla · F.Z. Boito · R.V. Kassick ·P. Velho · P.O.A. NavauxInstitute of Informatics, Federal University of Rio Grande do Sul(UFRGS), Porto Alegre, RS, Brazile-mail: [email protected]

L.L. Pillae-mail: [email protected]

F.Z. Boitoe-mail: [email protected]

R.V. Kassicke-mail: [email protected]

P. Velhoe-mail: [email protected]

P.O.A. Navauxe-mail: [email protected]

E.L. PadoinDepartment of Exact Sciences and Engineering, RegionalUniversity of Northwest of Rio Grande do Sul (UNIJUI), Ijuí, RS,Brazil

brid architectures. Results point out that the power supplierconstraints depend also on the workload.

Keywords Energy efficiency · Heterogeneousarchitectures · CPU + GPU · Energy consumption

1 Introduction

A wide range of applications on areas such as physics [32],weather forecast [23], oil exploitation [28], and industry [12]require high processing power. To respond to this demand,High Performance Computing (HPC) systems gather theprocessing power of several computational resources. Re-cent changes in the way the microprocessors industry in-creases the performance of its products influenced state-of-the-art HPC systems. Today’s processors use many coreswith reduced clock frequencies to increase performance. Asa result, current HPC environments have many cores perprocessors.

Graphics Processing Units (GPUs) followed a similartendency, having several processing elements (PEs) insidea single silicon die. Reported results indicate that the perfor-mance of commonly found computational intensive kernelsat least doubles when running on GPUs [20]. As a conse-quence, the combined use of CPUs and GPUs in HPC sys-tems has become a popular choice among top ranked and yetto come platforms [8].

Performance in the HPC community is often translatedand measured in Flops (Floating Point Operations per Sec-ond). With the increasing demand of processing power, HPCarchitectures tend to grow in performance. Recently both e-science and industry claim to be exascale ready (1 EFlops =1018 Flops), while most powerful architectures are still onpetascale (1 PFlops = 1015 Flops).

Page 2: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

512 Cluster Comput (2013) 16:511–525

However, this step ahead faces the limits of energy con-sumption. If an exascale machine was built using today’sfastest supercomputer technology, it would consume morethan 1 GW. Because of that, one of the current main con-cerns of the HPC community is to increase performance re-specting critical energy consumption constraints [11, 14].

The two main metrics of energy efficiency for HPCare picoJoules/Flop [7, 18, 19], the amount of energy spentper float point operation, and Flops/W [15], the peak per-formance produced per Watt. Based on such metrics, re-search results point out directions to improve energy effi-ciency [16, 18, 22, 30]. In a general way, these works aimat evaluating the energy consumption while using DynamicVoltage and Frequency Scaling (DVFS). Another approachaims at investigating the trends of frequency and perfor-mance [38]. However, this practice is restricted to platformmaintainers.

Energy efficiency is a hard feasibility constraint of newHPC systems. Hence several state-of-the-art studies evalu-ate software and hardware approaches to improve energyefficiency. In addition to these approaches, our paper triesto understand the trends of the applications’ high-level pa-rameters, such as workload, wisely use electric power.

In the present paper we analyze performance and en-ergy consumption of heterogeneous architectures based onCPU + GPU. We focus in understanding the cause and ef-fect of application parameters and energy efficiency. Specifi-cally, we take an important application from the agroforestrydomain and adapt it to hybrid CPU + GPU environments.Porting this application we aim at understanding the trade-off of performance and energy consumption as a functionof workload. Our findings point out that the use of GPUglobally leads to a better energy efficiency. However, opti-mal workload varies depending on the application and GPUcharacteristics.

The remaining content of the paper is organized as fol-lows. In Sect. 2, we present related work. In Sect. 3, wedetail the problem of energy consumption in HPC. CPU +GPU heterogeneous architectures appears in Sect. 4. InSect. 5, we describe the case-study application, showingthe experimental methodology in Sect. 6 and experimentalresults in Sect. 7. Finally, in Sect. 8, we present a sum-mary of the contributions as well as directives for future re-search.

2 Related work

Several researchers have evaluated the trade-off of perfor-mance and power consumption on CPU + GPU heteroge-neous architectures.

Buck et al. [4] propose four applications (SAXPY andSGEMV BLAS operators, image segmentation, FFT, and

ray tracing) to evaluate the performance of heterogeneousarchitectures. Jiao et al. [16] used a subset of these applica-tions (FFT and BLAS operators) to analyze the power andenergy efficiency of GPU and CPU when dynamically scal-ing voltage and clock frequency. Wang et al. [38] use a simi-lar approach. They indicate that the use of GPUs is a straightforward path to achieve green computing. In spite of similar-ities, their focus is to fine tune low-level parameters whichare rarely available at user level. Our focus is to fine tunethe workload of the application directly, enabling the finaluser to explore power efficiency regardless to low-level de-tails.

Ren and Suda [30] present an approach to automaticallyallocate workload on CPU + GPU architectures aiming atimproving energy efficiency. To allocate the workload, theyuse a model based on the power consumption of distinctmodules. One of the main concerns in this process is to mea-sure separately the power consumption of CPU and GPU, aswell as some other devices consuming electric power duringexecution. To do so, they make approximations by measur-ing power changes on the main board. However, one shouldexpect or at least estimate the bias of these approximations.To avoid bias while measuring power, we decided to sam-ple power consumption directly from the power outlet. Thismethod is broadly used on the maintenance of electric de-vices to monitor power consumption.

Another paper that follows a similar methodology pro-poses automatically mapping and allocation of workload onheterogeneous systems [22]. Their approach reduces the ex-ecution time up to 25 % and of the energy consumption upto 20 % when compared to static workload allocation. Nev-ertheless, the device used to measure power is unable to di-rectly measure power consumption. To overcome this limi-tation, they use the product of power and execution time toestimate the total energy consumption product. This approx-imation is error prone since the machine time can presentdiscrepancies with the measuring equipment time. To avoidthis kind of issues, we use a power measuring pad that candirectly estimate power consumption.

Liu et al. [21] propose a task mapping method for het-erogeneous CPU + GPU clusters that aims at decreasing theenergy consumption, while respecting the tasks’ deadlines.The proposed technique was able to save 50 % of the en-ergy spent on a day by a cluster architecture with two GPUcards per CPU. This approach focus more on the infrastruc-ture level and system’s design than the developer’s point-of-view.

Our study has complementary nature from the related pa-pers summarized here. While their focus is mainly in thearchitecture and infrastructure levels, we focus on the ap-plication level, studying the impact of the application work-load.

Page 3: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Comput (2013) 16:511–525 513

3 Energy consumption

The energy consumption of computer systems has been thesubject of research in digital circuits. Kogge [18] studied thisaspect. In his findings, he states that the average demand ofabout 70 picoJoules (pJ) per floating point operation can de-crease by a factor of seven (becoming 10 picoJoules/Flop).To scale down clock frequency is a technique that can im-prove energy efficiency. However, it decreases the process-ing power. The results of Younge et al. [39] show that whenreducing in 18 % the clock frequency, power consumptiondecreases 20 % while performance decreases only 5 %. Ifthis scenario arises at the level of digital circuits, it is ex-pected that it will also be present on complex and heteroge-neous HPC environments.

With today’s technology, an exascale system would con-sume over a GW of power, making it economically and eco-logically impracticable [2, 18, 29]. An indicator of this issueis the average power of the top 10 ranked machines in theTop500 list. In November 2009, it was roughly 2 MW [6].Two years later this average increased by 65 %. The K Com-puter of the RIKEN Advanced Institute for ComputationalScience, the current first ranked machine, has an energyconsumption above 12 MW. This shows how the energyconsumption of supercomputers is increasing to unfeasiblethresholds.

Doing a simple extrapolation, using the same technol-ogy from the K Computer (which current performance is10.5 PFlops), for instance, one would need 1.2 GW to reachan exaflop. While the performance of HPC computers in-creased 3000 times in 20 years, the performance per Wattincreased only 65 times [39]. This disparity on the way per-formance and energy efficiency grow makes it impossible toreach exascale. In this context, the study of energy efficiencybecomes mandatory to build exascale architectures [1].

Responding to the energy efficiency problem, the HPCcommunity started looking at this matter. To do that, it isnecessary to determine metrics that consider both perfor-mance and energy consumption. Flops per Watt is a measureof energy efficiency, as it shows the rate of computation pro-vided by a processing element for every Watt of power con-sumed [30]. The Green500 lists use the Flops/Watt metric toexpress the efficiency of HPC systems. Because the industrycommonly relies on the High-Performance Linpack bench-mark (HPL), the Green500 lists use HPL Flops results and,using the Flops/Watt metric [5].

Another metric of energy efficiency that is broadly usedis EPI (energy per instruction). This metric comes from themicroprocessors design measuring the energy consumed perinstruction. EPI registers the mean amount of energy spentper instruction [13], and is obtained through (1). We can alsorelate this metric to other energy efficiency metrics. Amongthose we have MIPS (millions of instructions per second) or

Flops per watt as presented in (2) and (3).

EPI = Joules

Instruction(1)

where

Joules

Instruction=

JoulesSeconds

InstructionSeconds

= Watt

IPS

Energy Efficiency = MIPS

Watt(2)

Energy Efficiency = FLOPS

Watt(3)

In recent Top500 lists, there is an increasing number ofheterogeneous architectures using GPU + CPU processingnodes. In the 36th list, from November 2010, 8 of the 10fastest machines rely on the use of GPU and present superiorFlops/W efficiency. In this list, the average efficiency of theheterogeneous systems is 756 MFlops/W, while the one ofconventional systems is 211 MFlops/W [31].

Several research projects emerged aiming at improv-ing energy efficiency of HPC architectures. The BarcelonaSupercomputing Center (BSC), for instance, is building aHPC system aiming at power efficiency using ARM pro-cessors. With ARM Cortex-A9 processors, which requirea maximum of 0.25 Watt per core, the Mont Blanc Zeroproject aims at reaching 200 PFlops using only 10 MW ofpower [37]. Mont Blanc Zero is part of a greater projectcalled Mont Blanc [37] that aims to achieve 20 GFlops/Wuntil 2020. To provide an idea of this goal, the currentleader on Green500 list is the IBM Rochester Blue Gene/Qachieves about 2.0 GFlops/W. Therefore, the Mont Blancproject goal is to increase 10 times the energy efficiency inthe next decade.

Our research points on the direction of using heteroge-neous CPU + GPU architectures and to manage the applica-tion’s parameters to achieve better energy efficiency in HPCsystems. For this study, we use the Flops/W metric.

4 Heterogeneous architectures

Heterogeneous architectures are computing systems that usedifferent processing units to maximize performance [3].A typical example of heterogeneous architecture involvesthe combined use of a multi-core CPU (Central Process-ing Unit) and a GPU (Graphics Processing Unit). Figure 1depicts such architecture. When compared to a traditionalCPU, the GPUs offer a higher peak performance and betterenergy and/or cost efficiency. However, they work as accel-erators and require a CPU acting as a host.

Page 4: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

514 Cluster Comput (2013) 16:511–525

Fig. 1 An example of heterogeneous architecture with CPU and GPU

These processors have completely different design goals.CPUs were built to provide performance for sequential ap-plications. They are composed of few general purpose coresworking in high frequencies and big cache memories. On theother hand, Graphical Processing Units were made aiming atrendering massively parallel graphics. They have hundredsof simple processing units working on low frequencies withsmall or no cache memories. These characteristics lead to itssuperior energy efficiency.

To use a GPU for computational intensive algorithms,one generally needs a specific programming library. A state-of-the-art programming architecture and environment forGPUs is CUDA (Compute Unified Device Architecture)[26]. CUDA GPUs can be found in off-the-shelf comput-ers, and supercomputers like Tianhe-1A [31]. These GPUsare used in our experiments, as can be seen in Sect. 6.1.

CUDA means both a SIMD (Single Instruction, MultipleData) architecture and a programming model that extendslanguages (such as C) to use these GPUs. A process on theCPU run a special function, named kernel, which executeson the GPU. All data has to be transferred to and from theGPU’s memory, which incurs in a communication overhead.

A CUDA GPU is composed of simple processing unitscalled Scalar Processors (SPs) or CUDA Cores. These SPsare organized in Streaming Multiprocessors (SMs). A SMcan be composed of 8 up to 48 CUDA cores and 1 or 2 in-struction units (or warp schedulers).

The CUDA programming model works with the abstrac-tion of thousands of threads computing in parallel. Thesethreads are organized in blocks. Threads in the same blockcan be easily synchronized with a barrier function. Multipleblocks organized as a grid compose the kernel. A block ofthreads runs on a SM and each of its threads execute on aSP. A thread block can have more threads than a SM can runin parallel.

The CUDA architecture also has a memory hierarchy. Allmemories inside the SM have a small size and low latency.The caches accelerate the access to textures and constants.Each block of threads running on the same SM sees its ownshared memory. The global memory has a large size and ahigh latency (from 400 to 600 cycles [26]) and can be ac-cessed by all threads. To increase efficiency and better uti-

lize the memory bandwidth, some access patterns can coa-lesce several memory requests in one request with a largeword [26]. The newer version of this architecture also con-tains a L1 cache per SM and a shared L2 cache [27].

5 Case study: soil irrigation system

To analyze the energy consumption and energy efficiency ofheterogeneous architectures we use an application from theagroforestry domain. The case study application searchesfor an optimal irrigation system model [9, 24, 25]. Study-ing the behavior of water absorption is of great interest onagroforestry and agriculture because it avoids wasting of wa-ter while irrigating the soil. In previous works, this applica-tion was parallelized in a cluster using MPI. Results show aspeedup of 6.98 with 20 nodes and 1.68 with 2 nodes on acluster of workstations.

5.1 The irrigation model

The application uses a model that represents the water in-filtrating a cylindrical tube. The cylindrical tube models thesoil that can be in two states regarding the water level: sat-urated and unsaturated. The model hence considers the irri-gation time on the cylinder center with a fixed continuousflow of water, depicted by Fig. 2. The water infiltrates thesoil due to a gradient of total potential. This potential is thesum of matrix potentials and gravity potential on unsaturatedsoils. When the soil is saturated, the pressure is used insteadof gravity. By this relationship of potentials, water infiltratethrough the tube on direction r and z varying the humidityof each root cell over time.

The absorption of water depends on the amount of wa-ter on the soil and also on daylight. So, in (4) we considerthe changes caused by amount of sunlight during a certainperiod of the day. To this purpose parameter b is constant toguarantee that the maximum absorption is between 6 am and6 pm. With a similar purpose c is fixed to 12 pm to modelthe symmetry of humidity in equally distant time slots frommidday. β is a constant of proportionality that optimizes themodel. The problem of finding this parameter is iterativelysolved by fitting it to the inverse problem.

S = β(θ − θr)b(t−c) (4)

whereS roots’ absorption rate (cm3/h)β proportional constant (cm3/h)b obtained experimentally (l/h)c midday constant (h)t time (h)θ volumetric humidity content (VHC) (cm3)θr real measured VHC (cm3)

Page 5: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Comput (2013) 16:511–525 515

Fig. 2 The cylindrical tubewhich models the root whilebeing infiltrated with water andits sampled surface

5.2 Numerical method

Due to the way that the water infiltrate in the soil the prob-lem has a diffuse nature. For this reason we use a finite-difference method with central differences to solve the prob-lem. The simulation is defined in r , z, δ and t . The VHC ofeach cell depends on the position of the cell (r and z) andthe size of each cell δ. At each time slot t , the VHC in a rootcell is θ(r, z, δ). Also, in each iteration the VHC dependson the previous state θ(r, z, δ, t − �t), where �t is the timevariation between two time slots. The humidity θ of a cell isalso driven by the humidity on the neighbor cells. This datadependency is a strong constraint of the parallel solution.

To compute the absorption rate S we rely on an ac-tive search network [35, 36] to determine the optimal β .The method considers a set of experimental data Sexp ={θ1, θ2, . . . , θk}, as follows:

Step 1—Estimate a set of values βi where probably the op-timal β is included.

Step 2—Compute the direct problem with the estimated β

finding the solution for θi(r, z, δ, t).Step 3—Compute the differences between the estimated so-

lution and the expected result:

di = |Sesti − Sexpi| (5)

Step 4—Find the minimum dmin of the differences of eachsolution di with respect to the optimal β .

Step 5—Refine the solution defining a new range to esti-mate β with respect to:

|βest − βoptimal| < �β (6)

where

�β = βi+1 − βi (7)

Step 6—Repeat step 2 to 5 until dmin have converged to aminimum error accepted ε.

∣∣di+1

min − dimin

∣∣ < ε (8)

To summarize, the application simulates water infiltrat-ing the soil iteratively with a fixed time slot. At each itera-tion the model is instantiated to match the results obtainedwith real experiments. So, the computation is divided in twoparts—the first one is the simulation itself and the other isrelated to fitting the model with the real experiments. Thisprocess results in several iteration in the matrix. Therefore,the iterations’ complexity is related to the matrices’ orders,i.e., the number of cells on the cylinder.

The application requires huge amounts of processingpower to simulate relevant (realistic) scenarios. To do sev-eral experiments in feasible time our tests simulate the wa-ter absorption for 18 minutes with one cylinder where theradius is 15 cm and the height is 33 cm. We fixed the timestep �t in 100, i.e., 100 iterations per second. The inputparameters of this application is hence the number of cellson the cylinder N , which is the product of the numbers ofcells on directions r and z. Although these parameters definea simulation of representative phenomenon, a specialist onthe agroforestry domain would need to increase the scale ofthe described test cases to simulate complete environments.

Studying ways of increasing this application performancewe implemented a parallel version for cluster architectures.This application uses MPI, a popular standard to programon clusters. The MPI version of the application distributesslices of root cells among several nodes of a cluster. How-ever, the MPI version has a significant overhead hinderingthe exploitation of several cores located on the same com-puting node. Ideally, we can have as much processing unitsas the number of root cells. Because of this property the ap-plication is highly suited to GPU that provide several inde-

Page 6: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

516 Cluster Comput (2013) 16:511–525

pendent CUDA cores (SPs). Many state-of-the-art applica-tions were recently ported to GPU aiming at performanceand energy consumption. However, these works rather finetune low-level parameters to improve the application energyefficiency. We focus on changing application workload toverify the trends on energy efficiency. Next, in Sect. 7, wedescribe our tests and experimental results using this appli-cation as case-study.

6 Experimental method

This section describes the methodology for our study. Wepresent the execution environment, and then discuss the en-ergy measurement methodology.

6.1 Execution environment

The present paper aims at analyzing the energy efficiencyof heterogeneous architectures composed of CPU + GPU.We evaluate our agroforestry application running sequen-tially, as well as using MPI and CUDA. We used the CUDAprogramming environment due to its availability and sup-port for the NVIDIA architectures used in our experiments.To improve energy efficiency we rather focus on tuning theworkload of the application instead of relying on low-levelparameters that end-users are rarely able to modify.

Our evaluation environment is composed of 3 machines,hereafter called Test Platform A, Test Platform B and TestPlatform C. They are described below, and their details aresummarized in Table 1.

Test Platform A—This environment has an Intel Centrino 2processor working at 2.4 GHz, with 3 GB of DDR3memory. It also has an NVIDIA GeForce 9300M GSGPU with 512 MB of memory.

Test Platform B—This platform is composed of an IntelCore 2 Duo processor working at 2.93 GHz, 4 GBof DDR2 memory and an NVIDIA GeForce 8400 GSGPU with 512 MB of memory.

Test Platform C—This environment has an Intel i3 proces-sor working at 3.2 GHz, with 3 GB of DDR3 memory,and an NVIDIA GeForce GTX 460 GPU with 1 GB ofmemory.

We used these three platforms to evaluate peak powerconsumption, but only Test Platforms A and C were usedto evaluate energy consumption and energy efficiency withdifferent GPU configurations due to the similarities betweenTest Platform A and Test Platform B. Both Test Platform Aand B have GPUs with 8 CUDA cores, while Test Plat-form C’s GPU has a total of 336 CUDA cores [17, 34].

All test environments use Ubuntu Linux Kernel 2.6.32-21, GNU compiler (gcc) version 4.4.3, and CUDA version3.2 with nvcc compiler version 0.2.1221. The next sectiondescribes the methodology used in the measurements.

6.2 Measurement methodology

The metrics used in our evaluation and the way they are mea-sured are described below.

Peak Power Consumption: to measure the peak power con-sumption, we used a Dranetz Powers 4300 [10]. Thisdevice works in VAC and directly computes the energyconsumption using voltage root mean square (VRMS).Moreover, it is equipped with an auxiliary memory tohold the collected data for posterior analysis, avoidingintrusiveness. To have an accurate evaluation of en-ergy consumption, we connected the device betweenthe wall outlet and the test platforms. After the exper-iments, we read the collected data with an auxiliary

Table 1 Detailed configuration of Test Platforms A, B and C

Test Platform A Test Platform B Test Platform C

CPU GPU CPU GPU CPU GPU

Vendor name Intel Centrino 2 NVIDIA GeForce Intel Core 2 Duo NVIDIA GeForce Intel i3 NVIDIA GeForce

Model P8600 9300M GS E7500 8400M GS 550 GTX 460

Clock frequency 2.4 GHz 580 MHz 2,93 GHz 400 MHz 3.2 GHz 1350 MHz

Memory 3 GB DDR3 512 MB DDR2 4 GB DDR2 512 MB DDR3 3 GB DDR3 1 GB DDR5

Memory bus speed (MHz) 800 800 667 567 1333 1800

Number of cores 2 – 2 – 2 –

Number of CUDA cores (SM) – 8 – 8 – 336

Scalar processor (SP) – 1 – 1 – 7

CUDA cores (SM) per SP – 8 – 8 – 48

Word length (bits) 32 64 64 64 64 256

Maximum potentialconsumption (Watt)

25 13 65 40 73 160

Page 7: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Comput (2013) 16:511–525 517

computer, as illustrated in Fig. 3. The results provideinstantaneous power (peak power consumption, KW)and energy consumption (KWh).

Execution Time: the measured time, in seconds, to executethe application. This value is obtained directly from theoperating system through the time system call.

Energy Consumption: the energy consumption is also mea-sured with the Dranetz equipment.

Energy Efficiency: to express the energy efficiency of testplatforms we used the Flops/W metric. This is themetric of choice on Green500 lists. In practice, theGreen500 lists use average power consumption to ob-tain energy efficiency. Balaji et al. [33] sustain that theHPC community should use peak instantaneous powerwhich is more suitable to obtain energy efficiency cop-ing with limitations on the power supplier. In our tests,to calculate the energy efficiency, we used the perfor-mance of the application described in Sect. 5 and thepeak instantaneous power measured with the Dranetzpower meter.

Fig. 3 Hardware setup to measure energy consumption. We use aDranetz measuring pad, between the power outlet and the computerto sample energy consumption

Figure 4 presents an example of Dranetz’s output. Thetop graph shows the instantaneous power (KWatt) as a func-tion of time. The bottom graph depicts the total energy con-sumed (KWh), i.e., the accumulative consumed energy. Bothgraphs use a sampling interval of 5 seconds.

7 Experimental results

In this section, we present the results obtained executing theapplication described in Sect. 5 on the platforms describedin Sect. 6.1. Each of the following subsections presentthe results for one of the metrics discussed in Sect. 6.2:peak power consumption in Watt, execution time in sec-onds, energy consumption in Wh and energy efficiency inMFlops/W.

7.1 Peak power consumption

Table 2 shows the peak power consumption observed in theexecution of sequential, MPI and GPU versions of the appli-cation with several workloads (matrices sizes) on Test Plat-form A.

The results shown in Table 2 point that the peak powerconsumption increases as the workload grows. With GPU

Table 2 Peak power consumption (Watts) on Test Platform A

Matrix size Sequential MPI (2 processes) GPU

Idle 22.86 22.86 22.86

8 × 8 27.89 32.72 23.76

16 × 16 28.05 33.45 23.75

32 × 32 33.64 34.02 24.43

64 × 64 33.98 34.36 30.86

128 × 128 34.62 39.79 33.23

256 × 256 34.82 40.49 41.52

512 × 512 39.23 41.34 43.78

Fig. 4 Instantaneous power(KW) on top with energyconsumption (KWh) in thebottom. Both graphs come fromthe Dranetz interface

Page 8: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

518 Cluster Comput (2013) 16:511–525

Table 3 Peak power consumption (Watts) on Test Platform B

Matrix size Sequential MPI (2 processes) GPU

Idle 68.90 68.90 68.90

8 × 8 81.58 87.53 72.92

16 × 16 82.13 89.07 73.02

32 × 32 82.52 88.18 71.31

64 × 64 82.95 89.15 74.84

128 × 128 83.51 91.96 79.35

256 × 256 82.73 92.04 75.37

512 × 512 83.08 88.16 87.79

Table 4 Peak power consumption (Watts) on Test Platform C

Matrix size Sequential MPI (2 processes) GPU

Idle 62.17 62.17 62.17

8 × 8 70.94 72.38 102.00

16 × 16 76.52 87.42 105.64

32 × 32 76.04 88.35 107.36

64 × 64 76.64 89.05 102.32

128 × 128 77.10 89.05 125.36

256 × 256 78.98 91.19 128.86

512 × 512 80.22 94.48 129.16

using matrices of order 8 peak power consumption was23.76 Watt. This value almost doubled with matrices of or-der 512 (43.78 Watt). We verify a similar behavior with theMPI version, however the peak power consumption hop islower. Peak power consumption increased of about 26 %comparing minimum workload 8×8 (32.72 Watt) with max-imum workload 512 × 512 (41.34 Watt) on the MPI pro-gram.

The workload also affects the peak power consumptionon other test platforms. Table 3 shows the peak power con-sumption on Test Platform B. This time the GPU versionpower consumption when increasing workload from mini-mum (8) to the maximum (512) was much lower, precisely20.4 %. However, using only CPU, with the sequential andMPI versions, peak power consumption increased even less1.84 % and 0.72 %. This results point that the increase ofpeak power consumption from the GPU and CPU remainsa factor of 10, very similar to Test Platform A. This resultsshow that GPUs are much more power hungry than CPU.However, advances in the conception of such architecturesmay affect this test. So, next we verify the peak power con-sumption on Test Platform C.

The peak power consumption observed in the Test Plat-form C appears in Table 4. With this platform, the resultspresent a behavior different to the observed with Test Plat-form A and Test Platform B. With this newer GPU modelthe peak power consumption increases 26.6 % when com-

Fig. 5 Maximum peak power consumption

paring extreme changes on the workload, from 8 × 8 to512 × 512. The sequential version increases the peak powerconsumption in 13.09 %, while the MPI version is about30.5 %. So, in this last platform the GPU is less powerhungry overwhelming the power consumption hop with theGPU. Several optimizations on the GPU design may explainthis improvement. The size of the word, 256 bits, numberof CUDA-cores, improved bus speed, and so on. The exactreason for this improvement is hard to find due to constraintsthat rise from the proprietary nature of CUDA architecture.However, we can verify that, when the workload increasesthe peak power consumption also increases. The peak powerconsumption is the maximum sampled power consumptionand determines a hard constraint on the power supplier.

To understand the energy cost, in peak power consump-tion, of running the application on several test platforms wefixed the size of matrices on 512 × 512 and compared thepeak power consumption. Our goal is to verify if the nom-inal parameters of peak power consumption of each plat-form was reached with the maximum workload. Then wecompare the increase of peak power consumption of the se-quential version with other test platforms. Figure 5 presentsthese results. We can observe that the GPU version on TestPlatform A and Test Platform B is near the maximum poten-tial consumption, respectively 43.78 Watt and 87.79 Watt onthe processor specification, see Table 1. However, in TestPlatform C the peak power consumption, 129.16 Watt, isgreater than all others. These results were designed to stresspeak power consumption to evaluate the power supplier. Ac-cording to recent studies, such as, the Exascale TechnologyStudy of the Defense Advanced Research Projects Agency(DARPA) [18], power consumption is a limiting factor offuture HPC systems. Throughout our analysis we saw thatpeak power consumption also depend on the applicationworkload. This fact is often neglected on state-of-the-artstudies.

Throughout this section we analyzed peak power as func-tion of the application workload. We also showed that the

Page 9: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Comput (2013) 16:511–525 519

Table 5 Execution time (seconds) on platforms A and C

Matrix size Test Platform A Test Platform C

Sequential MPI (2 processes) GPU Sequential MPI (2 processes) GPU

8 × 8 0.239 ± 0.052 1.342 ± 0.087 0.455 ± 0.176 0.640 ± 0.0043 1.015 ± 0.1957 0.264 ± 0.0251

16 × 16 4.729 ± 0.017 5.274 ± 0.105 1.120 ± 0.246 3.514 ± 0.018 4.761 ± 0.0036 0.294 ± 0.1279

32 × 32 5.715 ± 0.054 7.235 ± 0.046 2.138 ± 0.216 4.298 ± 0.0131 4.949 ± 0.0112 0.359 ± 0.1155

64 × 64 16.317 ± 0.149 21.326 ± 0.094 7.599 ± 0.153 11.954 ± 0.0086 14.342 ± 0.0179 0.553 ± 0.1185

128 × 128 55.446 ± 0.370 69.108 ± 0.016 49.231 ± 0.104 40.118 ± 0.1619 47.910 ± 0.0974 1.348 ± 0.1171

256 × 256 297.428 ± 0.824 245.624 ± 0.207 362.430 ± 0.228 150.288 ± 0.1134 170.347 ± 0.2334 4.565 ± 0.1182

512 × 512 946.160 ± 0.764 1033.326 ± 0.176 1478.420 ± 0.792 573.964 ± 0.1745 728.191 ± 0.2902 17.152 ± 0.1127

peak power consumption can reach the limits of power sup-ply and must be considered in the design of HPC systems.However, we need peak power and time to achieve our goalof analyzing energy efficiency. Following, in the next sec-tion, we present execution time results.

7.2 Execution time

Measuring energy consumption depends on time. Moreover,time is used also to estimate performance, generally mea-sured in speed as operations per second. The results pre-sented in this section are necessary to estimate both met-rics energy consumption and performance. For the sake ofclarity, we present in this section application execution timealone to enter on the details of energy consumption in thenext section.

Table 5 shows the execution times of the three implemen-tations (sequential, MPI and CUDA) on the Test Platforms Aand C. The results presented are the arithmetic average of 10executions. The table also shows the standard deviation forall the values.

On Test Platform A, the MPI version was in average1.78 times slower than the sequential one. The reason tohave such results with MPI compared to the sequential ver-sion come from the overhead of running two process onthe same node and the MPI environment initialization. Eventhough the application is CPU bound the CUDA version alsopresents lower speed results (compared to the sequential ver-sion on Test Platform A) due to overhead. For the imple-mentation that uses GPU, the speedup for matrices of sizesup to 128 × 128 was in average 3.05, comparing with theMPI version, and in average 2.14, comparing with the se-quential version. However, when the matrices size is greaterthan 128 × 128, the execution times of the CUDA versionare greater than both MPI and sequential versions. This hap-pens because of the data dependency, as discussed in Sect. 5.

We observe a similar behavior on Test Platform C withthe MPI version. In Table 5 we can observe that the MPIversion execution time with Test Platform C increase ateach line with a factor of almost 4 times, for matrices over

32 × 32. MPI version was always slower than the sequen-tial version due to the overhead. However, the CUDA ver-sion on Test Platform C achieved better speedups in allworkload settings. On Test Platform C, the GPU imple-mentation’s speedup is in average 32.04 for matrices above128 × 128. The main reason for such high improvement onTest Platform C compared to Test Platform A come from theGPU architecture which has an improved clock frequency(from 580 MHz to 1.35 GHz) and memory bandwidth (from800 MHz to 1.8 GHz). Moreover, Test Platform C’s GPUhas 336 CUDA cores while Test Platform A has 8 CUDAcores.

The CUDA version has better scalability on Test Plat-form C because the number of available cores. We can seethat when multiplying the workload by a factor of n weachieve an improvement of at least a n-factor on the ex-ecution time. In the best case, when increasing the work-load from 16 × 16 to 32 × 32 (i.e., 4 times) the executiontime increases about 20 %. Based on the execution time re-sults we conclude that the Test Platform C is more adaptedin scalability due to the CUDA architecture. However, TestPlatform C has higher clock frequency and memory band-width which may increase the power consumption. Hence,an open question remains about the tradeoff between perfor-mance gain and power consumption. In the next section, wewill evaluate this tradeoff.

7.3 Energy consumption

Having measured before peak power consumption and exe-cution time we present in this section the energy consump-tion using Test Platform A and Test Platform C. Tables 6and 7 present the energy consumption in (Wh) using theTest Platforms with different workloads. The energy con-sumption of some of the experiments is not shown becausetheir execution times were smaller than the integration inter-val of the measurement equipment. To compare the energyconsumption of the MPI and GPU versions, we show theefficiency factor. It is computed by dividing the energy con-sumption of the MPI version by the consumption of the GPUone.

Page 10: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

520 Cluster Comput (2013) 16:511–525

Table 6 Energy consumption(Wh) of the sequential, MPI andGPU versions on TestPlatform A

Matrix size Sequentialconsumption (Wh)

MPI (2 processes)consumption (Wh)

GPU consumption(Wh)

Efficiency factor(MPI/GPU)

8 × 8 – – – –

16 × 16 0.0446 0.0490 0.0140 3.50

32 × 32 0.0520 0.0650 0.0290 2.24

64 × 64 0.1720 0.2240 0.0970 2.31

128 × 128 0.5150 0.6420 0.5090 1.26

256 × 256 2.6350 2.2100 3.8000 0.58

512 × 512 8.4930 9.2900 16.1940 0.57

Table 7 Energy consumption(Wh) of the sequential, MPI andGPU versions on TestPlatform C

Matrix size Sequentialconsumption (Wh)

MPI (2 processes)consumption (Wh)

GPU consumption(Wh)

Efficiency factor(MPI/GPU)

8 × 8 – – 0.570 –

16 × 16 0.075 0.190 0.570 0.33

32 × 32 0.091 0.190 0.380 0.50

64 × 64 0.200 0.380 0.770 0.49

128 × 128 0.950 1.150 0.770 1.49

256 × 256 1.530 4.200 0.760 5.53

512 × 512 12.430 18.240 1.340 13.61

The results on Test Platform A show that the parallelGPU version has a smaller energy consumption than theMPI version for matrices of order up to 128. For matricesof order 256 and 512, the MPI version spends less energy.This happens mostly because, for these two matrix sizes, theGPU version takes more time to finish its execution than theMPI version (as seen in Table 5).

On Test Platform C, the experiments with matrices largerthan 64×64 using GPU presented smaller energy consump-tions than what is presented by the MPI and sequential ver-sions. Its biggest energy consumption (1.34 Wh, with thematrix of order 512) is only 17 % greater than the energyconsumption of the MPI version for the matrix of order 128(1.15 Wh), 8 times smaller instance. The energy consumedby the MPI version for the matrix of order 256 was 4.2 Wh,while the GPU version only consumed 0.76 Wh. This rep-resents a 5.53 times smaller consumption. This differenceincreases to 13.61 when considering a matrix of order 512.

With workloads lower than 128, the Test Platform A hasa peak power consumption in average 26 Watt, while TestPlatform C is in average 105 Watt. However, peak powerconsumption all alone is insufficient to determine energyconsumption. Considering execution time we have lower en-ergy consumption on Test Platform A than with Test Plat-form C. For workloads above 128 we see that Test Plat-form C overwhelms Test Platform A. Figures 6(a) and 6(b)show the energy consumption and the execution time ofthe sequential, MPI and GPU versions on Test Platform Aand C.

When using the GPU, an increase in the peak power con-sumption can be seen. This happens mainly in Test Plat-form C. However, even with a higher peak power consump-tion, the energy consumption of some of these experimentswas smaller than the ones using the sequential or MPI ver-sions. This happens because the GPU was able to greatly re-duce the execution time of the application. This means thatwe can spend less energy by using GPUs that consume morepower.

The energy consumption results determine directly theeffective cost of maintaining the platform. However, we lookforward to minimize energy consumption (the maintenancecost) or maximizing system speed (the performance bene-fit). Finding this tradeoff is of ultimate importance to designHPC systems that respect the environment. With that goal inmind, in the next section we present results of energy effi-ciency.

7.4 Energy efficiency

In this section, to evaluate energy efficiency, we present re-sults obtained from the number of operations performed bythe application, the execution time and the energy consump-tion. Tables 8 and 9 present the number of operations foreach matrices order, the amount of operations executed persecond (in MFlops) and the amount of operations executedper Watt (in MFlops/W) spent for the three versions of theapplication in the Test Platforms A and C.

Page 11: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Comput (2013) 16:511–525 521

Fig. 6 Energy consumption (bars) and execution time (lines) as a function of workload on the test platforms

The best energy efficiency observed on Test Platform Awas 487.16 MFlops/W, with the CUDA version with matri-ces of order 64. From the same test is the best performanceon this platform, 6.22 MFlops. As the matrices order in-crease, the performance (in MFlops) from the GPU versiondecreases. The sequential version has a performance hopwith 128 × 128 matrices. This workload minimizes cacheeffects due to the strong interaction between neighbor cells.The CUDA version presents better results with workloadssmaller than the number of CUDA cores of Test Platform A.

With workloads above 128 × 128 performance decreasesimplying an energy efficiency drop. With matrices of size512 × 512, the GPU version has an energy efficiency of184.31 MFlops/W, 1.74 times smaller than the efficiency ofthe MPI version, and 1.9 times smaller than the sequentialversion’s.

On Test Platform C, we can observe that the energy ef-ficiency increases with the matrices order, differently fromwhat happened in Test Platform A. Therefore, with matri-ces of order 512, the best energy efficiency was reached,

Page 12: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

522 Cluster Comput (2013) 16:511–525

Table 8 Detailed energyefficiency with Test Platform A Matrix size Operations (M) Sequential MPI (2 processes) GPU

MFlops MFlops/W MFlops MFlops/W MFlops MFlops/W

8 × 8 0.815 3.42 – 0.61 – 1.79 –

16 × 16 3.086 0.65 69.20 0.58 62.98 2.76 220.44

32 × 32 11.990 2.10 230.59 1.66 184.47 5.61 413.47

64 × 64 47.254 2.90 274.74 2.22 210.96 6.22 487.16

128 × 128 187.603 3.38 364.28 2.71 292.22 3.81 368.57

256 × 256 747.582 2.51 283.71 3.04 338.27 2.06 196.73

512 × 512 2,984.664 3.15 351.43 2.89 321.28 2.02 184.31

Table 9 Detailed energyefficiency with Test Platform C Matrix size Operations (M) Sequential MPI(2 process) GPU

MFlops MFlops/W MFlops MFlops/W MFlops MFlops/W

8 × 8 0.815 1.27 – 0.80 – 3.08 –

16 × 16 3.086 0.88 41.15 0.65 16.24 10.48 5.41

32 × 32 11.990 2.79 131.77 2.42 63.11 33.35 31.55

64 × 64 47.254 3.95 236.27 3.29 124.36 85.37 61.37

128 × 128 187.603 4.68 197.48 3.92 163.13 139.11 243.64

256 × 256 747.582 4.97 488.62 4.39 178.00 163.71 983.66

512 × 512 2,984.664 5.20 240.31 4.10 163.63 174.01 2,227.36

2,227.36 MFlops/W. This value is 12 times greater thanwhat was observed in Test Platform A (184.31 MFlops/W).Comparing the GPU version with the MPI and the sequen-tial versions, the first was 13.6 times better than the MPIimplementation, and 9.27 times better than the sequentialone. This increase in energy efficiency happens because theGPU from Test Platform C has more blocks per grid en-abling several threads per block. So, scalability of adjustingthose two parameters improve energy consumption and per-formance.

We observed that the peak power consumption increasesas the workload grows. On Test Platform A, the lower peakpower consumption was reached by the GPU version for ma-trices of size up to 128 × 128. For larger matrices, the best(lower) peak power consumption is presented by the sequen-tial version. The performance of the GPU version was bet-ter than the sequential one for matrices of order up to 128,and smaller for larger matrices. The results of the GPU ver-sion for energy consumption and energy efficiency followthe same behavior: the GPU version present better resultsfor matrices smaller than 256 × 256, and the worst resultsfor large matrices.

Results reinforce that energy consumption depend onperformance. The GPU version performs poorly for largematrices because of limitations of the GPU card from TestPlatform A leading to higher energy consumption. On TestPlatform B, the GPU card has the same number of CUDAcores and a configuration slightly better than the card from

Test Platform A. On this platform, the GPU version pre-sented smaller peak power consumption than the sequentialversion’s with matrices of size up to 256 × 256. Therefore,results show that the GPU application has improved energyefficiency compared to the MPI version.

Test Platform C has a GPU card with more CUDA coresand that is expressively more sophisticated than the cardsfrom Test Platform A and Test Platform B. On this platform,the peak power consumption of the GPU version was al-ways greater than the presented by the sequential version. Inspite of that, the GPU version presented better performancefor all the workloads. However, its energy consumption andenergy efficiency were better than the sequential version’sones only for matrices larger than 128 × 128. We believethat this happens because we do not use all the available re-sources from the GPU card with small matrices. With work-loads larger than what was tested in this work, we expectthe same behavior observed on Test Platforms A and B: withsome workload size, the GPU version will stop being advan-tageous and present worse energy consumption and energyefficiency than the sequential one.

8 Conclusion

The present paper describes an energy consumption studyon CPU + GPU heterogeneous architectures. The evalua-tion used an important application from the agroforestrydomain that optimizes the use of water during irrigation.

Page 13: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Comput (2013) 16:511–525 523

Current state-of-the-art research on the field tried to finetune low-level parameters to improve the energy efficiencyof CPU + GPU applications. Differently, we rather rely onapplication level parameters such as workload. Our resultsshow that changing workload can drastically improve theenergy efficiency of CPU + GPU heterogeneous architec-tures.

To analyze energy efficiency we used the CUDA pro-gramming toolkit. We hence compare the energy consump-tion of the CPU + GPU version with a parallel MPI version.We compared also the CPU+GPU version with a sequentialCPU version. Results point out an energy efficiency gain ofup to 3 times with CPU + GPU. Partially, this gain is due toperformance increase over the sequential version. However,in some cases the data dependency was a strong limitingfactor of energy efficiency. Tests with platforms presentedless efficiency above a certain workload threshold (128).Characteristics of Test Platform A explain this limitation.Test Platform A has few CUDA cores, eight CUDA-coresprecisely, justifying the limitation on scalability found withworkloads greater than 128. However, results globally showthat the use of CPU + GPU is effective when the applica-tion workload is well instantiated. Increasing the number ofCUDA-cores (336), doubling memory (from 512 MBytes to1024 MBytes) and improving bus speed (DDR5 instead ofDDR3) Test Platform C always overwhelms the MPI appli-cation.

In the near future we intend to analyze several applica-tion parameters to analyze CPU + GPU heterogeneous ar-chitectures. A first step in this direction is to verify pos-sible optimization on data transfer of iterative methods inthe literature. Another point to improve is to enable andevaluate the simulation of a complete agroforestry environ-ment.

Acknowledgements This work was partially supported by severalBrazilian research agencies: CNPq, CAPES, FAPERGS and FINEP.We would like to thank these agencies, their support made this workpossible. We also would like to thank all persons of the Parallel andDistributed Processing Group (GPPD) at Federal University of RioGrande do Sul (UFRGS), their help and expertise were of great value.This research has been partially supported by CAPES-BRAZIL undergrants 5854/11-3 and 5847/11-7. Work developed on the context of theassociated international laboratory between UFRGS and Université deGrenoble—LICIA.

References

1. Barker, K., Davis, K., Hoisie, A., Kerbyson, D., Lang, M., Pakin,S., Sancho, J.: Using performance modeling to design large-scalesystems. IEEE Comput. 42(11), 42–49 (2009)

2. Beckman, P., Dally, B., Shainer, G., Dunning, T., Ahalt, S.C.,Bernhardt, M.: On the road to exascale. Sci. Comput. World 116,26–28 (2011)

3. Brodtkorb, A.R., Dyken, C., Hagen, T.R., Hjelmervik, J.M.,Storaasli, O.O.: State-of-the-art in heterogeneous computing. Sci.Program. 18(1), 1–33 (2010)

4. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Hous-ton, M., Hanrahan, P.: Brook for GPUS: stream computing ongraphics hardware. In: ACM Transactions on Graphics (TOG),vol. 23, pp. 777–786. ACM, New York (2004)

5. Cameron, K.: A tale of two green lists. Computer 43(9), 86–88(2010). doi:10.1109/MC.2010.246

6. Dong, Y., Chen, J., Tang, T.: Power measurements and analysesof massive object storage system. In: 2010 10th IEEE Interna-tional Conference on Computer and Information Technology (CIT2010), pp. 1317–1322. IEEE, New York (2010)

7. Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Mat-suoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., etal.: The international exascale software project: a call to coopera-tive action by the global high-performance community. Int. J. HighPerform. Comput. Appl. 23(4), 309–322 (2009)

8. Dongarra, J.J.: The Top500 list—TOP500 supercomputer sites(2011). http://www.top500.org/

9. Doussan, C., Jouniaux, L., Thony, J.: Variations of self-potentialand unsaturated water flow with time in sandy loam and clay loamsoils. J. Hydrol. 267(3), 173–185 (2002)

10. DRANETZ: Power Platform PP-4300. Disponivel em (2011).http://dranetz.com/old/powerplatform-pp4300

11. Feng, W., Cameron, K.: The Green500 list: encouraging sustain-able supercomputing. Computer 40(12), 50–55 (2007)

12. Frachtenberg, E., Heydari, A., Li, H., Michael, A., Na, J., Nis-bet, A., Sarti, P.: High-efficiency server design. In: Proceedings of2011 International Conference for High Performance Computing,Networking, Storage and Analysis, p. 27. ACM, New York (2011)

13. Grochowski, E., Annavaram, M.: Energy per instruction trends inIntel microprocessors. Technol. Intel Mag. 4(3), 1–8 (2006)

14. Hsu, C., Feng, W., Archuleta, J.: Towards efficient supercomput-ing: a quest for the right metric. In: Proceedings 19th IEEE In-ternational Parallel and Distributed Processing Symposium, 2005,p. 8. IEEE, New York (2005)

15. Hsu, C.H., Feng, W.-C., Archuleta, J.S.: Towards efficient super-computing: a quest for the right metric. In: Proc. 19th IEEE Inter-national Parallel & Distributed Processing Symposium, p. 8. Den-ver, Colorado, USA (2005). Technical report LA-UR05-0936

16. Jiao, Y., Lin, H., Balaji, P., Feng, W.: Power and performance char-acterization of computational kernels on the GPU. In: Green Com-puting and Communications (GreenCom), 2010 IEEE/ACM Int’lConference on & Int’l Conference on Cyber, Physical and SocialComputing (CPSCom), pp. 221–228. IEEE, New York (2010)

17. Khairy, M., Mehlfuhrer, C., Rupp, M.: Boosting sphere decodingspeed through graphic processing units. In: 2010 European Wire-less Conference (EW), pp. 99–104. IEEE, New York (2010)

18. Kogge, P.: The tops in flops. IEEE Spectr. 48(2), 44–50 (2011)19. Kogge, P., Bergman, K., Borkar, S., Campbell, D., Carson, W.,

Dally, W., Denneau, M., Franzon, P., Harrod, W., Hill, K., et al.:In: Exascale Computing Study: Technology Challenges in Achiev-ing Exascale Systems, pp. 1–297 (2008)

20. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen,A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammar-lund, P., Singhal, R., Dubey, P.: Debunking the 100x gpu vs. cpumyth: an evaluation of throughput computing on cpu and gpu.In: Proceedings of the 37th Annual International Symposium onComputer Architecture, ISCA’10, pp. 451–460. ACM, New York(2010). doi:10.1145/1815961.1816021

21. Liu, W., Du, Z., Xiao, Y., Bader, D., Xu, C.: A waterfall modelto achieve energy efficient tasks mapping for large scale gpuclusters. In: 2011 IEEE International Symposium on Parallel andDistributed Processing Workshops and Phd Forum (IPDPSW),pp. 82–92. IEEE, New York (2011)

Page 14: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

524 Cluster Comput (2013) 16:511–525

22. Luk, C., Hong, S., Kim, H.: Qilin: exploiting parallelism on het-erogeneous multiprocessors with adaptive mapping. In: Proceed-ings of the 42nd Annual IEEE/ACM International Symposium onMicroarchitecture, pp. 45–55. ACM, New York (2009)

23. Michalakes, J., Vachharajani, M.: Gpu acceleration of numeri-cal weather prediction. Parallel Process. Lett. 18(04), 1–8 (2008).doi:10.1142/S0129626408003557. http://www.worldscinet.com/ppl/18/1804/S0129626408003557.html

24. Miyazaki, T.: Water flow in unsaturated soil in layered slopes.J. Hydrol. 102(1–4), 201–214 (1988)

25. Miyazaki, T.: Water Flow in Soils. CRC Press, Boca Raton (2006)

26. NVIDIA: NVIDIA CUDA Compute Unified Device ArchitectureProgramming Guide (2009)

27. NVIDIA: Next Generation CUDA Compute Architecture: Fermi(2009)

28. Panetta, J., Teixeira, T., de Souza Filho, P.R., da Cunha Finho,C.A., Sotelo, D., da Motta, F.M.R., Pinheiro, S.S., Junior, I.P.,Rosa, A.L.R., Monnerat, L.R., Carneiro, L.T., de Albrecht, C.H.:Accelerating Kirchhoff migration by CPU and GPU coopera-tion. In: 21st International Symposium on Computer Architectureand High Performance Computing (SBAC-PAD 2009), pp. 26–32 (2009). doi:10.1109/SBAC-PAD.2009.29. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5336217

29. Pawlowski, S.S.: Exascale science: the next frontier in high perfor-mance computing. In: The 24th International Conference on Su-percomputing (ICS), 2010, p. 1 (2010)

30. Ren, D.Q., Suda, R.: Investigation on the power efficiency ofmulti-core and gpu processing element in large scale SIMD com-putation with CUDA. In: International Conference on Green Com-puting, pp. 309–316. IEEE, New York (2010)

31. Schreier, P.: How cool are supercomputer? Sci. Comput. World116, 22–24 (2011)

32. Shiers, J.: The worldwide lhc computing grid (worldwide lcg).Comput. Phys. Commun. 177(1–2), 219–223 (2007)

33. Subramaniam, B., Feng, W.: Understanding power measurementimplications in the Green500 list. In: Green Computing and Com-munications (GreenCom), 2010 IEEE/ACM Int’l Conference on& Int’l Conference on Cyber, Physical and Social Computing (CP-SCom), pp. 245–251. IEEE Press, New York (2010)

34. Suda, R., Aoki, T., Hirasawa, S., Nukada, A., Honda, H., Mat-suoka, S.: Aspects of gpu for general purpose high performancecomputing. In: Proceedings of the 2009 Asia and South PacificDesign Automation Conference, pp. 216–223. IEEE Press, NewYork (2009)

35. Tarantola, A.: Inverse Problem Theory and Methods for ModelParameter Estimation. SIAM, Philadelphia (2005)

36. Tveito, A., Langtangen, H., Nielsen, B., Cai, X.: Parameter esti-mation and inverse problems. In: Elements of Scientific Comput-ing, pp. 411–421 (2010)

37. Valero, M.: Towards exaflop supercomputers. In: Conference Cen-ter of the University of Patras—High Performance ComputingAcademic Research Network (HPC-net) (2011)

38. Wang, G., Ren, X.: Power-efficient work distribution method forcpu-gpu heterogeneous system. In: International Symposium onParallel and Distributed Processing with Applications, pp. 122–129. IEEE, New York (2010)

39. Younge, A., von Laszewski, G., Wang, L., Lopez-Alarcon, S.,Carithers, W.: Efficient resource management for cloud computingenvironments. In: International Conference on Green Computing,pp. 357–364. IEEE, New York (2010)

Edson Luiz Padoin received theMSc degree in Production Engi-neering from the Federal Universityof Santa Maria in 2001. He is a PhDstudent in the Institute of Informat-ics at the Federal University of RioGrande do Sul (UFRGS), Porto Ale-gre, Brazil. Also is currently assis-tant professor at the Regional Uni-versity of the Northwest of State ofRio Grande do Sul (UNIJUI), Ijuí,Brazil. His research interests are inthe areas of energy efficiency, ARMprocessors, parallel and distributedprocessing, high performance com-puting.

Laércio Lima Pilla is doing a jointdoctorate between the Institute ofInformatics at the Federal Univer-sity of Rio Grande do Sul (UFRGS),Porto Alegre, Brazil, and the MSTIIDoctoral School at the GrenobleUniversity, Grenoble, France. Heobtained his BSc in Computer Sci-ence at UFRGS in 2009. His re-search topics are load balancing,task mapping, and heterogeneous/hierarchical architectures.

Francieli Zanon Boito is a PhDstudent in a joint doctorate betweenthe Institute of Informatics at theFederal University of Rio Grande doSul (UFRGS), Porto Alegre, Brazil,and the MSTII Doctoral School atthe Grenoble University, Grenoble,France. She has a bachelor degreein Computer Science from UFRGS(2009) and her main research topicis parallel file system and schedul-ing of I/O operations for HPC.

Rodrigo Virote Kassick is a PhDstudent in the joint doctorate pro-gram between Federal Universityof Rio Grande do Sul, Brazil, andMSTII Doctoral School at Greno-ble University. He has a bachelorand a master degree in ComputerScience from the Federal Universityof Rio Grande do Sul. His main re-search topic is parallel file systemsand storage management for HPC.

Page 15: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Comput (2013) 16:511–525 525

Pedro Velho received a PhD degreein computer science by the Greno-ble University, France in 2011. Hisresearch career started in 2000 re-ceiving a BSc degree in ComputerScience in 2004 and a MSc degreein Computer Science in 2006 bythe Pontifícia Universidade Católicado Rio Grande do Sul (PUCRS/Brazil). He is currently workingwith Prof. Phillipe O.A. Navauxas post-doctoral researcher study-ing modeling alternatives to simu-late power consumption and perfor-mance of CPU+GPU hybrid appli-

cations at UFRGS. He is currently attached to SONGS (Simulation OfNext Generation Systems) and the CNPq/Brazil post-doctoral program.He is also an active member of the Latin America HPC research net-work staffing the committee of the two scientific manifestations in thisscope: CLCAR (Conferencia Latin-Americana de Alto-Rendimiento)and SC-Camp (Super Computing Camp). His research interest areasare: parallel and distributed computing, performance evaluation andsimulation.

Philippe O.A. Navaux is profes-sor of the Informatics Institute fromthe university UFRGS, Porto Ale-gre, Brazil, since 1971. Graduatedin Electronic Engineering, 1970,Brazil, UFRGS, Master in AppliedPhysics, 1973, UFRGS, PhD inComputer Science from INPG,Grenoble, France, 1979. Profes-sor of graduate and undergradu-ate courses on Computer Architect-ure—High Performance Comput-ing. Leader of the GPPD, Par-allel and Distributed ProcessingGroup, with projects from govern-

ment agencies Finep, CNPq, Capes, and international Cooperationwith groups from France and Germany, with funding from CNPq andCAPES. Also participate on projects with Microsoft, Intel, HP, DELL,Altus, Itautec. Has oriented more than 70 Master and PhD studentsand has published more than 300 papers in journals and conferences.Member of the SBC, Brazilian Computer Society, SBPC and IEEE.Consultant to various national and international funding organizationsDoE (USA), ANR (FR), FINEP, CNPq, CAPES, FAPESP, FAPERGS,FAPEMIG, FACEPE and others. Was Member of the FAPERGS Su-perior Council and form the CTC, Scientific and Technical Council,LNCC/MCT. Actually is coordinator of the Computing Committeefrom Capes/MEC.