Microprocessor Floorplanning with Power Load Aware ...eda.ee.ucla.edu/EE201C/uploads/ReadingAssignment07F/ReadingAssignment07F/thermal.pdfFloorplanning is the design stage with the

Microprocessor Floorplanning with Power Load AwareTemporal Temperature Variation ∗

Chun-Ta Chu, Xinyi Zhang, Lei He, and Tom Tong JingDepartment of Electrical Engineering, University of California at Los Angeles, CA, 90095

{chunta, zxy, lhe, tomjing}@ee.ucla.edu

ABSTRACTTraditional microprocessor floorplanning only considers areaand wire length. Recent research has started to consider per-formance optimization due to interconnect pipelining andthermal hotspot caused by increased power density. How-ever, there is no efficient approach on microprocessor floor-planning considering application dependent power load forperformance and thermal optimization. This paper stud-ies microprocessor floorplanning considering thermal andthroughput optimization. We first develop a stochastic heatdiffusion model taking into account the application depen-dent power load for thermal analysis. Then, we design thefloorplanning algorithm based on this model. Experimentalresults show that, compared with the deterministic heat dif-fusion model, our model obtains up to 3.2oC reduction ofthe on-chip peak temperature, 1.29% reduction of the area,and 1.13x better CPI (cycles per instruction) performance,respectively. Compared with temperature aware floorplan-ning in the HOTSPOT tool set that ignores interconnectpipelining, our algorithm is up to 27x faster, reduces thepeak temperature by up to 3oC, and also reduces CPI signifi-cantly with a negligible area overhead. We also use our ther-mal model to analyze the floorplanning for dual core micro-architecture. The results show that optimizing the floorplanover two cores simultaneously can further reduce tempera-ture by 3.36oC with slightly better CPI performance.

1. INTRODUCTIONTraditional microprocessor floorplanning only considers

area and wirelength, which is under the assumption thatmicro architecture optimization decides the average cyclesper instruction (CPI) and therefore system throughput, andfloorplanning determines wirelength and clock rate. Due tothe increasing clock rate, the interconnect delay may be-come longer than the clock period. As a result, intercon-nect pipelining becomes a must and affects CPI as well [1][2]. Several recent studies [1] [3] [4] optimized microproces-sor floorplanning on area and CPI considering interconnectpipelining.

As devices keep shrinking, the decreasing rate of powerconsumption in a chip cannot catch up with the shrinkingchip size. As a result, the power density within a chip keepsincreasing, which leads to higher temperature and higherpossibility to degrade performance and chip reliability [5].Floorplanning is the design stage with the largest impact on

∗This paper is partially supported by NSF CAREER award CCR-

0093273/0401682 and a UC MICRO grant sponsored by Altera andIntel. Address comments to [email protected].

chip temperature. Therefore, the impact of thermal effectsshould be considered in the floorplanning phase, which hasnot been done in the above studies.

Thermal aware placement has been studied for severalyears [6] [7] [8] [9]. Different methods were used to make thethermal distribution of standard cells smooth, which makesthe temperature distribution quasi-uniform in one module.However, even though the thermal distribution across onemodule is smooth, without thermal aware floorplan, it isstill possible that the temperature gradient between mod-ules is quite large since different modules may have differentpower density and thus different temperature distribution.Putting modules with high power density together may leadto higher temperature in those modules since the horizonheat diffusion is littler across those modules. As a result, animproper floorplan may lead to large temperature gradientacross the modules and thus causes hotspots within a chip.

Several recent developments in microprocessor foorplan-ning have started to consider thermal modeling and ther-mal aware optimization. In order to estimate the tempera-ture impact on a floorplan, it is straightforward to calculatethe temperature to estimate the impact. Given a micro-architecture, [10] first obtained the average power over allbenchmarks for each block. For each new floorplan, it usedHOTSPOT [11] to model the whole package as a thermal RCnetwork with power as branch current and temperature asnode voltage and calculate the peak steady-state tempera-ture with average power as input. As a result, the objectivethen consists of the area, CPI, and the peak steady-statetemperature. [12] further considered transient power andthe dependency between power and CPI with considerationof interconnect pipeline for floorplanning. Besides, for eachnew floorplan, the input is not just average power but tran-sient power over time. In this case, the objective consists ofnot only the area and CPI but also the peak temperature andthe average temperature over time. One major drawback ofthe aforementioned work is that the temperature is calcu-lated for new floorplan, which is time-consuming. [12] hadeven longer running time than [10] since [12] applied tran-sient power instead of the average power. Considering this,[13] proposed a simple deterministic heat diffusion model tomodel the lateral heat diffusion between modules to avoiddirectly calculating temperature. However, this heat diffu-sion model is too simplified to guarantee a good solution.Also, [13] just used total wire length instead of CPI in theobjective. More importantly, all the above studies do notconsider the application dependent power load and its in-duced temperature variation. As shown in Fig. 1[14], given

a micro-architecture with different applications, the temper-ature distribution across the chip could be totally different.It is not clear how to extend the above thermal aware mi-croprocessor floorplanning to handle application dependentpower load.

Figure 1: Temperature distribution for different ap-plications

In this paper, we develop an accurate, yet efficient thermal-aware floorplanning considering the correlation between powerconsumptions for different micro-architecture modules andfor different microprocessor applications. Instead of cal-culating temperature directly during floorplanning, we de-velop a stochastic heat diffusion model with considerationof block geometry and the above power correlation. Weapply this model to the iterative-improvement-based floor-planning for thermal optimization, and also simultaneouslyoptimize throughput using the trajectory piecewise linearmodel (TPWL) for pipelined interconnects developed in [3].

Multi-core micro-architecture has been studied for severalyears because of less area and high efficiency compared withsingle-core design. However, the thermal effect on floor-planning is only studied based on one core [15] [16]. In thiswork, we compare the thermal effect and performance ondual-core floorplan, and show the difference if the designconsiders thermal effect.

The rest of this paper is organized as follows. Section 2introduces the background of floorplanning, microprocessorperformance, and the relation between power and temper-ature. Section 3 reviews the deterministic heat diffusionmodel, points out its shortcomings, and then presents ourstochastic heat diffusion model. Section 4 summarizes theflow of floorplanning, experiment settings, and experimentalresults. We conclude this work in Section 5.

2. BACKGROUNDS AND PROBLEMFORMULATION

2.1 FloorplanningFloorplanning determines the locations and the shapes of

the modules on chip subject to the minimization of a costfunction. Traditionally, the cost function considers the chiparea and total wire length. However, floorplan in the micro-architecture level has to consider not only the area and thetotal wire length but also CPI. Moreover, in order to con-sider the impact of thermal interaction between any twomodules for different floorplans, we have to take into con-sideration the thermal estimation. Therefore, the objectivefunction in our floorplan algorithm consists of area, CPI,and thermal effect as follows.

Warea ·Area

Areanorm

+ WCPI ·CPI

CPInorm

+

Wthermal ·thermal

thermalnorm

(1)

where Warea, WCPI , and Wthermal are the weights for corre-sponding constraints and can be adjusted by different goal,Areanorm, CPInorm, and thermalnorm are terms for nor-malization.

Most current floorplanning solvers are based on simulatedannealing (SA) algorithm [17] [18], which is also used in thispaper. The SA starts with an arbitrary initial floorplan andchanges the floorplan to a new one in the next iteration byeither changing the positions or the shapes of a small setof modules. During each iteration, the cost is calculated byusing the cost function. The new floorplan will be acceptedeither if the cost is smaller than the current best case orwith a probability based on the annealing temperature. Ifthe number of iteration is large enough, SA is able to find ahigh-quality solution.

2.2 Microprocessor PerformanceMicroprocessor performance is measured by CPI (for a

given clock rate) or millions instructions per second (MIPS).In this paper, we use the metric of CPI for a given clock rateunder a fixed clock rate, which means the lower the CPI, thebetter the performance. The traditional objective of floor-planning is to minimize interconnect delay and obtain thetargeted clock rate. Due to the ever increasing clock rate,interconnect delay may become longer than one clock cycle.In this case, interconnect pipelining is a must and it affectsCPI. Note that floorplanning is the primary design stageto decide interconnect length and the need of interconnectpipelining.

CPI with interconnect pipelining obtained by micro-architecturelevel simulation is time-consuming. In this paper, we applyefficient TPWL model from [3]. It starts with a traditionalfloorplan with objectives of area and wire length to samplethe bus latency vector that is the latencies between mod-ules. A CPI table is then generated by using cycle-accuratesimulation tool, SimpleScalar [19]. CPI under a new floor-planning is approximated by CPI values from the CPI tableand TPWL.

2.3 Power and TemperatureTo simulate the power consumption for a given micro-

architecture, we use PTscalar [20] that is a cycle-accuratemicro-architecture-level performance and power simulatorfor SuperScalar architectures. We get the power consump-tion for all modules with simulation from SPEC2000 [21]benchmarks.

Given a micro-architecture floorplan and power consump-tion input, HOTSPOT [11] models the whole package shownin Fig. 2 as a thermal RC network [22] with power as abranch current and temperature as a node voltage. In theflip-chip package design as shown in Fig. 2, the power (heat)generated by modules flow both horizontally to the adja-cent modules or to the adjacent heat spreader and verticallydown to the PCB and up to the heat spreader and thento heat sink. Without the horizontal flow, vertical flow di-rectly dissipates heat from the modules and thus lower thetemperature of modules. However, because of the horizon-

tal flow between modules, the relative locations of modulesmay affect the final temperature of each module and thusgenerate a temperature gradient. Also, different applica-tions with the same floorplan may have completely differentpower consumptions and thus different temperature gradi-ents. Fig. 1 shows such an example. Therefore, because oflateral heat flow, we can arrange the locations of differentmodules to make the temperature distribution across thechip smooth, which means the variance of temperature formodules becomes smaller.

During each new floorplan, the transient power for eachmodule may also change since the change of wire length be-tween modules affects CPI and CPI affects transient power.Also, different floorplan leads to different temperature, whichaffects leakage power. In this work, we assume the power isinvariant over different floorplan for modules and differenttemperature but our work can be easily extended to considerleakage, temperature, floorplanning interdependency [23].

Figure 2: Thermal RC model of flip-chip package

In order to make the temperature distribution across thedie smooth, previous work [10] [12] used HOTSPOT [11] todirectly calculate temperature during each iteration in SAand to evaluate the cost function since the goal is to min-imize the peak temperature and mean temperature acrossthe die. The objective function with calculating tempera-ture then becomes

Warea ·Area

Areanorm

+ WCPI ·CPI

CPInorm

+

WTemp ·(Tavg + Tmax)

Tnorm

(2)

The major drawback is that the computation time is longin SA flow. For example, considering temperature effect, ittook around forty minutes to complete the whole flow [10],while it only took less than five minutes to complete theflow without considering the temperature effect. Therefore,directly calculating temperature is not efficient for thermal-aware micro-architecture floorplanning. We’ll describe howto accurately and efficiently model the thermal effect in thenext section.

3. STOCHASTIC THERMAL-AWAREFLOORPLANNING

In this section, we first describe the deterministic heatdiffusion model and point out why it is inaccurate based on

four observations. We then show how to build an accuratestochastic heat diffusion model based on those observations.

3.1 Deterministic Heat Diffusion Modeling

3.1.1 Deterministic model definition[13] improved the runtime in thermal-aware floorplanning

by using heat diffusion model to avoid directly calculatingtemperature. The concept is that the temperature of an iso-lated module linearly depends on its power density. Moduleswith high power density tend to have high temperature as aresult. Therefore, in order to lower the temperature of mod-ules with high power density, it is better to make moduleswith low power density adjacent to those modules to help todissipate the heat. The temperature of modules with highpower density then becomes lower because of large heat dif-fusion to the modules with low power density. As a result, inorder to lower the temperature of modules with high powerdensity, the more modules with lower power density adja-cent to the modules with high power density, the smootherthe temperature distribution is across the chip.

The heat diffusion between two adjacent modules Mi andMj can be represented as

h(Mi, Mj) = (PDi − PDj) · shared lengthij (3)

where PDi and PDj are the average power density over timefor modules Mi and Mj , shared lengthij is the shared lengthbetween modules Mi and Mj .

The total heat diffusion for module i is

Hi =X

j adjacent to i

h(Mi, Mj) (4)

Since the potentially hottest modules are limited, consid-ering the total heat diffusion for those potentially hottestmodules can be a good estimation. The total heat diffusionbecomes

HeatDiff =X

i

Hi (5)

The objective function with heat diffusion becomes

Warea ·Area

Areanorm

+ WCPI ·CPI

CPInorm

+

Wthermal ·HeatDiff

HeatDiffnorm

(6)

Although the heat diffusion is a good representation toestimate the lateral heat flow, this model is too simplifiedto guarantee finding a good floorplan considering thermaleffect since it ignores other factors described below.

3.1.2 ShortcomingsFirst of all, given a micro-architecture and a series of ap-

plications, transient power may fluctuate a lot over time forone module since different applications have different workload on that module and thus the transient power over timeis quite nonuniform. Also, power vectors over time for twomodules may be either positively or negatively correlated.Using average power may underestimate the peak temper-ature for positively correlated modules since transient tem-perature is higher when both modules have high transientpower consumption as shown in Fig. 3, where Fig. 3(a) andFig. 3(b) have the same average power but the difference

positively correlated negatively correlated

Tmax 92.51 87.14Tmin 51.59 57

Figure 3: Temporal correlation between M1 and M2,(a) positively correlated and (b) negatively corre-lated. M1 has a higher transient temperature in (a)than in (b), although the average power is same.

of the peak temperature and that of the lowest temperatureare around 5OC, which can not be ignored.

Here a real case from our simulation is shown in Fig. 4.In Fig. 4(a), it shows that the correlation between LSQand IALU4 is almost perfect, while in Fig. 4(b) RUU andFPAdd have no correlation. As a result, without consider-ing the correlation of power consumptions between modules,the maximal temperature is underestimated.

0 2 4

x 104

−0.2

0

0.2

0.4

0.6

0.8

1

1.2power over time

10000 cycle

powe

r/wat

t

LSQIALU4

0 2 4

x 104

−1

0

1

2

3

4power over time

10000 cycle

powe

r/wat

t

RUUFPAdd

Figure 4: Different styles of lines for different mod-ules: (a)LSQ and IALU4 and (b)RUU and FPAdd

Second, the module next to the border of a die has extraheat flow to the heat spreader or the ambient. Fig. 5 showsthat when three modules have similar power consumptions,temperature of the two modules on the side is lower thanthat of the module between them. The temperature differ-ence is around 0.5OC in this case and the difference mayvary according to the design of the package.

Third, since there is no power consumption for any deadspace, the heat diffusion from one module to the dead space(shadow in Fig. 6) is much larger than that from one mod-ule to another module. As shown in Fig. 6, M1 has thehighest power density and M1 has a lower peak tempera-ture (92.96OC) in Fig. 6(b) compared with that (94.2OC)in Fig. 6(a) since the dead space can remove more heat thanother modules do. In the floorplanning, thermal effect maytry to move modules separately as far as possible while theconstraint on the area attempts to minimize the total deadspace. As a result, the weights of area and temperatureconstraints have to be adjusted for balance.

Fourth, the heat diffusion between two modules is pro-

M1 M2 M3

Tmax 93.4 93.9 93.4

Figure 5: The temperature of M2 is higher than thatof M1 and M3

(a) (b)

Tmax 94.2 92.96Tmin 72 71.02

Figure 6: Dead space effect: M1 has lower temper-ature in (b) than in (a)

portional to their shared length since the lateral thermalconductance is proportional to the width of the module.However, the area of each adjacent module should also beconsidered. Given four modules M1, M2, M3, and M4 withpower density PD1 > PD4 > PD2 = PD3 in Fig. 7, M1 mayhave a lower temperature with adjacency of M3 since M1can diffuse more heat to M3 than M2, which suggests theheat diffusion should consider not only the shared length butalso the depth of the adjacent module. As shown in Fig. 7,the difference of the peak temperature is over 1OC, and thetemperature gradient in Fig. 7(a) is sharper than that inFig. 7(b). However, if the depth of one adjacent module istoo large, this may lead to wrong estimation. The reason isthat another possible floorplan candidate with an adjacentmodule having lower power density may still be rejectedbecause the depth of the adjacent module is much smallerthan the current one. Considering this, we can predefine apenetration window to enclose the target module. If thedepth of the adjacent module is across the window, we onlyconsider the module inside the window. On the other hand,if the adjacent module is inside the window, we take intoaccount the modules adjacent to it as well. For example, inFig. 7(a), M2 is inside the red window (dash line), we haveto consider the area of M4 inside the red window as well.Similarly, since M3 crosses the window, we only considerthe area of M3 inside the window. Details will be describedin the next section.

Finally, considering several hottest modules, just sum-ming their heat diffusion may not gurantee a good solution.For example, module A and module B are the two hottestmodules, and PDA > PDB. After running the SA algorithm,we have HA = HB ≈ C. This is not the optimal solutionsince H ′

A = C +D and H ′

B = C−D, where C > D > 0, cangive a better solution because of more heat flow for moduleA as shown in Fig. 8. Considering this effect, we should usethe weighted sum of the heat flow for those hottest modules

(a) (b)

Tmax 78.39 77.01Tmin 59.06 62.7

Figure 7: M1 has lower temperature in (b) since M3and M2 have same power density but M3 is largerthan M2

to reduce peak temperature more effectively.

Figure 8: Module deserves more heat flow due tohigher power density

3.2 Stochastic Modeling

3.2.1 Stochastic heat diffusion modelSince directly calculating temperature is time-consuming

and the deterministic heat diffusion model is not accurateenough, here we propose an accurate and efficient stochas-tic heat diffusion model based on the observations in Sub-section 3.1.2. Given a micro-architecture floorplan with mmodules, n dead spaces, and power vector Pi = [pi1, ..., piT ]over T time steps for module Mi, 1 ≤ i ≤ m.

The mean power density PDi for module Mi is

PDi = E(PDi) =1

Ai

·1

T·

TX

j=1

pij (7)

where Ai is the area for module Mi, PDi is the transientpower density vector, which equals Pi

Ai. In this paper, E(X)

is the expectation value of vector X.The power density covariance between any two modules

Mi and Mj is

cov(PDi, PDj) = E(PDi · PDj) − PDi · PDj (8)

Given x adjacent modules, y adjacent dead spaces, and apenetration window size W ×L, the heat diffusion vector tothe adjacent modules Hi adj and to the adjacent dead spacesHi dead for module Mi are defined as follows, respectively.

Hi adj =x

X

j = 1

(PDi − PDj) · Lij (9)

Hi dead =

yX

j = 1

PDi · Cij (10)

where Lij is the shared length between modules Mi and Mj ,Cij is the shared length between module Mi and dead spaceNj .

The heat diffusion vector to the border is

f(Bi) = PDi · Bi ·Con lateral

Con adjacent(11)

where Bi is the shared length between module Mi and theborder of the die, Con lateral and Con adjacent are theunit lateral conductance between the heat spreader and mod-ule Mi and between two adjacent modules, respectively,both of which can be calculated according to [11].

The standard deviation of the total heat diffusion for mod-ule Mi is

σi = sqrt(E((Hi adj + Hi dead + f(Bi))2) −

(E(Hi adj + Hi dead + f(Bi)))2) (12)

The stochastic heat diffusion model for module Mi is

H̃i = E(Hi adj) + E(Hi dead) + E(f(Bi)) + 3 · σi (13)

where the first two terms are the mean heat diffusion to theadjacent modules and dead space, respectively, the thirdterm is the mean heat diffusion to the lateral heat spreader,and 3σi is the term for the correlation impact approximatedby Equation (12). The larger the standard deviation be-tween modules is, the smaller the correlation is.

If dead space Nj or module Mj are totally inside the pen-etration window, we have to consider other modules that arepartially inside the window. Then PDj is modified to

PDj =

PK

k = 1˜PDk · Dk · (K − k + 1)

Pk

k = 1Dk · (K − k + 1)

(14)

where K is level number between the target block and thewindow, the level contacting Mi is level 1 and the level con-tacting the window is level K. ˜PDk is the average powerdensity in level k. Dk is the depth of each level k.

In Fig. 9, the red window (dash line) defines the modulesinvolved, and the blue (slash) one defines the modules tocalculate modified PD1. The modified PD1 is derived frommodules M1, M2, and M3, and the first belongs to level 1and the second and the third belong to level 2. Also, thepower density of level 1 ( ˜PD1) is just PD1 and the power

density of level 2 ( ˜PD2) is composed of PD2 and PD3.

Figure 9: Illustration of calculation of modified PDj

Note that Equation (12) can be easily calculated oncewe calculate equations (7) and (8) that are pre-calculatedduring one-time cycle-accurate micro-architecture simula-tion for performance and power before entering SA algo-rithm. Therefore, there is no significant runtime overhead.

Considering Z potential hottest modules, the total stochas-tic heat flow then becomes

Stochastic HeatDiff =Z

X

i = 1

Wi · H̃i (15)

where Wi is the weight proportional to PDi

If the heat diffusion for module Mi is positive, it meansthe total net heat diffusion flows outward. Similarly, neg-ative value means the total net heat diffusion flows inwardto module Mi. As a result, the larger the heat diffusion,the more heat can be diffused to the adjacent modules withlower power densities and thus the more the temperaturecan be lowered.

3.2.2 Hierarchal clusteringWe only have to consider the heat diffusion for several po-

tential hottest modules and we use K-mean clustering algo-rithm to find the right number of potential hottest modules.In this paper, the objective is to minimize variance V ar

V ar =

kX

i = 1

X

PDj∈Si

(|PDj − µi|)2 (16)

where PDj is power density of module Mj and µi is theaverage power density within cluster Si.

We use a hierarchical K-mean clustering to find the po-tential hottest modules. First we set a threshold such as30% of total modules as the maximum number we have toconsider. Then we run K-mean to cluster modules into twoclusters and perform the same procedure to the cluster withthe higher power density. The above recursive procedurestops when the number of modules in the recursively re-fined cluster with the higher power density is less than thethreshold. Using this hierarchical method, we can find theoptimal number to be considered in the calculation of totalheat diffusion.

3.3 Floorplanning with Stochastic Thermal ModelIn Equation (6), the thermal term is deterministic without

considering correlations of power densities between modulesand other effects. We propose the new objective function as

Warea ·Area

Areanorm

+ WCPI ·CPI

CPInorm

+

Wthermal ·Stochastic HeatDiff

HeatDiffnorm

(17)

where we replace HeatDiff by Stochastic HeatDiff , whichis calculated from Sub-section 3.2.1 and Sub-section 3.2.2.

4. ALGORITHM FLOW AND EXPERIMENTSIn this section, we compare the floorplanning based on our

proposed model with [13] and [10]. Moreover, we optimizethe overall floorplan on the dual core micro-architectures aswell as only on one core to study the impact. The experi-ment setting is described below.

4.1 Flow and Experiment SettingSimilar to [3], we assume two SuperScalar processors for

both 90nm and 65nm technologies. The settings are shownin Table 1. We treat the blocks as soft and the aspect ratiois between 0.33 and 3 and L2 is partition into three modules.

Table 1: Settings in 90nm and 65nm technology.90nm 65nm

Issue Width 4 8Die Area(mm2) 100 200

Die Thickness(mm) 0.5Heat Spreader(mm2) 900 1600

Heat Sink(mm2) 2500 3600Functional Units 3 Integer ALU 6 Integer ALU

1 Integer Mult 2 Integer Mult1 FP Adder 2 FP Adder1 FP Mult 2 FP Mult

Register Update Unit 64 Instructions 128 InstructionsLoad Store Queue 32 Instructions 32 Instructions

Fetch Queue 8 Instructions 16 InstructionsClock Frequency 3GHz 6GHz

FF Insertion Length 2000 um 707 um

Blocks composed of multiple modules are the RUU blockincluding Register Update Unit and Load Store Queue, De-code block including Fetch Queue and the Decoder, Branchblock including Fetch Unit and Branch Predictor, DL1 blockincluding the Level 1 Data Cache and the DTLB, and thelast IL1 block including the Level 1 Instruction Cache andthe ITLB. The L2 unified cache and all functional units aretreated as independent blocks. The block area is summa-rized in Table 2.

Table 2: Area of logical block in 90nm and 65nmtechnology.

Technology 90nm 65nm

Block Area (mm2) 90nm Area (mm2) 65nm

IALU 1.00 0.65IMULT 1.00 0.65FPADD 1.94 1.27

FPMULT 2.07 1.29RUU 4.48 5.88

Decode 1.44 1.98Branch 2.27 3.11

L2 75.6 130IL1 8.99 18.8DL1 10.03 13.7

The length of interconnect between two blocks in Table 2are computed according to the Manhattan distance betweenthe centers of two blocks. We view the latency of each suchinterconnect as an independent variable. Changing the la-tency of one of these interconnects is an effective changein the micro-architecture and may impact performance. InTable 3, we specify these interconnects with respect to theirterminal blocks. We apply TPWL model [3] as our CPImodel.

Given the architecture configuration in Table 1, we usePTscalar [20] to simulate the power consumption for fourinteger applications bzip2, gcc, gzip, and mcf and threefloating applications art, equake, and mesa in SPEC2000[21]. With these power vectors, we calculate the mean powerdensity (w/mm2) and standard deviation for each module.

Fig. 10 shows the correlation matrix for 90nm technology,which is acquired from the sampling power vectors from all

Table 3: Buses that potentially affect CPIBus ID Terminal blocks Bus ID Terminal blocks Area

1 IALU,RUU 6 IL1,IL22 IMULT,RUU 7 DL1,L23 FPAdd,RUU 8 Branch,IL14 FPMul,RUU 9 Decode,Branch5 LSQ,DL1 10 Decode,RUU

5 10 15

2

4

6

8

10

12

14

16

18

−0.5

0

0.5

1

Number Block Name Description1 decode Instruction Decoder2 branch Branch Predictor3 RAT Rename Table4 RUU Register Updata Unit5 LSQ Load Store Queue6 IALU1 Integer Adder 17 IALU2 Integer Adder 28 IALU3 Integer Adder 39 IntReg Integer Register10 IL1 Level 1 Instruction Cache and ITLB11 DL1 Level 1 Data Cache and DTLB12 IALU4 Integer Multiplier13 FPAdd Floatpoint Adder14 FPMul Floatpoint Multiplier15 FPReg Floatpoint Register16 L2 left Part of Level 2 Cache17 L2 right Part of Level 2 Cache18 L2 bottom Part of Level 2 Cache

Figure 10: temporal correlation matrix of powerconsumption

modules. In order to get the real correlation, we only countthose sampling points in active mode since in sleep mode,every module consumes only static power. If we considerthose sampling points, there is big error in estimating thecorrelation. As shown in Fig. 10, we can roughly partitionall modules into three groups, the first group is from De-code(1) to DL1(11), the second group is IALU4(12), whichdoes not have strong correlation to any other module, andthe last one is from FPAdd(13) to L2 right(18). Modules inthe same group are highly positive correlated and the cor-relations between modules in the different groups are eitheruncorrelated or negative correlated.

We use SA-based PARQUET [18] as our base floorplansolver combined with the CPI model [3] and our stochasticheat diffusion model. We run the experiments on a Linuxworkstation. After completing the whole flow with differentobjectives, HOTSPOT [11] is used to calculate the temper-

Figure 11: Whole flow of thermal-aware floorplan-ning using TPWL to calculate CPI

ature for verification purpose only. Also, for each objective,we run ten iterations to acquire the best case and the aver-age case due to SA algorithm. The whole flow of our workis summarized in Fig 11.

4.2 Comparison between Thermal ModelsWe compare our stochastic heat diffusion model (SHDM)

with [10], which calculates the max temperature for everyiteration in SA to estimate the cost for each new floorplan.We run seven benchmarks as described in Section 4.1. Theobjective function is area and thermal effect with weight0.6 and 0.3, respectively. Table. 4 summarizes the finalresult. WS means white space (dead space) percentage inthe total area. For example, 217.4 (4.3%) means the totalarea is 217.4 and the white space percentage is 4.3%. Fromthe table, SHDM can reduce Tmax by up to 3oC (93.0oC-90.0oC) (3.2%) with a 1.34% increase in the area. The aboveresults show our model is quite accurate while it is 27x fasterfor 90nm processor and 19x faster for 65nm processor.

4.3 Comparison between Different ObjectivesIn this section, we compare the thermal impact on the

floorplan with different objectives and we also compare theresults with [13]. We use the same setting as in Section4.2 with different objective function. We summarize theresults in Table 5. Since our model is stochastic, we denoteour model with objectives area, CPI, and heat diffusion asACHs and [13] with the same objectives as ACWHd. The

Table 4: Comparison between stochastic heat diffusion model and [10]90nm 65nm

Tmax (oC) Area (mm2) (WS) Runtime (s) Tmax (oC) Area (mm2) (WS) Runtime (s)

[10] 93.0 119.4 (4.7%) 2300 93.3 217.4 (4.3%) 2980SHDM 90.0 121.0 (5.6%) 85 93.1 219.6 (5.8%) 155impact -3.2% +1.34% 1/27x -0.2% +1.03% 1/19x

weight for area, CPI, and heat diffusion is 0.6, 0.3, and 0.2,respectively, in the objective function. We first comparethe results with or without considering thermal effect basedon our stochastic model. As shown in Table 5, consideringthe thermal effect with objective AC in the best case, themaximal temperature is reduced by 9.1% from 97.7oC to88.8oC for 90nm and by 5.4% from 102.8oC to 97.2oC for65nm with negligible area overhead and increase of CPI upto 7.3%. Clearly, there is a trade off between lowering thetemperature and reducing CPI, the reason is that to lowertemperature, two hot modules with critical bus or wire haveto be separated as far as possible, but which also meansCPI or wire length increase. However, the huge amount ofreduction in temperature with little decrease in performanceis worthy since lower temperature implies better clock rateand clock reliability.

The work in [13] used simple heat diffusion model withoutconsidering correlation, dead space, border effect and thegeometry of the modules, which produced up to 2.9% areaoverhead and up to 21.3% increase of CPI with similar orless temperature reduction compared with our model, whichshows that ours is more accurate and robust. Although ourruntime is longer than that in [13] as shown in Table 6, sincethe run time is just less than a few minutes, it does not havemuch practical impact. AR (aspect ratio) in Table 6 meansthe ratio of length and width of the final floorplan.

4.4 Dual Core Micro-architectureIn this section, we compare the thermal impacts on the

dual-core floorplan for 90nm technology. We assume bothcores run the same application and both cores have the samemicro-architecture as described in Section 4.1. When opti-mizing is only on one core, we define the aspect ratio to betwo to make the aspect ratio be one after combing two cores.In this experiment, the weight for area, CPI, and thermalis 0.5, 0.6, and 0.3, respectively. In Table 7 and Table 8,AC Separate and ACHs Separate mean the floorplan is op-timized only on one core while AC Mixed and ACHs Mixedmean the floorplan is optimized over two cores.

Fig. 12, 13, 14, and 15 show the final floorplan for dif-ferent objectives. Modules that are named * 1 (i.e., 1 atthe end) with blue color (dark) belong to core one. Modulesnamed * 2 with yellow color (grey) belong to core two. Thewhite space means dead space. As can been seen from thefigures, both AC Separate and AC Mixed tend to clusterthe core modules together to reduce the CPI but AC mixedhas smaller CPI than AC Separate as shown in Table 7.Theoretically, AC Mixed should get the minimal CPI or atleast close result compared with AC Mixed. The experi-ments show the similar results. Another thing should benoticed is that both the average peak temperature and thepeak temperature in the best case for AC Mixed is lowerthan that for AC Separate as shown in Table 7. The reasonis mainly because modules with low power densities have

more chance to be adjacent to the potentially hottest mod-ules in the mixed floorplan. In brief, without consideringthermal effect, AC Mixed generally has lower peak temper-ature with slightly smaller CPI compared with AC Separate.

With considering thermal effect, ACHs Separate still haswell-clustered core with lower peak temperature comparedwith AC Separate while the core of ACHs Mixed is spreadout with the lowest peak temperature compared with AC Mixed.Moreover, ACHs Mixed obtains better temperature reduc-tion by 3.36oC (98.57oC-95.21oC) and even 1.02x (0.926/0.910)faster CPI performance compared with ACHs separate. How-ever, CPI increases by 7.5% (best, (0.910-0.846)/0.846) and6.9% (average, (0.958-0.896)/0.896), respectively, and areaincreases by 1.7% (best, (244.5-240.3)/240.3) and 1.0% (av-erage, (249.3-246.8)/246.8) compared with floorplan with-out considering thermal effect, which shows a trade off be-tween thermal and performance as shown in Table 7.

In conclusion, floorplan optimized over two cores can slightlyimprove CPI performance compared with floorplan opti-mized separately. With consideration of thermal effect, mixedfloorplan can further reduce the peak temperature comparedwith floorplan optimized separately.

5. CONCLUSIONS AND DISCUSSIONSWe have proposed a stochastic thermal-aware floorplan-

ning with consideration of micro-architecture level through-put optimization. First, we have convincingly shown thatthere are correlations between power for modules for differ-ent microprocessor applications. Second, considering deadspace, border effect, and the geometry of modules, we havedeveloped a stochastic heat diffusion model and implementedthis model on microprocessor floorplanning. Compared withthe existing floorplanning using deterministic heat diffusionmodel, our model obtains up to 3.2oC (92.0oC-88.8oC) re-duction of the on-chip peak temperature, 1.29% ((224.1-221.2)/224.1) reduction of the area, and 1.13x (0.995/0.880)better CPI performance, respectively. Moreover, comparedwith temperature aware floorplanning in the HOTSPOTtoolset that ignores interconnect pipelining, our algorithmis up to 27x faster and reduces the peak temperature by upto 3oC with a negligible area overhead.

We also study the dual core micro-architecture to see theeffect if the floorplan is optimized over two cores instead ofonly on two cores separately. The results show optimizingover two cores can further reduce the temperature by 3.36oCwith slightly better CPI performance compared with opti-mizing only one core, which provides different aspects ondual core micro-architecture design. Also, this concept maybe applied to multi core to obtain a better balance betweenperformance and thermal issues.

In the future, we will further apply our stochastic thermalmodel to deal with rectilinear blocks, 3D stacking, multi coremicro-architecture, and other floorplanning methods.

Table 5: Comparison of stochastic and deterministic heat diffusion model with different objectives90nm

Obj. CPI Tmax (oC) Area (mm2) WS (%)Best Avg Best Avg Best Avg Best Avg

AC 0.820 0.890 97.7 96.7 118.5 122.4 3.05 6.89ACHd 0.995 1.000 92.0 92.2 122.0 125.3 6.67 9.08

+21.3% +12.4% -5.8% -4.7% +2.9% +2.3%ACHs 0.880 0.954 88.8 88.9 121.1 123.2 6.10 7.36

+7.3 % +7.2% -9.1% -8.1% +2.2% +0.6%

65nm

Obj. CPI Tmax (oC) Area (mm2) WS (%)Best Avg Best Avg Best Avg Best Avg

AC 0.730 0.770 102.8 105.6 217.8 223.6 4.37 7.00ACHd 0.790 0.839 97.6 100.7 224.1 221.5 7.39 6.42

+8.3% +8.9% -5.0% -4.6% +2.89% -1.00%ACHs 0.778 0.784 97.2 97.6 221.2 223.0 6.03 6.98

+6.6% +1.8% -5.4% -7.5% +1.56% -0.27%

Table 6: Comparison of runtime and aspect ratio (AR) under different objectives90nm

Obj. Runtime (s) ARBest Avg

AC 212 1.10 1.08ACHd 248 1.02 1.09ACHs 298 1.00 1.06

65nmObj. Runtime (s) AR

Best AvgAC 483 1.01 1.08

ACHd 583 1.02 1.05ACHs 634 1.04 1.02

Table 7: Comparison between mixed and separate dual core floorplanObj. CPI Tmax (oC) Area (mm2) WS (%)

Best Avg Best Avg Best Avg Best Avg

AC Separate 0.849 0.939 110.69 105.8 240.3 250.1 4.93 8.70AC Mixed 0.846 0.896 103.57 103.22 240.3 246.8 4.94 7.46

-0.35% -4.57% -6.43% -2.44% +0.00% -1.32%ACHs Separate 0.926 0.965 98.57 99.47 246.4 248.8 7.22 8.07ACHs Mixed 0.910 0.958 95.21 97.63 244.5 249.3 6.62 8.68

-1.73% -0.72% -3.41% -1.84% -0.77% +0.20%

Table 8: Comparison of runtime and aspect ratio (AR) under different objectives for dual core micro-architecture

Obj. Runtime (s) ARBest Avg

AC Separate 220 1.10 1.03AC Mixed 775 1.00 1.00

ACHs Separate 286 1.15 1.09ACHs Mixed 858 1.02 1.04

6. REFERENCES[1] M. Ekpanyapong, J. R. Minz, T. Watewai, H. S. Lee,

and S. K. Lim, “Profile-guided microarchitecturefloorplanning for deep submicro processor design,” in

IEEE/ACM Design Automation Conf., 2004.

[2] V. Nookala, Y. Chen, D. Lilja, and S. Sapatnekat,“Microarchitectur-aware floorplanning using astatitical design of experiemtns approach,” in

Figure 12: Floorplan: AC Separate (half)

Figure 13: Floorplan: AC Mixed

IEEE/ACM Design Automation Conf., pp. 579-584,2005.

[3] C. Long, L. Simonson, W. Liao, and L. He,“Floorplanning optimization with trajectorypiecewise-linear model for pipelined interconnects,” inIEEE/ACM Design Automation Conf., 2004.

[4] A. Jagannathan, H. H. Yang, K. Konigsfeld,D. Milliron, M. Mohan, M. Romesis1, G. Reinman,and J. Cong1, “Microarchitecture evaluation withfloorplanning and interconnect pipelining,” in AsiaSouth Pacific Design Automation Conf., pp.1-8 2004.

[5] J. Srinivasan, S. Adve, P. Bose, and J. Rivers, “Theimpact of technology scaling on lifetime reliability,” inProc. Dependable Systems and Networks, 2004.

[6] C. Tsai and S. Kang, “Cell-level placement forimproving substrate thermal distribution,” in IEEETrans. on Computer-Aided Design of IntegratedCircuits and Systems, vol.19, no.2, pp.253-266, 2000.

[7] C. C. Chu and D. Wong, “A matrix synthesisapproach to thermal placement,” in IEEE Trans. onComputer-Aided Design of Integrated Circuits andSystems, vol.17, no.11, pp.1166-pp1174, 1998.

[8] B. Oberneier and F. Johannes, “Temperature-awareglobal placement,” in Asia South Pacific DesignAutomation Conf., pp.143-148, 2004.

[9] G. Chen and S. Sapatnekar, “Partition-drivenstandard cell thermal placement,” in Int. Symp. onPhysical Design, vol.1, pp.75-80, 2003.

Figure 14: Floorplan: ACHs Separate (half)

Figure 15: Floorplan: ACHs Mixed

[10] K. Sankaranarayanan, S. Velusamy, M. R. Stan, andK. Skadron, “A case for thermal-aware floorplanningat the microarchitectural level,” in Journal ofInstruction-Level Parallelism, 2005.

[11] K. Skadrona, M. R. Stan, K. Sankaranarayanan,W. Huang, S. Velusamy, and D. Tarjan,“Temperature-aware microarchitecture,” in Proc.IEEE Int. Symp. on Circuits and Systems, 2003.

[12] V. Nookala, D. J. Lilja, and S. S. Sapatnekar,“Temperature-aware floorplanning ofmicroarchitecture blocks with ipc-power dependencemodeling and transient analysis,” in Int. Symp. onLow Power Electronics and Design, 2006.

[13] Y. Han, I. Koren, and C. A. Moritz, “Temperature

aware floorplanning,” in Temperature awareComputing Systems, 2005.

[14] H. Yu, Y. Hu, C. C. Liu, and L. He, “Minimal skewclock embedding considering time variant temperaturevariation with automatic correlation extraction,” inInt. Symp. on Physical Design, 2007.

[15] M. Monchiero, R.Canal, and A. Gonzalez, “Designspace exploration for multicore architectures: apower/performance/thermal view,” in Proc.of Int.Conf. on Supercomputing pp177-186, 2006.

[16] E. Humenay, D. Tarjan, and K. Skadron, “Impact ofprocess variations on multicore performancesymmetry,” in Proc. Design Automation and Test inEurope, 2007.

[17] N. Sherwani, “Algorithms for vlsi design automation,”Kluwer, 3rd ed., 1999.

[18] S. N. Adya and I. L. Markov, “Fixed-outlinedfloorplanning through better local research,” inIEEE/ACM Int. Conf. on Computer Aided Design,pp. 328-334, 2001.

[19] D. Burger and T. Austin, “The simplescalar tool setversion 2.0,” in University of Wisconsin-Madison,1997.

[20] W. L. W. Liao, L. He, and K. Lepakand, “Ptscalarversion 1.0,” in University of California-Los Angeles,2003.

[21] J. L. Henning, “SPEC CPU 2000:Measuring CPUperformance in the new millennium,” IEEE Trans. onComputers, pp. 28–35 vol.33, 2000.

[22] A. Krum, “Thermal management,” in F.Kreith, ed,The CRC Handbook of Thermal Enginnering pp.2.1-2.92. CRC Press, Boca Raton, FL, 2000.

[23] W. Liao, L. He, and K. Lepak, “Temperature andsupply voltage aware performance and powermodeling at microarchitecture level,” IEEE Trans. onComputer-Aided Design of Integrated Circuits andSystems, pp. 1042 – 1053, 2005.

Documents

Microprocessor Floorplanning with Power Load Aware ...eda.ee.ucla.edu/EE201C/uploads/ReadingAssignment07F/ReadingAssignment07F/thermal.pdfFloorplanning is the design stage with the