The impact of x86 instruction set architecture on superscalar processing

Journal of Systems Architecture 51 (2005) 63–77

www.elsevier.com/locate/sysarc

The impact of x86 instruction set architecture onsuperscalar processing

Rafael Rico *, Juan-Ignacio Perez, Jose Antonio Frutos

Department of Computer Engineering, Universidad de Alcala, 28871 Alcala de Henares, Spain

Received 11 July 2003; received in revised form 14 February 2004; accepted 3 July 2004

Available online 7 October 2004

Abstract

Performance improvement of x86 processors is a relevant matter. From the point of view of superscalar processing,

it is necessary to complement the studies on instruction use with analogous ones on data use and, furthermore, analyze

the data flow graphs, as its dependencies are responsible for limitations on ILP. In this work, using instruction traces

from common applications, quantitative analyses of implicit operands, memory addressing and condition codes have

been performed, three sources of significant limitations on the maximum achievable parallelism in the x86 architecture.

In order to get a deeper knowledge of these limitations, the data dependence graphs are built from traces. By means of

graph matrix representation, potentially exploitable parallelism is quantified and parallelism distributions from the

traces are shown. The method has also been applied to measure the impact of the use of condition codes. Results

are compared with previous work and some conclusions are presented relating the obtained degree of parallelism with

negative characteristics of x86 instruction set architecture.

� 2004 Elsevier B.V. All rights reserved.

Keywords: Instruction level parallelism; Instruction set architecture; DDG-based quantification

1. Introduction

Some of the characteristic features of instruc-

tion sets have an effect on the fine-grain parallelism

available in the code in that they can impose data

dependences which in their turn cause an over

1383-7621/$ - see front matter � 2004 Elsevier B.V. All rights reserv

doi:10.1016/j.sysarc.2004.07.002

* Corresponding author. Tel.: +34 91 885 66 15; fax: +34 91

885 66 41.

E-mail address: [email protected] (R. Rico).

ordering of instructions in the program which is

not strictly necessary just to preserve the computa-

tional meaning of the tasks compiled. This effect

can be critical in a superscalar processing

environment.

Data dependences can be classified as true or

false. True data dependences are not avoidableby any hardware means, and they cause a sequen-

tial execution. Output- and anti-dependences, on

the other hand, can be overcome by hardware

ed.

mailto:[email protected]

64 R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77

techniques such as register renaming. However,

these techniques can sometimes mean a greater de-

sign complexity not completely justified in reduc-

tion of execution time.

Some of those instruction set characteristic fea-tures that provoke extra dependencies are implicit

operands (dependent on the operation and not

specified by the programmer), registers used for

memory address computation, condition codes,

etc. The IA of the x86 family is an example for

all of these features, a necessary evil in this case,

if binary compatibility is to be maintained. That

is why, in order to increase the throughput ofx86-family processors today, but at a reasonable

cost, it is necessary to attain a precise knowledge

of the use that instructions make of operands

and the sources of the most critical dependences.

With the aim of quantifying the impact of these

instruction set characteristics, measurements have

been made which are based on Data Dependence

Graphs (DDGs). These graphs are built from thetraces from the program�s execution, and are

mathematically analyzed thanks to their matrix

representation. This process is not in the scope

of this paper, but more information on it can

be found in our web site [13], where more is

explained about applying graph theory to ILP

analysis/scheduling. This method makes repro-

ductibility of the experiments easier, since it usesan algebraic formalization. This is not the case

with measurements based on IPC or on execution

time, which need complex simulators where sup-

positions or simplifications dramatically affect

final results, and where it is extremely difficult to

bear in mind all the dynamical events typical of

real execution.

DDG-based quantification objectively describesthe code, since it is hardware-independent. This is

another difference with quantifications based on

time, since these ones require definition of a spe-

cific physical layer to work with, and this physical

layer has an influence on the code it executes.

Therefore, measurements based on time describe

hardware and software as a whole.

Objective evaluation of code, on the other hand,presents a great advantage in that it shows how the

hardware must be designed in order for it to attain

an optimal performance by taking advantage of

the positive elements found in code and by solving

the problems it highlights.

In this work a study on the use of operands in

representative DOS applications is presented, to-

gether with a quantification of the parallelismfound in data dependence graphs, as a means of

reaching a better understanding of how different

pieces of data relate to each other. Specifically,

the impact of condition codes has been measured,

which represent a source of output dependences

without the least computational meaning. Distri-

butions of parallelism are also presented.

2. Previous work on the x86 instruction set

Adams and Zimmerman performed a study on

the frequency of use of x86 instructions in DOS

applications [1]. However, their work does not in-

clude operands. More recently, Huang and Peng

have performed counts of the use of x86 instruc-tions with different operands [6]. Despite this, they

do not analyze data dependence graphs and con-

centrate on the improvement of the execution time

of the most frequently used microoperations.

A different approach to the study of the x86

instruction set is measurement of parallelism. This

is the case with work presented by Huang and Xie,

which includes measurement of parallelism at themicrooperation level, where operation as well as

operand type has been taken into account, includ-

ing addressing mode [7].

Bhandarkar and Ding perform measurements

of parallelism using hardware counters featured

by the more recent processors [3]. The variety of

studied events is great.

Our research presents the frequency of use ofdata items, and concentrates on the analysis of

DDGs, which is something that is not to be found

in other works on the x86 instruction set.

3. The workbench

In view of the results obtained by Huang andPeng, for DOS as well as for Windows95 [6], and

considering the extra difficulty of the 32 bit envi-

ronment due to the great variety of operands, a

R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77 65

decision has been made to restrict the analysis to

16 bit DOS real mode applications, and to defer

the study of 32 bit operands for later work.

Instruction trace generation is based, as was the

case in the work quoted above, on step-by-stepexecution mode and modification of interrupt-

service-routine 1. Where source code was not

available, the executable image was �injected� witha virus which configures step-by-step execution.

Table 1 lists DOS applications and programs

compiled for DOS in real-mode which have been

proposed for study, as well as the number of

instructions processed for each trace. It must benoted that traces contain complete sequences of

execution for a specific workload, that is, not par-

tial code sequences but whole program execution

has been analyzed, so as not to work on partial

instruction mixes which do not represent the usual

behavior of the programs.

An integer processing test-bench has been con-

sidered. Several operating system utilities fromMS-DOS 5.0 (comp, find and debug) have been se-

lected, as well as commonly used applications: a

compressing-decompressing utility (rar version

1.52) and a C language compiler (tcc version

1.0). Program go, from the SPECint95 Suite has

also been included.

In the case of go, since in this case the source

code is available, two completely opposite compi-lation schemes have been used. In one of them,

optimization for size (flags –O1 –Os –G) has been

performed; in the other one, optimization for

speed (flags –O2 –Ot –Ox –G). The aim is to eval-

uate the possible impact of compilation. A com-

piler has been chosen which is not specifically

Table 1

Workbench

Program Number of executed instructions

comp 689,866

debug 8,071,335

find 6,119,641

go (optimized for size) 30,636,605

go (optimized for speed) 30,290,351

rar (compressing) 98,244,064

rar (decompressing) 14,782,924

tcc 1,010,078

Total 189,844,864

designed for superscalar code generation: Borland

C++ version 4.0.

The workload for all these programs has been

chosen so as to obtain a reduced instruction count,

in order for the traces to be manageable. In spite ofthis, traces represent almost 190 million instruc-

tions.

Since the programs are compiled for the x86-

DOS platform, and our study is concerned with

the code, the workstation/PC environment where

the traces are obtained is not important, as it just

affects the speed at which traces are generated.

One last important issue must be mentioned.Since our analysis deals with traces, every condi-

tional branch is already resolved when they are

formed, and so we can build code sequences

as long as we need them to be, being sure that

every branch prediction is going to be a hit; in

other words, traces assume perfect branch

prediction.

4. The x86 instruction set and the superscalar model

In order to preserve binary compatibility with

previous processors, which in itself has yielded

undeniable benefits, the x86 instruction set has

inherited design characteristics suitable for past

requirements and clearly unfit for superscalarprocessing. The original design followed two guide-

lines: to minimize the instruction format and to

close the gap between high level and machine lan-

guages and so ease the compilation process. Now-

adays these demands are not so important but,

on the other hand, the limitations imposed by this

instruction set regarding exploitation of ILP are

commonly acknowledged. Dedicated use of regis-ters, implicit operands and state register updating

are potential sources of dependences which impose

an over-serialization of code.

X86 family processors have improved perform-

ance by devising a 2-level microarchitecture. The

upper level works as an interface to the CISC

instruction set, translating those instructions to

RISC-like micro-operations which are executedin the lower level. Decoding is performed by three

kinds of units: simple, general and sequencer units.

Instructions decoded by the sequencer lead to a


serial execution mode whereas the other two

decoding units generate superscalar code.

True dependences induced by the x86 ISA are

passed on to the RISC level. False dependences

can be dealt with by renaming techniques, even ifthey mean an additional cost.

The impact of the CISC to RISC translation on

the structure of the data dependence graphs in the

lower level should not mean great modifications, if

one bears in mind the average of microoperations

per instruction presented in some studies on this

process. Thus, Huang and Xie point out that on

average each CISC instruction generates 1.26microoperations [7], Bhandarkar and Ding calcu-

late 1.35 [3], whereas Huang and Peng�s work sug-

gests about 1.41 [6]. That is, the graph for the

microoperation sequence would comprise some-

thing less than 1.5 times the number of nodes of

the instruction sequence, which means that the

structure of the data dependence graph is not sig-

nificantly altered by this process.Quantification of ILP is one of the most popu-

lar subjects in computer architecture. For exam-

ples see [8–11], [14,16,17] and [18].

Comparison of studies on RISC processors and

others performed on x86 processors shows a

noticeable difference in the degree of potential par-

allelism. Thus, in RISC processors we can find that

Wall reports an ILP from 2 to more than 10,depending on the execution model and its configu-

ration, with and an average of 7 and a median of 5

[18], Theobald et al. reach the conclusion that,

with memory disambiguation, 30 simultaneous

operations can be executed [16].

Studies on the x86 ISA do not get so positive re-

sults. Y. Patt�s group perform a set of experiments

under several configurations achieving IPCs rang-ing between 0.5 and 3.5 in the best situations, with

most values somewhere slightly above 1 [10].

Huang and Xie measure Microoperation Level

Parallelism (MLP) [7]. The average MLP is 1.32

without renaming and 2.20 with renaming, includ-

ing sequential instructions. The improvement due

to renaming is above what can be found in other

processors (MIPS, for instance) due to the factthat x86 family processors have only eight general

purpose registers. Bhandarkar and Ding charac-

terize the performance of the Pentium Pro based

on the hardware counters which this processor in-

cludes [3]. CPI for the SPECint95 benchmark

ranges between 0.75 and 1.6.

5. Instruction frequency of use

Although it is not the aim of this work, a study

has been performed on the frequency of use of

instructions, classifying them according to their

mnemonics. Results agree with those by Adams

and Zimmermann [1], with the exception of the

LOOP instruction, which is not to be found sincecompilers tend not to use it due to the problems

that the dedicated use of CX poses, as explained

below. The distribution is also consistent with

the tables by Huang and Peng if care is taken to

add instruction percentages independently of the

data addressing modes [6].

6. Analysis of operand use

In this Section, results for register use are pre-

sented from three different perspectives: implicit

use, use related to memory address arithmetic

and use derived from condition codes.

The information refers to the number of ac-

cesses, not how they relate to each other. This lat-ter approach will be taken below.

6.1. Use of implicit operands

Implicit operands are those indissolubly linked

to the mnemonic (the operation) and which are

not explicitly present in the instruction format.

Their use is justifiable when the objective is to min-imize the instruction format, but in the environ-

ment of superscalar execution, it diminishes

versatility and causes a potential increase in data

dependences by reducing the number of possibili-

ties for variable allocation.

Fig. 2 illustrates two typical examples. In Fig.

2(a), the instruction specifies a single operand (reg-

ister BX). The other source operand and the targetoperand are implicit. This means they can not be

specified by the programmer, and they are always

the same in all possible instances of the operation.


In Fig. 2(b) something similar happens. The loop

operation consists of two more basic operations:

counter decrement and conditional branch. The

counter is implicit. In all instances of the operation

the same counter is used. This causes a seriousproblem when nested loops are needed. Nowadays

this instruction has almost totally disappeared

from all instruction counts.

Fig. 1 shows implicit use of registers in thou-

sands of references. Locality of used registers is

very evident, as a consequence of registers being

associated to specific operations. The number of

true dependences must be very large, since for

comp

0

2

4

6

8

10

12

AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI

Thou

sand

s

find

0

500

1.000

1.500

2.000

2.500


Thou

sand

s

go (optimized for speed)

0

500

1.000

1.500

2.000

2.500


Thou

sand

s

rar (decompressing)

0

200

400

600

800

1.000

1.200


Thou

sand

s

Fig. 1. Implicit use of registers in thousands of referenc

many of the registers the number of reads is very

close to the number of writes, and for others it is

even a lot greater. A great number of reads will

follow some writes, causing true dependences. It

is reasonable to assume that some of these arenot a consequence of the computational tasks at

hand, but are caused by the lack of flexibility of

the ISA.

6.2. Memory address computation

Something similar happens with effective ad-

dress computation. Address modes are very strict

debug

0200400600800

1.0001.2001.4001.600


Thou

sand

s

go (optimized for size)

0

500

1.000

1.500

2.000

2.500

3.000


Thou

sand

s

rar (compressing)

0

500

1.000

1.500

2.000

2.500

3.000

3.500


Thou

sand

s

tcc

0

50

100

150

200

250


Thou

sand

s

es. Dark gray for reads and light gray for writes.

MUL BX LOOP destination

AX x BX → AX:CX CX — 1 → CX

if CX == 0 go to destination

(a) (b)

Fig. 2. Examples of implicit operands use.

Fig. 3. Memory addressing modes distribution (considering

just offset registers).


as regards the registers that can be used. Each

addressing mode implicitly assumes one or two

fixed registers as offset, besides a segment register.

That means that the address arithmetic always

falls on the same operands time and again, whichincreases the chances for data dependences.

Table 2 shows the register combinations availa-

ble for the different addressing modes including the

segment register by default (and, thus, implicit un-

less a prefix is used to specify another one). As a

consequence, Intel�s memory segmentation also

helps to increase the number of potential data

dependences.The number of registers which are involved in

the calculation of an effective memory address is

large (two at the minimum considering segment

and offset). Fig. 3 shows the distribution for the

programs in the benchmark as far as the offset is

concerned.

As Fig. 3 illustrates, the column for averages at

the far right gives a general idea: in 70% the effec-tive memory address arithmetic involves two regis-

ters, one for segment and one for offset and in a

6% the behaviour is even worse because there are

three registers involved, one for segment and two

for offset. The potentiality of data dependences in-

creases with the number of registers used, and even

more so if we realize that they are not dedicated to

address calculation, but are utilized with a gen-

Table 2

Registers involved in memory addressing modes

r/m Seg. Mod = 00

000 DS BX + SI

001 DS BX + DI

010 SS BP + SI

011 SS BP + DI

100 DS SI

101 DS DI

110 SS Direct

111 DS BX

eral purpose, as we discuss in the following

paragraph.Fig. 4 shows the explicit accesses to registers in

thousands of references. Use of registers for mem-

ory address computation has been included. It can

be seen how the distribution of the use of registers

is very concentrated in a few of them, except in the

case of traces for rar and, to a lesser extent, debug.

Fig. 4 confirms that registers used for address

arithmetic (BX, BP, SI, DI) also support a heavyload of general purpose data accesses which can-

not be simply counted as pointer manipulation.

The conclusion is the same again: forcing oper-

ands to reside in the same logical locations in-

creases the risk of true dependences without real

computational meaning but impossible to avoid

with the use of renaming techniques.

6.3. Condition codes

Condition codes are used to store status infor-

mation which can be used later to evaluate a

Mod = 01 Mod = 10

BX + SI + D8 BX + SI + D16

BX + DI + D8 BX + DI + D16

BP + SI + D8 BP + SI + D16

BP + DI + D8 BP + DI + D16

SI + D8 SI + D16

DI + D8 DI + D16

BP + D8 BP + D16

BX + D8 BX + D16

comp

050

100150200250300350400450500


Thou

sand

s

debug

0

200

400

600

800

1.000

1.200

1.400


Thou

sand

s

find

0

200

400

600

800

1.000

1.200

1.400

1.600


Thou

sand

s


0

2.000

4.000

6.000

8.000

10.000

12.000

14.000

AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SITh

ousa

nds


0

2.000

4.000

6.000

8.000

10.000

12.000

14.000


Thou

sand

s

rar (compressing)

0

5.000

10.000

15.000

20.000

25.000

30.000

35.000

40.000


Thou

sand

s

rar (decompressing)

0

500

1.000

1.500

2.000

2.500

3.000

3.500


Thou

sand

s

tcc

0

50

100

150

200

250


Thou

sand

s

Fig. 4. Explicit access to registers in thousands of references. Dark gray for references used in memory addressing.

0: r1 op_a r2 →→→→ r3 (→→→→state)1: r4 op_b r5 →→→→ r6 (→→→→state)2: if state == cc go to

0: r1 op_a r2 →→→→ r3 ()1: r4 op_b r5 →→→→ r6 (→→→→state)2: if state == cc go to

0

2

1

dependencesthrough status

register

0

2

1 dependence withcomputational

meaning topreserve

(a) (b)

Fig. 5. Impact of condition codes on parallelism.


branch condition. It is another example of implicit

use of operands.The negative impact of condition codes on the

superscalar processing mode through state writing,

is evident. Let�s see an example. In Fig. 5(a), state

is always stored. In Fig. 5(b) it is only stored when

it has a computational meaning. This latter case

can be executed in fewer control steps.

Specifically, the kind of data dependence gener-

ated is an output dependence. The solution can beto use run-time techniques like renaming, at the

cost of increasing hardware complexity and silicon

consumption. As a consequence, use of condition

codes increases code serialization in the executable

binary, although it has no computational meaning.


This lack of computational meaning makes the

additional cost incurred not justifiable.

In order to provide an glimpse of the magnitude

of the problem, in Fig. 6 implicit use in thousands

of references of condition codes (state flags) isshown.

A conclusion can be drawn from the graphs:

more codes are stored than are necessary, and also,

codes are stored more times than they are later

read. This proves our point: most of the writes

have no computational meaning, since these values

0

200

400

600

800

1.000

1.200

1.400

OF SF ZF AF PF CF

Thou

sand

s


0

2.000

4.000

6.000

8.000

10.000

OF SF ZF AF PF CF

Thou

sand

s

rar (decompressing)

0

1.000

2.000

3.000

4.000

5.000

6.000

7.000

OF SF ZF AF PF CF

Thou

sand

s

comp

0

50

100

150

200

250

OF SF ZF AF PF CF

Thou

sand

s

find

Fig. 6. Implicit use of condition codes in thousands of referen

will never be read. Or put another way, many val-

ues will be overwritten without having been used,

requiring unnecessary renaming time and

resources.

To this respect, some ISAs have chosen to avoidtheir use. This is the case in the Alpha, where

instructions relate to each other only through ex-

plicit operands [4]. Other ISAs have chosen to in-

clude a bit in the instruction format which

specifies if the state register is to be updated or

not. Thus, the compiler can decide if it must create

debug

0

500

1.000

1.500

2.000

2.500

3.000

3.500

OF SF ZF AF PF CF

Thou

sand

s


0

2.000

4.000

6.000

8.000

10.000

OF SF ZF AF PF CF

Thou

sand

s

rar (compressing)

05.000

10.00015.00020.00025.00030.00035.00040.00045.000

OF SF ZF AF PF CF

Thou

sand

s

tcc

0

50

100

150

200

250

OF SF ZF AF PF CF

Thou

sand

s

ces. Dark gray for writings and light gray for readings.


the data dependence or not. That is the case in the

PowerPC [12]. The x86 ISA has not avoided the

problem because it must maintain binary compat-

ibility. This characteristic causes data dependences

without computational meaning which make com-pilation more complex and require a run-time

solution, which causes an additional cost.

7. Quantification based on DDGs

Quantification of the degree of parallelism is a

relevant part in works about ILP. The usual wayis to take measures based on time: the task is to

calculate IPC from the execution time and the

instruction count. This method demands complex

simulators, if measurements are to be precise,

where suppositions and simplifications have an

important effect on the final result. Not without

reason, reproducibility of results is no easy task.

Measurements obtained are strongly dependenton the simulated physical layer and often they

are used to evaluate the performance of new hard-

ware proposals.

An alternative measurement method is one

based on data dependence graphs. DDGs are con-

structed from real code traces and the length of the

critical path is calculated. The characterization is,

in this case, independent of the physical implemen-tation. It lies in a previous step in the computation

process: at the machine language level. However, it

includes the impact of the compilation process.

These measurements provide an indication of

how the hardware must be designed in order to

take advantage of the partial order specified by

the graph without the additional restrictions that

the physical implementation imposes.Quantification based on DDG is a powerful

analysis tool when used together with matrix rep-

resentation of the DDG. Thus, besides the critical

path length, the degree of parallelism of an instruc-

tion window can be measured, as well as the life

span of operands, data sharing or reuse, the most

important sources of dependences, the distribution

of parallelism and other important parameters.Quantification of parallelism available in execu-

tables by means of analysis of DDGs has been

used in previous works, although time-based

measurements (IPC) are often preferred. Kumar

used them on source code written in FORTRAN

[9]. Later, they can be found in works by Austin

and Sohi [2], Postiff et al. [11] and Stefanovic

and Martonosi [15]. These studies start off fromtraces of real executed code and later build graphs

from them. However they adopt certain simplifica-

tions which exclude some graph nodes from the

analysis. We have considered all operations, so

that no instruction is kept out of the DDG. In

the following paragraphs we discuss simplifica-

tions performed in other works.

Austin and Sohi [2] do not take branches intoaccount, since they do not produce data to be con-

sumed by other instructions. This operation alters

the program sequence, but in a trace the path

taken is known in advance. In our work, we con-

sider the instruction pointer as one more operand

and include the node for the instruction in the

graph.

Stefanovic and Martonosi [15] do not insertdata transfer operations into the graph in order

for them to be independent of the store scheme,

since they measure the effect of the address compu-

tation as a limiting factor on the degree of

parallelism.

In this work we consider all dependences (true

or false), since we are interested in the quantifica-

tion of the degree of parallelism in the machinelanguage layer, although we are aware of the abil-

ity of renaming techniques to partially unravel the

code in run-time.

8. Experimental setup

Data dependence graphs for different staticinstruction windows are constructed from real

traces. Instruction windows in a real execution

process are dynamic (sliding windows) in order

to increase the probability of finding independent

instructions which can be executed. However, con-

tinuous processing is affected by the run-time

scheduling process, and that is something we are

trying to avoid, so that measurements are inde-pendent of the physical implantation. Besides, as

far as measurements are concerned, it is easier to

find average values with discrete processing than


to overload computation with continuous process-

ing of sliding windows (instantaneous degree of

parallelism). On the other hand, as D. Wall points

out, when window size is big enough, results for

continuous and discrete are similar [18].In order to obtain a measurement of parallelism

that is independent of instruction window size, we

define normalized critical path length, LN, as the

ratio of critical path length (l) to the instruction

window size (n). When LN is 1 there is not any par-

allelism, and the closer a value is to 0, the more

parallelism. We also define degree of parallelism,

Gp, as the reciprocal of LN (Gp ¼ L�1N ). Gp is a

value between 1 (lack of parallelism) and n (maxi-

mum degree of parallelism).

In this work we measure the degree of parallel-

ism from the critical path length (l) and the

instruction window size (n) for each window, and

then we find out the average for all the windows

in the trace.

Let us discuss an interesting point. The degreeof parallelism is a function of windows size, that

is, Gp = f(n). It is reasonable to think that as n in-

creases, Gp does the same without limit. Following

this logic, instruction window sizes of some super-

scalar processors have become greater in the hope

of getting more independent instructions ready to

be executed simultaneously. Opposite to that idea,

Wall� simulations about RISC processors foundan asymptotic behaviour [18]. We begin with a

hypothesis close to Wall�s results for to two

reasons which we explain in the following para-

graphs.

First, as n increases towards the whole instruc-

tion count in a program, Gp moves towards the

average degree of parallelism of that program.

When all instructions are taken into account, theresulting Gp depends on the relations between the

instructions in that program, not on their number.

Second, our quantification ismade over the code,

just after compiling for a target platform processor-

operating system and just before any physical

implementation. The parallelism we find in code is

moderated by the ISA limitations imposed when

we go from language layer to code layer. As we ex-posed in the introductory section and have qualita-

tively confirmed in Section 6, the extra dependences

forced by the ISA (with no computational meaning)

have as a consequence that successive measures of

Gp show a smother profile rounded to an average

value. It is reasonable to think that if we analyse

the available parallelism at the language layer we

find a greater degree of parallelism than at the codelayer. Although it is not translatable to integer

processing, Kumar�s studies in the language layer

for scientific applications confirm this idea [9].

To confirm our hypothesis about asymptotic

behaviour we have studied Gp over different win-

dow sizes and we have logged the histogram that

shows the distribution in number of windows vs.

every possible normalized critical path length(LN). The window size was increased as long as

the time consumed in analyzing was tolerable.

We are aware that there is another definition of

degree of parallelism in the literature as the ratio

of total number of instructions to the time slots

scheduled. This definition assumes some physical

configuration about instruction latencies, memory

access suppositions and so on—even assuming noresource constraints—that we want to avoid in

order to analyse instruction set architecture (x86

ISA specifically) without any interference. In other

words, we prefer to talk about computational

steps instead of time slots, as we keep our mind

closer to the graph theory used to carry out the

study.

9. Parallelism results

We present two kinds of results: the average de-

gree of parallelism for different instruction window

sizes, and the distribution of parallelism for the

biggest window size. The impact of the x86 ISA

condition codes has also been measured by repeat-ing the measurement of the degree of parallelism

without taking dependences related to the state

register into account.

9.1. Degree of parallelism

Fig. 7 shows how the degree of parallelism Gp

evolves as instruction window size varies, for bothtraces of go.

We can see that, as the instruction window size

increases, available parallelism grows, but in an


1.95

1

1.5

2

2.5

1 2 4 8 16 32 64 128Window size

Gp


1.92

1

1.5

2

2.5

1 2 4 8 16 32 64 128Window size

Gp

Fig. 7. Evolution of parallelism degree (Gp) vs. instructions

window size.


asymptotic manner. That is, the degree of poten-

tial parallelism at code layer is bounded. This limit

is marked in the charts. This asymptotic behaviour

(when the window size increases) agrees with

Wall�s results for RISC processors [18].

It is important to remember at this moment that

we have the possibility of increasing the window

size because we work with traces that have everycondition branch resolved.

The rest of the traces show a similar evolution,

reaching the limit values shown in Fig. 8 when the

window size is 128. The trace for comp shows the

least degree of parallelism (Gp � 1), with almost

sequential graphs, whereas in the other traces the

Gp for window size 128

1

1.5

2

2.5

comp debug find go size gospeed

rarcomp.

rar dec. tcc

Fig. 8. Highest degree of parallelism (window size 128).

value of Gp is about two instructions per computa-

tional step.

The discrepancy in comp coincides with a great

locality of register use, explicit and implicit, as can

be seen in the charts in Section 6.The two compiled versions of go do not present

any significant difference at least in data depend-

ence analysis.

These data agree with those of Section 4 for x86

family processors (see [10] and [3]). They specially

agree with Huang and Xie, although they deal with

microoperations [7]. They obtain an average MLP

of 1.66 when sequential microoperations are nottaken into consideration, no renaming is per-

formed and branches are predicted, which are the

conditions which come closest to our own assump-

tions in performing a quantification based on

DDGs. We have obtained an average ILP of 1.77

considering all traces, and 1.89 for comp

specifically.

Since each CISC instruction is translated intoapproximately 1.3 RISC microoperations (see Sec-

tion 4 of this paper), it is reasonable to expect a

small lengthening of the graph�s critical path to

be caused by the transformation from CISC to

RISC. This would yield a slightly lower degree of

parallelism. Fig. 9 sheds some light on the two pos-

sible transformations of a graph. The CISC graph

has three nodes (instructions) and will be trans-lated into 4 microoperations following the above

rate. That means that one of the CISC nodes be-

comes two microoperations. The divided node in

the RISC graph inherited a data flow edge that

can be constructed in two possible ways: (a) and

(b). However, the most usual case is Fig. 9(b),

CISC graph

RISC graphs

(a) (b)

Fig. 9. Possible CISC to RISC graph transformations.


since often the second microoperation cannot start

until the first one is completed, as explained Goch-

man [5].

From the point of view of the physical layer de-

sign, these results mean that two functional unitsof each kind would be enough to absorb most of

the available parallelism.

9.2. Distributions of parallelism

Fig. 10 shows the distribution of parallelism for

each trace in number of windows vs. normalized

comp

0

5

0.01

0.06

0.12

0.17

0.23

0.28

0.34

0.39

0.45

0.50

0.55

0.61

0.66

0.72

0.77

0.83

0.88

0.94

0.99

Thou

sand

s

LN

Thou

sand

s

find

0

5

10

15

20

0.01

0.06

0.12

0.17

0.23

0.28

0.34

0.39

0.45

0.50

0.55

0.61

0.66

0.72

0.77

0.83

0.88

0.94

0.99

Thou

sand

sTh

ousa

nds

LN

LN

LN


02468

101214

rar (decompressing)

0

5

10

15

20

25

0.01

0.06

0.12

0.17

0.23

0.28

0.34

0.39

0.45

0.50

0.55

0.61

0.66

0.72

0.77

0.83

0.88

0.94

0.99

Thou

sand

s0.

01

0.06

0.12

0.17

0.23

0.28

0.34

0.39

0.45

0.50

0.55

0.61

0.66

0.72

0.77

0.83

0.88

0.94

0.99

Fig. 10. Distribution in number of windows vs. normalized critic

critical path length (LN) for a window size of 128

instructions.

Distributions show that parallelism appears

spread over a range: there are peaks, and mo-

ments of sequentiality (LN � 1). However, mostanalyzed windows are gathered around one or

two maxima in the histogram and exhibit a de-

gree of parallelism (Gp ¼ L�1N ) related to the

average.

In one end we can observe how comp and find

programs have got a very narrow distribution with

almost every instruction window just in the aver-

age degree of parallelism. In the opposite end we

debug

0

3

6

0.01

0.06

0.12

0.17

0.23

0.28

0.34

0.39

0.45

0.50

0.55

0.61

0.66

0.72

0.77

0.83

0.88

0.94

0.99

LN

LN

LN

LN


0

5

10

15

0.01

0.06

0.12

0.17

0.23

0.28

0.34

0.39

0.45

0.50

0.55

0.61

0.66

0.72

0.77

0.83

0.88

0.94

0.99

Thou

sand

s

rar (compressing)

050

100150200250300350

0.01

0.06

0.12

0.17

0.23

0.28

0.34

0.39

0.45

0.50

0.55

0.61

0.66

0.72

0.77

0.83

0.88

0.94

0.99

Thou

sand

s

tcc

0

0.5

1

0.01

0.06

0.12

0.17

0.23

0.28

0.34

0.39

0.45

0.50

0.55

0.61

0.66

0.72

0.77

0.83

0.88

0.94

0.99

Thou

sand

s

al path length (LN) for an instruction window of size 128.

1

1.5

2

2.5

3

3.5

comp debug find go size gospeed

rarcomp.

rar dec. tcc

Gp

Fig. 11. Degree of parallelism without state writes (dark gray)

and normal situation (light gray) for a window size of 128.

12,77%

0%5%

10%15%20%25%30%35%40%45%

com

p

debu

g

find

go s

ize

go s

peed

rar c

omp.

rar d

ec.

tcc

Fig. 12. Improvement in the degree of parallelism with no state

writes and a window size of 128.


count rar, with both workloads, and debug, which

show a wider distribution.

We know that Kumar, after studying parallel-

ism availability of scientific applications at the lan-

guage layer, asserts that parallelism appears tocome in bursts, that is, there are phases when in-

stant parallelism is several orders of magnitude

higher or lower than the average parallelism [9].

Our distribution range is not as wide as those re-

ported for scientific applications. It can not be

translated to our case due to the specific character

of integer performance (lower intrinsic parallelism,

smaller basic block). Besides, the behaviour de-picted by Kumar should be moderated by the

ISA limitations imposed when we go from the lan-

guage layer to the code layer, as we exposed

before.

All parallelism available for the range of Gp = 2

(LN = 0.5) to 1 (LN = 1) can be exploited with

functional unit duplication since there are a maxi-

mum of two available independent operations.The range between Gp = 4 (LN = 0.25) and 2 re-

quires four functional unit of each kind (maximum

of four available independent operations). How-

ever, the speed-up in this case would be small,

since the number of instruction windows that

could take advantage of it is residual. This assert

is supported in the fact that it is possible to defer

operations from the peak times without increasingthe final execution time. This concept is what The-

obald et al. call smoothability [16].

We would like to point that quantification of

parallelism based on simulators hide the presence

of those windows of higher degree of parallelism

by the smoothability phenomena and by its light

contribution to speed-up.

Distributions for comp and find are especiallynarrow because they repeatedly perform a single

task which, besides, is not too many instructions

long. Program comp is an anomaly again, as it

should, because of a low level of parallelism.

Program rar executes a variety of computa-

tional task that derives in a wide distribution,

but the final degree of parallelism is determined

by the maximum: LN = 0.61 (Gp = 1.64) for rar

compressing and LN = 0.5 (Gp = 2) for rar decom-

pressing, values which quite fit with those exposed

in Fig. 8.

9.3. Condition code impact quantification

In order to show the influence of the ISA on

ILP we now quantify the negative effect of state

register writes.Data dependence graphs have been built again

but this time assuming that process operations

do not store condition codes. In this way a higher

bound on parallelism can be established, since

dependences related to the state register are not

considered.

Fig. 11 shows the top degree of parallelism with

and without dependences generated by statewrites.

Fig. 12 charts the improvement in the degree of

parallelism with data from Fig. 11. The average is

12.77%, although for debug and rar compressing

the effect is even stronger.


10. Conclusions and future work

This study reveals that there is a great locality in

the frequency of use of registers. This characteris-

tic of the x86 instruction set causes, as a first con-sequence, a potential increase in data dependences

which, out of sheer probability, will turn into true

dependences in some cases. These true depen-

dences will impose a stricter ordering of instruc-

tions in the code which will not be avoidable in

run time, and the consequence will be a worse per-

formance in a superscalar environment.

A second consequence is that already pointedout by Adams and Zimmerman: lack of storing

locations causes an increase in the number of

MOV, PUSH and POP instructions with respect

to typical RISC processors [1]. This effect has feed-

back on the previous one, since it causes an extra

load of address arithmetic, again with very local-

ised operands, and stack operations with long

dependence chains (see [2,11] and [16]). Besides,memory transfers cause data dependence graphs

to be more complex, since memory is usually con-

sidered to be a unique resource, and the most con-

servative case is assumed. To this respect, memory

disambiguation techniques are complex and not

always applicable.

Counts of register use have been presented in

three groups: those for implicit use (use in connec-tion with a specific operation), explicit use and state

register use. The third group, besides locality,

shows how the number of writes without computa-

tional meaning is big. Output-dependences must be

produced which are unnecessary from the perspec-

tive of the task being compiled. Although these

dependences can be eliminated in run-time, this still

has a hardware cost that is not to be neglected, interms of silicon, time and power consumption.

Quantification based on DDGs has been used in

order to measure the degree of parallelism en-

closed in the code of the traces of the benchmark.

This quantification has the advantage of being

independent of the processor architecture, reflect-

ing instead, in an almost exclusive manner, the

characteristics of the ISA. Besides, it does not re-quire use of complex simulators, and it provides

a mathematic formalization which is ideal for

quantitative analysis.

The average degree of parallelism has been cal-

culated for each trace, with different instruction

window sizes and the highest degree achievable

has been found. Results have been proven consist-

ent with those obtained by other researchers withdifferent techniques.

Distribution of parallelism for each trace has

been presented, showing its irregularity and pro-

viding an estimate of the benefit which can be de-

rived from functional multiplicity.

Finally, the effect of condition codes has been

measured. These results show that the effect of

state writes is noticeable and especially importantfor some of the programs considered.

The asymptotic behavior of the degree of paral-

lelism as window size increases shows that the

potential parallelism at the code layer is bounded.

As our quantification is independent of physical

implementation, that limited parallelism must come

just from two possible sources: inherited from the

language layer or imposed by the instruction setarchitecture. There are several motives to reasona-

bly thinking that the second is the most likely: previ-

ous work reported in Section 4, analysis of operand

use carried out by us and detailed in Section 6, his-

tograms of degree of parallelism and the quantifica-

tion of condition code impact (see Section 9).

The experience acquired with the measurement

technique proposed, and its validation will allowa deeper study of the x86 ISA by building of

DDGs which differentiate the sources of data

dependences, with the aim of locating the most

important sinks of parallelism.

Acknowledgement

The authors would like to thank Antonio Gon-

zalez and Jose Gonzalez, from the Intel Labora-

tory in Barcelona, for their suggestions, and

specially their colleague Francisco Tirado from

UCM, whose indications represented a valuable

help during the elaboration of this work.

References

[1] T.L. Adams, R.E. Zimmerman, An analysis of 8086

instruction set usage in MS DOS programs, in: Proceedings


of the Third International Conference on Architectural

Support for Programming Languages and Operating

Systems (ASPLOS-III), April 1989, pp. 152–160.

[2] T.M. Austin, G.S. Sohi, Dynamic dependency analysis of

ordinary programs, in: Proceedings of the 19th Interna-

tional Symposium on Computer Architecture, 1992, pp.

342–351.

[3] D. Bhandarkar, J. Ding, Performance characterization of

the Pentium Pro processor, in: Proceedings of the Third

International Symposium on High-Performance Computer

Architecture, 1997, pp. 288–297.

[4] Compaq. Alpha Architecture Handbook. Order number:

EC-QD2KC-TE, October 1998. Available from: <http://

gatekeeper.dec. com/pub/Digital/info/semiconductor/litera-

ture/dsc-library.html>.

[5] S. Gochman, et al., The Intel Pentium M processor:

microarchitecture and performance, Intel Technical Jour-

nal 7 (2) (2003) 21–36, Available from: <http://devel-

oper.intel.com/technology/itj/>.

[6] I.J. Huang, T.C. Peng, Analysis of x86 instruction set

usage for DOS/Windows applications and its implication

on superscalar design, IEICE Transactions on Information

and Systems E85-D (6) (2002) 929–939 (SCI).

[7] I.J. Huang, P.H. Xie, Application of instruction analysis/

scheduling techniques to resource allocation of superscalar

processors, IEEE Transactions on VLSI Systems 10 (1)

(2002) 44–54.

[8] N.P. Jouppi, D.W. Wall, Available instruction-level paral-

lelism for superscalar and superpipelined machines, in:

Proceedings of the Third International Conference on

Architectural Support for Programming Languages and

Operating Systems, April 1989, pp. 272–282.

[9] M. Kumar, Measuring parallelism in computation inten-

sive scientific/engineering applications, IEEE Transactions

on Computers 37 (9) (1988) 1088–1098.

[10] O. Mutlu, J. Stark, Ch. Wilkerson, Y.N. Patt, Runahead

execution: an alternative to very large instruction windows

for out-of-order processors, in: Proceedings of the 9th

International Symposium on High-Performance Computer

Architecture (HPCA�03), 2003, pp. 129–140.[11] M.A. Postiff, D.A. Greene, G.S. Tyson, T.N. Mudge, The

limits of instruction level parallelism in SPEC95 applica-

tions, in: Proceedings of the 3rd Workshop on Interaction

Between Compilers and Computer Architecture, 1998.

[12] T. Potter, M. Vaden, J. Young, N. Ullah, Resolution of

data and control-flow dependencies in the PowerPC 601,

IEEE Micro (1994) 18–29.

[13] R. Rico, On applying graph theory to ILP analysis/

scheduling, Technical Note UAH-AUT-GAP-2003-01.

Available from: <http://atc2.aut.uah.es/~gap/>.

[14] J.E. Smith, G.S. Sohi, The microarchitecture of superscalar

processors, Proceedings of the IEEE 83 (12) (1995) 1609–

1624.

[15] D. Stefanovic, M. Martonosi, Limits and graph structure

of available instruction-level parallelism, in: Proceedings of

the European Conference on Parallel Computing (Euro-

Par 2000), 2000.

[16] K.B. Theobald, G.R. Gao, L.J. Hendren, On the limits of

program parallelism and its smoothability, in: Proceedings

of the 25th Annual International Symposium on Microar-

chitecture, 1992, pp. 10–19.

[17] D.M. Tullsen, S.J. Eggers, H.M. Levy, Simultaneous

multithreading: maximizing on-chip parallelism, in: Pro-

ceedings of the 22nd Annual International Symposium on

Computer Architecture, 1995, pp. 392–403.

[18] D.W. Wall, Limits of instruction-level parallelism, in:

Proceedings of the Fourth International Conference on

Architectural Support for Programming Languages and

Operating Systems, 1991, pp. 176–188.

Rafael Rico received the B.S. in Phys-ics from the Universidad Complutensede Madrid, Spain, in 1988. He isassistant professor in ComputerArchitecture with the Department ofComputer Engineering at Universityof Alcala since 1998. His researchinterests include microprocessors,parallel architectures, instruction levelparallelism, and VHDL modeling.

Juan-Ignacio Perez received the B.S. inPhysics from the Universidad Com-plutense de Madrid, Spain, in 1993. Heis assistant professor with the Depart-ment of Computer Engineering of theUniversidad de Alcala since 2002. Hehas participated in several projects ofthe University of Alcala aboutinstruction level parallelism. He iscurrently working on schedulingalgorithms.

Jose-Antonio Frutos received the B.S.in Physics from the UniversidadComplutense de Madrid, Spain, in1982, and the Ph.D. from the Univer-sidad de Alcala, Spain, in 1998. He isassistant professor with the Depart-ment of Computer Engineering of theUniversidad de Alcala since 1991. Heparticipated in several projects of theUniversity of Alcala about instructionlevel parallelism. He has been theauthor of several publications in con-ference proceedings and journals and

holds a patent for a distributed computer control systems. His
research is related parallel computer architecture and to appliedautomatic control and simulation. Nowadays is working likeprincipal research from the University of Alcala in the Euro-pean Commission project SmartFuel Third Generation DigitalFluid Management System.
http://gatekeeper.dec.com/pub/Digital/info/semiconductor/literature/dsc-library.html



http://developer.intel.com/technology/itj/

http://developer.intel.com/technology/itj/

http://atc2.aut.uah.es/~gap/

Documents

The impact of x86 instruction set architecture on superscalar processing