Upload
rafael-rico
View
216
Download
1
Embed Size (px)
Citation preview
Journal of Systems Architecture 51 (2005) 63–77
www.elsevier.com/locate/sysarc
The impact of x86 instruction set architecture onsuperscalar processing
Rafael Rico *, Juan-Ignacio Perez, Jose Antonio Frutos
Department of Computer Engineering, Universidad de Alcala, 28871 Alcala de Henares, Spain
Received 11 July 2003; received in revised form 14 February 2004; accepted 3 July 2004
Available online 7 October 2004
Abstract
Performance improvement of x86 processors is a relevant matter. From the point of view of superscalar processing,
it is necessary to complement the studies on instruction use with analogous ones on data use and, furthermore, analyze
the data flow graphs, as its dependencies are responsible for limitations on ILP. In this work, using instruction traces
from common applications, quantitative analyses of implicit operands, memory addressing and condition codes have
been performed, three sources of significant limitations on the maximum achievable parallelism in the x86 architecture.
In order to get a deeper knowledge of these limitations, the data dependence graphs are built from traces. By means of
graph matrix representation, potentially exploitable parallelism is quantified and parallelism distributions from the
traces are shown. The method has also been applied to measure the impact of the use of condition codes. Results
are compared with previous work and some conclusions are presented relating the obtained degree of parallelism with
negative characteristics of x86 instruction set architecture.
� 2004 Elsevier B.V. All rights reserved.
Keywords: Instruction level parallelism; Instruction set architecture; DDG-based quantification
1. Introduction
Some of the characteristic features of instruc-
tion sets have an effect on the fine-grain parallelism
available in the code in that they can impose data
dependences which in their turn cause an over
1383-7621/$ - see front matter � 2004 Elsevier B.V. All rights reserv
doi:10.1016/j.sysarc.2004.07.002
* Corresponding author. Tel.: +34 91 885 66 15; fax: +34 91
885 66 41.
E-mail address: [email protected] (R. Rico).
ordering of instructions in the program which is
not strictly necessary just to preserve the computa-
tional meaning of the tasks compiled. This effect
can be critical in a superscalar processing
environment.
Data dependences can be classified as true or
false. True data dependences are not avoidableby any hardware means, and they cause a sequen-
tial execution. Output- and anti-dependences, on
the other hand, can be overcome by hardware
ed.
64 R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77
techniques such as register renaming. However,
these techniques can sometimes mean a greater de-
sign complexity not completely justified in reduc-
tion of execution time.
Some of those instruction set characteristic fea-tures that provoke extra dependencies are implicit
operands (dependent on the operation and not
specified by the programmer), registers used for
memory address computation, condition codes,
etc. The IA of the x86 family is an example for
all of these features, a necessary evil in this case,
if binary compatibility is to be maintained. That
is why, in order to increase the throughput ofx86-family processors today, but at a reasonable
cost, it is necessary to attain a precise knowledge
of the use that instructions make of operands
and the sources of the most critical dependences.
With the aim of quantifying the impact of these
instruction set characteristics, measurements have
been made which are based on Data Dependence
Graphs (DDGs). These graphs are built from thetraces from the program�s execution, and are
mathematically analyzed thanks to their matrix
representation. This process is not in the scope
of this paper, but more information on it can
be found in our web site [13], where more is
explained about applying graph theory to ILP
analysis/scheduling. This method makes repro-
ductibility of the experiments easier, since it usesan algebraic formalization. This is not the case
with measurements based on IPC or on execution
time, which need complex simulators where sup-
positions or simplifications dramatically affect
final results, and where it is extremely difficult to
bear in mind all the dynamical events typical of
real execution.
DDG-based quantification objectively describesthe code, since it is hardware-independent. This is
another difference with quantifications based on
time, since these ones require definition of a spe-
cific physical layer to work with, and this physical
layer has an influence on the code it executes.
Therefore, measurements based on time describe
hardware and software as a whole.
Objective evaluation of code, on the other hand,presents a great advantage in that it shows how the
hardware must be designed in order for it to attain
an optimal performance by taking advantage of
the positive elements found in code and by solving
the problems it highlights.
In this work a study on the use of operands in
representative DOS applications is presented, to-
gether with a quantification of the parallelismfound in data dependence graphs, as a means of
reaching a better understanding of how different
pieces of data relate to each other. Specifically,
the impact of condition codes has been measured,
which represent a source of output dependences
without the least computational meaning. Distri-
butions of parallelism are also presented.
2. Previous work on the x86 instruction set
Adams and Zimmerman performed a study on
the frequency of use of x86 instructions in DOS
applications [1]. However, their work does not in-
clude operands. More recently, Huang and Peng
have performed counts of the use of x86 instruc-tions with different operands [6]. Despite this, they
do not analyze data dependence graphs and con-
centrate on the improvement of the execution time
of the most frequently used microoperations.
A different approach to the study of the x86
instruction set is measurement of parallelism. This
is the case with work presented by Huang and Xie,
which includes measurement of parallelism at themicrooperation level, where operation as well as
operand type has been taken into account, includ-
ing addressing mode [7].
Bhandarkar and Ding perform measurements
of parallelism using hardware counters featured
by the more recent processors [3]. The variety of
studied events is great.
Our research presents the frequency of use ofdata items, and concentrates on the analysis of
DDGs, which is something that is not to be found
in other works on the x86 instruction set.
3. The workbench
In view of the results obtained by Huang andPeng, for DOS as well as for Windows95 [6], and
considering the extra difficulty of the 32 bit envi-
ronment due to the great variety of operands, a
R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77 65
decision has been made to restrict the analysis to
16 bit DOS real mode applications, and to defer
the study of 32 bit operands for later work.
Instruction trace generation is based, as was the
case in the work quoted above, on step-by-stepexecution mode and modification of interrupt-
service-routine 1. Where source code was not
available, the executable image was �injected� witha virus which configures step-by-step execution.
Table 1 lists DOS applications and programs
compiled for DOS in real-mode which have been
proposed for study, as well as the number of
instructions processed for each trace. It must benoted that traces contain complete sequences of
execution for a specific workload, that is, not par-
tial code sequences but whole program execution
has been analyzed, so as not to work on partial
instruction mixes which do not represent the usual
behavior of the programs.
An integer processing test-bench has been con-
sidered. Several operating system utilities fromMS-DOS 5.0 (comp, find and debug) have been se-
lected, as well as commonly used applications: a
compressing-decompressing utility (rar version
1.52) and a C language compiler (tcc version
1.0). Program go, from the SPECint95 Suite has
also been included.
In the case of go, since in this case the source
code is available, two completely opposite compi-lation schemes have been used. In one of them,
optimization for size (flags –O1 –Os –G) has been
performed; in the other one, optimization for
speed (flags –O2 –Ot –Ox –G). The aim is to eval-
uate the possible impact of compilation. A com-
piler has been chosen which is not specifically
Table 1
Workbench
Program Number of executed instructions
comp 689,866
debug 8,071,335
find 6,119,641
go (optimized for size) 30,636,605
go (optimized for speed) 30,290,351
rar (compressing) 98,244,064
rar (decompressing) 14,782,924
tcc 1,010,078
Total 189,844,864
designed for superscalar code generation: Borland
C++ version 4.0.
The workload for all these programs has been
chosen so as to obtain a reduced instruction count,
in order for the traces to be manageable. In spite ofthis, traces represent almost 190 million instruc-
tions.
Since the programs are compiled for the x86-
DOS platform, and our study is concerned with
the code, the workstation/PC environment where
the traces are obtained is not important, as it just
affects the speed at which traces are generated.
One last important issue must be mentioned.Since our analysis deals with traces, every condi-
tional branch is already resolved when they are
formed, and so we can build code sequences
as long as we need them to be, being sure that
every branch prediction is going to be a hit; in
other words, traces assume perfect branch
prediction.
4. The x86 instruction set and the superscalar model
In order to preserve binary compatibility with
previous processors, which in itself has yielded
undeniable benefits, the x86 instruction set has
inherited design characteristics suitable for past
requirements and clearly unfit for superscalarprocessing. The original design followed two guide-
lines: to minimize the instruction format and to
close the gap between high level and machine lan-
guages and so ease the compilation process. Now-
adays these demands are not so important but,
on the other hand, the limitations imposed by this
instruction set regarding exploitation of ILP are
commonly acknowledged. Dedicated use of regis-ters, implicit operands and state register updating
are potential sources of dependences which impose
an over-serialization of code.
X86 family processors have improved perform-
ance by devising a 2-level microarchitecture. The
upper level works as an interface to the CISC
instruction set, translating those instructions to
RISC-like micro-operations which are executedin the lower level. Decoding is performed by three
kinds of units: simple, general and sequencer units.
Instructions decoded by the sequencer lead to a
66 R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77
serial execution mode whereas the other two
decoding units generate superscalar code.
True dependences induced by the x86 ISA are
passed on to the RISC level. False dependences
can be dealt with by renaming techniques, even ifthey mean an additional cost.
The impact of the CISC to RISC translation on
the structure of the data dependence graphs in the
lower level should not mean great modifications, if
one bears in mind the average of microoperations
per instruction presented in some studies on this
process. Thus, Huang and Xie point out that on
average each CISC instruction generates 1.26microoperations [7], Bhandarkar and Ding calcu-
late 1.35 [3], whereas Huang and Peng�s work sug-
gests about 1.41 [6]. That is, the graph for the
microoperation sequence would comprise some-
thing less than 1.5 times the number of nodes of
the instruction sequence, which means that the
structure of the data dependence graph is not sig-
nificantly altered by this process.Quantification of ILP is one of the most popu-
lar subjects in computer architecture. For exam-
ples see [8–11], [14,16,17] and [18].
Comparison of studies on RISC processors and
others performed on x86 processors shows a
noticeable difference in the degree of potential par-
allelism. Thus, in RISC processors we can find that
Wall reports an ILP from 2 to more than 10,depending on the execution model and its configu-
ration, with and an average of 7 and a median of 5
[18], Theobald et al. reach the conclusion that,
with memory disambiguation, 30 simultaneous
operations can be executed [16].
Studies on the x86 ISA do not get so positive re-
sults. Y. Patt�s group perform a set of experiments
under several configurations achieving IPCs rang-ing between 0.5 and 3.5 in the best situations, with
most values somewhere slightly above 1 [10].
Huang and Xie measure Microoperation Level
Parallelism (MLP) [7]. The average MLP is 1.32
without renaming and 2.20 with renaming, includ-
ing sequential instructions. The improvement due
to renaming is above what can be found in other
processors (MIPS, for instance) due to the factthat x86 family processors have only eight general
purpose registers. Bhandarkar and Ding charac-
terize the performance of the Pentium Pro based
on the hardware counters which this processor in-
cludes [3]. CPI for the SPECint95 benchmark
ranges between 0.75 and 1.6.
5. Instruction frequency of use
Although it is not the aim of this work, a study
has been performed on the frequency of use of
instructions, classifying them according to their
mnemonics. Results agree with those by Adams
and Zimmermann [1], with the exception of the
LOOP instruction, which is not to be found sincecompilers tend not to use it due to the problems
that the dedicated use of CX poses, as explained
below. The distribution is also consistent with
the tables by Huang and Peng if care is taken to
add instruction percentages independently of the
data addressing modes [6].
6. Analysis of operand use
In this Section, results for register use are pre-
sented from three different perspectives: implicit
use, use related to memory address arithmetic
and use derived from condition codes.
The information refers to the number of ac-
cesses, not how they relate to each other. This lat-ter approach will be taken below.
6.1. Use of implicit operands
Implicit operands are those indissolubly linked
to the mnemonic (the operation) and which are
not explicitly present in the instruction format.
Their use is justifiable when the objective is to min-imize the instruction format, but in the environ-
ment of superscalar execution, it diminishes
versatility and causes a potential increase in data
dependences by reducing the number of possibili-
ties for variable allocation.
Fig. 2 illustrates two typical examples. In Fig.
2(a), the instruction specifies a single operand (reg-
ister BX). The other source operand and the targetoperand are implicit. This means they can not be
specified by the programmer, and they are always
the same in all possible instances of the operation.
R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77 67
In Fig. 2(b) something similar happens. The loop
operation consists of two more basic operations:
counter decrement and conditional branch. The
counter is implicit. In all instances of the operation
the same counter is used. This causes a seriousproblem when nested loops are needed. Nowadays
this instruction has almost totally disappeared
from all instruction counts.
Fig. 1 shows implicit use of registers in thou-
sands of references. Locality of used registers is
very evident, as a consequence of registers being
associated to specific operations. The number of
true dependences must be very large, since for
comp
0
2
4
6
8
10
12
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
find
0
500
1.000
1.500
2.000
2.500
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
go (optimized for speed)
0
500
1.000
1.500
2.000
2.500
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
rar (decompressing)
0
200
400
600
800
1.000
1.200
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
Fig. 1. Implicit use of registers in thousands of referenc
many of the registers the number of reads is very
close to the number of writes, and for others it is
even a lot greater. A great number of reads will
follow some writes, causing true dependences. It
is reasonable to assume that some of these arenot a consequence of the computational tasks at
hand, but are caused by the lack of flexibility of
the ISA.
6.2. Memory address computation
Something similar happens with effective ad-
dress computation. Address modes are very strict
debug
0200400600800
1.0001.2001.4001.600
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
go (optimized for size)
0
500
1.000
1.500
2.000
2.500
3.000
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
rar (compressing)
0
500
1.000
1.500
2.000
2.500
3.000
3.500
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
tcc
0
50
100
150
200
250
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
es. Dark gray for reads and light gray for writes.
MUL BX LOOP destination
AX x BX → AX:CX CX — 1 → CX
if CX == 0 go to destination
(a) (b)
Fig. 2. Examples of implicit operands use.
Fig. 3. Memory addressing modes distribution (considering
just offset registers).
68 R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77
as regards the registers that can be used. Each
addressing mode implicitly assumes one or two
fixed registers as offset, besides a segment register.
That means that the address arithmetic always
falls on the same operands time and again, whichincreases the chances for data dependences.
Table 2 shows the register combinations availa-
ble for the different addressing modes including the
segment register by default (and, thus, implicit un-
less a prefix is used to specify another one). As a
consequence, Intel�s memory segmentation also
helps to increase the number of potential data
dependences.The number of registers which are involved in
the calculation of an effective memory address is
large (two at the minimum considering segment
and offset). Fig. 3 shows the distribution for the
programs in the benchmark as far as the offset is
concerned.
As Fig. 3 illustrates, the column for averages at
the far right gives a general idea: in 70% the effec-tive memory address arithmetic involves two regis-
ters, one for segment and one for offset and in a
6% the behaviour is even worse because there are
three registers involved, one for segment and two
for offset. The potentiality of data dependences in-
creases with the number of registers used, and even
more so if we realize that they are not dedicated to
address calculation, but are utilized with a gen-
Table 2
Registers involved in memory addressing modes
r/m Seg. Mod = 00
000 DS BX + SI
001 DS BX + DI
010 SS BP + SI
011 SS BP + DI
100 DS SI
101 DS DI
110 SS Direct
111 DS BX
eral purpose, as we discuss in the following
paragraph.Fig. 4 shows the explicit accesses to registers in
thousands of references. Use of registers for mem-
ory address computation has been included. It can
be seen how the distribution of the use of registers
is very concentrated in a few of them, except in the
case of traces for rar and, to a lesser extent, debug.
Fig. 4 confirms that registers used for address
arithmetic (BX, BP, SI, DI) also support a heavyload of general purpose data accesses which can-
not be simply counted as pointer manipulation.
The conclusion is the same again: forcing oper-
ands to reside in the same logical locations in-
creases the risk of true dependences without real
computational meaning but impossible to avoid
with the use of renaming techniques.
6.3. Condition codes
Condition codes are used to store status infor-
mation which can be used later to evaluate a
Mod = 01 Mod = 10
BX + SI + D8 BX + SI + D16
BX + DI + D8 BX + DI + D16
BP + SI + D8 BP + SI + D16
BP + DI + D8 BP + DI + D16
SI + D8 SI + D16
DI + D8 DI + D16
BP + D8 BP + D16
BX + D8 BX + D16
comp
050
100150200250300350400450500
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
debug
0
200
400
600
800
1.000
1.200
1.400
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
find
0
200
400
600
800
1.000
1.200
1.400
1.600
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
go (optimized for size)
0
2.000
4.000
6.000
8.000
10.000
12.000
14.000
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SITh
ousa
nds
go (optimized for speed)
0
2.000
4.000
6.000
8.000
10.000
12.000
14.000
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
rar (compressing)
0
5.000
10.000
15.000
20.000
25.000
30.000
35.000
40.000
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
rar (decompressing)
0
500
1.000
1.500
2.000
2.500
3.000
3.500
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
tcc
0
50
100
150
200
250
AL AH AX BL BH BX CL CH CX DL DH DX BP SP DI SI
Thou
sand
s
Fig. 4. Explicit access to registers in thousands of references. Dark gray for references used in memory addressing.
0: r1 op_a r2 →→→→ r3 (→→→→state)1: r4 op_b r5 →→→→ r6 (→→→→state)2: if state == cc go to
0: r1 op_a r2 →→→→ r3 ()1: r4 op_b r5 →→→→ r6 (→→→→state)2: if state == cc go to
0
2
1
dependencesthrough status
register
0
2
1 dependence withcomputational
meaning topreserve
(a) (b)
Fig. 5. Impact of condition codes on parallelism.
R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77 69
branch condition. It is another example of implicit
use of operands.The negative impact of condition codes on the
superscalar processing mode through state writing,
is evident. Let�s see an example. In Fig. 5(a), state
is always stored. In Fig. 5(b) it is only stored when
it has a computational meaning. This latter case
can be executed in fewer control steps.
Specifically, the kind of data dependence gener-
ated is an output dependence. The solution can beto use run-time techniques like renaming, at the
cost of increasing hardware complexity and silicon
consumption. As a consequence, use of condition
codes increases code serialization in the executable
binary, although it has no computational meaning.
70 R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77
This lack of computational meaning makes the
additional cost incurred not justifiable.
In order to provide an glimpse of the magnitude
of the problem, in Fig. 6 implicit use in thousands
of references of condition codes (state flags) isshown.
A conclusion can be drawn from the graphs:
more codes are stored than are necessary, and also,
codes are stored more times than they are later
read. This proves our point: most of the writes
have no computational meaning, since these values
0
200
400
600
800
1.000
1.200
1.400
OF SF ZF AF PF CF
Thou
sand
s
go (optimized for speed)
0
2.000
4.000
6.000
8.000
10.000
OF SF ZF AF PF CF
Thou
sand
s
rar (decompressing)
0
1.000
2.000
3.000
4.000
5.000
6.000
7.000
OF SF ZF AF PF CF
Thou
sand
s
comp
0
50
100
150
200
250
OF SF ZF AF PF CF
Thou
sand
s
find
Fig. 6. Implicit use of condition codes in thousands of referen
will never be read. Or put another way, many val-
ues will be overwritten without having been used,
requiring unnecessary renaming time and
resources.
To this respect, some ISAs have chosen to avoidtheir use. This is the case in the Alpha, where
instructions relate to each other only through ex-
plicit operands [4]. Other ISAs have chosen to in-
clude a bit in the instruction format which
specifies if the state register is to be updated or
not. Thus, the compiler can decide if it must create
debug
0
500
1.000
1.500
2.000
2.500
3.000
3.500
OF SF ZF AF PF CF
Thou
sand
s
go (optimized for size)
0
2.000
4.000
6.000
8.000
10.000
OF SF ZF AF PF CF
Thou
sand
s
rar (compressing)
05.000
10.00015.00020.00025.00030.00035.00040.00045.000
OF SF ZF AF PF CF
Thou
sand
s
tcc
0
50
100
150
200
250
OF SF ZF AF PF CF
Thou
sand
s
ces. Dark gray for writings and light gray for readings.
R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77 71
the data dependence or not. That is the case in the
PowerPC [12]. The x86 ISA has not avoided the
problem because it must maintain binary compat-
ibility. This characteristic causes data dependences
without computational meaning which make com-pilation more complex and require a run-time
solution, which causes an additional cost.
7. Quantification based on DDGs
Quantification of the degree of parallelism is a
relevant part in works about ILP. The usual wayis to take measures based on time: the task is to
calculate IPC from the execution time and the
instruction count. This method demands complex
simulators, if measurements are to be precise,
where suppositions and simplifications have an
important effect on the final result. Not without
reason, reproducibility of results is no easy task.
Measurements obtained are strongly dependenton the simulated physical layer and often they
are used to evaluate the performance of new hard-
ware proposals.
An alternative measurement method is one
based on data dependence graphs. DDGs are con-
structed from real code traces and the length of the
critical path is calculated. The characterization is,
in this case, independent of the physical implemen-tation. It lies in a previous step in the computation
process: at the machine language level. However, it
includes the impact of the compilation process.
These measurements provide an indication of
how the hardware must be designed in order to
take advantage of the partial order specified by
the graph without the additional restrictions that
the physical implementation imposes.Quantification based on DDG is a powerful
analysis tool when used together with matrix rep-
resentation of the DDG. Thus, besides the critical
path length, the degree of parallelism of an instruc-
tion window can be measured, as well as the life
span of operands, data sharing or reuse, the most
important sources of dependences, the distribution
of parallelism and other important parameters.Quantification of parallelism available in execu-
tables by means of analysis of DDGs has been
used in previous works, although time-based
measurements (IPC) are often preferred. Kumar
used them on source code written in FORTRAN
[9]. Later, they can be found in works by Austin
and Sohi [2], Postiff et al. [11] and Stefanovic
and Martonosi [15]. These studies start off fromtraces of real executed code and later build graphs
from them. However they adopt certain simplifica-
tions which exclude some graph nodes from the
analysis. We have considered all operations, so
that no instruction is kept out of the DDG. In
the following paragraphs we discuss simplifica-
tions performed in other works.
Austin and Sohi [2] do not take branches intoaccount, since they do not produce data to be con-
sumed by other instructions. This operation alters
the program sequence, but in a trace the path
taken is known in advance. In our work, we con-
sider the instruction pointer as one more operand
and include the node for the instruction in the
graph.
Stefanovic and Martonosi [15] do not insertdata transfer operations into the graph in order
for them to be independent of the store scheme,
since they measure the effect of the address compu-
tation as a limiting factor on the degree of
parallelism.
In this work we consider all dependences (true
or false), since we are interested in the quantifica-
tion of the degree of parallelism in the machinelanguage layer, although we are aware of the abil-
ity of renaming techniques to partially unravel the
code in run-time.
8. Experimental setup
Data dependence graphs for different staticinstruction windows are constructed from real
traces. Instruction windows in a real execution
process are dynamic (sliding windows) in order
to increase the probability of finding independent
instructions which can be executed. However, con-
tinuous processing is affected by the run-time
scheduling process, and that is something we are
trying to avoid, so that measurements are inde-pendent of the physical implantation. Besides, as
far as measurements are concerned, it is easier to
find average values with discrete processing than
72 R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77
to overload computation with continuous process-
ing of sliding windows (instantaneous degree of
parallelism). On the other hand, as D. Wall points
out, when window size is big enough, results for
continuous and discrete are similar [18].In order to obtain a measurement of parallelism
that is independent of instruction window size, we
define normalized critical path length, LN, as the
ratio of critical path length (l) to the instruction
window size (n). When LN is 1 there is not any par-
allelism, and the closer a value is to 0, the more
parallelism. We also define degree of parallelism,
Gp, as the reciprocal of LN (Gp ¼ L�1N ). Gp is a
value between 1 (lack of parallelism) and n (maxi-
mum degree of parallelism).
In this work we measure the degree of parallel-
ism from the critical path length (l) and the
instruction window size (n) for each window, and
then we find out the average for all the windows
in the trace.
Let us discuss an interesting point. The degreeof parallelism is a function of windows size, that
is, Gp = f(n). It is reasonable to think that as n in-
creases, Gp does the same without limit. Following
this logic, instruction window sizes of some super-
scalar processors have become greater in the hope
of getting more independent instructions ready to
be executed simultaneously. Opposite to that idea,
Wall� simulations about RISC processors foundan asymptotic behaviour [18]. We begin with a
hypothesis close to Wall�s results for to two
reasons which we explain in the following para-
graphs.
First, as n increases towards the whole instruc-
tion count in a program, Gp moves towards the
average degree of parallelism of that program.
When all instructions are taken into account, theresulting Gp depends on the relations between the
instructions in that program, not on their number.
Second, our quantification ismade over the code,
just after compiling for a target platform processor-
operating system and just before any physical
implementation. The parallelism we find in code is
moderated by the ISA limitations imposed when
we go from language layer to code layer. As we ex-posed in the introductory section and have qualita-
tively confirmed in Section 6, the extra dependences
forced by the ISA (with no computational meaning)
have as a consequence that successive measures of
Gp show a smother profile rounded to an average
value. It is reasonable to think that if we analyse
the available parallelism at the language layer we
find a greater degree of parallelism than at the codelayer. Although it is not translatable to integer
processing, Kumar�s studies in the language layer
for scientific applications confirm this idea [9].
To confirm our hypothesis about asymptotic
behaviour we have studied Gp over different win-
dow sizes and we have logged the histogram that
shows the distribution in number of windows vs.
every possible normalized critical path length(LN). The window size was increased as long as
the time consumed in analyzing was tolerable.
We are aware that there is another definition of
degree of parallelism in the literature as the ratio
of total number of instructions to the time slots
scheduled. This definition assumes some physical
configuration about instruction latencies, memory
access suppositions and so on—even assuming noresource constraints—that we want to avoid in
order to analyse instruction set architecture (x86
ISA specifically) without any interference. In other
words, we prefer to talk about computational
steps instead of time slots, as we keep our mind
closer to the graph theory used to carry out the
study.
9. Parallelism results
We present two kinds of results: the average de-
gree of parallelism for different instruction window
sizes, and the distribution of parallelism for the
biggest window size. The impact of the x86 ISA
condition codes has also been measured by repeat-ing the measurement of the degree of parallelism
without taking dependences related to the state
register into account.
9.1. Degree of parallelism
Fig. 7 shows how the degree of parallelism Gp
evolves as instruction window size varies, for bothtraces of go.
We can see that, as the instruction window size
increases, available parallelism grows, but in an
go (optimized for size)
1.95
1
1.5
2
2.5
1 2 4 8 16 32 64 128Window size
Gp
go (optimized for speed)
1.92
1
1.5
2
2.5
1 2 4 8 16 32 64 128Window size
Gp
Fig. 7. Evolution of parallelism degree (Gp) vs. instructions
window size.
R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77 73
asymptotic manner. That is, the degree of poten-
tial parallelism at code layer is bounded. This limit
is marked in the charts. This asymptotic behaviour
(when the window size increases) agrees with
Wall�s results for RISC processors [18].
It is important to remember at this moment that
we have the possibility of increasing the window
size because we work with traces that have everycondition branch resolved.
The rest of the traces show a similar evolution,
reaching the limit values shown in Fig. 8 when the
window size is 128. The trace for comp shows the
least degree of parallelism (Gp � 1), with almost
sequential graphs, whereas in the other traces the
Gp for window size 128
1
1.5
2
2.5
comp debug find go size gospeed
rarcomp.
rar dec. tcc
Fig. 8. Highest degree of parallelism (window size 128).
value of Gp is about two instructions per computa-
tional step.
The discrepancy in comp coincides with a great
locality of register use, explicit and implicit, as can
be seen in the charts in Section 6.The two compiled versions of go do not present
any significant difference at least in data depend-
ence analysis.
These data agree with those of Section 4 for x86
family processors (see [10] and [3]). They specially
agree with Huang and Xie, although they deal with
microoperations [7]. They obtain an average MLP
of 1.66 when sequential microoperations are nottaken into consideration, no renaming is per-
formed and branches are predicted, which are the
conditions which come closest to our own assump-
tions in performing a quantification based on
DDGs. We have obtained an average ILP of 1.77
considering all traces, and 1.89 for comp
specifically.
Since each CISC instruction is translated intoapproximately 1.3 RISC microoperations (see Sec-
tion 4 of this paper), it is reasonable to expect a
small lengthening of the graph�s critical path to
be caused by the transformation from CISC to
RISC. This would yield a slightly lower degree of
parallelism. Fig. 9 sheds some light on the two pos-
sible transformations of a graph. The CISC graph
has three nodes (instructions) and will be trans-lated into 4 microoperations following the above
rate. That means that one of the CISC nodes be-
comes two microoperations. The divided node in
the RISC graph inherited a data flow edge that
can be constructed in two possible ways: (a) and
(b). However, the most usual case is Fig. 9(b),
CISC graph
RISC graphs
(a) (b)
Fig. 9. Possible CISC to RISC graph transformations.
74 R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77
since often the second microoperation cannot start
until the first one is completed, as explained Goch-
man [5].
From the point of view of the physical layer de-
sign, these results mean that two functional unitsof each kind would be enough to absorb most of
the available parallelism.
9.2. Distributions of parallelism
Fig. 10 shows the distribution of parallelism for
each trace in number of windows vs. normalized
comp
0
5
0.01
0.06
0.12
0.17
0.23
0.28
0.34
0.39
0.45
0.50
0.55
0.61
0.66
0.72
0.77
0.83
0.88
0.94
0.99
Thou
sand
s
LN
Thou
sand
s
find
0
5
10
15
20
0.01
0.06
0.12
0.17
0.23
0.28
0.34
0.39
0.45
0.50
0.55
0.61
0.66
0.72
0.77
0.83
0.88
0.94
0.99
Thou
sand
sTh
ousa
nds
LN
LN
LN
go (optimized for speed)
02468
101214
rar (decompressing)
0
5
10
15
20
25
0.01
0.06
0.12
0.17
0.23
0.28
0.34
0.39
0.45
0.50
0.55
0.61
0.66
0.72
0.77
0.83
0.88
0.94
0.99
Thou
sand
s0.
01
0.06
0.12
0.17
0.23
0.28
0.34
0.39
0.45
0.50
0.55
0.61
0.66
0.72
0.77
0.83
0.88
0.94
0.99
Fig. 10. Distribution in number of windows vs. normalized critic
critical path length (LN) for a window size of 128
instructions.
Distributions show that parallelism appears
spread over a range: there are peaks, and mo-
ments of sequentiality (LN � 1). However, mostanalyzed windows are gathered around one or
two maxima in the histogram and exhibit a de-
gree of parallelism (Gp ¼ L�1N ) related to the
average.
In one end we can observe how comp and find
programs have got a very narrow distribution with
almost every instruction window just in the aver-
age degree of parallelism. In the opposite end we
debug
0
3
6
0.01
0.06
0.12
0.17
0.23
0.28
0.34
0.39
0.45
0.50
0.55
0.61
0.66
0.72
0.77
0.83
0.88
0.94
0.99
LN
LN
LN
LN
go (optimized for size)
0
5
10
15
0.01
0.06
0.12
0.17
0.23
0.28
0.34
0.39
0.45
0.50
0.55
0.61
0.66
0.72
0.77
0.83
0.88
0.94
0.99
Thou
sand
s
rar (compressing)
050
100150200250300350
0.01
0.06
0.12
0.17
0.23
0.28
0.34
0.39
0.45
0.50
0.55
0.61
0.66
0.72
0.77
0.83
0.88
0.94
0.99
Thou
sand
s
tcc
0
0.5
1
0.01
0.06
0.12
0.17
0.23
0.28
0.34
0.39
0.45
0.50
0.55
0.61
0.66
0.72
0.77
0.83
0.88
0.94
0.99
Thou
sand
s
al path length (LN) for an instruction window of size 128.
1
1.5
2
2.5
3
3.5
comp debug find go size gospeed
rarcomp.
rar dec. tcc
Gp
Fig. 11. Degree of parallelism without state writes (dark gray)
and normal situation (light gray) for a window size of 128.
12,77%
0%5%
10%15%20%25%30%35%40%45%
com
p
debu
g
find
go s
ize
go s
peed
rar c
omp.
rar d
ec.
tcc
Fig. 12. Improvement in the degree of parallelism with no state
writes and a window size of 128.
R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77 75
count rar, with both workloads, and debug, which
show a wider distribution.
We know that Kumar, after studying parallel-
ism availability of scientific applications at the lan-
guage layer, asserts that parallelism appears tocome in bursts, that is, there are phases when in-
stant parallelism is several orders of magnitude
higher or lower than the average parallelism [9].
Our distribution range is not as wide as those re-
ported for scientific applications. It can not be
translated to our case due to the specific character
of integer performance (lower intrinsic parallelism,
smaller basic block). Besides, the behaviour de-picted by Kumar should be moderated by the
ISA limitations imposed when we go from the lan-
guage layer to the code layer, as we exposed
before.
All parallelism available for the range of Gp = 2
(LN = 0.5) to 1 (LN = 1) can be exploited with
functional unit duplication since there are a maxi-
mum of two available independent operations.The range between Gp = 4 (LN = 0.25) and 2 re-
quires four functional unit of each kind (maximum
of four available independent operations). How-
ever, the speed-up in this case would be small,
since the number of instruction windows that
could take advantage of it is residual. This assert
is supported in the fact that it is possible to defer
operations from the peak times without increasingthe final execution time. This concept is what The-
obald et al. call smoothability [16].
We would like to point that quantification of
parallelism based on simulators hide the presence
of those windows of higher degree of parallelism
by the smoothability phenomena and by its light
contribution to speed-up.
Distributions for comp and find are especiallynarrow because they repeatedly perform a single
task which, besides, is not too many instructions
long. Program comp is an anomaly again, as it
should, because of a low level of parallelism.
Program rar executes a variety of computa-
tional task that derives in a wide distribution,
but the final degree of parallelism is determined
by the maximum: LN = 0.61 (Gp = 1.64) for rar
compressing and LN = 0.5 (Gp = 2) for rar decom-
pressing, values which quite fit with those exposed
in Fig. 8.
9.3. Condition code impact quantification
In order to show the influence of the ISA on
ILP we now quantify the negative effect of state
register writes.Data dependence graphs have been built again
but this time assuming that process operations
do not store condition codes. In this way a higher
bound on parallelism can be established, since
dependences related to the state register are not
considered.
Fig. 11 shows the top degree of parallelism with
and without dependences generated by statewrites.
Fig. 12 charts the improvement in the degree of
parallelism with data from Fig. 11. The average is
12.77%, although for debug and rar compressing
the effect is even stronger.
76 R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77
10. Conclusions and future work
This study reveals that there is a great locality in
the frequency of use of registers. This characteris-
tic of the x86 instruction set causes, as a first con-sequence, a potential increase in data dependences
which, out of sheer probability, will turn into true
dependences in some cases. These true depen-
dences will impose a stricter ordering of instruc-
tions in the code which will not be avoidable in
run time, and the consequence will be a worse per-
formance in a superscalar environment.
A second consequence is that already pointedout by Adams and Zimmerman: lack of storing
locations causes an increase in the number of
MOV, PUSH and POP instructions with respect
to typical RISC processors [1]. This effect has feed-
back on the previous one, since it causes an extra
load of address arithmetic, again with very local-
ised operands, and stack operations with long
dependence chains (see [2,11] and [16]). Besides,memory transfers cause data dependence graphs
to be more complex, since memory is usually con-
sidered to be a unique resource, and the most con-
servative case is assumed. To this respect, memory
disambiguation techniques are complex and not
always applicable.
Counts of register use have been presented in
three groups: those for implicit use (use in connec-tion with a specific operation), explicit use and state
register use. The third group, besides locality,
shows how the number of writes without computa-
tional meaning is big. Output-dependences must be
produced which are unnecessary from the perspec-
tive of the task being compiled. Although these
dependences can be eliminated in run-time, this still
has a hardware cost that is not to be neglected, interms of silicon, time and power consumption.
Quantification based on DDGs has been used in
order to measure the degree of parallelism en-
closed in the code of the traces of the benchmark.
This quantification has the advantage of being
independent of the processor architecture, reflect-
ing instead, in an almost exclusive manner, the
characteristics of the ISA. Besides, it does not re-quire use of complex simulators, and it provides
a mathematic formalization which is ideal for
quantitative analysis.
The average degree of parallelism has been cal-
culated for each trace, with different instruction
window sizes and the highest degree achievable
has been found. Results have been proven consist-
ent with those obtained by other researchers withdifferent techniques.
Distribution of parallelism for each trace has
been presented, showing its irregularity and pro-
viding an estimate of the benefit which can be de-
rived from functional multiplicity.
Finally, the effect of condition codes has been
measured. These results show that the effect of
state writes is noticeable and especially importantfor some of the programs considered.
The asymptotic behavior of the degree of paral-
lelism as window size increases shows that the
potential parallelism at the code layer is bounded.
As our quantification is independent of physical
implementation, that limited parallelism must come
just from two possible sources: inherited from the
language layer or imposed by the instruction setarchitecture. There are several motives to reasona-
bly thinking that the second is the most likely: previ-
ous work reported in Section 4, analysis of operand
use carried out by us and detailed in Section 6, his-
tograms of degree of parallelism and the quantifica-
tion of condition code impact (see Section 9).
The experience acquired with the measurement
technique proposed, and its validation will allowa deeper study of the x86 ISA by building of
DDGs which differentiate the sources of data
dependences, with the aim of locating the most
important sinks of parallelism.
Acknowledgement
The authors would like to thank Antonio Gon-
zalez and Jose Gonzalez, from the Intel Labora-
tory in Barcelona, for their suggestions, and
specially their colleague Francisco Tirado from
UCM, whose indications represented a valuable
help during the elaboration of this work.
References
[1] T.L. Adams, R.E. Zimmerman, An analysis of 8086
instruction set usage in MS DOS programs, in: Proceedings
R. Rico et al. / Journal of Systems Architecture 51 (2005) 63–77 77
of the Third International Conference on Architectural
Support for Programming Languages and Operating
Systems (ASPLOS-III), April 1989, pp. 152–160.
[2] T.M. Austin, G.S. Sohi, Dynamic dependency analysis of
ordinary programs, in: Proceedings of the 19th Interna-
tional Symposium on Computer Architecture, 1992, pp.
342–351.
[3] D. Bhandarkar, J. Ding, Performance characterization of
the Pentium Pro processor, in: Proceedings of the Third
International Symposium on High-Performance Computer
Architecture, 1997, pp. 288–297.
[4] Compaq. Alpha Architecture Handbook. Order number:
EC-QD2KC-TE, October 1998. Available from: <http://
gatekeeper.dec. com/pub/Digital/info/semiconductor/litera-
ture/dsc-library.html>.
[5] S. Gochman, et al., The Intel Pentium M processor:
microarchitecture and performance, Intel Technical Jour-
nal 7 (2) (2003) 21–36, Available from: <http://devel-
oper.intel.com/technology/itj/>.
[6] I.J. Huang, T.C. Peng, Analysis of x86 instruction set
usage for DOS/Windows applications and its implication
on superscalar design, IEICE Transactions on Information
and Systems E85-D (6) (2002) 929–939 (SCI).
[7] I.J. Huang, P.H. Xie, Application of instruction analysis/
scheduling techniques to resource allocation of superscalar
processors, IEEE Transactions on VLSI Systems 10 (1)
(2002) 44–54.
[8] N.P. Jouppi, D.W. Wall, Available instruction-level paral-
lelism for superscalar and superpipelined machines, in:
Proceedings of the Third International Conference on
Architectural Support for Programming Languages and
Operating Systems, April 1989, pp. 272–282.
[9] M. Kumar, Measuring parallelism in computation inten-
sive scientific/engineering applications, IEEE Transactions
on Computers 37 (9) (1988) 1088–1098.
[10] O. Mutlu, J. Stark, Ch. Wilkerson, Y.N. Patt, Runahead
execution: an alternative to very large instruction windows
for out-of-order processors, in: Proceedings of the 9th
International Symposium on High-Performance Computer
Architecture (HPCA�03), 2003, pp. 129–140.[11] M.A. Postiff, D.A. Greene, G.S. Tyson, T.N. Mudge, The
limits of instruction level parallelism in SPEC95 applica-
tions, in: Proceedings of the 3rd Workshop on Interaction
Between Compilers and Computer Architecture, 1998.
[12] T. Potter, M. Vaden, J. Young, N. Ullah, Resolution of
data and control-flow dependencies in the PowerPC 601,
IEEE Micro (1994) 18–29.
[13] R. Rico, On applying graph theory to ILP analysis/
scheduling, Technical Note UAH-AUT-GAP-2003-01.
Available from: <http://atc2.aut.uah.es/~gap/>.
[14] J.E. Smith, G.S. Sohi, The microarchitecture of superscalar
processors, Proceedings of the IEEE 83 (12) (1995) 1609–
1624.
[15] D. Stefanovic, M. Martonosi, Limits and graph structure
of available instruction-level parallelism, in: Proceedings of
the European Conference on Parallel Computing (Euro-
Par 2000), 2000.
[16] K.B. Theobald, G.R. Gao, L.J. Hendren, On the limits of
program parallelism and its smoothability, in: Proceedings
of the 25th Annual International Symposium on Microar-
chitecture, 1992, pp. 10–19.
[17] D.M. Tullsen, S.J. Eggers, H.M. Levy, Simultaneous
multithreading: maximizing on-chip parallelism, in: Pro-
ceedings of the 22nd Annual International Symposium on
Computer Architecture, 1995, pp. 392–403.
[18] D.W. Wall, Limits of instruction-level parallelism, in:
Proceedings of the Fourth International Conference on
Architectural Support for Programming Languages and
Operating Systems, 1991, pp. 176–188.
Rafael Rico received the B.S. in Phys-ics from the Universidad Complutensede Madrid, Spain, in 1988. He isassistant professor in ComputerArchitecture with the Department ofComputer Engineering at Universityof Alcala since 1998. His researchinterests include microprocessors,parallel architectures, instruction levelparallelism, and VHDL modeling.
Juan-Ignacio Perez received the B.S. inPhysics from the Universidad Com-plutense de Madrid, Spain, in 1993. Heis assistant professor with the Depart-ment of Computer Engineering of theUniversidad de Alcala since 2002. Hehas participated in several projects ofthe University of Alcala aboutinstruction level parallelism. He iscurrently working on schedulingalgorithms.
Jose-Antonio Frutos received the B.S.in Physics from the UniversidadComplutense de Madrid, Spain, in1982, and the Ph.D. from the Univer-sidad de Alcala, Spain, in 1998. He isassistant professor with the Depart-ment of Computer Engineering of theUniversidad de Alcala since 1991. Heparticipated in several projects of theUniversity of Alcala about instructionlevel parallelism. He has been theauthor of several publications in con-ference proceedings and journals and
holds a patent for a distributed computer control systems. His
research is related parallel computer architecture and to appliedautomatic control and simulation. Nowadays is working likeprincipal research from the University of Alcala in the Euro-pean Commission project SmartFuel Third Generation DigitalFluid Management System.