13
7/31/2019 Vlsi19_10 Performance Testing Vlsi http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 1/13 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011 1861 On the Use of Simple Electrical Circuit Techniques for Performance Modeling and Optimization in VLSI Systems G. Hazari and H. Narayanan  Abstract—Leading-edge VLSI systems, essentially multi- processor systems-on-chip, have a wide range of components integrated together and operating in unison. They can be analyzed as flow networks in which the system performance depends on the bandwidth, transmission time, and queueing delay characteristics of the individual components, their connectivity and interactions, as well as the traffic patterns they encounter. The flow in various parts of the system must ideally be distributed so as to extract the maximum throughput possible with minimum end-to-end delays. Such an ideal distribution for flow networks has previously been obtained using simple electrical circuits. We demonstrate a similar methodology for typical VLSI systems and provide the necessary extensions of the theory. We empirically validate the methodology using a cycle-accurate simulation model as the reference. We find this methodology to supply better distributions in the average case and comparable distributions in the worst case as compared to standard search procedures such as random sampling and simu- lated annealing. The real strength is that it provides a speedup of several orders of magnitude, i.e., 3–5 orders in our experiments. Thus it is an elegant means for analyzing and optimizing the flow in VLSI systems, which can easily be incorporated into design pro- cedures, compilers and on-chip modules for real-time allocations.  Index Terms—Analytical models, circuits, system performance. I. INTRODUCTION V LSI systems for high performance applications such as networking, communications, servers, multimedia, and gaming, among others are nowadays being built as multipro- cessor systems-on-chip (SoC) [2]–[5]. Even general purpose computers are being built with multiple processor cores [6] and have similar levels of complexity. These systems integrate a large number of processors, memories, input/output interfaces, interconnects, and application specific hardware onto the same chip. In order to meet the performance demands, the architec- ture contains multiple instantiations of each of the components which are made to operate in a highly parallel and pipelined manner. Such systems can be visualized as shown in Fig. 1. The number of processors is expected to increase manifold in the near future [6]. The memories include register files, caches, Manuscript received January 14, 2010; revised April 23, 2010; accepted July 07, 2010. Date of publication August 23, 2010; date of current version August 10, 2011. This work was supported by the sponsored project “VLSI Consortium at I.I.T. Bombay”. The authors are with the Department of Electrical Engineering, I.I.T. Bombay, Bombay 400076, India (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TVLSI.2010.2060502 Fig. 1. Visualization of multiprocessor SoC. scratch pads, SRAMs, and DRAMs. As the number of compo- nents is increasing, the interconnect sub-systems are evolving towards elaborate networks with their own protocols [7]–[10]. Reconfigurable topologies are also available [8]–[10]. There is a tremendous performance demand on the memory andinterconnectsub-systems[4]–[6],[8]–[10].Theyareknown to be performance bottlenecks in terms of both their bandwidth, i.e., maximum service rate supported, and the access or trans- mission delays [5], [6], [8]–[11]. The situation is expected to worsen since memory performance is not scaling as fast as that of the processors [5], [6], [11]. The memory and interconnect sub-systems are often designed together while trading off per- formance against chip area and energy consumption [12]–[15]. Thedesignprocessrequiresperformanceevaluationtoolsand optimization procedures. The most popular practice is to use cycleaccuratesimulatorsfortheevaluationandexploratorypro- cedures for the optimization [9], [10], [12], [13]. We propose an alternative strategy for certain steps in the de- sign process, wherein we view the system as a flow network. We first separate out the components that generate activity and those that simply respond to the activity generated by others. The network is composed of the second category, and we refer to the first category as its generators . The memories and inter- connects go into the network. The hardware units typically go into the network. In some cases they may also generate activity, but we neglect such cases for the present discussion. The gen- erators are primarily the processors and input/output interfaces. Thus we visualize the system as shown in Fig. 2. The traffic is of various types which includes memory ac- cesses, application data that is transferred between processors or between a processor and an interface, as well as informa- tion pertaining to themanagement protocolsin thesystem. Each traffic unit follows a particular path through the network which is defined by its source and destination. In the case of a memory access, thedestinationisa particular memoryandeitherthedata 1063-8210/$26.00 © 2010 IEEE

Vlsi19_10 Performance Testing Vlsi

Embed Size (px)

Citation preview

Page 1: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 1/13

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011 1861

On the Use of Simple Electrical Circuit Techniquesfor Performance Modeling and Optimization

in VLSI SystemsG. Hazari and H. Narayanan

 Abstract—Leading-edge VLSI systems, essentially multi-processor systems-on-chip, have a wide range of componentsintegrated together and operating in unison. They can be analyzedas flow networks in which the system performance depends on thebandwidth, transmission time, and queueing delay characteristicsof the individual components, their connectivity and interactions,as well as the traffic patterns they encounter. The flow in variousparts of the system must ideally be distributed so as to extract themaximum throughput possible with minimum end-to-end delays.Such an ideal distribution for flow networks has previously been

obtained using simple electrical circuits. We demonstrate a similarmethodology for typical VLSI systems and provide the necessaryextensions of the theory. We empirically validate the methodologyusing a cycle-accurate simulation model as the reference. We findthis methodology to supply better distributions in the average caseand comparable distributions in the worst case as compared tostandard search procedures such as random sampling and simu-lated annealing. The real strength is that it provides a speedup of several orders of magnitude, i.e., 3–5 orders in our experiments.Thus it is an elegant means for analyzing and optimizing the flowin VLSI systems, which can easily be incorporated into design pro-cedures, compilers and on-chip modules for real-time allocations.

 Index Terms—Analytical models, circuits, system performance.

I. INTRODUCTION

VLSI systems for high performance applications such as

networking, communications, servers, multimedia, and

gaming, among others are nowadays being built as multipro-

cessor systems-on-chip (SoC) [2]–[5]. Even general purpose

computers are being built with multiple processor cores [6] and

have similar levels of complexity. These systems integrate a

large number of processors, memories, input/output interfaces,

interconnects, and application specific hardware onto the same

chip. In order to meet the performance demands, the architec-

ture contains multiple instantiations of each of the componentswhich are made to operate in a highly parallel and pipelined

manner.

Such systems can be visualized as shown in Fig. 1. The

number of processors is expected to increase manifold in the

near future [6]. The memories include register files, caches,

Manuscript received January 14, 2010; revised April 23, 2010; accepted July07, 2010. Date of publication August 23, 2010; date of current version August10, 2011. This work was supported by the sponsored project “VLSI Consortiumat I.I.T. Bombay”.

The authors are with the Department of Electrical Engineering, I.I.T.Bombay, Bombay 400076, India (e-mail: [email protected];[email protected]).

Digital Object Identifier 10.1109/TVLSI.2010.2060502

Fig. 1. Visualization of multiprocessor SoC.

scratch pads, SRAMs, and DRAMs. As the number of compo-

nents is increasing, the interconnect sub-systems are evolving

towards elaborate networks with their own protocols [7]–[10].

Reconfigurable topologies are also available [8]–[10].

There is a tremendous performance demand on the memory

and interconnect sub-systems [4]–[6], [8]–[10]. They are known

to be performance bottlenecks in terms of both their bandwidth,

i.e., maximum service rate supported, and the access or trans-

mission delays [5], [6], [8]–[11]. The situation is expected to

worsen since memory performance is not scaling as fast as thatof the processors [5], [6], [11]. The memory and interconnect

sub-systems are often designed together while trading off per-

formance against chip area and energy consumption [12]–[15].

The design process requires performance evaluation tools and

optimization procedures. The most popular practice is to use

cycle accurate simulators for the evaluation and exploratory pro-

cedures for the optimization [9], [10], [12], [13].

We propose an alternative strategy for certain steps in the de-

sign process, wherein we view the system as a flow network.

We first separate out the components that generate activity and

those that simply respond to the activity generated by others.

The network is composed of the second category, and we refer

to the first category as its generators. The memories and inter-connects go into the network. The hardware units typically go

into the network. In some cases they may also generate activity,

but we neglect such cases for the present discussion. The gen-

erators are primarily the processors and input/output interfaces.

Thus we visualize the system as shown in Fig. 2.

The traffic is of various types which includes memory ac-

cesses, application data that is transferred between processors

or between a processor and an interface, as well as informa-

tion pertaining to the management protocols in the system. Each

traffic unit follows a particular path through the network which

is defined by its source and destination. In the case of a memory

access, the destination is a particular memory and either the data

1063-8210/$26.00 © 2010 IEEE

Page 2: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 2/13

1862 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011

Fig. 2. Visualizing the system as a flow network.

Fig. 3. Example of a flow network.

or control information is returned to the source. The choice of 

memory depends on the data object accessed and the memory al-

location which assigns each data object to a physical memory.

If there are multiple paths between a source-destination pair, the

 path allocation determines which one is taken. It may either be

fixed before entry or decided along the way by considering the

system state. Both the memory and path allocations have an im-

pact on the traffic rate and delays through the network.

In general, the throughput of the network, i.e., traffic rate sup-

ported, as well as transmission delays depend on the network ar-

chitecture, traffic patterns and the allocations. The architectureplaces an upper bound on the throughput and a lower bound on

the delay for every path. For example, the total bandwidth across

the memories is an upper bound for the rate at which memory

accesses can be serviced, and the total bandwidth of the inter-

connects is an upper bound for the rate across all traffic. The

connectivity constraints reduce the bounds further. For example

when many memories are connected to a single interconnect

the access rate may get limited by the interconnect. The actual

throughput may still be lower than the upper bound depending

on the application design, allocations [14], [15], and manage-

ment protocols. This typically happens when the traffic flow gets

concentrated only in certain parts of the network. Similarly the

latency or minimum delay along any path depends only on the

architecture parameters. The actual delays also include the time

spent waiting in queues for busy resources to become available.

Once again the application design and allocations have a signifi-

cant impact on the traffic patterns and thus the actual delays [8],

[9].

In order to estimate the throughput and delays for a partic-

ular system, it can be modeled as a flow network [16]. An ex-

ample is shown in Fig. 3. Flow networks are built up of nodes

and branches. Each branch connects two nodes and has the fol-

lowing characteristics: the maximum flow permitted and the

cost incurred as a functionof the flow . Two types ofopti-

mization problems are generally solvedfor flow networks: 1) themaximum flow problem which determines the maximum flow

possible through a given network and 2) the minimum cost flow

problem which determines the flow distribution that minimizes

the total cost in the network when a specified amount of flow is

injected into it. When is a linear or quadratic function,

the minimum cost flow problem can be solved using linear or

quadratic programming, respectively [1].

In [1], flow networks are converted to electrical circuits tosolve these two problems. The power of this approach is that

electrical devices can cover arbitrary cost functions as well.

Thus we propose that the same technique be applied to VLSI

systems, with data transfer rates being equated to flows and the

delays being equated to the cost. We argue that the queueing

delay functions are unlikely to be either linear or quadratic since

they are not for the M/M/1, M/G/1 and M/D/1 queues [17]. This

transformation is convenient since there are a number of soft-

wares available, both commercially and freely, for solving elec-

trical circuits. We use a dc circuit solver developed in-house [35]

for our purposes. Other solvers such as SPICE may be used

 just as well. However with commercial solvers, the range of 

queueing delay functions that can be modeled is restricted bythe electrical devices available. If the source code is accessible,

additional devices having the required characteristics may be

added. Since the networks are being solved only in software,

hypothetical devices having arbitrary characteristics can also be

used. It is not essential for a physical device with the same char-

acteristics to exist.

On mapping the flows to currents and the cost functions to

voltage drops [1], an electrical circuit can be constructed to

find the maximum flow or the minimum cost flow distribution

in such networks. The branch currents automatically distribute

themselves so as to attain the throughput upper bound and min-

imize the mean delay. This distribution indicates how the trafficin the original system should ideally be distributed. We shall

refer to the circuit that models the flow network for a particular

VLSI system as the electrical flow-delay model or the flow-delay

model in short.

An interesting aspect of this approach is that a simpler elec-

trical circuit models a more complicated one. The currents and

voltages in the model have no relation to those in the actual cir-

cuit, rather they are related to the flow rates and delays. Since

electrical network analysis is a well established field, there are

optimized procedures for solving the circuits. In principle, we

can directly apply the required procedures to the design of VLSI

systems. The visualization as an electrical circuit is not abso-

lutely essential, but there is an advantage in linking the two

domains. First, we get a better understanding of the analysis

and procedures, and second as the techniques and softwares

continue to evolve they can be applied directly. Thus if em-

pirical studies can confirm a correspondence between the two

domains, then there are significant benefits in modeling VLSI

system performance using simple electrical circuits. The elec-

trical flow-delay model can then be integrated into the design

flow for VLSI systems.

The flow-delay model can replace the simulation models and

exploratory procedures in certain steps of the design flow. It can

be solved much faster and can potentially lead to better designs

within shorter time budgets. However, we need to understandthe approximations it makes, which are essentially as follows.

Page 3: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 3/13

HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1863

1) The flow rate and delays are modeled as continuous quan-

tities, whereas both of them are discrete in VLSI systems.

For example, each memory access or data packet is a

single unit that travels together. The circuit currents do not

capture this effect. Second, VLSI chips are synchronous

systems in which all delays are multiples of the system

clock. Simple electrical devices do not have discrete cur-rent-voltage characteristics.

2) Non-deterministic delays such as those encountered in

caches and DRAMs with row-buffering must be modeled

either as their average values or separately for all cases.

Such approximations can potentially introduce large inac-

curacies with the underlying system.

In addition, the approach has two fundamental limitations: 1)

it requires that the queueing delay functions be characterized in

advance and 2) it can only give the ideal distribution. Additional

procedures are required while optimizing the memory and path

allocations. Constraints arise because each data object must re-

side in a unique physical memory, otherwise the overheads in-

curred to maintain consistency must also be taken into account.For some applications, the ideal distribution may not actually be

realizable. The simplest example of such a situation is when the

proportion of accesses to one data object is larger than the pro-

portion of flow through each of the memories. Further, the ac-

cess patterns may be bursty due to which the allocation which

comes closest to the ideal distribution may not always be the

best one.

The above issues must be kept in mind when integrating the

flow-delay model into the design methodology. It is well suited

for the following steps in the design process. First, it can eval-

uate the maximum throughput possible during architecture ex-

ploration and application design. Since the flow-delay model of-fers a substantial speedup over simulations, a larger number of 

candidates can be evaluated within similar time budgets. How-

ever, due to the approximations one should not solely rely on

this model when the differences in performance predictions are

small. Second, it provides the ideal distribution for the memory

and path allocations. When these are done at compile time there

is sufficient time to employ intensive search procedures such as

simulated annealing and others. These procedures will almost

always find better solutions but the flow-delay model can poten-

tially improve their speed by providing a good initial allocation.

The speedup provided by the flow-delay model can make a sig-

nificant contribution when the allocations are done in real-time,

which becomes relevant when there are reconfigurable inter-

connects [18]–[20] or when the traffic patterns change signif-

icantly with the inputs. We note that when used in real-time, the

flow-delay model would have to be solved on the chip itself, but

the systems in use today do have processors that are capable of 

doing so. The experiments in this study are designed to gauge

such contributions in the design process.

The objective of this paper is to develop the modeling

methodology and conduct empirical studies to validate it.

The organization is as follows. We compare this methodology

against existing analytical modeling strategies in Section II.

We develop the methodology in Section III and also extend

the ideas in [1] to cover arbitrary queueing delay functions. InSection IV, we model the Intel IXP2800 architecture [3] and

conduct a preliminary validation. In Section V, we conduct a

more thorough empirical validation including a comparison

with random sampling and simulated annealing. We find that

the flow-delay model indeed provides good distributions along

with a substantial speedup.

II. POPULAR ANALYTICAL MODELING TECHNIQUES

The flow-delay model provides an analytical technique for

modeling and optimizing performance in large distributed sys-

tems. In this section, we look at other analytical modeling tech-

niques and discuss how they compare with it.

The most popular technique is queueing theory, which was

conceived for manufacturing processes [21]–[24] and later ap-

plied to VLSI systems [25]. Queueing theory is a good means for

modeling service time and waiting time distributions for com-

ponents with queues, and then determining the sojourn time (or

transmission delay) distributions for a network of components

and queues. It is well suited for analyzing a given flow dis-

tribution, and for comparing dynamic allocation or schedulingpolicies when the path taken by each job is not fixed a priori.

However when it comes to finding the optimal flow distribu-

tion for a static allocation, additional algorithms and heuristics

are required that are not well suited for highly distributed sys-

tems [22], [23], [25]. The optimization is far simpler for systems

with up to two components or when all components have iden-

tical service time distributions [22]. The optimal distribution can

be obtained by analytical means for certain networks [26] how-

ever the number of such cases is very limited. The advantage

of the flow-delay model is that it has a naturally associated pro-

cedure for finding the optimal static flow distribution. Further,

queueing theory often assumes Poisson arrivals to enable theanalyses [22]. The analysis becomes difficult when the arrival

traffic stresses the network and further becomes infeasible for

general networks. In contrast the flow-delay model easily cap-

tures stressed arrivals and its solution procedure is not limited

by the complexity of the network.

Queueing theory has wider application than the flow-delay

model in terms of first modeling the distributions for through-

puts and delays rather than just the mean values, second being

able to analyze any given flow distribution and a variety of allo-

cation or scheduling policies, and third providing more informa-

tion such as expected queue lengths. The flow-delay model can

capture only a limited set of performance aspects and is well

suited for exactly two purposes: obtaining the maximum pos-

sible network throughput and a good flow distribution, however

it has inherent advantages for each of these. The two techniques

can also be used in conjunction since the flow-delay model re-

quires the delay versus flow characteristics of individual com-

ponents, which can be obtained conveniently through queueing

theory.

Another pair of similar techniques is real-time calculus [27],

[28] and network calculus [29]. They provide an elegant means

for modeling the throughput and delays, in terms of lower and

upper bounds for individual components, and then determining

the same for a network. However, once again there is no elegant

procedure for optimizing the flow distribution. The advantagesand disadvantages are similar to those with queueing theory.

Page 4: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 4/13

1864 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011

Fig. 4. Current limiting circuit.

Recent research has also used concepts from thermodynamics

and statistical physics to model complex VLSI systems [30].

These techniques are well suited for evaluating a particular de-

sign with a specified flow distribution, and then determining the

queue capacities required at various points in the system. Opti-

mizing the flow distribution remains a difficult task.

Large amounts of research have been directed towards sim-

plifying the performance evaluation for specific systems into an-

alytical models [30], [31]. The subsequent optimization is done

either using intensive search algorithms or heuristics, such as

greedy ones, simulated annealing and genetic algorithms [30],or using linear programming, integer linear programming, or

quadratic programming wherever applicable [30], [32]. It is in-

feasible for search algorithms or heuristics to cover the entire de-

sign space, which is typically very large due to its combinatorial

nature [29]. In this regard, the second class of models is better,

however they are likely to be less accurate. Also the possibility

of expressing the performance metric as a linear or quadratic

function of the design variables exists in a limited number of 

cases. Note that when all queueing delays are linear or quadratic

functions of the flow rate, the flow-delay model can also be

solved using linear or quadratic programming. Thus under these

conditions it can be considered to be in that class of analyticaltechniques.

To summarize this comparison, the flow-delay model cancap-

ture a very limited set of performance features as against other

techniques, however it has an optimization procedure associated

with it which makes it a stronger candidate for exactly two steps

in the design process. In the next section we proceed with de-

veloping the flow-delay modeling technique.

III. FLOW-DELAY MODELING METHODOLOGY

The basic premise is that we are going to convert data flows

in VLSI systems to electrical currents, and delays to voltagedrops. We use simple electrical circuits to model the system

components and then connect them together as in the architec-

ture. We organize this section as follows. In Section III-A, we

present the circuit models for individual components along with

the formulation for generalized delay versus flow functions. In

Section III-B, we describe the network construction.

 A. Component Models

The models must essentially capture two features: the max-

imum flow rate and the relationship between delay and flow.

We model the maximum rate using a current source in parallel

with an ideal diode (which has characteristics such that), as shown in Fig. 4. Diode ensures that the

Fig. 5. Circuit model for a general component.

Fig. 6. Component with no queueing delay.

Fig. 7. Component with a linear queueing delay function.

current flows in the forward direction. Thus the current satis-

fies .

We model the delay function by adding an appropriate device

as indicated in Fig. 5. The simplest function is a constant one,

which can be modeled as a voltage source. The next in the pro-

gression is a linear function which can be modeled as a voltage

source in series with a resistor. The device characteristics mustbe selected keeping the following point in mind. The current

distribution in the resulting network is governed by Kirchhoff’s

laws, and satisfies the following property [1]: It minimizes the

sum of the power in the voltage sources and half the power in

the resistors.

Since we intend to minimize the mean delay through the net-

work, we select the values of the voltage sources to be equal

to the constant component and the values of the resistors to be

twice the slope of the delay function. We elaborate on these two

cases next.

Consider a component with a throughput or flow limit, a la-

tency which is the minimum delay through it, and no additionalqueueing delay. This component is modeled as shown in Fig. 6.

The value of the current source is selected to be equal to the

throughput limit and that of the voltage source equal to the min-

imum delay. Note that the actual current through this circuit can

be less than and the voltage drop can be greater than .

Next consider a component having queueing delay that in-

creases linearly with the flow. This is modeled as in Fig. 7. The

resistor value corresponds to a minimum queueing delay of 0

and a maximum of .

Now let us illustrate a system with two components operating

in parallel. If both of them have constant delay functions, the

network is as shown in Fig. 8. Let the current entering be which

splits up as and . Let us assume then the currentdistribution is as follows.

Page 5: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 5/13

HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1865

Fig. 8. Current distribution: Example-1.

Fig. 9. Current distribution: Example-2.

• If then and diode is reverse

biased, which carries the voltage drop required for Kirch-

hoff’s voltage law to be satisfied.

• If then and is

reverse biased.

• If then while and

are reverse biased.

It is straightforward to see that this distribution minimizes the

power in the voltage sources.

If both components have linear delay functions, the network 

is as shown in Fig. 9. When is small, and , which

holds until . As increases beyond this value,

there is a range of values where .

Then comes a point beyond which either or .

If then will be reached and

. The situation in the converse condition is similar.

Let us look at the intermediate range more closely. Here,

Kirchhoff’s voltage law gives

. Let denote the sum

of the power in the voltage sources and half the power in the

resistors. Then

(1)

(2)

(3)

Thus is equivalent to solving Kirchhoff’s

voltage law and since it is a local min-

imum. This example clearly demonstrates the minimization

property of Kirchhoff’s laws, for both voltage sources and

resistors.

The focus of the rest of this section is to extend the propertyto general devices so that we can model any arbitrary delay-flow

function. Consider an electrical network with nodes and

branches. Let be the vector of b ranch currents.

and be the node voltages. Let be the

incidence matrix and be the fundamental circuit matrix.1

Then Kirchhoff’s laws can be expressed as

Kirchhoff's Current Law: (4)Kirchhoff's Voltage Law: (5)

The currents in the network can also be expressed as the cur-

rent circulating in each loop of the fundamental circuit matrix.

Let this vector be . The following equation re-

lates the branch and loop currents:

(6)

where is the transpose of the fundamental circuit matrix2.

Thus the rows of form a basis for the vector space of branch

currents that satisfies Kirchhoff’s current law.

Let us separate the branches containing current sources fromthose that do not. Let the number of branches without current

sources be . Let us split the incidence matrix vertically into

and to correspond to the current sources and the other

branches, respectively. Similarly let us split the current vector

into and . Then Kirchhoff’s current law gives

(7)

Let the current and voltage vectors that satisfy Kirchhoff’s

laws be and . Let be split as and . Let us assume

that Kirchhoff’s laws minimize a quantity that can be expressed

as

(8)

where and is a property of the

device along the respective branch. In general can depend on

the entire current vector, however in our case it depends only on

. Thus . Our objective for the

rest of this section is to derive in terms of the current-voltage

characteristics of the devices.

Since minimizes , which gives

(9)

can be approximated using Taylor’s expansion to give the fol-

lowing:

(10)

where is the Jacobian matrix. It is a square matrix

of size and the element in row column is .

Since depends only on it becomes a diagonal matrix.

1An introduction to electrical network theory, if required, can be got from[33].

2We use the notation to denote the transpose of vectors or matricesthroughout this section.

Page 6: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 6/13

1866 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011

In the above equation we can neglect the

term, subtract the right hand side and reorganize the transposed

vectors, to get

(11)

Let us now apply Kirchhoff’s laws to the above expression.

Since satisfies Kirchhoff’s current law

(12)

Let the loop current vector be , then

(13)

Note that the branch currents satisfy Kirchhoff’s current law for

all choices of . We can replace the above relationship in (11)

and reorganize the transposed matrices to get

(14)

Since can be chosen arbitrarily, the following must hold:

(15)

The vector therefore satisfies Kirch-

hoff’s voltage law, as given in (5). Thus we get the following

relationship:

(16)

Now the elements of are and the elements of are .

and are functions of which is the corresponding element

of . They can thus be related as follows:

(17)

This expression specifies how the circuit models should be de-

veloped since corresponds to the quantity to be minimized

in the original system. Thus if corresponds to a constant

, and if is a linear function .If all components have linear delay versus flow functions the

terms are constant and the current distribution in

the resulting network can be got by solving a system of linear

equations. If there are nonlinear functions iterative techniques

are required. In the next section, we describe how these compo-

nent models are to be connected together to build the network.

 B. Network Construction

The architecture of the VLSI system defines the connections

between the various components. The circuits for the compo-

nent models must be connected in exactly the same way. For

example if the architecture is as shown in Fig. 10, the corre-sponding flow-delay model is as shown in Fig. 11. The currents

Fig. 10. Network construction example: System architecture.

Fig. 11. Network construction example: Flow-delay model.

Fig. 12. Network for maximum flow.

Fig. 13. Network for minimum cost flow.

at points where multiple paths merge together automatically add

up, thus adjusting the queueing delays. The network is then to

be solved in the following two steps.

• First solve the maximum flow problem by modeling the

generators as voltage sources and removing the compo-

nents that model delays in the network, as indicated inFig. 12. The currents flowing through each of the gener-

ator models indicate the maximum throughput supported

for each of them.

• Next solve the minimum cost flow problem by modeling

the generators as current sources with values obtained from

the maximum flow, as shown in Fig. 13.

A few details must be kept in mind while solving these two flow

problems. We discuss them in the following paragraphs.

The selection of voltage source values for the first step can

play a role. If all generators see a symmetric view of the net-

work, sources of identical values should be used. We can expect

the resulting current flow through them to be equal. All genera-

tors that are connected to the network at the same points can alsobe merged together and modeled as a single voltage source. If 

Page 7: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 7/13

HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1867

Fig. 14. Example of current distributions guiding the application design andallocations.

the generators see an asymmetric view of the network, the rela-

tive values of the voltage sources can affect the current distribu-

tion across them. We have not studied such situations as yet. Our

empirical studies cover only symmetric systems. We note that

a majority of systems are likely to be symmetric, as is the case

with Intel’s IXP Network Processor [3], Sun Microsystem’s Ni-

agara Processor [4], and IBM’s Cell Multiprocessor [5].

The currents that flow through the voltage sources give the

maximum flow rate through the corresponding generator or set

of generators. In the second step, the generator voltage sources

must be replaced by current sources having values exactly equal

to the current values found in the first step. The delay modeling

components inside the network, i.e., the voltage sources and re-

sistors, must be brought back. Then the current through the re-

spective branches of the flow-delay model gives the desired flow

distribution for the original system.

As an example consider a system with the processors con-nected to memories through point-to-point interconnects. If 

the system has two processors and two memories as shown in

Fig. 14, let us denote the currents flowing out from the proces-

sors as and . Then the application should be parallelized

such that the number of memory accesses generated by the

processors are in the ratio . Further, let split up as

and in the branches going towards the two memories.

Then the memory allocation should be such that the number of 

accesses flowing through the respective connections are also in

the ratio .

Note that multiple distributions are also possible in such net-

works. In the above scenario, let the processors and memoriesbe connected through a common bus which offers a throughput

bottleneck. Then if the memories are identical and are modeled

as having constant delays, the maximum flow is governed by

the bus and can be distributed across the memories in any ratio.

All such distributions will give the same mean delay in the flow

network and satisfy Kirchhoff’s laws in the flow-delay model.

However if the memory models have resistive components then

the solution is unique. In general if a majority of the components

do not have constant delay functions we can expect a unique so-

lution whereas if a majority have constant delay functions mul-

tiple solutions are likely to exist.

This completes the construction of the flow-delay model and

an understanding of how the current distribution can help indesigning the system. In the next two sections we validate the

Fig. 15. Intel’s IXP 2800 network processor block diagram [3].

methodology first on the Intel IXP2800 system and then through

a more detailed study on a cycle accurate simulator.

IV. PRELIMINARY DEMONSTRATION

In this section, we conduct a preliminary study to demonstrate

that the results from the flow-delay model are consistent with re-

sults obtained through intuitive analyses of simple systems. Weuse the Intel IXP architecture [3] as the basis for this exercise.

The IXP architecture has a single bottleneck that can be identi-

fied intuitively.

We show the architecture in Fig. 15. The salient features are

as follows [3].

• There are 16 micro-engines arranged as 2 clusters of 8

each. Let us assume that the processors generate memory

accesses faster than they can be serviced.

• There are three kinds of memories: SRAMs, DRAMs, and

a scratchpad. The SRAMs are 4 in number, each running

at 200 MHz and providing 800 MB/s read as well as write

bandwidth. The DRAMs are 3 in number, each running at533 MHz DDR and providing approximately 2.12 GB/s of 

bandwidth shared between the reads and writes. There is a

single scratchpad running at 700 MHz, providing 2.8 GB/s

bandwidth.

• The interconnect sub-system consists of four buses. The

first two are connected to the SRAMs and scratchpad. The

third and fourth to the DRAMs. The first and third are con-

nected to the first cluster of processors. The second and

fourth being connected to the second cluster. Each bus has

two units, one for each direction, both of which operate at

700 MHz. Thus a bandwidth of 2.8 GB/s is provided in

each direction. All connections have identical latency.

We show the flow-delay model in Fig. 16. Since all the pro-cessors within a cluster are connected to the rest of the system

Page 8: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 8/13

1868 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011

Fig. 16. Flow-delay model for IXP 2800.

in an identical manner, we model each cluster as a single gen-

erator. We neglect the queueing delays for this study. Thus we

model the component delays using only a voltage source as dis-

cussed earlier through Fig. 6.

We analyze the following situation. The processors gen-

erate an equitable mix of read and write accesses very fast,

and we need to determine the maximum rate supported.

Now the total bandwidth or throughput provided by therespective sub-systems is as follows. The memories provide

approximately 15.7 GB/s3 while the buses provide 11.2 GB/s

both to and from the memories. Hence the buses are expected

to be the bottleneck. The analysis must also factor in the fact

that the buses connected to the different memories are separate.

The SRAMs and scratchpad provide a total bandwidth of 

9.2 GB/s while the DRAMs provide approximately 6.5 GB/s.3

The respective buses provide 5.6 GB/s. Thus the buses are

expected to limit the throughput towards both the SRAMs and

DRAMs.

We find that our methodology predicts a system throughput

of exactly 11.2 GB/s. The flow distribution for all processors is

identical, the flow in all SRAMs is the same and the flow in allDRAMs is the same. We refer to such a distribution as being

balanced for the rest of this section.

We then experiment with the following modifications in the

architecture. Note that we start with the original architecture for

each of the modifications listed as follows.

• Mod-1: We increase the bandwidth of each bus by a factor

of 2. Now we expect the memories to become the bottle-

neck.

3While implementing the flow delay model we go through an intermediateformat wherein we model the throughput of each component as a cycle timeparameter having an integer value. This introduces a rounding-off error which

gives us a nett DRAM bandwidth of 6.5 GB/s rather than the actual 6.4 GB/s.Thisis a limitationof our implementation and notthe methodology.We continuewith the modeled value for the rest of this section.

TABLE ICOMPARISON OF RESULTS FROM THE INTUITIVE ANALYSIS AND THE

FLOW-DELAY MODEL

• Mod-2: We increase the bandwidth of each bus by a factor

of 2 and remove the scratchpad. We again expect the mem-

ories to be the bottleneck but the system throughput to

change.

• Mod-3: We decrease the bandwidth of the DRAMs by a

factor of 2. Now we expect the DRAMs to be the bottle-

neck in their part of the system and the buses to be the bot-

tleneck in the part of the system containing the SRAMs and

scratchpad.

• Mod-4: We consider the architecture with the memoriesas the bottleneck, i.e., Mod-1. We assign unequal connec-

tion latencies to the SRAMs as follows: We assume that the

first processor cluster is close to the first two SRAMs and

far from the third and fourth. The second cluster is con-

versely close to the third and fourth SRAMs and far from

the first two. We assume that the bus latencies for a close

and far connection are in the ratio . Now we expect

the accesses from the first cluster to be biased towards the

first two SRAMs and vice versa for the second cluster.

We present all the results in Table I4. We find that the flow-

delay model predicts the expected throughput very closely in all

scenarios. When the connection latencies are equal, the flow dis-tribution isalso balanced.For inwhich the latencies are

unequal, the flow between the processors and SRAMs is biased

in the following manner: The entire flow from the first cluster

goes to the first and second SRAMs while the entire flow from

the second cluster goes to the third and fourth SRAMs. This

is the expected distribution in the absence of queueing delays.

With queueing delay functions, it is not possible to estimate the

optimal distribution in a simple manner.

Through this exercise we have demonstrated that the flow-

delay model gives correct throughput predictions in simple situ-

ations. It also gives the flow distribution that achieves the system

throughput and minimizes mean latencies. In the next section

we consider a wider range of system configurations and com-

pare this methodology against standard search procedures for

finding the optimal distributions.

V. EMPIRICAL STUDIES

The purpose of this section is a thorough validation of the

flow-delay model. We use a cycle accurate simulation frame-

work called MemSim [36]–[38] to model the reference system.

We start with simple regular configurations and then move on

to randomized ones. We compare the proposed methodology

with the following two procedures: 1) where we generate a large

number of random distributions and select the best one and 2)4 We explain the biased distribution later in this paragraph.

Page 9: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 9/13

HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1869

where we use an intensive search procedure, namely simulated

annealing to find a good distribution. We simulate the MemSim

model both for evaluating the random distributions and for eval-

uating the objective function during simulated annealing. We

compare the procedures based on two performance parameters,

namely system throughput and mean delay, with throughput

being the primary one.We organize this section as follows. In Section V-A, we

present relevant details for the construction of the flow-delay

model and the procedures involving random sampling and

simulated annealing. In Section V-B, we illustrate the type of 

systems considered by giving a brief overview of MemSim.

Then in Section V-C we present details of the configurations

used and study the distribution given by the flow-delay model

for a few of the simpler ones. Finally, in Section V-D, we

present the consolidated results across all configurations and

procedures.

 A. Details of Flow Optimization Procedures

Let us start with the flow-delay model. Since we do not have

techniques to characterize the queueing delay functions as yet,

we use the following three strategies to construct the model.

• FDM-L: We consider only the component latencies and

neglect the queueing delays.

• FDM-Q: We assume a linear dependence between the flow

through a component and its queueing delay. We assume

the queueing delay to be 0 when the flow rate is negli-

gibly small and the maximum to be the product of its queue

capacity and cycle time, when the flow rate reaches its

throughput.

• FDM-I : We iteratively determine the queueing delay

at which each component is operating in the followingmanner. We initially assume all queueing delay functions

to be linear and identical to FDM-Q. We consider the slope

to be a variable parameter. Once we get the distribution,

we simulate it on the MemSim model and note the actual

delays. We compute the difference between the delays

predicted by the flow-delay model and the observed delays

for each component. We adjust each slope by an amount

proportional to the difference and repeat the process. We

stop when the difference is within a specified tolerance for

all components or the number of iterations exceeds a spec-

ified limit. We use the following practical considerations

as the stopping criteria: either the difference in delay iswithin 5% for all components or the number of iterations

has exceeded 1000. This method simply ensures that the

queueing delays modeled are approximately equal to the

actual delays for the distribution at which the system is

operating. A possible limitation is that the flow distribution

may converge to one that is suboptimal for the underlying

system. We still use this method as a practical means in the

absence of an accurate characterization of the queueing

delay functions.

For the random samples, we generate a random probability

for each path through the MemSim model. Each access path

starts at a processor goes to a memory and returns to the same

processor. Each path also involves an input port and an outputport. The processor-port connectivity and memory-port connec-

Fig. 17. MemSim model overview.

tivity are specified as a part of the configuration. They determine

which out of the possible paths are present. We then normalize

the path probabilities to add up to 1 for each processor and sim-

ulate each distribution on the MemSim model. While presenting

the results, we refer to the random search procedures as RND-

where is the number of distributions generated. We consider

and in our study.

For simulated annealing we use the library available at [34].We start with a random distribution and compute the objective

function as follows. We simulate the MemSim model and define

the objective that is to be maximized as

objective system throughput mean delay. (18)

In this expression is a weight constant which we take to be

0.01. We simulate a short access trace for evaluating the objec-

tive, wherein each processor generates 1000 accesses. We refer

to this procedure as SA while presenting the results. In the next

section we quickly introduce MemSim, the modeling frame-

work within which we have conducted this validation.

 B. Reference Simulation Model

MemSim provides a cycle accurate modeling framework for

the memory accesses in multiprocessor systems having a shared

and distributed memory sub-system. The system is modeled as

shown in Fig. 17. It is assumed to be synchronous with a global

clock. The processors generate memory accesses each of which

is allocated to a path through the memory and interconnect sub-

systems. For this study we assume that the processors randomly

allocate the accesses based on a probability distribution.

The memory and interconnect sub-systems are modeled using

the template shown in Fig. 18. The components with perfor-

mance parameters are the memory modules, arbiters, distribu-tors (which direct the accesses at points where multiple paths

diverge) and interconnect wires. The memories, arbiters, and

distributors are characterized by a cycle time which is the recip-

rocal of the throughput. Their latency is also assumed to be equal

to this cycle time. The wires are pipelined, they are character-

ized by the number of stages which we refer to as the length. An

access proceeds one stage in each clock, thus the throughput is 1

and latency equal to the length. There are queues at a number of 

positions in the template. They are characterized by a capacity

in terms of the number of accesses. All components follow a

first-in-first-out policy with blocking.

We have developed the infrastructure to automatically con-

vert a MemSim description to a flow-delay model. The compo-nents are modeled as described in the previous section and then

Page 10: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 10/13

1870 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011

Fig. 18. Model for the memory and interconnect sub-systems.

Fig. 19. Memory and interconnect parameters for Configuration-1.

TABLE IIDEFAULT PARAMETERS FOR CONFIGURATION-1

connected exactly as in Fig. 18. The description also contains

the valid path connections for the design, which gets reflected

in the flow-delay network constructed. In the next section, we

describe the configurations we have used for this study in terms

of the MemSim template.

C. Configurations Used 

The simplest configuration we consider has four processors,

four memories, one input port, and one output port. Since there

is a single input port and a single output port, the connectivity

is trivial. We show the memory cycle time and wire length pa-

rameters in Fig. 19. For the rest of the parameters, we use thedefaults given in Table II. We label this configuration as C-1.

Fig. 20. Memory and interconnect parameters for Configuration-2.

In this configuration, the memories clearly offer a throughput

bottleneck. Since their cycle times are in the ratio 1:4 we can ex-

pect the ideal flow distribution to be in the same ratio, favouring

the faster memories. We do indeed observe that the distribution

in the flow-delay model directs 40% of the flow towards each of 

the faster memories and 10% towards each of the slower ones.

The distribution is identical for all processors. We observe the

same for all three strategies: FDM-L, FDM-Q, and FDM-I .

The remaining configurations consist of four processors, four

memories, two input ports, and two output ports. In the first

such configuration we consider uniform memory cycle times

and nonuniform wire lengths as shown in Fig. 20. We use the

same defaults as in Table II for the remaining parameters. We

assume that all processors are connected to all ports and simi-

larly for the memories. We label this configuration C-2.

The memories still offer a throughput bottleneck but the first

input and output ports are close to the first two memories and

far from the other two, and vice versa for the second input and

output ports. This is reflected in the wire lengths. All three forms

of the flow-delay model result in the flows being distributed

equally across the memories but using only the short wires.In the next configuration labeled C-3, we interchange the wire

lengths as follows. We assume that the first input and output

ports are close to the memories while the second input and

output ports are far. We find that FDM-L and FDM-I use only the

short wires whereas FDM-Q divides the flow, placing a larger

share on the short wires.

The subsequent configurations cover the following system

features.

• Nonuniform memory cycle times, as in Fig. 19 along with

the two previous wire length combinations. We label these

C-4 and C-5.

• An arbiter as the throughput bottleneck with the memorycycle times and wire lengths being selected randomly, as

indicated in Table III. We use the connectivity scheme

shown in Fig. 21 for such configurations. We consider 2

of them labeled C-6 and C-7 .

• Partially randomized configurations in which each of the

memory cycle times and wire lengths is selected indepen-

dently from the values given in Table IV. The rest of the

parameters are taken to be the defaults in Table II. We con-

sider two connectivity schemes: 1) complete connectivity

as before and 2) all processors are connected to both ports

but the memories are connected as shown in Fig. 21. We

study four such configurations labeled C-8–C-11.

• Completely randomized configurations in which allparameters are selected independently from the ranges

Page 11: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 11/13

HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1871

Fig. 21. Alternate port-memory connectivity scheme.

TABLE IIIPARAMETERS FOR CONFIGURATIONS WITH AN ARBITER AS

A THROUGHPUT BOTTLENECK

TABLE IVPARAMETERS FOR PARTIALLY RANDOMIZED CONFIGURATIONS

TABLE VPARAMETERS FOR COMPLETELY RANDOMIZED CONFIGURATIONS

shown in Table V. We assume complete connectivity for

these configurations. We consider five of them labeled

C-12–C-16 .

We assume that the processors stress the memory and in-terconnect sub-systems. We generate 10 000 accesses per pro-

cessor and observe the time taken to service them. We also ob-

serve the mean access delay. We present a summary of the re-

sults across all configurations in the next section.

 D. Results

We have six flow distribution procedures: FDM-L, FDM-Q,

FDM-I , RND-1000, RND-10000, and SA. We say that one

procedure is better than another either if it results in a higher

throughput or if it results in approximately the same throughput

and a lower mean delay. We use the following two ratios to

compare two procedures, say and , against each other:

(19)

(20)

Thus is better than if or and .

If we can expect since we are stressing the

network. In general if supports a higher throughput we expect

it to be accompanied by larger delays.

We first compare the three forms of the flow-delay model

against each other. We present the detailed results in Table VI.

We find the differences in resulting system performance to besmall throughout and that no single strategy to be consistently

TABLE VIDETAILED RESULTS FOR COMPARISON BETWEEN THE VARIOUS FORMS OF THE

FLOW-DELAY MODEL

TABLE VII

SUMMARY OF THE COMPARISON BETWEEN THE FLOW-DELAY MODELS

better than the others. A more in-depth look at the results

gives the following observations: In terms of only throughput,

FDM-Q is better than the other two by 5% for exactly two

configurations, FDM-I  is better than FDM-L by 4% for one

configuration and all other differences are within 2%. The larger

throughput gains are accompanied by a significant increasein delay. In terms of only delay, the differences between the

strategies are larger for certain configurations. These cases also

show a difference in throughput which accounts for the effect

on mean delay. The configurations for which one strategy is

better than another in terms of both throughput and mean delay

are restricted to the following: FDM-L is better than FDM-Q

for C-3 and C-12 while FDM-Q is better than FDM-L for C-14.

We present a summary of the above results in Table VII. This

information is sufficient to conclude that: 1) the differences in

throughput are small; 2) the differences in mean delay are larger

and show greater variations; and 3) no particular strategy is con-

sistently better than the other two, across all configurations. Al-though FDM-I  is consistently better than FDM-L the average

throughput gain is negligible and it is not consistently better than

FDM-Q. Thus we continue with all three forms for the rest of 

this section.

Next we compare the flow-delay model with the random

search and simulated annealing procedures. Let us start by

studying the speedup provided. In our experiment setup, a

single simulation of the MemSim model is 4–5 times slower

than constructing and solving the flow-delay model. Thus the

speedup is almost and over the two random

procedures we use. The simulated annealing procedure is much

slower and the average speedup is . Thus the proposed

methodology offers a substantial speedup over the other twoprocedures.

Page 12: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 12/13

1872 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011

TABLE VIIICOMPARISON OF THE FLOW-DELAY MODEL WITH RANDOM SAMPLING AND

SIMULATED ANNEALING

To compare the procedures in terms of system performance,

we modify the method followed so far in the following two

ways.

1) When computing for the random sampling we take the

distribution giving highest throughput. When computing

for the random sampling we take the distribution giving

minimum mean delay with a throughput at least as high asthat given by the flow-delay model it is being compared

against.

2) Then, for computing the maximum, minimum and average

statistics of we consider only those configurations for

which the random sampling or simulated annealing gives a

higher or equal throughput than the flow-delay model. We

present the results in Table VIII.

In terms of system throughput, we find cases where the flow-

delay models give better distributions as well as cases where

they give worse distributions. In the worst case, the flow-delay

models give a distribution that comes within 7% of the best

distribution the other two procedures can find. For the config-urations in which the flow-delay model finds a better distribu-

tion, the differences are much larger. In the average case they

give better distributions than the two random procedures by ap-

proximately 20% and 15%, respectively, and equivalent ones as

compared to simulated annealing. In terms of delays, we once

again find cases where the flow-delay models give better as well

as worse distributions. The differences are large when they are

better and small when they are worse. In the average case the

mean delays given by the proposed methodology is lower. Thus

we conclude that in general the flow-delay models do indeed

give flow distributions that achieve high throughput along with

low delays.

Our experiments clearly show that the proposed methodologycan find comparative if not better distributions within much

smaller time budgets as compared to standard search proce-

dures. This result establishes it as a promising approach towards

designing the flow distributions in VLSI systems.

VI. CONCLUSION AND FURTHER DIRECTIONS

Through this paper we have demonstrated a methodology for

modeling VLSI system performance using simple electrical cir-

cuits. We have shown how the network solution provides a nat-

ural means for optimizing the flow distributions from the per-

spective of obtaining high throughputs with low end-to-end de-

lays. Although this approach can model only a limited set of features and makes a number of approximations, the empirical

results are promising. Our present studies show that the approx-

imation of discrete VLSI quantities, such as data transfer units

and synchronous time, as continuous currents and voltage drops

is valid.

We have built on the methodology proposed in [1]. The ad-

vancement from linear queueing delay functions to arbitrary

ones is important for VLSI systems. The next requirement is tobe able to characterize the queueing delay functions. Although

our results show promise when we assume simple linear func-

tions or even neglect queueing delays, this may not be the case in

all systems. Further research on this characterization can make

the approach more robust and efficient.

We have conducted a substantial empirical study which

clearly shows that the flow-delay model gives good distribu-

tions as compared to standard search procedures. Its strength

is that it offers a substantial speedup, which makes it a good

candidate to be incorporated into design procedures, compilers,

and for on-chip allocations. Further studies can validate the

methodology on a wider range of systems. Nondeterministic

behavior such as caching and row-buffering can also be covered,however we do not expect the flow-delay model to capture these

effects well. A final research direction is to develop algorithms

to construct good allocations using the ideal flow distribution

as a reference. This would complete the requirements for a

complete design flow.

ACKNOWLEDGMENT

The authors would like to thank Prof. M. P. Desai and his stu-

dents for simulation and optimization infrastructure. They are

grateful to Y. Save for initiating them with the circuit simulator

and helping to revise the drafts. They would also like to thank 

the reviewers for suggesting significant improvements over theinitial submission.

REFERENCES

[1] J. B. Dennis , Mathematical Programming and Electrical Networks.New York, London: MIT, Wiley, Chapman & Hall, 1959.

[2] W. Wolf, A. A. Jerraya, and G. Martin, “Multiprocessor system-on-chip (MPSoC) technology,” IEEE Trans. Comput.-Aided Des. Integr.Circuits Syst., vol. 27, no. 10, pp. 1701–1713, Oct. 2008.

[3] M. Adiletta, M. Rosenbluth, D. Bernstein, G. Wolrich, and H.Wilkinson, “The next generation of Intel IXP network processors,”

 Intel Technol. J. , vol. 6, no. 3, pp. 6–18, Aug. 2002.[4] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-way Mul-

tithreaded SPARC processor,” IEEE Micro, vol. 25, no. 2, pp. 21–29,Feb. 2005.

[5] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer,and D. Shippy, “Introduction to the cell multiprocessor,” IBM J. Res.

 Developm., vol. 49, no. 4/5, pp. 589–604, Jul./Sep. 2005.[6] S. Borkar, “Thousand core chips—A technology perspective,” in Proc.

44th ACM IEEE Des. Autom. Conf., San Diego, CA, Jun. 2007, pp.746–749.

[7] G. Micheli and L. Benini, “Networks on chip: A new paradigm for sys-tems on chip design,”in Proc. Des., Autom., Test Eur. Conf. Exhibition,Paris, France, Mar. 2002, pp. 418–419.

[8] G. Micheli and L. Benini , Networks on Chip. San Francisco, CA:Morgan Kauffman, 2006.

[9] J. Nurmi, “Network-on-chip: A new paradigm for system-on-chip de-sign,” in Proc. Int. Symp. Syst.-on-Chip, Tampere, Finland, 2005, pp.2–6.

[10] T. Bjerregaard and S. Mahadevan, “A survey of research and practicesof network-on-chip,” ACM Comput. Surveys, vol. 38, no. 1, pp. 1–51,

2006, Article 1.[11] S. A. McKee, “Reflections on the memory wall,” in Proc. 1st ACM Conf. Comput. Frontiers, Ischia, Italy, Apr. 2004, p. 162.

Page 13: Vlsi19_10 Performance Testing Vlsi

7/31/2019 Vlsi19_10 Performance Testing Vlsi

http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 13/13

HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1873

[12] S. Medardoni, M. Ruggiero, D. Bertozzi, L. Benini, G. Strano, andC. Pistritto, “Capturing the interaction of the communication, memoryand I/O subsystems in memory-centric industrial MPSoC platforms,”in Proc. Des., Autom. Test Eur., Nice, France, 2007, pp. 660–665.

[13] M. Monchiero, G. Palermo, C. Silvano, and O. Villa, “Exploration of distributed shared memory architectures for NoC-based multiproces-sors,” J. Syst. Arch., vol. 53, no. 10, pp. 719–732, Oct. 2007.

[14] B. H. Meyer and D. E. Thomas, “Simultaneous synthesis of buses, data

mapping and memory allocation for MPSoC,” in Proc. 5th IEEE/ACM  Int. Conf. Hardw./Softw. Codes. Syst. Synth., Salzburg, Austria, Sep./ Oct. 2007, pp. 3–8.

[15] S. Pasricha and N. Dutt, “COSMECA: Application specific co-syn-thesis of memory and communication architectures for MPSoC,” inProc. Des., Autom. Test Eur., Munich, Germany, Mar. 2006, pp. 1–6.

[16] R. Ahuja, T. Magnanti, and J. Orlin , Network Flows: Theory, Algo-rithms, and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1993.

[17] L. Kleinrock  , Queueing Systems. New York: Wiley, 1975.[18] P. K. F. Holzenspies, J. L. Hurink, J. Kuper, and G. J. M. Smit,

“Run-time spatial mapping of streaming applications to a hetero-geneous multi-processor system-on-chip (MPSOC),” in Proc. Des.,

 Autom. Test Eur., Munich, Germany, 2008, pp. 212–217.[19] K. Goossens, J. Dielissen, O. P. Gangwal, S. G. Pestana, A. Radulescu,

and E. Rijpkema, “A design flow for application-specific networks onchip with guaranteed performance to accelerate SOC design and veri-fication,” in Proc. Des., Autom. Test Eur., 2005, pp. 1182–1187.

[20] C. Chou and R. Marculescu, “Incremental run-time application map-ping for homogeneous NoCs with multiple voltage levels,” in Proc.

 IEEE/ACM Int. Conf. Hardw./Softw. Codes. Syst. Synth., Salzburg,Austria, Sep.–Oct. 2007, pp. 161–166.

[21] J. A. Buzacott and D. D. Yao, “Flexible manufacturing systems: Areview of analytical models,” Management Sci., vol. 32, no. 7, pp.890–905, Jul. 1986.

[22] A. Federgruen and H. Groenevelt, “Characterization and optimizationof achievable performance in general queueing systems,” Oper. Res.,vol. 36, no. 5, pp. 733–741, Sep.–Oct. 1988.

[23] J. A. Buzacott and J. G. Shanthikumar, “Design of manufacturing sys-tems using queueing models,” Queueing Syst., vol. 12, no. 1–2, pp.135–213, Mar. 1992.

[24] M. K. Govil and M. C. Fu, “Queueing theory in manufacturing: Asurvey,” J. Manuf. Syst., vol. 18, no. 3, pp. 214–240, 1999.

[25] O. Boxma, G. Koole, and Z. Liu, “Queueing-theoretic solutionmethods for models of parallel and distributed systems,” in Proc. 3rd QMIPS Workshop Perform. Evaluation Parallel Distrib. Syst.—Solu-tion Methods, Torino, Italy, 1993, pp. 1–24.

[26] B. Gaujal and E. Hyon, “Optimal routing in several deterministicqueues with two service times,” J. Eur. Syst. Automat., vol. 36, no. 2,pp. 945–957, 2002.

[27] F. E. B. Ophelders, S. Chakraborty, and H. Corporaal, “Intra- andinter-processor hybrid performance modeling for MPSoC architec-tures,” in Proc. 6th IEEE/ACM/IFIP Int. Conf. Hardw./Softw. Codes.Syst. Synth., Atlanta, GA, 2008, pp. 91–96.

[28] S. Schliecker, A. Hamann, R. Racu, and R. Ernst, “Formal methodsfor system level performance analysis and optimization,” Des. Autom.

 Embed. Syst., vol. 13, no. 1–2, pp. 27–49, Jun. 2009.[29] S. Chakraborty, S. Kunzli, L. Thiele, and P. Sagmeister, “Performance

evaluation of network processor architectures: Combining simulationwith analytical estimation,” Comput. Netw., vol. 41, no. 5, pp.641–665,Apr. 2003.

[30] R. Marculescu and P. Bogdan, “The chip is the network: Toward a sci-ence of network-on-chip design,” Foundations Trends Electron. Des.

 Autom., vol. 2, no. 4, pp. 371–461, 2009.[31] A. L. Varbanescu, H. Sips, and A. van Gemund, “PAM-SoC: A

toolchain for predicting MPSoC performance,” in Euro-Par 2006 Parallel Processing. Berlin, Germany: Springer, 2006, pp. 111–123.

[32] B. Ristau, T. Limberg, and G. Fettweis, “A mapping framework basedon packing for design space exploration of heterogeneous MPSoCs,”

 J. Signal Process. Syst., vol. 57, no. 1, pp. 45–56, Oct. 2009.[33] N. Balabanian andT. Bickart , Electrical Network Theory. New York:Wiley, 1969.

[34] G. Kliewer and S. Tschoke, “Parallel simulated annealing library,”1998. [Online]. Available: http://wwwcs.uni-paderborn.de/fach-bereich/AG/monien/SOFTWARE/PARSA/ 

[35] S. H. Batterywala and H. Narayanan, “Efficient DC analysis of RVJcircuits for moment and derivative computations of interconnect net-works,” in Proc. IEEE Int. Conf. VLSI Des., Goa, India, Jan. 1999, pp.169–174.

[36] G. Hazari, M. P. Desai, and H. Kasture, “On the impact of addressspace assignment on performance in systems-on-chip,” in Proc. IEEE 

 Int. Conf. VLSI Des., Bangalore, India, 2007, pp. 540–545.[37] G. Hazari, M. P. Desai, andG. Srinivas,“Bottleneck identification tech-

niques leading to simplified performance models for efficient designspace exploration in VLSI memory systems,” presented at the IEEEInt. Conf. VLSI Des., Bangalore, India, 2010.

[38] H. Kasture, “A memory subsystem simulator for SoC applications,”B.Tech. and M. Tech. dissertation, Dept. Elect. Eng., I.I.T., Bombay,India, Jul. 2006.

Gautam Hazari receivedthe DualDegree which includes a B.Tech. in electricalengineering and an M.Tech. in microelectronics from I.I.T. Bombay, India, in2002. He completed the Ph.D. degree from the same institute in 2010.

His current research interests are centered around performance analysis andmodeling at the system level. He has previously enjoyed working with digitalcircuits, especially asynchronous ones.

H. Narayanan received the B.Tech. and Ph.D. degrees from I.I.T. Bombay,India, in 1969 and 1974, respectively.

He hasbeen a faculty memberwith theDepartment of Electrical Engineering,I.I.T. Bombay, since 1974. He has also been a visiting faculty with the De-partment of Electrical Engineering and Computer Sciences, University of Cali-fornia, Berkeley, from 1983 to 1985. During 2000 to 2003, he was the Head-of-Department of Electrical Engineering, I.I.T. Bombay. His primary research in-terests are in the area of electrical network analysis, particularly in the use of topological methods for efficient analysis. He has supervised the building of thegeneral purpose circuit simulator BITSIM at I.I.T. Bombay, which uses suchmethods.He hasparticipated inthe buildingof VLSI circuit partitionersfor real-ization through FPGAs, in collaboration with industry partners from the UnitedStates and Japan. He is the author of a monograph titled Submodular Functionsand Electrical Networks (North Holland, 1997), a revised edition of which isavailable online at http://www.ee.iitb.ac.in/~hn/book/.