11
J. Parallel Distrib. Comput. 66 (2006) 1014 – 1024 www.elsevier.com/locate/jpdc Stigmergic approaches applied to flexible fault-tolerant digital VLSI architectures Danilo Pani , Luigi Raffo 1 DIEE—Department of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123 Cagliari, Italy Received 2 May 2005; received in revised form 6 October 2005; accepted 2 November 2005 Available online 13 December 2005 Abstract Parallel implementations are widely used in digital architectures to enhance computational performances, exploiting the number of involved processing units. Cooperative behaviors typical of swarm intelligence can enhance the performances of such systems introducing an amplification effect due to the collective effort of a set of interacting hardware agents. Cooperation can also be exploited like a new weapon to achieve the fault-tolerance goal, with no need for expressly inserted redundant hardware resources. In this paper we present a novel architecture able to address these issues exploiting all the potentiality exposed by this bioinspired approach. A first implementation on CMOS 0.13 m technology shows the feasibility of such a design style, allowing preliminary simulations and discussions. © 2005 Elsevier Inc. All rights reserved. Keywords: VLSI; Parallel architectures; Fault tolerance; Stigmergy; Swarm intelligence 1. Introduction Parallel architectures are widely used to speed up the computation of parallelizable algorithms. Consolidated de- sign strategies nowadays allow impressive improvements in terms of speed and efficiency. However, requests for flexible computational substrates and fault-tolerant systems generally overcome the possibilities offered by standard approaches to parallelism. Furthermore, even if VLSI technology is always improving speed and reducing area, it is necessary to define novel approaches that could be exploited with other implemen- tation techniques. Nature represents an important source of inspiration for the researchers because some very impressive results are obtained surprisingly by living organisms. The Swarm Intelligence (SI) is a bioinspired approach that originally comes from the observation of swarms, large sets of simple individuals with limited capabilities that can perform complex tasks without centralized control and taking advan- tage of the cooperation among them. Examples of swarms are Corresponding author. Fax: +39 070 675 5782. E-mail addresses: [email protected] (D. Pani), [email protected] (L. Raffo). 1 Luigi Raffo is also with CNISM, Section of Cagliari (Italy). 0743-7315/$ - see front matter © 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2005.11.001 ant colonies, bird flocks, fish schools, and so on. With respect to different definitions of SI, we agree with the one proposed in [11], relaxing the constraint of social insect inspiration proposed by other authors, even respecting the five principles of proximity, quality, diverse response, stability, adaptability proposed by Millonas [14]. One of the most important charac- teristics of swarms is performance scalability: the performance in a task execution can be modulated by the number of in- volved individuals, until a maximum or minimum number is reached. The individuals interact either directly or indirectly. Direct interactions are not mediated by any medium, hence relying on visual or direct contacts. Indirect interactions are mediated by a medium, i.e. the environment. In social insects such a mechanism of interaction is called stigmergy [8]. An ef- fective definition of stigmergy is the one proposed in [3]: “two individuals interact indirectly when one of them modifies the environment and the other responds to the new environment at a later time. Such an interaction is an example of stigmergy”. An example of stigmergy is the clustering behavior exposed by some species of ants (which can lead to cemetery organization, larval sorting, etc.). Ants move randomly in the nest, piking up the items (corpses, larvae, etc.) and putting them down next to other items. Ants perceive the presence of clusters of items following pheromone concentrations. Since pheromone is

Stigmergic approaches applied to flexible fault-tolerant digital VLSI architectures

  • Upload
    unica

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

J. Parallel Distrib. Comput. 66 (2006) 1014–1024www.elsevier.com/locate/jpdc

Stigmergic approaches applied to flexible fault-tolerant digitalVLSIarchitectures

Danilo Pani∗, Luigi Raffo1

DIEE—Department of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123 Cagliari, Italy

Received 2 May 2005; received in revised form 6 October 2005; accepted 2 November 2005Available online 13 December 2005

Abstract

Parallel implementations are widely used in digital architectures to enhance computational performances, exploiting the number of involvedprocessing units. Cooperative behaviors typical of swarm intelligence can enhance the performances of such systems introducing an amplificationeffect due to the collective effort of a set of interacting hardware agents. Cooperation can also be exploited like a new weapon to achieve thefault-tolerance goal, with no need for expressly inserted redundant hardware resources. In this paper we present a novel architecture able toaddress these issues exploiting all the potentiality exposed by this bioinspired approach. A first implementation on CMOS 0.13 �m technologyshows the feasibility of such a design style, allowing preliminary simulations and discussions.© 2005 Elsevier Inc. All rights reserved.

Keywords: VLSI; Parallel architectures; Fault tolerance; Stigmergy; Swarm intelligence

1. Introduction

Parallel architectures are widely used to speed up thecomputation of parallelizable algorithms. Consolidated de-sign strategies nowadays allow impressive improvements interms of speed and efficiency. However, requests for flexiblecomputational substrates and fault-tolerant systems generallyovercome the possibilities offered by standard approaches toparallelism. Furthermore, even if VLSI technology is alwaysimproving speed and reducing area, it is necessary to definenovel approaches that could be exploited with other implemen-tation techniques. Nature represents an important source ofinspiration for the researchers because some very impressiveresults are obtained surprisingly by living organisms.

The Swarm Intelligence (SI) is a bioinspired approach thatoriginally comes from the observation of swarms, large sets ofsimple individuals with limited capabilities that can performcomplex tasks without centralized control and taking advan-tage of the cooperation among them. Examples of swarms are

∗ Corresponding author. Fax: +39 070 675 5782.E-mail addresses: [email protected] (D. Pani), [email protected]

(L. Raffo).1 Luigi Raffo is also with CNISM, Section of Cagliari (Italy).

0743-7315/$ - see front matter © 2005 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2005.11.001

ant colonies, bird flocks, fish schools, and so on. With respectto different definitions of SI, we agree with the one proposedin [11], relaxing the constraint of social insect inspirationproposed by other authors, even respecting the five principlesof proximity, quality, diverse response, stability, adaptabilityproposed by Millonas [14]. One of the most important charac-teristics of swarms is performance scalability: the performancein a task execution can be modulated by the number of in-volved individuals, until a maximum or minimum number isreached. The individuals interact either directly or indirectly.Direct interactions are not mediated by any medium, hencerelying on visual or direct contacts. Indirect interactions aremediated by a medium, i.e. the environment. In social insectssuch a mechanism of interaction is called stigmergy [8]. An ef-fective definition of stigmergy is the one proposed in [3]: “twoindividuals interact indirectly when one of them modifies theenvironment and the other responds to the new environment ata later time. Such an interaction is an example of stigmergy”.An example of stigmergy is the clustering behavior exposed bysome species of ants (which can lead to cemetery organization,larval sorting, etc.). Ants move randomly in the nest, pikingup the items (corpses, larvae, etc.) and putting them downnext to other items. Ants perceive the presence of clusters ofitems following pheromone concentrations. Since pheromone is

D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024 1015

attached to an item by the carrier, largest clusters become moreand more attractive for ants, and after a while it is possible tosee large clusters of items in the nest. This is an example ofa stigmergic system since ants do not use direct interactionsbetween them to accomplish the task, but they perform it per-ceiving the environmental modifications.

In this paper we consider the possibility to adopt stigmergyto design computational platforms flexible enough to support alarge number of tasks, either in multitasking or not, with intrin-sic properties of fault tolerance. The performance scalabilitycan be effectively exploited to accomplish the fault-tolerancegoal, without any demand for spare resources. For this reasonwe have conceived a proof-of-concept computational fabric forparallel array processing based on stigmergy. The fabric is com-posed of an environment (memories and interconnections be-tween them) and computational hardware agents. Every agentis over a memory and it is able to carry out only the operationsstored there. Stigmergy has been adopted to allow workloadspreading inside the fabric without any need for direct com-munications, and hence synchronization, between agents. Thisfact allows a simple fully decentralized control of the agentswithout arbitrages and handshake protocols. Agents change theenvironment reducing their workload (executing their opera-tions) and smoothing over the workload differences betweenadjacent zones by means of data movement. It is exactly the op-posite behavior compared to the clustering behavior presentedabove. A small system composed of nine elements has beenimplemented in a standard cell CMOS 0.13 �m technology andtested on simple array processing algorithms. The remainder ofthis paper is organized as follows. Section 2 introduces the ac-tual state of the art; Section 3 exposes the proposed approachwith respect to parallel implementations and fault tolerance;Section 4 illustrates the proposed architecture whereas Sec-tion 5 shows how tasks are executed in the architecture. Sec-tion 6 reports some synthesis and simulation results. Lastly,Section 7 concludes this work with a short analysis of theresults.

2. Beyond parallel architectures

Traditional approaches to parallelization are based on fixedpartitioning of the work between the available units. This factimplies some drawbacks: the system is inflexible and perfor-mances are not adaptive with respect to the actual overall work-load. Furthermore, the potential fault of a single unit entails thefault of the overall system.

Reconfigurable computing (RC) [6] has been proposed toovercome these limitations. RC tries to exploit parallelizationby means of multiple processing modules connected to forma specialized datapath for a specific algorithm, with the pos-sibility to change configuration at run time. Compared to fieldprogrammable gate arrays (FPGAs), RC usually makes use ofcoarser modules than configurable logic blocks (CLBs) [9],even in fine-grained architectures (e.g. [10,13]). Coarse-grainedreconfigurable architectures (e.g. [1,7,15]) are less flexible butcan rely on a lighter configuration. Requiring a centralized

(off-line) development of the configuration stream, these archi-tectures are unable to adapt themselves to a mutable environ-ment on their own (i.e. changes in the workload or adaptivespeculation on data). More unconventional approaches descendfrom bioinspired original Von Neumann’s studies on cellularautomata (CA) [16]. CA are artificial-life systems characterizedby an evolution in the state of a cell due to the state of its neigh-bors. The goal of the studies on CA is the development of uni-versal computation substrates exploiting massive parallelism,locality of cellular interconnections and simplicity of cells [21].Starting from the general CA idea, many researchers have in-vestigated the possibility to extend the functionality of CA withthe explicit property of fault tolerance. The concept of fault tol-erance was originally proposed by Avizienis [2] stating that “asystem is fault tolerant if its programs can be properly executeddespite the occurrence of logic faults”. Embryonics (EmbryonicElectronics) [12,17] is a fine-grained FPGA architecture (thebasic element, MUXTREE [22], is a multiplexer-based struc-ture) characterized by configuration and fault-recovery mech-anisms inspired by the POE model (phylogenesis, ontogenesisand epigenesis) [23]. These approaches borrow from CA thesimplicity of cells and the locality of the interconnection sys-tem. However, in these systems the memories of cells must bepre-loaded with a configuration: cells create an artificial organ-ism for a specific computation by themselves (starting from anartificial genome). System functionality and interconnectionsbetween cells are decided at configuration time: after that thesystem works like a traditional FPGA. The most important dif-ference is in the fault-tolerance support. Fault tolerance can beobtained by means of fault masking or recovery [24] strate-gies. In the first case the reliability is due to a majority “vot-ing” mechanism, whereas in the second case a first step of er-ror detection is required, before the recovery could take place.Embryonics [12] and POEtic [23] belong to the latter category.Generally, fault detection is performed by means of built-inself-tests (BISTs) or similar techniques, including special ar-tificial immune modules [4] applied to an appreciable part ofthe cell (usually links are not tested). Fault recovery is usuallyaccomplished by means of cell exclusion and reconfigurationof the routing resources. In [12,23], such a reconfiguration isneeded because cells are constrained to fixed positions and thearchitecture requires the replication of the genome through thecolumns. The fault tolerance is supported at two levels: molecu-lar and cellular one. Since every cell is composed by molecules,a first repair mechanism occurs at this level, reconfiguring theinterconnections between the molecules of the same cell [22].It is not acceptable more than one fault per molecular row inthe same cell. If so, all the cell “dies”. A second repair mech-anism works at cell level: if a cell “dies”, its whole column isreplicated to the right, and the same happens to all the columnson its right until a spare column is reached and occupied. Asimilar approach, considering rows rather than columns, andonly at cell level, is presented in [17]. Such approaches requirespare columns and waste a lot of resources since a columnaccommodates more than one cell. In [25], the fault of morethan one element in a row is recovered making the whole rowtransparent.

1016 D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024

(a) (b)

Fig. 1. An abstract view of the architecture. In (a) the cell fabric topology.In (b) the cell: at the top the switch, and below the memory well with theattached hardware agent.

3. Augmenting performances and reliability withcooperation

We are interested to define an architecture that, in a way, re-spects the requirements of flexibility, scalability and fault toler-ance. To overcome the limits exposed by standard approachesto parallelization, we have taken inspiration from SI. The dif-ference between this paradigm and the solutions briefly pre-sented in Section 2 lies in the loose structure of swarms [11],where the agents are not constrained to fixed positions (even ifwe remove the property of mobility, to work at an higher levelof abstraction). Interactions (direct or indirect) represent a keyaspect, and generally they are simple, even if more complexthan those typical of CA. Having in mind the goal to implementon silicon such an architecture, a simple bi-dimensional arrayof locally interconnected undifferentiated cells is the most at-tractive topology (Fig. 1(a)). To respect the loose structure ofswarm systems, links must not be configurable. The only pos-sibility is to define a real (even if simple) packet switchingnetwork with only local interconnections. It is worth to depictabstractly the overall system considering different layers. Thefirst layer is the environment, a set of wells that can be filledwith data. Wells are interconnected at an upper level by meansof an interconnection layer composed of switches able to movedata between adjacent wells. Switches simply perform routingoperations without any own initiative. Hardware agents formthe third layer, where data are processed. At this level coopera-tive behaviors and computational capabilities are implemented.The abstract representation of a cell considering the three layersis depicted in Fig. 1(b). Hardware agents consume only data ofthe attached well but they are also responsible for data move-ment between adjacent wells with different load. In this mannerthey modify the environment, triggering stigmergic behaviors.This allows uniform workload distribution during processing,even starting from an original not uniform one.

3.1. Cooperative tasks distribution and execution

Every computation is physically represented by an amount ofdata: their distribution is a key issue in this kind of architecturebecause it represents a potential bottleneck. Usually in parallelarchitectures data are assigned (more or less explicitly) to a

specific computational element, and this is decided at designtime. In our approach, data (and the associated operation) arenot assigned to a specific cell; data streams are sent to wellsfrom the southern border of the fabric (see Fig. 1(a)) to theinside. When a well is full, the next one in the same column isfilled up, and so on. After processing, obtained results are sentback to the border cells.

During processing, a gradient in the concentration of data isdetected by hardware agents which provide for data movementtowards unloaded zones of the fabric. This happens exploitingonly local knowledge. In this manner computational resourcesnot involved in any task can be effectively exploited to speedup the tasks actually in execution. Stigmergy is the basis ofthis mechanism since there are no direct interactions betweenhardware agents but every agent modifies the surrounding envi-ronment moving data from overloaded zones to unloaded ones.Agents modify directly the environment also performing theoperations stored into their wells, since this alters workloadconcentrations. The movement of segment of computation isbased only on algorithm and data complexities. The former isbased on the definition of a computational workload for ev-ery operation, considering that different operations could in-volve different number of cycles. The latter is based on the datacomplexity, that is the computational effort required to elabo-rate a specific data item. There are examples of VLSI systemsthat consider this aspect speculating on the distance betweenadjacent data or coefficients, adopting differential approaches(DCM [20], DCIM [5], DECOR [19], etc.). Nevertheless theyexploit characteristics typical of a data sequence and not of sin-gle data items. Consider an operation separable into elementarysteps which number results from data complexity: the latencyof such an operation will depend on the number of actual stepsrequired by the couple of input data. An example of data com-plexity exploitation applied to the Modified Booth Algorithm(MBA) and used again for the same purpose in this work canbe found in [18].

3.2. A fault-tolerant architecture

We are interested to define an architecture where every sin-gle cell (hardware agent) can be removed without any lossof functionality, but only a performance degradation. This ispossible provided that interconnections between cells do notconsist in configurable links. This is the main difference be-tween the swarm systems and the other approaches discussedin Section 2. Transparency is the solution adopted to tacklefault events: our system does not require any reconfigurationto obtain transparency since every hardware agent is physicallyattached to a well in a fixed position, but it can be active or notwithout affecting the activities of the other agents. Stigmergycan be useful for fault-tolerance purposes since data movementis accomplished every time there is a gradient in workload dis-tribution, and this is true even in case of faults. Cooperationlike a new way to tackle fault tolerance is absolutely normal inswarms, since the presence or the absence of an individual canreduce the overall performances in a task execution but does not

D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024 1017

c/r w/s x_disp y_disp vector_index opcode operand_A operand_B

3 bits 5 bits3 bits 3 bits 16 bits 16 bits

Fig. 2. Packet structure. Explanation of each field is given in Section 4.1.

SWITCH SWITCH

EXT_N

EXT_S

EXT_EEXT_W

WI WO

TAGGER

HARDWARE

MEMORY

I/O load

I/O load

I/O load

I/O load

AGENT

MANAGER

LOADMONITOR

W

W

W

S

S

E

E

E

N

N

MEM

MEM

MEM

MEM

switchcontrol

(a) (b)

Fig. 3. A simplified representation of two parts of the cell: (a) the memory well and (b) the switch (the interconnections with the memory well have been omitted).

entail any structural modification into the swarm. To achieve thestated goals the system must be able to identify faults, and thiscan be done during periods of inactivity by means of functionalself-tests. If the hardware agent experiences some troubles dur-ing tests, it “kills” itself, and the switch makes the cell trans-parent bypassing it. The transparency of a cell simply meansthat the well cannot be filled up, and the workloads showed tothe adjacent agents are those of the next “healthy” agents. Thisallows stigmergy even in presence of faults.

4. The proposed architecture

Following ideas and guidelines presented in Section 3, wehave designed a novel architecture composed of a fabric ofundifferentiated cells locally interconnected. Every cell imple-ments a module for each layer introduced in Section 3, even ifthese modules should be conceived like parts of their respectivelayers and not simply parts of a cell.

4.1. Topology and network details

All the communications occur by means of the interconnec-tion layer, so different modules of the same layer do not inter-act physically at that level. Switches form a very simple packetswitching network. The packet structure consist of differentfields: some of them are modifiable by the switches or by thewells. It is 48-bit wide, organized in eight fields, as shown inFig. 2. Operands are placed into their 16-bit fields, used jointly

for 32-bit results. The 3-bit field opcode provides support foreight operations. The 5-bit field vector_index is useful jointly tothe other 3-bit fields y_disp and x_disp to enumerate data pack-ets during task distribution. Fields y_disp and x_disp are alsoessential for cooperation, since they store the relative displace-ment at every data movement to allow the routing of the resultsback to the original column. At last, two 1-bit flags are presentin the packet structure: w/s flag for saturation/wrap strategy inaccumulation operations, c/r flag to indicate if the packet hasbeen sent either for cooperative purposes or for result return.

4.2. The cell architecture

Every cell consists of three modules: a memory well, aswitch and an hardware agent. Let us analyze separately thesemodules.

4.2.1. Memory wellThe structure of the memory well is depicted in Fig. 3(a). It

consists of two FIFOs, one for the incoming data (WI) and theother for the outgoing ones (WO). Both FIFOs are dual-portedto avoid bottlenecks in simultaneous read–write accesses. Forthis first implementation, the size of WI has been fixed to 32memory-words, whereas the size of WO has been limited to8 since it serves only as a buffer to support the operations ofthe switch. A memory manager is responsible for the correctoperation of the two FIFOs, including tags management andmonitor of the actual workload in WI.

1018 D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024

register_A

op_A op_B

index A

comparator

>

Barrel shifter

pipe register

+/−

accumulator

index B MB

A_shift

result

Control

load opcode

=<

register_B

Fig. 4. A simplified representation of the computational part of the hardwareagent.

4.2.2. SwitchAs stated in Section 3, the switch operates without any

own initiative but simply applying invariable rules. The switchpresents links in the four directions (North–East–South–West),allowing connections with its four neighbors and with thememory well beneath. Its schematic representation is depictedin Fig. 3(b). To allow one transfer per clock cycle, switchesprovide parallel links rather than serial ones. Links cover verysmall distances since interconnections are only local, and thereare no long distance connections into the fabric. Two differentchannels on each side of the switch allow full-duplex commu-nications in every direction. Every input channel is equippedwith a 4-locations FIFO buffer. Every I/O port of the switch iscontrolled by a port interface, and this allows simultaneous I/Ooperations on all the ports. Since our approach is based on aloosely structured collection of agents, routing operations areperformed working only with relative displacements. Switchesprovide also support for routing of control signal and workloadinformation, on dedicated links not shared with data (I/O loadin Fig. 3). Particularly, the stigmergic strategy exploits theinformation of workload monitors: every switch exports to itsneighbors this information and propagates it towards its hard-ware agent which establishes if it is worth or not to performdata movement. Switches also support fault tolerance allowingthe transparency of a cell, as discussed in Section 4.3.

4.2.3. Hardware agentThe hardware agent is composed by two main parts: a compu-

tational part and a smart part. The first one consists of an arith-metical unit that operates like a small processor, treating wordsbasically composed of two 16-bit operands and an operator. Itssimplified schematic representation is depicted in Fig. 4. Thecomputational part consists of a flexible datapath able to carryout all the operations shown in Table 1. Particularly, sequen-tial multiplication has been implemented following the MBAapproach augmented with null-triplets skipping [18] to allow

Table 1Possible operations of the hardware agent with operands a and b

Operation Description

MUL res = a × b

ADD res = a + b

SUB res = a − b

CMP res = a > b, a = b, a < b, and index of occurrenceSHR res = b?a

SHL res = b>a

MAC resn = resn−1 + (a × b)

ACC resn = resn−1 + (a + b)

a workload differentiation into the fabric, and hence exploit-ing as much as possible the stigmergic mechanism. Multiplica-tion is the only operation where computational complexity isexplicitly referred to data. To reduce the complexity, the hard-ware agent can swap multiplicand and multiplier if this allowsthe reduction of the overall latency. The sequential multiplierimplemented by our hardware agent does not require the intro-duction of special modules exploiting only the available arith-metical structures.

The smart part of the hardware agent is responsible for theself-test procedure (self-detection of faults in the arithmeticalunit) and for the stigmergic behavior. Since data movement isnot a work for the hardware agent (it is accomplished by thememory well and by the switch), during data processing thispart of the cell monitors the residual workload in the attachedmemory well, comparing it to those of the neighboring wells.This happens continuously during the activity of the hardwareagent. If it verifies that some wells have less workload than itsown and the difference is appreciable, then it can ask the mem-ory well manager to move a specific amount of data towards thezone it indicates. In this manner, the hardware agent modifiesthe environment allowing the other immovable hardware agentsto cooperate in data processing. This choice avoids bottlenecksdisallowing the simultaneous access to the same well for coop-erative purposes to more than one agent. Stigmergy simplifiesthe control of the agents since cooperation is forced by the ac-tivities of the agents and does not require direct synchroniza-tion. There is a built-in mechanism to inhibit stigmergy: a cellcan be forced to refuse data movement towards its memory well(at the moment it cannot “decide” this autonomously). This al-lows, in a way, the definition of two priority levels in tasks:high-priority tasks are assigned to cells that can be active sub-jects for stigmergy but not passive ones; low-priority tasks areassigned to cells that can be both active and passive subjects.

4.3. Fault-tolerance mechanism

In this paper we apply a well-known approach to fault tol-erance usually indicated as “cell exclusion”, that is the virtualoperation of cell removal, impossible on silicon. Since our linksare not configurable but switches perform continuously rout-ing operations, our cell exclusion works differently comparedto the other approaches presented in Section 2. The search forfaults occurs before a task distribution, by means of self-testsapplied to the computational part of the hardware agent. Since

D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024 1019

8

6 0

30

20

0

0

0

06

3020

8 8

8

202030

30

6 6

(a) (b) (c)

Fig. 5. Workload information propagation in case of fault. In (a) the workloads of five cells, in (b) the workload information propagation in absence of faults,in (c) the workload information propagation in case of fault of the central cell.

it is impossible to recover a “partial fault” that could happenin a part of the datapath, the process tests all the datapath atthe same time and if it experiences some troubles in obtainingthe correct result, the hardware agent declares its incapacity indata processing, communicating by means of a one-bit signalthis state to both the memory well and the switch. Since stig-mergic behaviors require the knowledge of the actual workloadof neighbors by an hardware agent, some problems arise at thispoint:

• if other hardware agents see that a memory well is empty,they will try to fill up that with their data to allow coopera-tion;

• if the switch associated to the faulty agent blocks data move-ment towards the memory well beneath, the faulty cell deter-mines a zone of no-cooperation or avoids data distributionto the other cells in the column;

• if the switch bypasses its memory well beneath combinato-rially, then the fault of a cell alters the critical path of thesystems.

To avoid the first problem, the switch, informed of a fault,plugs the memory well beneath therefore avoiding further dataloading. At the same time, to avoid blocks in the stigmergicprocess, the switch exports to its neighbors the workload in-formation of the opposite neighbor, therefore masking the factthat the memory well beneath is empty (Fig. 5). If the mem-ory well is not empty, it can be completely emptied. At last,the switch routes the incoming data in the opposite directionwithout special paths, just introducing some further clock cy-cle to accomplish the data movement. In sequential architec-tures there are two possible kind of faults: temporary and per-manent ones. Temporary faults are mainly due to invalid statetransitions. Such faults can be detected, and the cell-exclusionmechanism isolates the faulty cell during tasks execution. Bymeans of a reset procedure or after power-off, the “fault” flagsare cleared in all the cells, and the self-test procedure startsagain, so that temporary faults can be removed, and perma-nent ones will be detected again. Permanent faults are dueeither to defects occurring during the manufacturing processor to electrical phenomena during the lifetime of the system.Like in the largest part of fault-tolerant platforms, the com-putational part is the only one tested for faults, but for future

releases we are extending fault detection to the other parts ofthe cell.

4.4. Border cells

Even if the fabric consists of undifferentiated cells, some ex-tra border cells are needed for distribution and finalization oftasks. These special cells work only with respect to their col-umn, and are placed only on the southern border of the fabric,as shown in Fig. 1. During tasks distribution these cells createthe packets and send them into the fabric. This process is ac-complished knowing how many faulty cells are present in thecolumn. This information is obtained by every border cell fromthe cells that form its column by means of an incremental fullydecentralized counting procedure. If the column above a bor-der cell is full and no more data can be passed to the array, theborder cell enables cooperation, waiting until some space turnsfree into wells to continue distribution. In this manner if thereare some fault cells into the column and a task cannot be com-pletely allocated, the result is a temporal performance degrada-tion, but the task is still executable. Stigmergy can completelymask this problem if the adjacent columns are sufficiently un-loaded. Border cells accommodate an accumulator and a sort-ing unit for tasks finalization. Border cells involved in the sametask cooperate at the end of it to extract the final result (seeSection 5 for further details).

5. Algorithm execution flow into the fabric

For this preliminary implementation, we have conceived asystem able to work on some general array processing op-erations, that can be grouped in two categories: element-by-element array operations, and cumulative ones. Further opera-tions are under evaluation at the moment, even if the peculiarityof the architecture is to work with arrays and matrices. The fab-ric cannot accept more tasks than the number of its columns.From the outside, a task is simply a pair of arrays and theoperands that must be applied to the couples of correspondingelements. If the operation remains the same for all the couplesof elements, its opcode can be passed to the chosen border cellonly at the beginning of the task. Sorted results or scalar onesare provided by the fabric at the end of the computation.

1020 D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024

5.1. Element-by-element array operations

This kind of operation is characterized by the repetition ofthe same operation between a scalar and all the element of anarray, or between the homologous elements of two arrays. Typ-ical examples are array summations or the multiplication ofa scalar by an array. Since the result is an array of the samedimension of the one passed as second operand, and the op-eration is performed by a set of agents working on small sec-tions of the array, it arises the problem of results sorting (theelement in the position i leads to a result that must be placedin position i in the output array). For this reason, operationsin this category require a careful attention in data distribu-tion. Even if cells start processing incoming data when theyhave the first one, for element-by-element operations to ev-ery memory well is a assigned a fixed number of elementaryoperations equal to the memory well size. After this numberhas been reached, incoming data are passed by the switch tothe next one even if the memory well beneath has been par-tially emptied. The element index is embedded into the datapacket, so that border cells can store in the output buffer thesorted results. The presence of faulty cells is negligible sinceelement index tags are attached to data packets and not tocells.

5.2. Cumulative operations

These operations require always an accumulation, producinga scalar result. They exploit better than the others the cooper-ative approach introduced in this paper. Like in the previouscase, cells do not need any information about their position intothe column, and the output of a cell is the summation (saturatedor not) of single additions or products. If a cooperation takesplace, the involved cell accumulates into its output register theresults obtained from data coming from the same column, evenif different from that where the cell is. If a cell encounters anoperation coming from a different column compared to the pre-vious one, the accumulated result is returned and a new MACor ACC starts. Every cell propagates by means of its switchthe accumulated result towards the original column and hencetowards the border cell of that column that performs the finalaccumulation. If a task is spread on more than one column,the final accumulation is performed by the leftmost involvedborder cell.

5.3. Main execution steps

Even if the decentralized nature of our approach preventsfrom defining a “linear” execution flow, it is possible to de-scribe the overall task execution considering different steps. Atthe beginning, a border cell receives a request from the outside,with 2 arrays of data and 1 operand (or 1 array of operands).The border cell pre-alerts the cells in its column that a newtask is available. Then they block incoming data due to stig-mergy, perform the residual operations stored into their mem-ory wells, and hence perform a self-test. When they are ready,

the border cell creates the packets and sends them to the cells ofits column. During packets distribution, stigmergy is inhibitedto avoid useless data movement in a quickly variable environ-ment. At the end of the distribution of that task, the border cellenables the stigmergic behaviors and hardware agents can per-form data movement to smooth over the workload differences.At this time it is possible to spread data in other columns, orto unloaded wells of the same column. It should be noted thatsuch operations are performed without any global knowledgeof the actual state of the architecture. Hardware agents performthe computations, and send back to the border cell the sin-gle/accumulated results. The mechanism of relative displace-ment indexes allows to return back the results to the propercolumn even if they have been obtained in a different column.The border cell performs the final accumulations or the resultssorting by itself.

6. Experimental results

The proposed architecture is absolutely scalable and flexible.In consequence, any fabric of the same kind of that shown inFig. 1 can be implemented without any modification to the cell.Obviously, some changes in cell structure can cause changesin some parameters, supported by an absolutely flexible para-metric RTL Verilog description. For this preliminary imple-mentation we have created a fabric composed of nine cellsorganized in a square matrix, besides an additional row on thebottom of it composed of special border cells. The overall sys-tem has been synthetized using Synopsys 2003.06 on a stan-dard cell CMOS 0.13 �m technology. The synthesis results forthe overall system show that the area occupied by a fabric sizedlike above is about 360 K equivalent gates, and the clock fre-quency is 600 MHz. Even if the communication layer could runat 1.3 GHz, in this proof-of-concept implementation we havenot applied any advanced clock strategy. More interesting aresynthesis results for a single cell, since the fabric is fully scal-able. Every cell requires only 6800 equivalent gates of logic,and 30 800 equivalent gates for memories (the 82% of the over-all cell area). Fig. 6 shows the percentage of area for the threemain modules of the cell: memory well, switch and hardwareagent. The figure shows how low is the cost for smartness in oursystem and how light is the control for stigmergic behaviors.Better area results can be obtained including custom memoriesrather than flip-flops for memories synthesis.

6.1. Simulations results

We have tested this small system in different situations oftasks distribution, obtaining always interesting results. Havingto chose among them the most representative, we have chosenfour configurations, illustrated in Fig. 7. In all the simulations,datasets have been generated randomly. Tasks involved in sim-ulations assume completely filled memories (hereafter calledfull tasks): this is the worst case and do not prejudice the pos-sibility to use the architecture with different loads. For everyconfiguration we have performed six simulations with three dif-ferent algorithms to test exhaustively the system. Particularly,

D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024 1021

Fig. 6. Relative silicon area for the different parts of a cell: (a) the 3 modules of the cell; (b) the 2 parts every module.

Fig. 7. Performance comparison adopting or not a stigmergic approach in the four different tasks situations described in Section 6.1.

we have chosen two element-by-element array operations(a summation, i.e. ADD operation, and a multiplication, i.e.MUL operation) and one dot product (i.e. MAC operation). Toverify the influence of stigmergy, for every algorithm we haveperformed simulations allowing or inhibiting stigmergy. Forevery test, an histogram highlights the difference in terms ofpercentage latency in the two cases.

6.1.1. First simulation: 1-cell full-taskIn Fig. 7(a) are shown the results for this simulation. The

central cell nearest to the southern border was involved in a taskconsisting of 32 elementary operations. The cell cannot show

any stigmergic behavior until the end of the task distribution.Stigmergy allows impressive improvements in all cases. How-ever, since the ADD operation requires a number of clock cy-cles close to the one required for data movement, it is clear thatstigmergy cannot improve performance at the most. Vice versa,MUL and MAC require more cycles than data movement, andhence stigmergy leads to better results. About difference be-tween MAC and MUL operations, it should be reminded thatdata complexity can influence latency in an unpredictable waysince multiplication latency depends on this parameter. Evenif all cells cooperates in task execution, performances cannotreach the maximum theoretic bound since all data movements

1022 D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024

150

100

50

0

150

100

50

0

150

100

50

0

150

100

50

0

1

21

233

1

21

233

1

21

233

1

21

233

(1)(2)(3)(a) (b)

(c) (d)

Fig. 8. Workload smoothing due to the stigmergic behaviors during the simulation presented in Section 6.1.4.

are generated by a single cell, and hence there is an overheaddue to task spreading.

6.1.2. Second simulation: 3-cell full-taskIn Fig. 7(b) are shown the results for this simulation. The

central cells were involved in a task consisting of 96 elementaryoperations. The evident lesser influence of stigmergy is dueto the fact that hardware agents cannot cooperate during taskdistribution, and if the operations are very fast (ADD) thereare very few cooperations. This also happens for the other twooperations, even if in this case this aspect is less evident.

6.1.3. Third simulation: 2-cell full-taskIn Fig. 7(c) are shown the results for this simulation. The

central cells were involved in a task consisting of 64 elemen-tary operations, but there is a faulty cell in the middle. Fromsimulations performed with or without the fault, there is nodifference between the two cases. The presence of a fault doesnot prejudice the correct operation of the fabric and does notincrement latency.

6.1.4. Last simulation: two 3-cell full-task and one 1-cellfull-task

In Fig. 7(d) are shown the results for this simulation. This isa multitasking situation where all the cells explicitly involvedare performing the same kind of task. It should be noted that

stigmergy still improve system performances even if the sys-tem in almost all occupied. This histogram should be carefullyread since every couple of bars represents always the samething compared to the previous ones. It is obvious that the task(3) ends before the others, but the histogram reports only theadvantages considering or not stigmergy.

6.2. Stigmergic behaviors

To allow a better understanding of the stigmergic behav-iors inside the fabric, we have taken four snapshots of work-load distribution in different moments for the last simulation(see Section 6.1.4), limitedly to the MAC operations and en-abling stigmergy. Results are depicted in Fig. 8. At the begin-ning (Fig. 8(a)), during data distribution, hardware agents can-not perform data movement for stigmergic purposes: data aremoved only for initial distribution. All the workload is con-centrated on the 3 southern cells, near to the left wall in the3D plot. Later (Fig. 8(b)), the task (3), which consists of only32-elementary operations, is near to its end. Since the distri-bution for that task is finished, the agents of that column canstart stigmergic behaviors, whereas the other two columns arestill involved in task distribution. After a while (Fig. 8(c)), thetasks distribution is finished in all the columns, and task (3) isnear to its end. At this point the stigmergic behaviors are ap-plied by all the hardware agents, to exploit the actual available

D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024 1023

resources at the most. They smooth over the gradient in work-load concentration leading to a uniform distribution (Fig. 8(d))that is maintained until the end of the computation.

7. Conclusions

In this paper we have presented a novel architecture that ex-ploits a set of undifferentiated cells to perform complex com-putations. Beyond normal parallelization, our system exploitsthe bioinspired approach to collective labor that can be foundin different swarm systems. Particularly, the system is based onstigmergic interactions, i.e. the indirect interaction of individu-als by means of modification to the environment. These modi-fications can be due to actions aimed to obtain this result (datamovement, in our system) or to side effect of normal opera-tion (data consumption and consequent wells emptying). Thecollective effort into algorithm execution by the swarm allowsimprovement in system performances, entails better hardwareusage and can deal with faults without any request for recon-figuration. In fact, the swarm composition can be varied evenduring system operation without any hardware structural mod-ification. This is due to the loose structure of swarms, whichare usually considered like super-organisms even if they are notreally organisms but just aggregation of individuals: their ap-parently coordinate behavior arises from the simple interactionrules that exist in the swarm.

Simulation results clearly show the potentiality of the ap-proach, even in this preliminary implementation. The systemshows: adaptability to the environment (e.g. data complexity,kind of operation required, presence of plugged wells due tofaults, different task allocations, and so on), flexibility (thenumber of possible operations for the hardware agents canbe extended) and scalability (the fabric size is independentfrom the structure of the cells). Furthermore, the powerfulnessof stigmergic interactions is able to keep simple the controlof the hardware agents, compared to other techniques. Fur-ther improvements to speed up the operations and to analyzeother strategies are in progress, like the extension of the fault-detection procedure to test the communication layer rather thanonly hardware agents datapaths.

Acknowledgements

The authors wish to thank Gianmarco Angius for his invalu-able contribution to this work.

References

[1] A. Abnous, Low-power domain-specific processors for digital signalprocessing, Ph.D. Dissertation, Department of EECS, UC Berkeley, CA,USA, 2001.

[2] A. Avizienis, Towards systematic design of fault-tolerant systems, IEEETrans. Comput. 30 (4) (1997) 51–58.

[3] E. Bonabeau, M. Dorigo, G. Theraulaz, Swarm Intelligence, FromNatural to Artificial Systems, Oxford University Press, Oxford, 1999.

[4] D. Bradley, A. Tyrrell, The architecture for a hardware immune system,in: Proceedings of Third NASA/DoD Workshop on Evolvable Hardware,Long Beach, CA, USA.

[5] T. Chang, Y. Chu, C. Jen, Low power FIR filter realization withdifferential coefficients and inputs, IEEE Trans. Circuits Systems II 47(2) (2000) 137–145.

[6] K. Compton, S. Hauck, Reconfigurable computing: a survey of systemsand software, ACM Comput. Surveys 34 (2) (2002) 171–210.

[7] S. Goldstein, H. Schmith, M. Moe, M. Budiu, S. Cadambi, R. Taylor, R.Laufer, Piperench: a coprocessor for streaming multimedia acceleration,in: Proceedings of the 26th Annual Internation Symposium on ComputerArchitecture, Atlanta, GA, 1999.

[8] P. Grassé, La reconstruction du nid et les coordinations interindividuelleschez bellicositermes natalensis et cubitermes sp. la theorie de lastigmergie: Essai d’interpretation des termites constructeurs, Ins. Soc. 6(1959) 41–83.

[9] R. Hartenstein, A decade of reconfigurable computing: a visionaryretrospective, in: Proceedings of Design, Automation and Test in Europe(DATE’01), 2001, pp. 642–649.

[10] J.H. Hauser, Augmenting a microprocessor with reconfigurable hardware,Ph.D. Dissertation, Department of EECS, UC Berkeley, CA, USA, 2000.

[11] J. Kennedy, R. Eberhart, Y. Shi, Swarm Intelligence, Morgan Kaufmann,Academic Press, New York, 2001.

[12] D. Mange, M. Sipper, A. Stauffer, G. Tempesti, Toward robust integratedcircuits: the embryonics approach, Proc. IEEE 88 (4) (2000) 516–541.

[13] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, B. Hutchings,A reconfigurable arithmetic array for multimedia applications, in:Proceeding of Seventh ACM International Symposium on Field-Programmable Gate Arrays, Monterey, CA, 1999, pp. 135–143.

[14] M. Millonas, Swarms, phase transitions, and collective intelligence, Proc.Artificial Life III (1994) 417–445.

[15] T. Miyamori, K. Olukotun, REMARC: reconfigurable multimedia arraycoprocessor, IEICE Trans. Inform. Systems E82-D (2) (1999) 389–397.

[16] J.V. Neumann, in: A.W. Burks (Ed.), Theory of Self-ReproducingAutomata, University of Illinois Press, Illinois, 1966.

[17] C. Ortega, A. Tyrrell, Biologically inspired fault-tolerant architecturesfor real-time control applications, Control Eng. Practice 30 (4) (1999)673–678.

[18] D. Pani, L. Raffo, A swarm intelligence based VLSI multiplication-and-add scheme, in: Proceedings of the Eighth Parallel Problem Solvingfrom Nature—PPSN VIII, Birmingham, UK, 2004, pp. 362–371.

[19] S. Ramprasad, N. Shanbhag, I. Hajj, Decorrelating (DECOR)transformations for low-power digital filters, IEEE Trans. CircuitsSystems II 46 (6) (1999) 776–788.

[20] N. Sankarayya, K. Roy, D. Bhattacharya, Algorithm for low power andhigh speed FIR filter realization using differential coefficients, IEEETrans. Circuits Systems II 44 (1997) 488–497.

[21] M. Sipper, Evolution of Parallel Cellular Machines: The CellularProgramming Approach, Springer, Heidelberg, 1997.

[22] G. Tempesti, A self-repairing multiplexer-based FPGA inspired bybiological processes, Ph.D. Dissertation, EPFL, Lousanne, Switzerland,1998.

[23] G. Tempesti, D. Roggen, E. Sanchez, Y. Thoma, R. Canham, A. Tyrrell,J. Moreno, A POEtic architecture for bio-inspired hardware, in:Proceedings of the Eighth International Conference on the Simulationand Synthesis of Living Systems (Artificial Life VIII), Sydney, Australia,2002.

[24] A. Tyrrell, Computer know thy self!: a biological way to look at fault-tolerance, in: Proceedings of the 25th Euromicro Conference, vol. 2,1999, pp. 129–135.

[25] X. Zhang, G. Dragffy, A.G. Pipe, N. Gunton, Q. Zhu, A reconfigurableself-healing embryonic cell architecture, in: Proceedings of ERSA’03:The 2003 International Conference on Engineering of ReconfigurableSystems and Algorithms, Las Vegas, USA.

Danilo Pani received his laurea degree in Electronic Engineering fromthe University of Cagliari (Italy) in 2002. In the same year he joined theMicroelectronic Laboratory in the Department of Electrical and Electronic

1024 D. Pani, L. Raffo / J. Parallel Distrib. Comput. 66 (2006) 1014–1024

Engineering of the University of Cagliari, where he is going to receive thePh.D. degree in Electronic Engineering and Computer Science. His primaryresearch interests are in the area of architectures and systems for DigitalSignal Processing, bioinspired approaches applied to parallel fault-tolerantVLSI architectures.

Luigi Raffo received his laurea degree in Electronic Engineering in 1989, andthe Ph.D. degree in Electronic Engineering and Computer Science in 1994

from the University of Genoa, Italy. In 1994 he joined the MicroelectronicLaboratory of the Electronic Engineering Department of the University ofCagliari as an Assistant Professor. Since 1998, he is a Professor at the sameUniversity. He is a teacher of electronic and system design courses. His mainresearch field is the design of digital/analog devices and systems. In thisfield he has authored more than 70 international publications, and patents. Hehas been coordinator of EU, Italian Research Ministry, Italian Space Agency,industrial projects.