26
INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS Int. J. Commun. Syst. 2007; 20:1011–1036 Published online 24 October 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/dac.856 Design and implementation of a scalable switch architecture for efficient high-speed data multicasting Cheng Li* ,y,z , R. Venkatesan } and Howard M. Heys z Faculty of Engineering and Applied Science, Memorial University of Newfoundland, St. John’s, Nfld., Canada A1B 3X5 SUMMARY This paper presents the design and implementation of a new scalable cell-based multicast switch fabric for broadband communications. Using distributed control and modular design, the multicast balanced gamma switch features a scalable, high performance architecture for unicast, multicast and combined traffic under both uniform and non-uniform traffic conditions. The important design characteristic of the switch is that a distributed cell replication function for multicast cells is integrated into the functionality of the switch element with the self-routing and contention resolution functions. Thus, no dedicated copy network is required. In the paper, we discuss in detail the design issues associated with the multicast functionality of the switch using 0:18 mm CMOS technology and discuss the scalability of the switch in terms of architectural, implementation, and performance scalability. Synthesized results are provided for measures of circuit complexity and timing. Copyright # 2006 John Wiley & Sons, Ltd. Received 8 February 2006; Revised 1 July 2006; Accepted 1 August 2006 KEY WORDS: balanced gamma (BG) switch; packet switching; multicast; switch fabric; multistage interconnection network (MIN); self-routing; self-replication; scalability; fault tolerance; VLSI design 1. INTRODUCTION High-speed networks have become increasingly popular worldwide driven by the Internet and multimedia applications. Many real-time applications, such as videoconferencing, music on demand, and video on demand, not only demand high bandwidth, but also require messages to *Correspondence to: Cheng Li, Faculty of Engineering and Applied Science, Memorial University of Newfoundland, St. John’s, Nfld., Canada A1B 3X5. y E-mail: [email protected] z Member, IEEE. } Senior Member, IEEE. Contract/grant sponsor: Natural Sciences and Engineering Research Council of Canada Contract/grant sponsor: Canadian Microelectronics Corporation Copyright # 2006 John Wiley & Sons, Ltd.

Design and implementation of a scalable switch architecture for efficient high-speed data multicasting

Embed Size (px)

Citation preview

INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMSInt. J. Commun. Syst. 2007; 20:1011–1036Published online 24 October 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/dac.856

Design and implementation of a scalable switch architecture forefficient high-speed data multicasting

Cheng Li*,y,z, R. Venkatesan} and Howard M. Heysz

Faculty of Engineering and Applied Science, Memorial University of Newfoundland, St. John’s, Nfld., Canada A1B 3X5

SUMMARY

This paper presents the design and implementation of a new scalable cell-based multicast switch fabric forbroadband communications. Using distributed control and modular design, the multicast balanced gammaswitch features a scalable, high performance architecture for unicast, multicast and combined traffic underboth uniform and non-uniform traffic conditions. The important design characteristic of the switch is that adistributed cell replication function for multicast cells is integrated into the functionality of the switchelement with the self-routing and contention resolution functions. Thus, no dedicated copy network isrequired. In the paper, we discuss in detail the design issues associated with the multicast functionality ofthe switch using 0:18 mm CMOS technology and discuss the scalability of the switch in terms ofarchitectural, implementation, and performance scalability. Synthesized results are provided for measuresof circuit complexity and timing. Copyright # 2006 John Wiley & Sons, Ltd.

Received 8 February 2006; Revised 1 July 2006; Accepted 1 August 2006

KEY WORDS: balanced gamma (BG) switch; packet switching; multicast; switch fabric; multistageinterconnection network (MIN); self-routing; self-replication; scalability; fault tolerance;VLSI design

1. INTRODUCTION

High-speed networks have become increasingly popular worldwide driven by the Internet andmultimedia applications. Many real-time applications, such as videoconferencing, music ondemand, and video on demand, not only demand high bandwidth, but also require messages to

*Correspondence to: Cheng Li, Faculty of Engineering and Applied Science, Memorial University of Newfoundland,St. John’s, Nfld., Canada A1B 3X5.yE-mail: [email protected], IEEE.}Senior Member, IEEE.

Contract/grant sponsor: Natural Sciences and Engineering Research Council of CanadaContract/grant sponsor: Canadian Microelectronics Corporation

Copyright # 2006 John Wiley & Sons, Ltd.

be sent to more than one destination. As a result, multicast has become a necessary feature forany switch designed for future broadband communication networks.

ATM-like fixed-sized packet switching attracts much interest because of its application inadvanced Internet routers and switches. A variable-sized incoming IP packet (datagram) isinternally segmented into fixed-size ATM-like cells which are switched to the output ports,where they are reassembled into the IP datagram. Borrowing from ATM terminology, we usethe term cell to identify the fixed-sized packet used in the switch, which can be ATM cells, or anyother convenient data format [1].

In [2], an overview of the challenges of state-of-the-art packet switches and routers areprovided. The authors addressed three major scalability aspects which are common to mostpopular switch architectures: implementation, multicasting, and quality-of-service (QoS)support. Among them, the first two are the major focus of this paper which presents apractical scalable switch architecture.

In this paper, we present the design and implementation of a new scalable cell-based multicastswitch architecture that has input and output buffers, a backpressure mechanism and a veryhigh throughput. The input–output buffer design and high throughput ensures that the switchcan achieve high performance while keeping the hardware complexity reasonable. The switch iscalled the balanced gamma (BG) switch. The BG multicast switch uses a multi-path multistageinterconnection network (MIN) design. The multicast cell replication function is integrated intothe functionality of each switch element (SE) and is performed in a distributed fashion alongwith routing. Two algorithms, the dynamic-length self-routing and self-replication algorithmand the dynamic-length backpressure algorithm, are proposed to perform an efficient routing,replication and acknowledgement. The use of unbuffered SEs ensures proper cell sequencing forall incoming traffic. Scalability is easy to achieve by using the SE as the basic building block.

This paper is organized as follows. In Section 2, we provide a background on differentapproaches to build a multicast switch and some design considerations. Section 3 introduces thearchitecture of the multicast BG switch. Section 4 presents the self-routing and self-replicationalgorithm and the dynamic-length backpressure algorithm. Section 5 describes the VLSI designand functional verification of the switch following the Deep Sub-Micron (DSM) design flowrecommended by the Canadian Microelectronics Corporation (CMC) [3]. The focus is on thebasic building block: the 4� 4 SE design. Hardware complexity and timing are highlightedduring the discussion. Section 6 discusses the scalability of the multicast BG switch with thefocus on the architectural scalability, the implementation scalability, and the performancescalability. Conclusions are presented in Section 7.

2. BACKGROUND

There has been a lot of research into multicast switch architectures since the late 1980s. Manymulticast switches have been proposed [4–10]. Multistage interconnection network (MIN)design is preferred due to many desirable features such as self-routing, parallel structure,modularity, constant delay for all input–output pairs and suitability for VLSI implementation.Using MIN design, the multicast function can be achieved either by placing a copy network atthe front of the routing network (cascade approach) or by integrating the cell replicationfunction into the SE of the MIN (integrated approach).

C. LI, R. VENKATESAN AND H. M. HEYS1012

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

2.1. Cascade approach to construct multicast switch

The intuitive approach to make a multicast switch is to employ a copy network in tandem with apoint-to-point unicast routing network. The copy network is a special network which canreplicate cells according to the fanout number specified in the header. The routing network usesthe output of the copy network as its input and routes each copy to its destination. Many of theproposed multicast switches follow this approach [4–10]. A typical example is Lee’s multicastswitch [4]. The basic structure of the copy network of Lee’s multicast switch consists of arunning adder network, dummy address encoders, a broadcast banyan network and trunknumber translators. Theoretically, there is no special requirement for the point-to-point routingnetwork. Any routing network can be used to route cells from an output port of the copynetwork to an output port of the multicast switch. However, Lee’s multicast switch suffers fromtwo problems. One is overflow, resulting when the total requested number of copies exceeds theavailable number of output ports of the copy network. In this situation, any cell whose fanout islarger than the remaining free output ports will be dropped [4, 6]. This will eventually decreasesystem performance and throughput. The other problem is the output port conflict problem inthe routing network when multiple cells request the same output port simultaneously. Besidesthese two problems, the memory size of the trunk number translation tables will increasesignificantly as the fanout and the switch size increase. Some modifications are suggested toimprove the design [4, 6]. However, they only mitigate the situation at best while increasing thehardware complexity of the switch.

2.2. Integrated approach to construct multicast switch

The integrated approach combines the routing and multicast cell replication functions into asingle unified network. The functionality of the SE is enhanced to accommodate both routingand cell replication. To minimize the load that multicast cells bring to the switch, cell replicationwill be performed only when necessary within the switch fabric as the cell is routed through.This kind of design will inherit most of the attractive features of the MIN design. Typical switchexamples include the MOBAS switch [7], the Abacus switch [8], and the PINIUM switch [9]. Theproblems encountered by the cascade approach no longer exist in the integrated approach. Eventhough each individual SE must be enhanced to handle both functions, which will increase itshardware complexity slightly, the overall complexity of the switch fabric is normally less thanthe sum of the copy and routing network because many resources originally required by bothnetworks are now shared, such as the memory components which are used to store the routingand replication tags. Besides the advantage of reduced hardware complexity, the characteristicsof reliability, scalability, and fault tolerance in the single unified network solution are also easierto improve. All these benefits make the integrated solution attractive for new architectures of thenext-generation multicast switches. The multicast BG network proposed in this paper utilizesthe integrated approach.

2.3. Buffering considerations

Blocking is a problem with which every switch design must deal. Blocking can happen eitherinternally when cells contend for the same internal links, or at the output port when multiplecells request the same output port in one switching cycle. To improve the throughput andmitigate blocking, many solutions have been proposed. The effects of internal blocking can be

DESIGN AND IMPLEMENTATION 1013

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

minimized by providing internal buffers or by providing input buffers and incorporating abackpressure mechanism [11]. The former solution is expensive and leads to out-of-sequencereceipt of cells in multi-path networks. The second solution is capable of handling incomingtraffic efficiently and helps to maintain a high throughput of the switch, and thus is preferred inbroadband communication networks. Output blocking can be controlled by choosing a MINarchitecture capable of accepting multiple cells at each output line in each switching cycle. SuchMINs, along with input buffers and a backpressure mechanism, would be capable ofovercoming the drop in performance due to internal blocking as well as output blocking. Anyinput-buffered switch architecture may suffer from head-of-line (HOL) blocking, in which thetemporarily un-transmissible cell at the head of the input buffers impedes the transmission ofcells behind it and thus reduces the switch throughput. However, if the MIN possesses very highthroughput, e.g. better than 99%; the HOL blocking does not significantly degrade performancebecause most of the cells can be delivered immediately without being buffered and delayed[12, 13].

Output buffered switches have been shown to provide the best delay and throughputperformance [8]. It is costly to achieve a pure output buffered switch because the output lineshave to operate N times as fast as a normal link, where N is the size of the switch. However, witha high throughput switch fabric, most of the cells will be switched to the output buffer, andtherefore, we need to provide only small amounts of input buffering to temporarily store thecells that lose contention to an internal link or output port. The input–output bufferingapproach, when armed with a suitable backpressure mechanism, can provide satisfactoryperformance and reduce the speed requirements of the output buffer. Switches following thisapproach include the MOBAS switch [7], the Abacus switch [8], and the Balanced Gammaswitch [14, 15].

3. SWITCH ARCHITECTURE

3.1. Switch architecture

Unlike the banyan network, which utilizes 2� 2 SEs, the multicast BG switch is based on anintegrated approach which utilizes 4� 4 SEs [15, 16]. Figure 1 shows the architecture of an 8� 8multicast BG switch. The basic architecture of an N �N BG multicast switch consists of Ninput port controllers (IPCs), an N �N multistage interconnected switch fabric (SF) thatsupports self-routing, copy replication and delivery acknowledgement, and N output portcontrollers (OPCs). No dedicated copy network is required. The IPC terminates the inputsignals from the network, strips the information contained in the cell header, and uses a lookuptable to determine the destinations. Buffering cells from the input link is considered a part of thefunctionality of the IPC. The bufferless switch fabric is the core of the multicast BG switch. AnN �N switch fabric consists of nþ 1 stages, where n ¼ log2N; with each stage consisting of N4� 4 SEs numbered from 0 to N � 1: The interconnection pattern between adjacent stages isregular. For all stages, the outlets of each SE are evenly distributed to the SEs in the followingstage. We denote SEi; j for the ith SE from the top at the jth stage. The algorithms which specifythe interconnection pattern can be found in Appendix A. The last stage is the output bufferstage, which can accept up to 4 cells per output port in one switching cycle. The output bufferscan be considered either as the last stage of the SF or as a part of an OPC along with other

C. LI, R. VENKATESAN AND H. M. HEYS1014

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

scheduling hardware. The OPC updates each arrived cell with a new cell header and sends itonto the output link.

3.2. Justification for the architecture

An important decision which has enabled the architecture of the BG switch to produce near-optimal performance even under realistic traffic conditions is the design choice that the outputstage of the SF is able to accept up to four cells per output line in one switching cycle. Underunicast uniform random traffic at full load, in any given switching cycle, the probability thatthere are i cells arriving at the input lines of a switch destined to a particular output line can bedescribed by the binomial distribution, which is given by

PrðiÞ ¼N

i

!1

N

� �i

1�1

N

� �N�i

ð1Þ

A similar distribution is obtained through simulation for the multicast bursty traffic, as shown inFigure 2, in which an average burst length, %L; of 5 and a mean fanout, %F ; of 2 are used. Detailsof the multicast bursty traffic model can be found in [17].

Equation (1) and Figure 2 clearly demonstrate that there is only approximately a 1%probability that more than four cells arriving at the input lines in any given cycle would try toreach the same output line for a fully loaded switch. It should be noted that this is irrespective ofthe size of the switch. This observation justifies the architectural decision of the BG switchaccepting up to four cells at each output line in one cycle. For the rare cases of more than fourcells requesting the same output, the backpressure scheme would ensure that the excess cells areheld in input queues. In addition, the BG network has a low probability of internal blocking,further necessitating input buffers to hold the cells that would have otherwise been dropped.

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Stage 0 Stage 1 Stage 2 Output Stage

Output 0

Output 1

Output 2

Output 3

Output 4

Output 5

Output 6

Output 7

Input 0

Input 1

Input 2

Input 3

Input 4

Input 5

Input 6

Input 7

IPC

IPC

IPC

IPC

IPC

IPC

IPC

IPC

OPC

OPC

OPC

OPC

OPC

OPC

OPC

OPC

Input Port Controllers (IPCs) Switch Fabric (SF) Output Port Controllers (OPCs)

Figure 1. Architecture of an 8� 8 multicast BG switch.

DESIGN AND IMPLEMENTATION 1015

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

4. SWITCHING OPERATION AND SUPPORTING ALGORITHMS

In this section, the concept of the three-phase switching operation is reviewed. Two importantalgorithms, namely the dynamic-length routing and replication algorithm and the dynamic-length backpressure algorithm, are proposed to facilitate switching and acknowledgement insidethe multicast BG switch fabric. The tag encoding scheme and the fault-tolerant property of themulticast BG switch are also studied.

4.1. Three-phase switching operation

A three-phase switching operation is performed inside the multicast BG switch [11, 14, 17]. Firstis the reservation phase during which the tag of the HOL cell is routed through the SF. Taginformation is used internally by the switch fabric for the connection setup and is generated andattached to each cell by the IPC. Multicast cell replication is performed implicitly by the SEsalong with routing operation only when necessary. Next comes the acknowledgement phase,during which cell delivery information is reported to the IPC by use of backpressure mechanism.Based on that information, the IPC decides whether the HOL cell should be transmitted or keptin the input buffer for the next cycle, or possibly both in the case of a multicast cell. In the thirdphase, the payload is transmitted via the established path. Because of the memoryless design ofthe SF, a cell will either reach the desired output port(s) or be kept in the input buffer. Cell

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Number of Inputs Requesting a Particular Output

Pro

bab

ility

32 x 32

64 x 64

128 x 128

256 x 256

0 1 2 3 4 5

Figure 2. Output request probability under 100% multicast bursty traffic.

C. LI, R. VENKATESAN AND H. M. HEYS1016

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

sequence can be easily maintained. Cell loss occurs only when the input buffer is full and a newcell arrives.

4.2. Self-routing and self-replication algorithm

Figure 3 shows the diagram of a 4� 4 SE. The four output links are numbered 0–3 from top tobottom. Link 0 and link 2 are called upper and lower regular links, respectively, while link 1 andlink 3 are called upper and lower alternate links, respectively. Both the regular link and itsalternate link have the same capability of reaching the same destination. During switching, theregular links are always used first. The alternate link is used only when the regular link has beenassigned to a connection.

In the multicast environment, tag design becomes more challenging because not only therouting information should be carried but also the cell replication information, and the taglength should be minimized to minimize the duration of the reservation phase. In the multicastBG switch, for each SE to make the right routing and replication decision, a 2-bit tag is used byeach SE for each input link. Four different actions can be taken by SE and are summarized inTable I.

Priority switching is a feature considered in the multicast BG switch with up to 8 prioritylevels supported in our implementation. The SE will make its decision in two steps. Firstly,active cells will be sorted based on their priority levels. Output link requests from cells withhigher priority are always processed first until all incoming cells are processed or all the outputlinks are used up. In the latter case, the remaining low priority cells will be blocked. For cells ofthe same priority level, they are processed in sequence in our implementation, i.e. from inputlink 0 to input link 3. During the second step, path will be established for the winning cells andtheir tag will be halved and transmitted to SEs in the next stage.

a(L1,11)

IDLE

c

a

b

a

Routing &

Replication Tag

Priority

SE4 x 4

Tag

b(L2,10)

c(L3,01)

Output 0

Output 1

Output 2

Output 3

Input 0

Input 1

Input 2

Input 3

Figure 3. Self-routing and cell replication in the 4� 4 SE.

DESIGN AND IMPLEMENTATION 1017

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

An example is provided in Figure 3 in which cells are coming in from the top three input links.By sorting on the priority tag, the process order is c! b! a: Following the routing andreplication table, cell c is a unicast cell which requests an upper output link and is switched tooutput 0. Similarly cell b is switched to output 2. For cell a; tag pair ‘11’ indicates thatreplication is required. The available outputs are checked and cell a is replicated and sent toboth upper and lower alternative output links, links 1 and 3, respectively.

4.3. Dynamic-length bitmap tag encoding scheme

To switch a cell from each input to the required destinations, an internal tag must be carefullyand properly selected. First, it should contain enough information for each SE to make therouting and replication decision. However, the tag is an overhead to the system, thus its lengthshould be minimized so as to reduce communication latency and achieve a faster switch rate. In[18], the authors studied different multi-address encoding schemes for multicast and pointed outthat a good encoding scheme should consider minimizing the tag length, reducing tag processingtime, and supporting cut-through switching.

The dynamic length bitmap tag encoding scheme is similar to the bit string encoding schemedescribed in [18]. The tag size changes as the cell is switched through the switch fabric. In thisscheme, the destination requests are represented in the bitmap format, with each bitcorresponding to an output port. Because fewer output ports are associated with each SE inlater stages, the tag length is halved after passing through each stage. Based on the bitmapreceived, the SE generates the decision pair before sending it to the decoding circuit. Thedecision pair generation circuit is quite simple: the upper half of the destination ports are used togenerate the first bit, and the lower half will be used for the second bit. As shown in Figure 4 forthe SEs in stage1 of a 16� 16 BG multicast switch, a simple ORing operation is sufficient. Thesum of the tag over all links determines the duration of the reservation phase if no pipelining isconsidered. Excluding the priority bits, the sum of the tag length over all links is

STag ¼N þN

N

4þ � � � þ 4þ 2

¼ 2n þ 2n�1 þ 2n�2 þ � � � þ 22 þ 2

¼Xni¼1

2i

¼ 2N � 2 ð2Þ

Figure 5 shows an example of a multicast cell from input 1 with output requests of 1, 4, 6 and7. Using the dynamic length bitmap tag encoding scheme, the decision pair can be extracted fromthe bitmap string using the circuit depicted in Figure 4. The SE controller sorts the four input

Table I. Routing and replication actions based on tag pair information.

Bit 1 Bit 0 Routing action Replication action

0 0 Idle (no action) Idle (no action)0 1 Lower link No replication1 0 Upper link No replication1 1 Both links Replication

C. LI, R. VENKATESAN AND H. M. HEYS1018

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

lines on their priority first, then interprets and processes the decision pair based on the resultingsequence, i.e. from the highest priority to the lowest. The bitmap tag is then divided into twogroups, each attached with the priority bits, then sent to the next stage. The first group is used asthe tags for the cell to be sent to the upper link and second group is used as the tag for a cell to besent to the lower link. In the example, at stage 0, based on the incoming tag 11010010, the SE1;0

generates a decision pair of (1,1). Using Table I, the SE knows that both routing and replicationshould be done to this cell. So, one copy of the cell is sent to the upper link and the other copy issent to the lower link. Therefore, the tag 11010010 will be split into two groups: 0010 and 1101.Each will be appended to the priority bits and then sent to the respective upper and lower links.A similar approach will be taken in the SEs of the following stages. In this example, priority bitsare not shown. In general, when several cells appear at the inputs of an SE, the priority bits willbe considered first before the routing and replication decisions are taken.

4.4. Dynamic-length backpressure algorithm

An acknowledgement mechanism to report blocking status is needed so that cells can be kept inthe input buffer when blocking occurs. As long as cells get queued in the IPC, they will not belost, but might be delayed when blocking occurs. The backpressure unit in each SE forms a pathto pass blocking information to the IPC. Compared to that of the unicast switch, an efficientbackpressure mechanism design in a multicast switch becomes more complex because the IPCshould get the acknowledgement for all possible copies.

In the multicast BG switch, a dynamic-length backpressure algorithm is used. In thealgorithm, blocking status is encoded in a bitmap format, with each bit representing delivery

Bitmap forlower half destinations

P1 P0P2b0b1b2b3b4b7 b6 b5

Priority FieldBitmap for

upper half destinations

Tags sending direction

Decision pair for current stage

T1 T0

To decoding circuit ofCentral Control Unit

Figure 4. Decision pair generation in the variable bitmap tag scheme.

DESIGN AND IMPLEMENTATION 1019

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

status to one output port. Each SE only sends the output information that it is associated withto the previous stage. At the output stage (stage n), the only needed information is whether thecell can be placed in the output buffer or not. Hence, 1 bit suffices. In the second last stage (stagen� 1), each SE is connected to two output queues in the last stage. Thus, two bits are used asthe backpressure output. A similar approach applies to all the SEs at different stages. In stage 0,the SE receives the acknowledgement from its downstream stages and generates its backpressureinformation of N bits based on the connection type and the blocking condition. The N-bitinformation is sufficient to decide the blocking status for any cell, unicast or multicast. Uponreceiving this information, the IPC makes the final decision on whether all copies of themulticast cell may be successfully received at the output. If so, the IPC removes the HOL celland tries the next cell during the next switching cycle. Otherwise, this cell is retained in the inputbuffer and retried until all copies are delivered. Using a work conserving method as described in[19], for any multicast or broadcast cell, only blocked copies will be retransmitted.

The BG network architecture enables us to achieve an efficient design and an easyimplementation of the dynamic-length backpressure algorithm. The backpressure unit locatedat each SE decides its backpressure information using two steps: the first step is to decidewhether the cell is blocked at the current stage or not, and the second step is a simpleconcatenation operation.

If there is no blocking, in the case of a unicast cell, the SE waits for the acknowledgementfrom the link where the tag is sent out and concatenates it with a string of zeros. Whether the

Stage 1

1101 0010 11

0010 01

1101 11 01 01

11 11

10 10

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Stage 0 Stage 2 Output Stage

Figure 5. Dynamic length bitmap tag and self-routing and replication algorithm.

C. LI, R. VENKATESAN AND H. M. HEYS1020

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

string of zeros will be concatenated before or after the received acknowledgement depends onwhether the upper or lower output link is selected. In the case of a multicast cell, the SE simplywaits for the acknowledgement string from downstream stages and concatenates that from theupper link with that from the lower link. The resulting string will be sent to the SE in theupstream stage. If blocking occurs, the SE generates a string of ones as the negativeacknowledgement for the copy/cell that loses the link contention immediately. A similarconcatenation operation will be performed with the acknowledgement string received fromdownstream stages replaced in this case by the string of ones generated locally at the SE.

For a multicast cell, if one copy loses contention for one link while the other copy wins itscontention for the other link, the control becomes more complex. In this case, in thebackpressure information of this stage, half of the outputs, which are associated with theblocked link, are marked as blocked by using the string of ones. For the other half that isassociated with the successful copy, the SE waits for the downstream backpressure informationto come back, and then concatenates it to generate the complete backpressure bits. The resultingacknowledgement is sent to the upstream stage.

An SE needs to remember its stage number so that it can decide the length of itsacknowledgement information. For SEs at stage i (04i4n� 1), the incoming backpressureinformation length is 2n�i�1 bits and the outgoing backpressure information is 2n�i bits.

It might seem that more output ports than the real port requests would be marked since alloutput ports related to an SE are marked as blocked when blocking occurs on that link. In fact,when the IPC receives the acknowledgement from the SF, an ANDing operation is performed todetermine which requests are actually blocked. For example in a 16� 16 BG switch, assumethat the IPC has a HOL cell with port requests 0, 4, 6, 7, 12 and 15. In the IPC, the port requestis represented as 1001000011010001. If the copy to the port 4 is blocked at stage 2 while othercopies reached their destinations, then the IPC receives a backpressure string of0000000000110000. After a logic AND operation, the result is 0000000000010000. It is thenclear that the cell is correctly delivered to all other ports except 4. Therefore, the cell should bekept in the input buffer. However, only a unicast cell to port 4 will be retransmitted.

The dynamic-length backpressure algorithm is used in Figure 6 to show how self-routing, cellreplication, acknowledgement and blocking are handled in an 8� 8 multicast BG switch. Theoutput requests of incoming cells are given in parenthesis.

In this example, blocking occurs at SE2;2 for the unicast cell from input 5 and at SE4;2 for amulticast cell from input 4. Cells from other input ports are all successfully transferred.Backpressure information, as shown on each SE in the diagram, is sent along the backward pathuntil it reaches the IPC. The IPC then knows that the unicast cell is blocked and the copy of themulticast cell heading towards output 4 is blocked as well. Since there is only one copyremaining, it becomes a unicast cell and will be sent along with the unicast cell left over in input5 and the new incoming cells in the next switching cycle.

4.5. Fault tolerance

Fault tolerance and reliability are desirable features for any practical switch. A fault-tolerant SFis able to send a cell to its destination in the presence of link or SE failures. For multicast BGswitch, each SE has four output links, which are divided into two groups with two links in eachgroup. One is used as the regular link and the other as the alternate link. When the regular linkor the downstream SE connected to the link fails, the SE can always re-route the cell to the

DESIGN AND IMPLEMENTATION 1021

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

alternate link, which has the same capability of reaching its destination. When equipped with aproper fault detection mechanism, the BG architecture can tolerate a single fault and is robust inthe presence of multiple faults. Further discussion on the reliability and fault toleranceproperties of the BG architecture can be found in [20].

5. DESIGN AND IMPLEMENTATION

It can be shown by analysis [17] that the multicast BG switch demonstrates high performanceunder various unicast, multicast, and mixed traffic conditions. At the same time, it is scalable interms of architecture and performance, which will be discussed in Section 6. Therefore, researchextends to the VLSI design to investigate the feasibility of building a practical switch using thisarchitecture. The hardware description language VHDL and 0:18 mm CMOS technology is used.Modularity and implementation scalability are emphasized in the study.

5.1. Digital system design flow and methodology

Generally speaking, an initial design idea goes through the several transformations before thefinal hardware implementation is obtained. At each step of transformation, the designer checksthe result of the last transformation, adds more information, and passes it through to the nextstep of the transformation. When all the design and verification steps are completed, a streamfile to describe the mask layer information for the circuit will be created. This is the file which isused for chip fabrication after Design Rule Checking (DRC) is completed.

Figure 6. Dynamic length backpressure scheme for an 8� 8 multicast BG switch.

C. LI, R. VENKATESAN AND H. M. HEYS1022

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

Figure 7 shows the digital system design flow using the Deep Sub-Micron (DSM) technologyrecommended by CMC [3]. The design flow can be divided into two stages. The first four stepsbelong to the front-end design stage, in which VHDL and Synopsys tools are used. Theremaining five steps comprise the back-end design stage, in which Verilog and Cadence tools areused. Simulation and synthesis are the major concerns for the front-end design. A design idea isconverted to a gate-level netlist through this process. In the back-end design, the major interestis the placement and routing of the imported gate-level netlist onto the silicon wafer.

The main design work of the multicast BG switch described in this paper is on the front-enddesign aspect. In our research, a top-down design and bottom-up implementation approach wasadopted. Following the divide-and-conquer strategy, the switch was recursively divided intosmaller components, as shown in Figure 8 for an 16� 16 multicast BG switch. At the end of the

Synthesis

Scan Insertation

Gate-Level Simulation

Floorplanning

Placement

Clock Tree Generation

Routing & Timing

Verification

RTL Simulation

Stream File

Synopsys

Cadence

VHDL

Design Analyzer

Design Analyzer

VHDL

Design Planner

DP/Qplace

DP/CTGen

SiliconEnsemble

DFII

Figure 7. Digital IC design flow.

DESIGN AND IMPLEMENTATION 1023

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

16 x 16 BG Multicast Switch

IPC SF OPC

Queue

OutputQueue

TagUpdate

IPCController

ACKHandler

OPCController

Stage 0 Stage 1 Stage 2 Stage 3

Bo

tto

m-

up

imp

lem

enta

tion

Top

-d

own

desg

in

SE 4x4

ControllerStage 0

Forward-pathControl Unit

Backward-pathControl Unit

Admission ControlUnit

ReceivingBuffer Bank

Pushout BufferBank

Multiplexer

Priority Sorter

Forward PathControl Logic

ACK OutputSelect Control

ACK OutputGeneration

ACK InputSelect Control

Glue Logic

ACKBuffer

TagGeneration

SequencerPriority TagBuffer Bank

Sequencer

PrioritySorter

Sequencer

SE 4x4

ControllerStage 1

Sequencer

SE 4x4

ControllerStage 2

Sequencer

SE 4x4

ControllerStage 3

Sequencer

Blocking InfoGeneration

Input

Figure 8. Divide-and-conquer strategy for multicast BG switch design and implementation.

C. LI, R. VENKATESAN AND H. M. HEYS1024

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

top-down design process, the whole switch design was broken down into the design of a numberof smaller but manageable terminal components of the partition tree, which are shown inthe shaded boxes. All terminal components were first designed, simulated, and tested usingVHDL. Then, following a bottom-up implementation approach, the lower level componentswere used to build higher level components until the top entity of the design was reached.The top level design was imported to and synthesized by the Synopsys DesignCompiler [21],provided by CMC [3], and the resulting circuit was tested through a comprehensive functionalsimulation.

5.2. Switch element design

The switch element is the basic building block for the whole switch fabric. The work of theswitch fabric is distributed into the functionality of each SE along the path. Therefore, the SEdesign becomes the most important task for the switch design. Following the three-phaseswitching operation, three different paths are involved in the work of each SE: the forward pathis used to carry the tag traversing the switch fabric during the reservation phase, the backwardpath is used to send the acknowledgement information back to the IPC during theacknowledgement phase, and the data path is used to send the payload from the input bufferto the required output port during the data transmission phase. The data path is configuredduring the reservation phase. During data transmission, the whole switch fabric functions like acircuit switch and the payload data is transmitted with no processing.

The internal architecture of each SE contains three major functional blocks: the forward-pathcontrol unit (FCU), the admission control unit (ACU) and the backward-path control unit(BCU), as shown in Figure 9. The FCU is used for tag and payload transmission. It is comprisedof the tag receiving buffer bank, the tag pushout buffer bank, the source/path select MUXgroup, the sequencer, and handshaking control logic. The select MUX group controls whetherthe tag or the payload will be transmitted to the next stage. During payload transmission, thedata input line will be directly connected to the select MUX group and the two buffer banks arebypassed. The receiving buffer bank is formed by the shift register group, which has a size of thelength of incoming tag. The tag will be halved at each SE and loaded to the pushout buffer bankto be transmitted to the next stage.

The BCU is similar to the FCU except that two more circuits are needed: the blockinginformation pre-generation circuitry (BIG) and the acknowledgement information generationcircuitry (AIG). BIG is used to generate the blocking information when an incoming cell losesits contention for the output at this SE. In this case, for a multicast cell, all copies to be sentfrom that output link are blocked. The AIG circuit is used to generate the finalacknowledgement bits that are passed back to the previous stage. Two scenarios should beconsidered: (1) when a cell successfully gets through this SE, and (2) when a unicast cell or partof a multicast cell is blocked during switching in this SE. For the first case, the work for the AIGcircuitry is simply passing the received acknowledgement bits (when it is a unicast cell) orconcatenating the acknowledgement bits received from the upper and lower links of the adjacentdownstream SEs (when it is a multicast cell) to form its output acknowledgement. For thesecond case, the acknowledgement bits for the blocked part should be generated by the BIGcircuit while the acknowledgement bits for the part that gets through should come from thedownstream stages. The AIG circuit then uses them to generate the final acknowledgement ofthis stage.

DESIGN AND IMPLEMENTATION 1025

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

The ACU is the heart of the SE and it co-ordinates the work of different parts of the SE. Themost important work it does is to make the correct routing and cell replication decisions basedon the priority bits and tag bits provided. Priority routing is a feature considered in the BGmulticast switch to resolve link contentions and for quality of service purposes. Up to 8 prioritylevels were supported in our implementation. Only a slight modification is required to expandpriority to include more levels. A modified 3-bit bitonic sorter [22] in Figure 10 is used inside theACU to pre-process incoming cells based on their priority levels. Let Sx;Tx and Px; which areon the input side of the sorter, denote line number, tag and priority of the cell on input link x;

Tag ReceivingBuffer Bank

Tag PushoutBuffer Bank

Source & PathSelect MUX

Forward-pathHandshake

Control (front)

Forward-pathHandshake

Control (back)

Sequencer

Admission Control Unit (ACU)

Backward-pathHandshake

Control (back)

ACK ReceivingBuffer Bank

ACKPath

SelectMUX

ACKOutputGen(AIG)

BlockingInformation

pre-Gen(BIG)

Backward-path Control Unit (BCU)

Backward-pathHandshake

Control (front)

ACK PushoutBuffer Bank

Sequencer

… ...… ... … ...

. . .

. . .. . .. . .

……

……

……

……

Forward-path Control Unit (FCU)

… ...

……

…… ...

IN0

IN1

IN2

IN3

Tag/DataInput

OUT 0

OUT 1

OUT 2

OUT 3

Tag/DataOutput

IN0

IN1

IN2

IN3

ACKInput

OUT 0

OUT 1

OUT 2

OUT 3

ACKOutput

HandshakeSignals withDownstream

Stage forData / Tagand ACK

HandshakeSignals withUpstreamStage forData / Tagand ACK

Figure 9. The internal architecture of an 4� 4 SE.

C. LI, R. VENKATESAN AND H. M. HEYS1026

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

where x 2 f0; 1; 2; 3g: After sorting, priority bits become useless and therefore are discarded.Line number Sy and the corresponding tag Ty; where y 2 fa; b; c; dg; appear on the output sideof the sorter.

The ACU then processes the tag following the order Ta ! Tb ! Tc ! Td : Sy provides theinput link number associated with the tag. Output resources are assigned during the processuntil either all incoming requests are processed or all output links are used up.

For cells of the same priority, in the current design, the cell from smaller input line number isgiven a higher priority over others. Therefore, input 0 of each SE has the highest priority, input1 has a higher priority over input 2 and 3, and so on. To achieve a better fairness, a pseudo-random number generator can be used inside each sorting element to randomly pick one cellamong the cells of the same priority.

5.3. Functional verification and testing

During the hardware design and implementation process, simulation, testing and verificationwere carried out at different levels to ensure a correct result. All lower level components werethoroughly tested and verified to be functionally correct before moving to higher levelcomponents.

SEs at different stages of the implemented 16� 16 BG switch were simulated to ensure therouting and replication function were performed correctly. Figure 11 shows one simulationexample for a 4� 4 SE in stage 2. A 5 ns clock signal is applied. Signal CURRENT gives thecurrent state of the SE. In this example, there are three active cells arriving at this SE from input0, 2 and 3. Cell tag and priority are marked on the waveform for clarity. The cell from input 2 is

Figure 10. Bitonic sorter for cell priority sorting.

DESIGN AND IMPLEMENTATION 1027

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

a multicast cell, but because it has the lowest priority, the destination request to the upper linkcan not be satisfied. The acknowledgement for this cell is a concatenation of the receivedacknowledgement (01) for the successfully delivered copy and the blocking information from theBIG circuit (11) for the blocked copy, resulting in 1101 sent to the upstream stage.

For the complete 16� 16 multicast BG switch, it is too complex and difficult to concludecorrect switching operation through waveform observation. Hence, an alternative computeraided verification method was employed. A combined high-level language and hardwaredescription language test method was used for this purpose. A C=Cþþ program was developedto generate cells to be switched by the fabric. The generated data was stored in data files whichemulate the input queues, one for each input port. During each switching cycle, the testbenchprogram checks the acknowledgement output from the IPC. When negative, it means there is acell retained at the HOL position which will be switched again during the next cycle. Otherwise,the testbench will read a new cell from the data file until either all data in the file is read out orthe testing duration is reached.

Tag 1000, Priority 110

Tag 1111, Priority 001

Tag 1111, Priority 100

Tag 10, Priority 110

Tag 11, Priority 100

Tag 11, Priority 100

Tag 11, Priority 001

PartlyBlocked

Ack 11

Ack 10

Ack 01

Ack 01

Ack 1100

Ack 1111

Ack 1101

Ack 1001

Figure 11. Simulation for single SE at stage0 of 16� 16 BG multicast switch.

C. LI, R. VENKATESAN AND H. M. HEYS1028

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

To verify the correctness of switching by the hardware, data at different places are recordedinto different output files by the VHDL testbench program. Data recording is performed on aswitching cycle basis. When simulation is completed, the data files are collected and brought toa C=Cþþ program for analysis and verification. The verification process is also carried out ona switching cycle basis. Payload data is used to decide the input port for each cell that managesto arrive at the output queue. Once all output matching is completed, the tag for the HOL cell isupdated with the delivered cell(s) sent from its input line. If the resulting tag array elements areall zero, the next cell from the input queue will be used as the HOL cell for the next switchingcycle. Otherwise, the HOL cell should be retained and switched again. This process repeats forevery switching cycle until all the data is processed. Whenever there is a mismatching, an errorindication will be generated by the C=Cþþ program.

5.4. VLSI implementation

The modular structure of the multicast BG switch facilitates its VLSI implementation. Manyaspects need to be considered during a digital design implementation, such as speed, powerconsumption, and chip size. Here, the hardware complexity and timing are the two mostimportant measures to be addressed. The results are collected from the Synopsys synthesis toolDesignCompiler [21], using the standard cell library targeting 0:18 mm CMOS technology. Thecore logic area is given in mm2 and the gate count is given with reference to the area of a two-input NAND gate. Tables II and III summarize the complexity for different stage components,as well as the whole SF of various switch sizes.

With the 5 ns clock, the 16� 16 switch fabric can run at the link speed of 200Mbps. It hasbeen observed that no positive slack has been generated from the report files. Therefore, theswitch fabric can comfortably run for the OC-3 links, which yields a switching capacity ofgreater than 3Gbps for the overall switch. By increasing parallelism and with more advancedfabrication technologies, such as the 0.09 and 0:065 mm CMOS technologies, the single link andoverall speed that the design can support will be even higher.

Table II. Hardware complexity of 16� 16 multicast BG switch.

16� 16 Stage 0 Stage 1 Stage 2 Stage 3 Total

Total area 274 024 351 514 387 772 333 252 1 346 562Gate count 22 466 28 819 31 792 27 322 110 400

Table III. Hardware complexity for switch fabric of various sizes.

Switch size Gate count

8� 8 33 80416� 16 110 40032� 32 350 50064� 64 1 139 835128� 128 3 822 604256� 256 13 877 837512� 512 51 060 3981024� 1024 204 127 457

DESIGN AND IMPLEMENTATION 1029

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

6. SCALABILITY

The scalability of the multicast BG network can be studied from three different standpoints,namely, architectural scalability, implementation scalability and performance scalability.

6.1. Architectural scalability

The multicast BG switch has a scalable architecture. A larger switch can be efficientlyconstructed using a smaller switch as the building block. For example, to achieve a biggerswitch, the switch of smaller size is duplicated and one stage of 4� 4 SEs is added in the front,thus the switch size is doubled, as shown in Figure 12. Only one input link of each SE in stage 0

Stage 1 Stage 2 Stage 3 Output Stage

Output 0

Output 1

Output 2

Output 3

Output 4

Output 5

Output 6

Output 7

Input 0

Input 1

Input 2

Input 3

Input 4

Input 5

Input 6

Input 7

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

Output 8

Output 9

Output 10

Output 11

Ou tput 12

Output 13

Output 14

Output 15

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Input 8

Input 9

Input 10

Input 11

Input 12

Input 13

Input 14

Input 15

Stage 0

2 x 2 BG Switch Module

4 x 4 BG Switch Module

8 x 8 BG Switch Module

2 x 2 BG Switch Module

4 x 4 BG Switch Module

8 x 8 BG S witch Module

16 x 16 BG Switch Module

Figure 12. Architectural scalability of the BG multicast switch.

C. LI, R. VENKATESAN AND H. M. HEYS1030

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

will be used. Others can be either grounded or can be used as alternative links by the IPC toachieve better fault tolerance and reliability.

6.2. Implementation scalability

The main factor considered in the implementation scalability is the overhead associated withreservation and acknowledgement phases directly proportional to the length of the routing tagand acknowledgement tag. The overhead associated with the multicast BG switch is a bitdifferent from that of the unicast BG switch [13, 14]. For both unicast and multicast BGswitches, the tag received by SEs in the first stage is the longest. For the unicast BG switch, thelongest tag is n bits, which is much shorter than the payload. For example, for a 256� 256switch, the tag sent to stage 0 is 8 bits for routing plus the priority bits and an active bit (which isused to indicate whether it is an active cell or not) [13, 20]. Compared to the payload, such as anATM cell, which is 48� 8 ¼ 384 bits, the overhead can almost be neglected. In the multicast BGswitch, more information is carried by the tag. The maximum tag length now becomes N bits.As the switch size increases, the tag length grows linearly. Using the same example, for amulticast BG switch of 256� 256; the maximum tag length now becomes 256 bits. In ourresearch, we have attempted to improve the situation. The explicit active bit, which is required inthe unicast switch, has been removed. As well, the dynamic-length algorithm has been designedto reduce the tag length by half for every stage along the path. For an incoming cell to traversethe whole SF, the sum of the tag bits to be transmitted is 2N � 2: This is corresponding to n2 inthe case of a unicast switch. We conjecture that the sum of the tag for any integrated multicastMIN switch that supports implicit cell replication must be at least OðNÞ:

6.3. Performance scalability

The performance scalability describes how the BG performance is maintained as the switch sizegrows. Here, performance scalability is demonstrated in terms of the size of input buffer to achievea desired cell loss rate and average cell delay. As the multicast BG network possesses a throughputperformance that is only slightly inferior to that of an ideal pure output buffered switch, the size ofinput buffer is a suitable measure to assess the effectiveness of the switch. Performance measuresare obtained through simulation experiments over a period of switching 109 cells across the switchfabric. Figures 13 and 14 give the size of input buffer (in terms of cells) to achieve a cell loss ratebetter than 10�8 and average cell delay, respectively, for various switch sizes. Even under multicastbursty traffic of load 90%; the required size for input buffering is much smaller than that foroutput buffering. Input buffer requirement scales linearly and average cell delay becomes saturatedas the switch size grows. The high throughput of the switch fabric is the main reason for this result.

6.4. Scalability of multicast switches with copy networks

It might seem that for large networks, copy network based multicast switch architectures may bemore attractive due to the excessive overhead resulting from large tag lengths for the BGarchitecture. Many research efforts have focused on the copy network design and have tried toimprove the copy network throughput and fairness [4, 6]. However, for this type of switch, afterpassing the copy network, replicated cells need to be processed by the trunk number translatorbefore entering the point-to-point switching network. Based on the translation table, cellheaders used for routing are appended to each incoming cell. The sizes of the translation tables

DESIGN AND IMPLEMENTATION 1031

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

increase dramatically with the switch size as well as the cell fanout [23]. Moreover, for severaldifferent approaches to build the copy network, such as having a sorting network in front of thebroadcast network [24], placing a running adder network and dummy address encoders in front

Traffic ConditionMulticast BurstyTrafficLoad: 90%Mean BurstLength: 5Mean Fanout: 2Output BufferSize Used: 8501 Billion Cells

8 16 32 64 128 256 512

0

5

10

15

20

25

30

35

40

45

50

55

60S

ize

of In

put B

uffe

r

(In

term

s of

num

ber

of c

ells

)

Switch Size

Figure 13. Performance scalability of the multicast BG switch}size of input buffer to achieve510�8 cell loss rate.

8 16 32 64 128 256 5120

10

20

30

40

50

Ave

rage

Cel

l Del

ay

(in te

rms

of S

witc

hing

Cyc

les)

Switch Size

Traffic ConditionMulticast BurstyTrafficLoad: 90%Mean BurstLength: 5Mean Fanout: 2Output BufferSize Used: 8501 Billion Cells

Figure 14. Performance scalability of the multicast BG switch}average cell delay.

C. LI, R. VENKATESAN AND H. M. HEYS1032

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

of the broadcast banyan network [4] or placing a copy network named PCN (plus 2i copynetwork) in front of the Batcher sorting network [25], the complexity of the sorting network orthe running adder network increases significantly for larger switches or higher fanouts.

The Batcher sorting network, which is commonly used as a key part of a copy network, has acomplexity of OðNðlog2NÞ

2Þ measured in terms of the number of cross-points in the network.The broadcast bus based multicast switch and the Knockout switch [26] have a hardwarecomplexity OðN2 þNMÞ; where M is the number of broadcast buses used for transmittingmulticast cells. Compared with that of those switches, the hardware complexity, measured interms of the number of cross points, of the multicast BG switch, which is OðNlog2NÞ (to beexact, 16Nlog2N � 22N), is much less, especially for large switches [16].

We can say that the difference between the cascaded approach with a separate copy network andthe integrated approach is actually the tradeoff between the hardware complexity and the latency.As the size grows, for the first approach, the hardware complexity grows so high that it may makethe switch almost not implementable. While for the latter approach, the hardware complexitymay be reasonable but the reservation phase for the cell, which is used to setup the internal path,could get longer. However, the time taken for the reservation phase can be reduced using certaindesign enhancements such as inter-stage link dilation, cut-through switching, and pipelining.

7. CONCLUSION

In this paper, we have discussed the pros and cons of constructing a multicast switch fabricusing the cascade approach and the integrated approach. We have studied the design andimplementation aspects of a practical multicast broadband switch fabric architecture which ismodular, scalable, and fault-tolerant. It is able to handle unicast, multicast or mixed traffic withhigh performance. A distributed control is used in the design to fulfill the high-speedrequirement. The similarity of functionality and structure of all the SEs facilitates a modulardesign and ease of VLSI implementation. The multicast cell routing and replication is integratedinto the functionality of the switch element. The dynamic-length tag encoding scheme is studiedand proposed to support self-routing and cell replication for multicast switching as well asreducing processing latency. A dynamic-length backpressure algorithm is also presented for themulticast switch fabric to support efficient delivery acknowledgement.

APPENDIX A: BALANCED GAMMA NETWORK TOPOLOGY

A.1. The notation used in the algorithm

SEi; j: Switch Element at the ith position in the jth stage;Queuei: Queue at the ith position in the last stage;n: Total number of stages of the switch fabric;N: The size of the switch fabric;%: Modular operation;bc: Floor operation;ILi: Input link i;OLi: Output link i:

DESIGN AND IMPLEMENTATION 1033

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

A.2. Algorithm pseudo code

For all middle stages:

for (j ¼ 0; j5n� 1 ; j þþ) // j is the stage indexfor (i ¼ 0; i5N; i þþ) // i is the row index

k ¼ i % 4; l ¼ b i2n�1c; a ¼ b l

4c þ l � N

2i; b ¼ N

2iþ2

if ð2 > k50Þconnect OL0 of SEi; j to ILk of SEa; jþ1:connect OL1 of SEi; j to ILk of SEaþb; jþ1:connect OL2 of SEi; j to ILk of SEaþ2b; jþ1:connect OL3 of SEi; j to ILk of SEaþ3b; jþ1:

else if ð4 > k53Þconnect OL1 of SEi; j to ILk of SEa; jþ1:connect OL0 of SEi; j to ILk of SEaþb; jþ1:connect OL3 of SEi; j to ILk of SEaþ2b; jþ1:connect OL2 of SEi; j to ILk of SEaþ3b; jþ1:

For last stage j ¼ n� 1:

for (i ¼ 0; i5N; i þþ) // i is the stage indexif ði % 2 ¼ 0Þ

connect OL0 of SEi; j to IL0 of Queuei:connect OL1 of SEi; j to IL1 of Queuei:connect OL2 of SEi; j to IL0 of Queueiþ1:connect OL3 of SEi; j to IL1 of Queueiþ1:

if (i % 2=1)connect OL0 of SEi; j to IL2 of Queuei�1:connect OL1 of SEi; j to IL3 of Queuei�1:connect OL2 of SEi; j to IL2 of Queuei:connect OL3 of SEi; j to IL3 of Queuei:

ACKNOWLEDGEMENTS

This work is supported in part by the Natural Sciences and Engineering Research Council (NSERC) ofCanada. This work is supported in part by the VLSI design facilities sponsored by CanadianMicroelectronics Corporation (CMC). Some preliminary results of this work have been presented at the2002 IEEE International Conference on Computer Communications and Networks (IC3N’2002) and the2006 Queen’s Biennial Symposium on Communications (QBSC’2006).

REFERENCES

1. Marsan MA, Bianco A, Giaccone P, Leonardi E, Neri F. On the throughput of input-queued cell-based switcheswith multicast traffic. Proceedings of IEEE INFOCOM2001, Alaska, U.S.A., 22–26 April 2001.

2. Chiussi FM, Francini A. Scalable electronic packet switches. IEEE Journal on Selected Areas in Communication2003; 21(4):486–500.

3. Canadian Microelectronics Corporation (CMC) website: www.cmc.ca.4. Lee TT. Nonblocking copy networks for multicast packet switching. IEEE Journal on Selected Areas in

Communication 1988; 6:1445–1467.

C. LI, R. VENKATESAN AND H. M. HEYS1034

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

5. Turner J. Design of a broadcast packet switching network. IEEE Transactions on Communications 1988; 36:734–743.

6. Turner J. A practical version of Lee’s multicast architecture. IEEE Transactions on Communications 1993; 41:1166–1169.

7. Chao HJ, Choe BS. Design and analysis of a large scale multicast output buffered ATM switch. IEEE/ACMTransactions on Networking 1995; 3:126–138.

8. Chao HJ, Choe B-S, Park J-S, Uzun N. Design and implementation of Abacus switch: a scalable multicast ATMswitch. IEEE Journal on Selected Areas in Communication 1997; 15:830–843.

9. Law KLE, Leon-Garcia A. A large scalable ATM multicast switch. IEEE Journal on Selected Areas inCommunication 1997; 15:844–854.

10. Zhong WD, Onozato Y. A copy network with shared buffers for large scale multicast ATM switching. IEEETransactions on Networking 1993; 1:157–165.

11. Pattavina A. Switching Theory: Architecture and Performance in Broadband ATM Networks. Wiley: New York,1998.

12. El-Sayed Y, Venkatesan R. Modeling and simulation of the pipelined balanced gamma network. Proceeding of the FourthIEEE International Conference on Electronics, Circuits, & Systems (ICECS’97), vol. 1, Cairo, Egypt, 1997; 97–101.

13. Venkatesan R, El-Sayed Y, Thuppal R, Sivakumar H. Performance analysis of pipelined multistage interconnectionnetworks. Informatica}International Journal of Computing and Informatics 1999; 23:347–357.

14. Venkatesan R, Mouftah H. Balanced gamma network}a new candidate for broadband packet switchingarchitectures. Proceedings of the IEEE INFOCOM’92, vol. 3, 1992; 2482–2488.

15. Li C, Heys HM, Venkatesan R. Design and scalability of the multicast balanced gamma (BG) switch. Proceedings ofthe Eleventh International Conference On Computer Communications and Networks (IEEE ICCCN’2002), Miami, FL,USA, October 2002; 518–521.

16. Li C, Venkatesan R, Heys HM. VLSI design and implementation of a high-speed multicast switch fabric.In 2006 23rd Queen’s Biennial Symposium on Communications (QBSC’06), Kingston, Canada, May 2006;356–359.

17. Li C, Venkatesan R, Heys HM. Performance modelling of the high-speed multicast balanced gamma switch.Proceedings of the IEEE International Conference on Communications (ICC2006), Istanbul, June 2006.

18. Chiang C-M, Ni LM. Multi-address encoding for multicast. http://citeseer.nj.nec.com/5854.htm, 1994.19. Prabhakar B, McKeown N, Ahuja R. Multicast scheduling for input-queued switches. IEEE Journal on Selected

Areas in Communication 1997; 15:855–866.20. El-Sayed Y, Venkatesan H, Sivakumar H. Fault tolerance and reliability analysis of the balanced gamma network.

International Journal of Parallel and Distributed Systems and Networks 1999; 2(4):244–254.21. Synopsys Inc. website: www.synopsys.com22. Batcher K. Sorting networks and their applications. Proceedings of AFIPS Spring Joint Computer Conference,

Atlantic City, NJ, U.S.A., 1968; 307–314.23. Chan KS, Yeung YL, Ko KT, Wong EWM. Clos-knockout: a large scale modular multicast ATM switch. IEICE

Transactions on Communication 1998; E81-B(2):266–275.24. Huang A, Knauer S, STARLITE: a wideband digital switch. Proceedings of Global Telecommunications Conference

(GLOBECOM’84), December 1984; 121–125.25. Chen WT, Deng TW. PPCN: a high-performance copy network for large scale ATM switching systems. IEICE

Transactions on Communications 1999; E82-B(1):1–13.26. Eng KY, Hluchyj MG, Yeh YS. Multicast and broadcast services in a knockout packet switch. Proceedings of IEEE

INFOCOM’88, San Francisco, CA, U.S.A., 1988; 29–34.

AUTHORS’ BIOGRAPHIES

Cheng Li obtained his BEng and MEng in Communications Engineering from

Harbin Institute of Technology (HIT) in Harbin, China and a PhD in Electrical and

Computer Engineering from Memorial University of Newfoundland at St. John’s,

Newfoundland and Labrador, Canada. Dr Li is now an Assistant Professor of

Electrical and Computer Engineering at Memorial University of Newfoundland. His

current research interests include broadband communications networks, switching

and routing, software defined radio, and underwater wireless sensor networks.

DESIGN AND IMPLEMENTATION 1035

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac

R. Venkatesan is currently a Professor and the Associate Dean of Graduate Studies

and Research of the Faculty of Engineering and Applied Science of Memorial

University of Newfoundland, St. John’s, Canada. He received his BE (Hons) from

Madurai University, MScE and PhD degrees from University of New Brunswick, all

in electrical engineering. His research interests include architecture and application

of parallel processing structures for broadband communication switch fabrics,

private key ciphers, codecs, error control codes, fault-tolerant computing and

induction motor drives.

Howard M. Heys obtained a BESc in Electrical Engineering from the University of

Western Ontario in London, Ontario, Canada and a PhD in Electrical and

Computer Engineering from Queen’s University at Kingston, Ontario, Canada. Dr

Heys is now a Professor of Electrical and Computer Engineering at Memorial

University of Newfoundland. His current research interests include cryptography

and communication networks.

C. LI, R. VENKATESAN AND H. M. HEYS1036

Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Commun. Syst. 2007; 20:1011–1036

DOI: 10.1002/dac