Network on Chip Based FPGA

Modelling and Prototyping of a Network on Chip

Rickard Holsmark

Magnus Högberg

Master of Science Thesis 2002

ELECTRONICS

Postadress: Besöksadress: Telefon: Telefax: Box 1026 Kyrkogatan 15 036-15 77 00 (vx) 036-12 00 65 551 11 Jönköping

Modellkonstruktion och Prototypframtagning av ett Nätverk på Kisel

Rickard Holsmark

Magnus Högberg

Detta examensarbete är utfört vid Ingenjörshögskolan i Jönköping inom ämnesområdet Elektroteknik. Arbetet är ett led i en 60 poängs magisterutbildning. Författarna svarar själva för framförda åsikter, slutsatser och resultat. Handledare: Shashi Kumar Omfattning: 20 p (D-nivå) Datum: Arkiveringsnummer:

1

Abstract

This thesis describes a design flow of a Network on Chip (NoC), which could be a solution for communication in future System on Chip (SoC). The time span in which this is thought to be commercial is 5 to 10 years. Because of the lack of information on performance of various NoC configurations, one important purpose of the design phase is to make a system-level model that can be used for performance simulations. To build the system-level model the programming language SDL has been used. A discrete event simulator for the SDL model has been used for the simulations.

The NoC is designed as a packet switched network, with micro-routers placed in a two-dimensional m*n mesh, in this case 4*4 that equals to 16 micro-routers. Every router has a connection for a resource, which could be for instance a processor, memory or an FPGA. Another objective is to make a prototype of a NoC in an FPGA. For that purpose VHDL has been used to describe the circuit at a synthesizable level of abstraction.

It is concluded that it is useful and relatively easy to use SDL for making performance simulations of a NoC and use these to draw conclusions of design questions. For example, the results of the simulations showed that increasing the buffer of a switch output from 2 to 3 packets only marginally have an effect of the performance. When the behaviour and structure is described in SDL it also helps as a template to the design in VHDL. A small working NoC prototype has been built on FPGA and tested using the serial port of a PC.

2

Acknowledgements

We like to thank Professor Shashi Kumar for the invaluable guiding in this new NoC world. Alf Johansson, programme coordinator has been very helpful and for that we are much grateful. Magnus want to thank his apartment friends for putting up with the unwashed plates. Rickard sends a special thanks to his wife and daughter and promises to spend more time at home.

3

Sammanfattning

Detta dokument beskriver ett designarbete som behandlar Network on Chip (NoC), vilket är en möjlig lösning för kommunikationen i framtida System On Chip (SoC). Denna lösning finns inte på marknaden men är tänkt att kunna användas kommersiellt om 5-10 år. Designen är byggd som ett paketväxlat nätverk, med 16 mikroroutrar placerade i en tvådimensionell 4*4 matris. Varje router har en koppling till en resurs som till exempel kan användas för en processor, minne eller FPGA. På grund av att det lilla utbud av prestandamätningar av olika NoC konfigurationer är simuleringar en viktig del av designfasen. För att göra dessa simuleringar har programmeringsspråket SDL använts. Efter det har en VHDL beskrivning av kretsen gjorts för att kunna göra en implementering i en FPGA.

Det konstateras att SDL är relativt enkelt att använda för prestandamätningar på ett NoC. Det är sedan möjligt att använda dessa för designavvägningar. Till exempel visar resultaten på att det inte lönar sig att utöka utbufferternas storlek från 2 till 3 paket, eftersom detta endast har en marginell effekt på prestandan. När beteendet och strukturen är beskriven i SDL hjälper det också till som ett stöd för konstruktionen i VHDL. En liten NoC prototyp har implementerats i en FPGA och testats via serieporten på en PC.

Key words

Core Based Design

FPGA

Network on Chip (NoC)

On Chip Communication

Packet Switched Network

SDL

System on Chip (SoC)

VHDL

4

List of Contents

List of Figures..................................................................................................................................... 6

1 Introduction............................................................................................................................ 7

1.1 System on Chip ........................................................................................................................ 7 1.2 Network on Chip ...................................................................................................................... 7

1.2.1 NoC Evaluation Tools...................................................................................................... 8 1.3 SDL as a modelling platform................................................................................................... 8 1.4 Objectives of the project .......................................................................................................... 9 1.5 Outline...................................................................................................................................... 9

2 Theoretical Background...................................................................................................... 10

2.1 Communication Networks ..................................................................................................... 10 2.1.1 Communication Techniques........................................................................................... 10 2.1.2 OSI Model ...................................................................................................................... 10 2.1.3 Routers ........................................................................................................................... 11 2.1.4 Buffers ............................................................................................................................ 12 2.1.5 Topologies ...................................................................................................................... 12

2.2 Network on Chip Concepts .................................................................................................... 13 2.2.1 Survey of Network on Chip Ideas................................................................................... 13

2.3 System Level Design ............................................................................................................. 14 2.3.1 SDL................................................................................................................................. 14

3 NoC: Design Decisions ......................................................................................................... 17

3.1 Design Methodology.............................................................................................................. 17 3.2 Network configuration........................................................................................................... 17 3.3 Route and Switch function..................................................................................................... 18 3.4 Connections between Nodes .................................................................................................. 18

3.4.1 Drop of Packets.............................................................................................................. 19 3.4.2 Physical issues ............................................................................................................... 19 3.4.3 Data-link layer connection............................................................................................. 19 3.4.4 Network layer connection .............................................................................................. 19 3.4.5 Transport layer .............................................................................................................. 20

3.5 Packet structures .................................................................................................................... 20 3.6 Buffers.................................................................................................................................... 21 3.7 Routing Algorithm................................................................................................................. 22 3.8 RNI function .......................................................................................................................... 22 3.9 Resource................................................................................................................................. 22 3.10 Connection to Environment ................................................................................................... 23

5

4 NoC: Modelling in SDL....................................................................................................... 24

4.1 Requirements of Model.......................................................................................................... 24 4.2 System Structure .................................................................................................................... 24

4.2.1 Design Blocks................................................................................................................. 24 4.2.2 Parameterised Mesh size ............................................................................................... 24 4.2.3 Micro-Router.................................................................................................................. 26 4.2.4 RNI ................................................................................................................................. 27 4.2.5 Resource......................................................................................................................... 28 4.2.6 Description of Common Types....................................................................................... 28

4.3 Design Tool............................................................................................................................ 29 4.3.1 Simulation Tool .............................................................................................................. 29

4.4 Simulation Set-up................................................................................................................... 29 4.5 Simulation Results ................................................................................................................. 31

4.5.1 Simulations with equal delay of Switch and Buffer ....................................................... 31 4.5.2 Simulations with unequal delay of Switch and Buffer ................................................... 39

4.6 Chapter Discussion ................................................................................................................ 42

5 NoC: Hardware Design ....................................................................................................... 44

5.1 Model Requirements.............................................................................................................. 44 5.2 Design Structure..................................................................................................................... 44

5.2.1 Micro-Router.................................................................................................................. 45 5.2.2 RNI ................................................................................................................................. 47 5.2.3 Resource......................................................................................................................... 47

5.3 Design and Simulation Tool .................................................................................................. 47 5.4 Simulation Results ................................................................................................................. 48

5.4.1 Simulated values ............................................................................................................ 48 5.4.2 Simulated and Implemented values................................................................................ 50

5.5 Chapter Discussion ................................................................................................................ 50

6 NoC: Prototyping on FPGA................................................................................................ 51

6.1 Prototype Board ..................................................................................................................... 51 6.2 Functional Description........................................................................................................... 51

6.2.1 Communication.............................................................................................................. 51 6.2.2 Resources ....................................................................................................................... 51 6.2.3 I/O-ports......................................................................................................................... 51

6.3 Technology Mapping tool...................................................................................................... 52 6.4 Implementation Result ........................................................................................................... 53 6.5 Chapter Discussion ................................................................................................................ 53

7 Results ................................................................................................................................... 54

7.1 SDL Modelling and Simulation of NoC ................................................................................ 54 7.2 Designing NoC using VHDL................................................................................................. 54 7.3 Implementation of NoC prototype in FPGA.......................................................................... 55

8 Conclusions ........................................................................................................................... 56

9 Vocabulary ............................................................................................................................ 59

6

List of Figures

FIGURE 1-1. RESOURCES IN A NOC 8 FIGURE 2-1. LAYERS IN THE OSI-MODEL 11 FIGURE 2-2. NETWORK TOPOLOGIES. 12 FIGURE 2-3. A SIMPLE SDL SYSTEM. 15 FIGURE 3-1. DESIGN REFINEMENT. 17 FIGURE 3-2. BLOCKS IN MICRO-ROUTER 18 FIGURE 3-3. LAYERS IN THE NOC 20 FIGURE 3-4. PACKET STRUCTURES IN DIFFERENT LAYERS. 21 FIGURE 3-5. RNI INTERFACE 22 FIGURE 3-6. RESOURCE 23 FIGURE 3-7. CONNECTIONS TO ENVIRONMENT 23 FIGURE 4-1. A NODE IN THE NOC 25 FIGURE 4-2. INTERNAL BLOCKS OF A NODE 25 FIGURE 4-3. SDL BLOCKS IN THE NETWORK LA YER. 26 FIGURE 4-4. NETWORK OVERVIEW 30 FIGURE 4-5. TABLE OF RESOURCE CONFIGURATION 31 FIGURE 4-6. TRANSFER STATISTICS FOR 1 CONTINUOUS AND 15 BURSTY RESOURCES 33 FIGURE 4-7. TRANSFER MEAN TIM E FOR 1 CONTINUOUS 15 BURSTY RESOURCES 33 FIGURE 4-8. SPREADING FACTOR WITH 1 CONTINUOUS AND 15 BURSTY RESOURCES 34 FIGURE 4-9. SIMULATION SET-UP FOR 16 VS. 14 BURSTY RESOURCES 35 FIGURE 4-10. NUMBER OF TRANSFERRED PACKETS, 16 VS. 14 BURSTY RESOURCES 35 FIGURE 4-11. NUMBER OF CANCELLED PACKETS, 16 VS. 14 BURSTY RESOURCES 36 FIGURE 4-12. NUMBER OF DROPPED PACKETS, 16 VS. 14 BURSTY RESOURCES 36 FIGURE 4-13. TRANSFER MEANTIME, 16 VS. 14 BURSTY RESOURCES 37 FIGURE 4-14. SIMULATION RESULTS WITH DIFFERENT BURST LENGTH 38 FIGURE 4-15. TRANSFER STATISTICS FOR 1 CONTINUOUS AND 15 BURSTY RESOURCES 39 FIGURE 4-16. TRANSFER MEAN TIME FOR 1 CONTINUOUS AND 15 BURSTY RESOURCES 40 FIGURE 4-17. NUMBER OF TRANSFERRED PACKETS, 16 VS. 14 BURSTY RESOURCES 40 FIGURE 4-18. NUMBER OF CANCELLED PACKETS, 14 VS. 16 BURSTY RESOURCES 41 FIGURE 4-19. NUMBER OF DROPPED PACKETS, 14 VS. 16 BURSTY RESOURCES 41 FIGURE 4-20. TRANSFER MEANTIME, 14 VS. 16 BURSTY RESOURCES 41 FIGURE 4-21. SIMULATION RESULTS WITH DIFFERENT BURST LENGTH 42 FIGURE 5-1. VHDL BLOCK MODEL OF NOC AT NODE LEVEL 45 FIGURE 5-2. VHDL MODEL OF NOC AT NETWORK LAYER 46 FIGURE 5-3. VHDL BLOCK MODEL OF RNI AND A RESOURCE 47 FIGURE 5-4. THE COMMUNICATION PROCESS BETWEEN A SWITCH AND BUFFERS 49 FIGURE 6-1. OVERVIEW OF NETWORK ON CHIP PROTOTYPE IN FPGA 52 FIGURE 6-2. BITS IN IMPLEMENTATION 52 FIGURE 6-3. IMPLEMENTATION RESULT 53 FIGURE 7-1. DESCRIPTION OF COMMUNICATION IN NOC-PROTOTYPE. 55

7

1 Introduction

1.1 System on Chip

With the increase in the number of transistors that can be fabricated onto a single chip, the System on Chip (SoC) concept has become possible. Before this the chip design was aimed at producing chips that contained a single stand-alone design. What SoC means is that multiple stand-alone VLSI (Very Large Scale Integration) designs are stitched together on a chip to provide one functional system. When used in SoC the stand-alone designs are often referred to as Cores or Intellectual Property (IP) blocks. There are several vendors that only design cores but let other companies buy these and manufacture the chips. A SoC design may thus contain cores from different vendors.

1.2 Network on Chip

Network on Chip (NoC) is one solution for designing communication among components in the SoC circuits with several billion transistors, that will reach the market in approximately 5-10 years from now. One reason for a new communication strategy is, that it is to costly to use one global synchronous clock in a circuit, since it will take several clock pulses just for a signal to travel across the chip. As a result of this the designs will approach a model that is locally synchronous in each component, but globally asynchronous on the chip. This is called the GALS (Globally Asynchronous Locally Synchronous) paradigm.

A solution that exists in today’s SoC is to have dedicated buses between the communicating resources. This will not give any flexibility since the needs of the communication, in this case, have to be thought of every time a design is made. Another possibility is the use of common buses, which have the problem that it does not scale very well, as the number of resources grow. NoC is intended to solve the shortcomings of these, by implementing a communication network of switches/micro-routers and resources.

Research on NoC is now expanding very rapidly, and there are several companies and universities that are involved. The KTH/VTT Project Group, which this project is mostly influenced by, has suggested a mesh where each micro-router is connected to one resource and four other micro-routers. The resources can be IP-cores or in-house developed designs. Examples could be Signal Processors, RAM, FPGA or any kind of custom hardware block. Figure 1-1 shows how a NoC, in comparison with shared buses, could be occupied with various components as resources. RNI stands for Resource Network Interface and is a component that adapts the communication requirements of the resource to the network protocol.

8

Figure 1-1. Resources in a NoC

1.2.1 NoC Evaluation Tools

Today the development of NoC is focused on developing a suitable ne twork configuration. The ideas have to be tested and therefore tools for evaluation of a NoC design have to be considered. If a design of a NoC should be able to claim having some degree of efficiency, this would have to be supported by performance simulations.

There are several network simulators available, for example NS-2 [15] has been used for this purpose. NS-2 was however, not designed to be used for NoC and the configuration possibilities seem unable to meet the requirements of a NoC model.

Another idea is to use an ordinary high- level programming language, like C++ to build a simulator. Here there is, of course, a possibility to make it as accurate as the developer wants, but in turn it will take a lot of time to develop.

The idea of this project is to use a system-level description language to build a model that will meet the requirements of a NoC simulator, in order to make the results valid for a specific design.

1.3 SDL as a modelling platform

The Specification and Description Language (SDL) is suited for describing reactive discrete systems. A NoC can be characterised as such a system, for example a micro-router reacts on a discrete event like a packet arrives. After that it makes a decision and perform some action according to this. Because of the high level of description and possibility to specify timing properties, it is relatively easy to build a model of a complex system. It is then possible to use a discrete event simulator to simulate and examine the behaviour of the system, before it is implemented.

9

SDL supports division of the system in hierarchy using blocks, which can be used to describe functional or physical units. The behaviour of the system can be specified using concurrent processes. The structural properties of the language, makes it possible to use a model in order to simplify the lower levels of the system design.

1.4 Objectives of the project

Since the NoC concept is relatively new, research in this area concentrates on making investigations on some small detail of the subject. When studying the published material it was discovered that nobody has tried to model a NoC using SDL. There are also no models in VHDL presented that make use of buffers in the micro-routers in the NoC.

In order to investigate this, the first objective of the project is to model and simulate a packet switched NoC with buffered micro-routers using SDL. Communication between micro-routers shall use a layered design similar to the OSI-model.

After this a VHDL description of the NoC shall be developed and simulated. This will give information about timing properties in the NoC. The model should be synthesizable and be possible to implement in an FPGA.

The third and last objective is to make a prototype of a NoC-system in an FPGA and verify the design. The resources could be very simple because of the limited space in the circuit, but it shall prove that the model works.

1.5 Outline

In this chapter the reader is introduced to the NoC concept and its motivation. Here the purpose of the project is also defined. Chapter 2, entitled Theoretical Background, describes theories about network in general and some different ideas on how a NoC could be designed. The purpose of this is for the readers understanding of the area and to show that there exist many different aspects to consider when designing a network. The theories presented here are also the basis that the design is built upon and the reader should after reading this chapter be able to understand the design decisions and limitations of the model.

In chapter 3, entitled NoC: Design Decisions, the overall decisions regarding all the stages of the design are presented and motivated. Chapter 4, entitled NoC: Modelling in SDL, describes the system level design and functionality of NoC in SDL. Chapter 5, entitled NoC: Hardware Design, discusses VHDL Design of various components of a NoC system. In chapter 6, entitled NoC: Prototyping on FPGA, we discuss issues of implementing a small prototype of NoC on a programmable platform like FPGA. The last three chapters also show part results and conclusions about NoC during various design phases. Chapter 7, entitled Results, presents the overall results and it is a summary of the obtained results in the previous chapters. In chapter 8, entitled Conclusions, some important thoughts about the project and the results are discussed and proposals for future work are also given.

10

2 Theoretical Background

This chapter describes the theoretical background and programming language on which the project is based on. It also gives a brief description about what has been presented in the NoC area by other researchers.

2.1 Communication Networks

2.1.1 Communication Techniques

The two main techniques for directing the communication in a network are called packet switching and circuit switching. Packet switching works as follows, when the router receives a packet it looks at the destination address and then tries to forward it in that direction. If impossible, for some reason, to send it to the best way, the packet may be redirected, dropped or buffered. The Internet is a good example of a network that uses this type of switching.

Circuit switching is another type of switching. The main idea behind this is to set up a physical or logical connection between the nodes that wants to communicate. An example of this is the Time Division Multiplexing (TDM) systems were nodes communicate with each other in specified time-slots.

2.1.2 OSI Model

International Standards Organisation (ISO) has presented a reference model called Open Systems Interconnection (OSI) [5], [7]. The concept of this model is to separate the different functions of a network into 7 layers, see Figure 2-1, where each layer performs a certain service in the network. For each layer the function of layer below is not visible. The layer in one node communicates as if it was directly connected to the same layer in the node it is communicating with. This is called a logical connection and the means of communication are called peer protocols.

Since it is not possible for a layer, except the physical, to reach its peer physically it must make use of the service that a lower layer performs. Consider a situation if a packet from the transport layer is to be sent to another node in a network. The transport layer service can not itself find the way to that node so the packet is sent to the network layer were it is attached with an address that corresponds with the receiving node.

However the network layer has no protocol for sending the packet to another node. It must send the packet via the data link layer were the packet is framed and error detection code is added. The data link layer then takes the service of the physical layer to transmit the frame to the next node that the network layer decided. At the next network node the packet will be unpacked to the network layer and if it is not the destination it will be re-routed and sent out on the network again. This will continue until the preferred destination is reached or the packet is dropped.

11

Figure 2-1. Layers in the OSI-model

2.1.3 Routers

In a switched network there is a need to find a route through several switches. Therefore the switches that are cross-points in the network also implement a routing function. They are called routers because of this functionality. The routing algorithm is a very important part of the router since its task is to route every packet towards the right direction. Some routing algorithms are able to tell which route is the fastest, not only, which way that is the shortest. The two main kinds of routing algorithms are static and adaptive routing.

Static routing is when there are one, or possibly a few paths between sender and receiver that are fixed. In static routing algorithm, the routing changes very slowly, if at all, over time. When the routing is changed it is often a result of human intervention.

Adaptive, also called dynamic routing, on the other hand is when the routing algorithm alters the route of packets in a dynamic way. A dynamic routing algorithm changes the routes according to, for example, network traffic or due to changes of the topology.

A global routing algorithm has complete information about connectivity and link costs in the network. The algorithm can thereby compute the least-cost path between source and receiver. The calculation itself can be run at one site or at multiple sites.

Decentralized routing algorithms calculate the least-cost path by communicating with its neighbours. In the beginning the node only knows the costs of its own directly attached links, then through an iterative process of communication between nodes, the least cost path to a destination is calculated.

Application

Presentation

Session

Transport

Network

Data Link

Physical

Application

Presentation

Session

Transport

Network

Data Link

Physical

12

2.1.4 Buffers

If buffers are added to a switch, in order to store packets when at times the network is overloaded, the possibility that packets will be dropped decreases. Some switches use only a single output buffer and multiple input buffers, which can cause the problem called “head of line blocking”[6] often seen in such switches. This fault appears when the first message in the FIFO queue on the input buffer can’t be sent, because its desired output is not available. The next packet cannot pass through the line, since it is waiting for the packet first in line to be sent.

Multiple output buffers with single input buffers do not suffer from this kind of problem. The main drawback with these is however that packets may be rejected if the rate of transmission to the router is higher than the router can handle.

Another method is to use shared memory in the switch. The problem with this type is that it may result in slower system since there have to be some amount of synchronization and organization of memory access.

2.1.5 Topologies

Network topology refers to the shape of the network. How the different nodes in a network are connected to each other and how they communicate are determined by the network's topology.

Figure 2-2. Network topologies.

Mesh topology comes in two types. They are full mesh and partial mesh. Full mesh means that a node is connected to every other node in the network, this is a very costly method and mostly used to connect busses. Partial mesh means that a node doesn’t have to be directly connected to all other nodes. This type of mesh is not as costly as full mesh, but the disadvantage is less redundancy.

2D-array is a type of mesh in which nodes form a two dimensional grid where each node is connected to the four adjacent routers. The routers at the edges have only two or three connections since they don’t have more adjacent routers. The number of nodes will then become CxR where C is the number of columns and R is the number of rows.

Torus is a topology, which is similar to the 2D-array in which nodes form a regular cyclic 2-dimensional grid. Here all routers have four connections since a torus basically is a mesh with wrap-around on the edges.

Star topology uses a central hub to which all recourses are connected. All communication between resources is then passed through the central hub.

Ring topology when the resources are connected to each other in a ring. Every resource is then connected to its two neighbours communication with other resources then has to pass through the neighbours.

Full Mesh Star Ring Bus

13

Bus topology means that several resources use the same communication channel. In an ordinary local area network this can results in collisions, caused by two resources sending a packet at the same time. If you want to avoid collisions it is a possible to let the resources send their packet in a time slot, which is unique for each resource.

2.2 Network on Chip Concepts

Many researchers are investigating how to design a network on chip. The issues cover aspects like design methodology, topology, physical constraints and switching techniques. The following subsection gives some examples on ideas that other researchers have published.

2.2.1 Survey of Network on Chip Ideas

Phillips [2]

The increasing complexity of designs will promote the use of Intellectual Property (IP) blocks, which will lead to that a system becomes a composition of heterogeneous blocks. The challenges for system design will shift from design of computation, the IP blocks, to communication and storage. As the relative cost for wire network will increase, an increased efficiency in their usage is suggested by introducing routers. To get a flexible and efficient solution the network will have to support at least two traffic classes called Best Effort (BE) and Guaranteed Throughput (GT). The router suggested is packet switched with input queuing for BE traffic and time division multiplexing for GT traffic.

Stanford University [3]

The limiting factor of achieving the desired SoC is meant to be the interconnection of the components. To solve this it is suggested that the layered design methodology should be used. The authors propose to borrow design methods from the network design field and view the SoC as a micro network of components. Each level in the micro network stack is then optimised for the target application domain. Network reconfigurability will serve for the flexibility where plug and play components can interact with each other through reconfigurable protocols.

The laws of nature will require a paradigm shift in the design of the physical layer. Because of the length of the global wires in the networks that will result in that they function as lossy transmission lines. It may be to costly to try to design “ideal” wires like today, but instead there will be necessary to accept that wires on the chip are not fully reliable and deal with that fact at a higher layer of abstraction. The conclusion is that the layered micro network approach is likely the only approach to master the complexity of future SoC.

KTH/VTT Project Group [1]

The NoC platform includes both architecture and design methodology. The architecture is a two-dimensional mesh of switches with slots for resources. The architecture is the communication infrastructure including the physical, data-link and network layer of the OSI protocol stack. Deriving the concrete architecture is the first of the two phases that is the foundation of the design methodology. The second phase is the mapping of applications that together with the architecture form a complete product. It is assumed that, based on calculations of the switch size that the physical bus width between switches can be about 300 wires. Therefore, it is possible to route 128 wires in each direction between neighbouring switches.

14

The general network is called CLICHÉ (Chip-Level Integration of Communicating Heterogeneous Elements). For more special purposes were performance is of more importance the network may have to support the concept of regions. These regions will not necessary have the same structure and communication mechanisms as the rest of the network.

2.3 System Level Design

If the construc tion of a system is going to be done in a way that resources are effectively used, the implementation issues should not be considered in the beginning. In this stage, called specification part, focus should be concentrated on fulfilling the requirements and specify the functions without telling if they are implemented in hardware or software. After that the functions are designed and implemented in the most efficient (least costly) way that can fulfil the demands.

A design at system level can also be useful if one wants to describe how the system performs its functions in this case with an implementation in mind. The most desired would be that one could make a system level specification and directly out of that get an implementation. At this time, tools are not available to directly go from specification to implementation. Some vendors have tools that can convert SystemC to VHDL and in that way get an implementation in hardware. Tools for SDL can at this time compile to C/C++ code and make an implementation in software.

2.3.1 SDL

SDL is an abbreviation of Specification and Description Language and was available in1976. It was used by the telecommunication sector to help in the design of the increasingly complex systems that were created. The language has been cons tantly developed during the years and the last version is updated in 2000. It is an object oriented language and has the possibility to separate the structural, data and behavioural aspects of a system. There are both a graphical and a textual representation of SDL and there is a possibility to translate between these representations.

The language is well suited for describing reactive discrete systems. A reactive system can be characterised as a system whose behaviour is dominated by its reactions on actions from its environment. The systems is also discrete if these actions take place in discrete events and not continuously.

With SDL there is a possibility to build a model of a complete system and examine the behaviour before its implemented. The structural aspects of a system are described in blocks, which can be hierarchy organised. Behaviour is described in processes and they are not possible to describe in hierarchy but instead all processes in a system are concurrent. Behaviour hierarchy can be described by using procedures in the processes.

Communication between processes is made by signals that pass in channels between the communicating processes. Extended Finite State Machines in which transitions are caused by discrete events describe process behaviour. Figure 2-3 shows an example of how a simple system can be modelled. It is describing the behaviour of a slot machine which responds to an input of a coin and either sends back the coin if it is not valid, or puts out a randomly chosen ware if the coin is a ten crown.

15

Figure 2-3. A simple SDL system.

In the upper left is the system specification, which in this case contains a block called Ware_Machine and channels that connect it to the environment. On the channel CoinInput there can pass a signal called Coin, which is of the type CoinType that is declared as a new type. The interior of block Ware_Machine is viewed in the upper left and in this case it contains one process called Ware_Machine_Process.

16

In this process the behaviour is described as an EFSM in the lower part of the picture. Let us follow what happens if a coin is inserted in the CoinInput. The EFSM makes a transition and leaves the state Wait_for_Coin and checks CoinSort in the decision box. If it is other than a TenCrown it puts Coin on CoinOutput and returns to Wait_for_Coin. If CoinSort is TenCrown a Task with the ANY operator randomly choose one of the wares, a key ring, a plastic snake or a small doll, declared in WareType. It is assigned to WareSort and put out with the signal Ware on Ware_Output. After this the EFSM returns to Wait_for_Coin.

17

3 NoC: Design Decisions

Before and during the project time it is necessary to make several design decisions since detail specifications for the project are not defined. The reason for this is that there is not enough information available to make a detailed specification before start.

3.1 Design Methodology

This project consists of three stages ordered from system level down to implementation as viewed in Figure 3-1. The networks behaviour is modelled at the system level in SDL. Certain descriptions of the systems behaviour will come out from the VHDL design. Since the VHDL model will be clock synchronous, the number of clock cycles a function takes, will be used in the SDL model to get a more realistic timing behaviour. The meaning of Figure 3-1 is that the design will be refined and information from a lower stage will be used to improve the model at a higher level.

Figure 3-1. Design Refinement.

3.2 Network configuration

The packet switched method was chosen because of its scalability and flexibility. The drawback of this is that there will be no hard real-time properties in the network. The m*n 2-d mesh structure is used since this makes the routing easy, even if the number of network nodes grow.

18

Micro-Router

Route-Control In/Out Buffer

In/Out Buffer

In/Out Buffer

In/Out Buffer

RNI In/Out Buffer

3.3 Route and Switch function

A micro-router is built according to the OSI-model where different layers perform specified services in it. This gives a ground structure to design upon and provides ability to change the service on one layer without affecting the other. The micro-router should be able to communicate between micro-routers both on network layer and data- link layer. The reason for this is that it should be possible to make a fast simulation in SDL only using the network layer, or to make a more accurate simulation with both layers.

The micro-router itself is divided into several functional blocks showed in Figure 3-2,

• I/O-buffers for communication with other micro-routers

• I/O-buffer for communication with the RNI

• Route-Control unit for switching the packets into the right direction.

Figure 3-2. Blocks in Micro-Router

3.4 Connections between Nodes

This area can be divided into two parts because of the objectives to both simulate and evaluate, and also to support the VHDL design. These objectives are partly conflicting. If a very detailed design in SDL were made, it would be more time consuming to simulate. On the other hand too few details, may result in that design aspects will be overlooked, and to deal with these in VHDL can be very complex. One solution to solve such a problem was that there should be a design, that were possible to simulate both down to the data- link layer and on the network layer. Since data- link layer does not add information on how the routing and buffers work, it would be possible to leave this layer out when making evaluations of network performance.

19

3.4.1 Drop of Packets

The connections of the packet transferring components in the design are made in a way that a sender of a packet cannot send, unless it gets a ready to receive signal from the receiver. A receiver does not give this signal if it is full. No packets will therefore be dropped between these components. Drops are allowed in the micro-router when it is impossible to route for a certain amount of time, in order to prevent deadlock or other unwanted behaviour. This strategy is chosen because when the network is heavily loaded, the resources cannot send packets that most likely will be dropped. This brings the information closer to the source and the decision on how to react is made by the resource.

3.4.2 Physical issues

In [1] the proposed bus-width is set to 300 wires including address and control wires. This is based on the assumption that the physical size of each side on the router can have this amount of I/O: s. If a bus-width is decided there must be a decision in what direction the wires should be used. There could be a bus were all wires are used by both routers to send and receive data. Another way is to use half of the bus as output on one and input on the other and vice versa. This means that less data will be transmitted in one direction but an easier way to communicate and a possibility to both send and receive at same time.

For simulation and implementation in this project it should not be necessary to use such a wide bus, as it would only slow simulation down and take up a lot of space in the FPGA. It is however useful to have a bus-width that is easily controlled and there should be a possibility to set the width at any size.

The physical layer defines the actual physical connections that are the basis for a network. Since the implementation is intended for an existing FPGA with a fixed structure the advantages of modelling at the physical layer in SDL is not clear. It is therefore not included in the design.

3.4.3 Data-link layer connection

The data- link layer deals with how the transfer of data between two micro-routers can be reliable. Data is grouped into frames that may consist of for example a header, payload and checksum. In this design the payload is a packet from the network layer. The transfer of one data-link frame is sent parallel to the neighbouring node. Every frame is checked for errors and retransmission is ordered if it is incorrect.

The transfer of bits on the bus has to be synchronised in some way. The method chosen to realise this is by a simple handshaking protocol. For example data will be present on the bus on a write signal from the sender and acknowledged by the receiver with another signal.

3.4.4 Network layer connection

This layer regards the transfer of data between any arbitrary nodes in the network. This is done by the routing function in the micro-routers. A message at this layer may consist of several packets that carry an address to its destination. The purpose of the network layer is to make sure that the packets reach their desired address in a way that is decided by the routing algorithm. The network layer is considered not reliable, because there is no confirmation that a message has reached its destination.

As mentioned in the beginning of this chapter there should be a possibility to simulate the design without the data- link layer. In this case the connection is directly between the buffers in the opposite nodes. Figure 3-3 gives a graphical explanation.

20

Transport

Network

Data Link

Physical

Transport

Network

Data Link

Physical

Figure 3-3. Layers in the NoC

The vertical connection of network and data- link layer is in this case broken. The transfer of data on the network layer is done in the same manner as if it were in the same node, i.e. as a transfer between network and data- link layer. In Figure 3-3 it is viewed as the horizontal network connection.

3.4.5 Transport layer

The purpose of the transport layer is to provide a reliable transfer of messages between the resources that use the network. For example if a packet gets dropped in the network, this would be detected and the transport layer would take necessary actions. The micro-routers do not implement the transport layer, since they only operate up to the network layer. In the project design resources are very simplified and transmitting transport layer packets. This leads to that the RNI will not perform transport layer services but will provide the service of the network layer to these packets.

3.5 Packet structures

Every layer has a different packet structure. It is based on what information is needed to perform each layers service. The highest level of packet structure that is defined is the Transport layer. Much of the structure that is used comes from the KTH/VTT project. In Figure 3-4 the packets relations to each other are explained.

Network Connection

21

Figure 3-4. Packet Structures In Different Layers.

In the transport layer it is necessary to be able to identify the Destination Process Id (DPID) and Source Process ID (SPID). Every message that a process sends has a Message Sequence Number. If a message is too large to fit into one packet it will be divided into several packets and thus every packet will have a Packet Sequence Number (PSN). The payload in the transport layer is in a real situation a packet from a higher- level service but in this model this will be sent as some test dummy bits.

The network layer adds the destination RID with the actual row address and column address of the node in the network were the DPID resides. The payload of this layer is a packet from the transport layer. To make sure that a packet with errors does not stay around in the network there is a possibility of a Hop Counter (HC) that is counted up for each router that it passes.

The data-link layer frames consist of a payload in form of a packet from the network layer and an error check field.

Example sizes in bits that each layer consists of are shown in Figure 3-4. In the SDL model there is no purpose to limit the size of fields and therefore the type integer is used. The size of each field is limited only in VHDL because of the need to set a fixed size in order to be able to implement it.

3.6 Buffers

Buffers make it possible to store packets for a while without dropping them if it is not possible to transmit them further at the moment. A major design issue is how large buffers are cost effective to be used in a design. The larger the buffer, the higher will the cost be in terms of chip area. To satisfy these conflicting demands, it is decided to use a maximum size of 4 buffers for our evaluating purposes.

22

3.7 Routing Algorithm

A simple dynamic routing algorithm makes the route decisions. It first tries to route the packet in the right direction according to columns, if that is not possible it tries to send the packet in the right direction according to rows. If it is not possible to route the packet in the right direction then it is just sent to any free output-buffer. When there is no output buffer available for a time period then the packet will be dropped.

The reason for choosing this type of algorithm is that we don’t need to use large tables that will take a lot of resources. It is also not necessary since the structure of the network is known. One of the drawbacks with this kind of routing algorithm is that we can’t guarantee any real-time properties in the NOC, but it is small in area and quite fast.

3.8 RNI function

The purpose of the Resource Network Interface (RNI) is to be a link between a micro-router in the network and the resource connected to it. The idea to have a RNI is that there is no need for a resource to implement the network layer. That means that some complexity of the resource will be removed. In the model it is assumed that the transport layer is implemented in the resource part of the interface between the RNI and resource. This is because there is a possibility that every resource will not have identical use of this layers service. This means that the RNI will receive transport layer packets, perform network layer service, which is, adding the network address of the desired destination for the transport packet.

Figure 3-5. RNI Interface

3.9 Resource

Because of the need to get good simulation results, the resource in the model are to resemble real resources behaviour. There can be different kinds of resources in a real system, such as DSP: s, general-purpose processors or memory. In VHDL the resource will be made very simple due to lack of time.

RNI

In/Out Buffer

Network Service Control

23

Figure 3-6. Resource

3.10 Connection to Environment

There are at least two ways in which the network can connect to the environment outside the chip. One is to have dedicated components to connect between edge routers and environment, see picture A in Figure 3-7. The other way is that resource space is used and connected via RNI, see picture B. The first way is a bit more complicated since there will have to be special routers in these connections. Packets should not be routed to the environment if it is not the destination. The second suggestion perhaps will take up unnecessary space but gives a simpler configuration. The proposal in this project is that some resource spaces are used to connect to environment. Making some pins available for resources to connect to can achieve this. In this way there is flexibility to choose both number of and type of connection, for example an Ethernet controller.

A) Connecting to I/O-Interface via edge routers B) Connecting to I/O-Interface via dedicated

resources

Figure 3-7. Connections to Environment

Resource

Resource Behavior Imitation

24

4 NoC: Modelling in SDL

There are many ways to look at NoC, but since it is basically a communication network it could be a good idea to divide it into structures using the layered network model. The hierarchical and regular structure of a NoC architecture are quite suitable for simulation using SDL.

4.1 Requirements of Model

The motivation for using SDL in the design process is to make a model of the system to support the design at a lower abstraction level, in this case RTL design using VHDL. This means that the SDL system is designed with a block structure that reflects the component structure in VHDL. Another issue is to make a system with good simulation possibilities. The SDL system should, in this sense, be easy to configure in order to obtain an effective design regarding real resources behaviour.

4.2 System Structure

At the top level in the design the type definitions, signals and configuration constants are set. As an example, the simulation can be performed on different layers. One alternative is to simulate only the behaviour on the network layer, which is accomplished with setting the synonym DLS to 0. If the behaviour of the data link layer also is to be simulated DLS should be set to 1. Another configuration possibility is to choose the size and shape of the network. Set the network row and column size and a mesh of a complete network will be created. The resources can be individually or collectively set to specific behaviour in the resource type.

4.2.1 Design Blocks

The major decision for the block division is that a block should encapsulate a specific function of the network. To some extent they should also reflect the physical units that form the network. The design takes frequent use of type definitions. Data-link layer is designed as a separate block that can be optionally connected to the buffers in the network layer connections.

4.2.2 Parameterised Mesh size

Since the model is to be used for simulation purposes the network size is an important parameter. In Figure 4-1 the node block is shown. Before a simulation the number of nodes that are the product of the values assigned to row size*column size are created. The connections between the nodes in the network are then automatically set up.

25

Figure 4-1. A node in the NoC

The internal structure of the Node block is viewed in Figure 4-2. The IO_Switch is a connection that is only used in the initialisation of the simulation in order to get all the created routers connected appropriately. When this is done this connection is no longer used.

Figure 4-2. Internal blocks of a node

26

4.2.3 Micro-Router

The block that is called the Router is a micro-router that is designed with functional blocks dividing the network and data-link layer. Figure 4-3 shows the network layer block consisting of one switch/routing block, 4 in/out buffers, and 1 RNI in/out buffer. This block has two optional modes; it can either communicate with the neighbouring routers directly at the network layer or through the data-link layer.

Figure 4-3. SDL blocks in the network layer.

Switch and Route unit

The block SWITCH_CONTROL contains the processes Switch and Route_Control. Though they have a sequential behaviour in this design, the division into two processes have been done in order to make it possible running them independently of each other. For example this is used when the Route_Control is updating the out buffer states. The Switch receives packets from the in-buffers and sends information about the packet’s desired destination to the Route_Control.

27

The Route_Control has information about the state of the out-buffers and makes a decision about the route for the packet. The first choice is to send it north or south according to the destination. If that is not possible it investigates west or east. If none of the preferred buffers are available it will send to any other free buffer. If there is no free buffer the Switch will be informed about this. When there are free buffers the Switch receives a message, telling in which buffer to put the packet. If at that time there were no free buffer, it starts a counter and sends another request for route of another packet. The counter is reset if a packet is routed to an out-buffer. In the case that the routing request has failed and the counter has reached a timeout value (16 in our case) the packet will be dropped.

Buffers

The buffers use the BUFFER_TYPE block and handle all communication in one direction (North, South, West, and East). To make it possible to set different configuration for the buffer connecting to RNI this is of a special type called RNI_BUFFER_TYPE. These blocks are also able to communicate with the neighbouring router or RNI, both directly on the network layer and via the data-link layer.

Data-link layer

The layer is divided into a separate block that connects to the buffers and performs service of the data-link layer. The type used is D_IO_TYPE.

4.2.4 RNI

The RNI contains the units that perform the actions of the Resource Network Interface. It encapsulates the block type N_LAYER with processes that perform the network layer services in the RNI. This block operates in two optional modes; it can either communicate with the neighbour router directly at the network layer or through the data- link layer.

It contains one RNI service block and one in/out-buffer towards the micro-router. There is also a possibility to use a data- link layer block.

RNI service

The block RNI_SERVICE is responsible for the transformation of data between the resource and the network. In this design the resource send transport layer (T_Layer) packets

In the process RNI_OUT, T_Layer packets are attached with network layer information and sent out on the network. There is a possibility to set a certain delay for the operations in this process. The process RNI_IN the N_layer packets are unpacked to transport layer and sent to the resource. There is also a possibility to set a certain delay for the operations at this stage.

Buffer

The N_BUFFER_TYPE is used to handle the communication with the network router on the network layer or via the data-link layer. It is similar to the BUFFER_TYPE in the router except that it is of an own type for configuration options.

Data-link layer

In the block D_LAYER there is one sub-block of the type D_IO_TYPE.

28

4.2.5 Resource

The block RESOURCE is a container of the type of resource that is connected. It is modelled using two processes, one each for sending packets and receiving packets.

Packet Sender

The process Sender simulates the behaviour of the resource. In this model the resource simulates the transport layer service, which result in that data is sent and received as transport layer packets. Between the RNI and the resource there is only a transport layer connection. Adding lower layers here will not give any benefit to evaluate the model. Since the resource and RNI sits in the same node there is no reason for implementing the network layer between these. Different behaviours can be set with procedures for individual resources.

We provide a possibility to choose between bursty and continuous base behaviour. With bursty behaviour it means that a certain number of packets is put out after each other with maximum rate, called burst, and after that there is a delay called burst gap before the process repeats itself. The number of packets between the delays is randomly selected according to the Poisson distribution.

The burst gap is calculated from a random delay that is uniformly distributed between a minimum and a maximum value. The length of the burst gap is weighted with the number of packets in the burst, which results in that if a big number of packets are sent there will be a longer burst gap. The selection of addresses are random within the limits of the network, but addresses situated next to the sender has double probability to be chosen since this is the most likely scenario for communication. With the continuous behaviour it is possible to set the delay between two packets transmitted which gives the output frequency of packets.

Packet Receiver

In the Receiver process the time of arrival of the packet is noted and the packet information is written to a file. It is now possible to compare the logged information from the receiver with other values. We can get a lot of information from these logs, for instance number of packets sent/received, transfer time and many other interesting figures.

4.2.6 Description of Common Types

The following is a description of some common types defined to model NoC.

Block type BUFFER_TYPE

Process DATA_IN_TYPE receives data from data- link layer (D_N_TYPE) or the neighbouring routers buffer (DATA_OUT_TYPE) and buffers it. When SWITCH_TYPE is ready to pass it on it is removed from the buffer.

DATA_OUT_TYPE receives data from SWITCH_TYPE and buffers it. Data is then passed on either to the neighbouring router (DATA_IN_TYPE) or to the Data-Link Layer (D_N_TYPE) and is thereafter removed from the buffer.

Block type D_IO_TYPE

This block contains processes, which performs the data-link layer services. The processes in this block will not be active during network layer simulation.

29

The process N_D_TYPE is consuming network layer data packets from network layer, framing them with data link layer information and error check. After that it passes them to the data-link when the receiver is ready to receive. If there is an error introduced in the transmission there will be a retransmission of the failing message.

Process D_N_TYPE receives data- link layer data packets from the data link, unframes the data-link layer information and checks for errors. After that it passes them to the network. If there is an error in the transmission there will be a signal sent to the sender.

4.3 Design Tool

The programming tool that was available and used in the project was Telelogic Tau 4.3. This package contains an SDL suite that fully supports SDL-92 and partially SDL-96. The design has been done with the graphical notation of SDL. After the design in SDL it is possible to make an implementation in C or C++ that can be compiled to an executable program, which can be tested with a simulator.

4.3.1 Simulation Tool

In the SDL suite there is a possibility to use a simulator that can simulate the function of the system. It is possible to stimulate inputs to the system and to examine data and state transitions.

In an SDL simulation there is a possibility to use real time or simulated time. Real time means that the time is updated according to the “wall clock”, that is, if an event is set to occur in 1 hour the user have to wait that amount of time. The simulated time is updated according to the discrete event simulation technique. This means that the current value of the simulation time, accessed with the NOW operator, is identical with the time at which the currently executing event is scheduled. The simulation time is updated when an event is finished and then set to the time the next event is scheduled. Say for instance that the next event scheduled in a simulation is the expiration of a timer set to 1 hour. This will lead to an immediate execution of the event and a transition in the processes. The simulation time will increase 1 hour but without having to wait this amount of time.

The scheduling assumes that a transition takes no time and that a signal is placed in the input port of the receiver immediately as the sender makes an output of the signal. Due to these rules the only events that update the simulated time are the output of timers.

4.4 Simulation Set-up

One interesting issue to test is how differences in the buffer size will affect the performance of the network. A comparison that cost the same regarding area could for instance be to test which of the configurations, in-buffer=2 and out-buffer=2 perform relatively to in-buffer=1 and out-buffer=3. Another factor that could have an impact on network performance are the delays in the different components that packets will pass through. When comparing this it is revealed how much it is worth to further improve these delays in VHDL.

The performance measures used are loss probability in terms of packet drops/100 packets and mean packet delay regarding all packets sent and received in the network. One thing that must be considered is what kind of traffic load should be offered to the network. Of course one must limit the simulations, because the possibilities are so many that they can continue forever.

The parameters below are regarded being mostly interesting to change, and after the simulations investigate how network performance is affected with different buffer sizes.

30

Burstiness- Same average in packet rate but change in burst gap and packets/burst.

Network clock speed- Change the delays of the network components.

Network load- Change the number of sending resources.

Network Size

It was decided that a 4*4 node network, like Figure 4-4, should be used for the simulations. This size is chosen because that a smaller will not test the routing thoroughly and a larger will take to long time to simulate and analyse. For a SoC it also seems like a size that could be realistic in a near future. After studying some simulations the following set-ups were used for experiments:

1. A mixed set-up according to Figure 4-5 with 1 continuous and 15 bursty resources. The positions of the resources will be same for all simulations in order to make a fair comparison.

2. Set up all sending resources as bursty according to the Poisson distribution.

3. Set up fewer sending resources as bursty.

The last two will give a comparison on how the load change, in number of resources, affects performance.

Figure 4-4. Network Overview

2,1

RESOURCE

RNI

2,2

RESOURCE

RNI

2,3

RESOURCE

RNI

2,4

RESOURCE

RNI

1,1

RESOURCE

RNI

1,2

RESOURCE

RNI

1,3

RESOURCE

RNI

1,4

RESOURCE

RNI

3,1

RESOURCE

RNI

3,2

RESOURCE

RNI

3,3

RESOURCE

RNI

3,4

RESOURCE

RNI

4,1

RESOURCE

RNI

4,2

RESOURCE

RNI

4,3

RESOURCE

RNI

4,4

RESOURCE

RNI

31

In Figure 4-5 it is shown what kind of communication the resources in Figure 4-4 have.

Position

Resource

Destination

Behaviour

Max Rate/Packet (MHz)

Mean Rate (MHz)

Mean Burst Gap (us)

Mean Nr of packets/Burst

1,1 0 3,1 Burst 384 100 59,2 8 1,2 1 1,1 Continuous64 64 1,2 1 2,2 Continuous64 64 1,2 1 4,4 Continuous64 64 1,2 1 3,3 Continuous64 64 1,2 1 2,3 Continuous64 64 1,2 1 1,4 Continuous64 64 1,3 2 2,4 Burst 384 100 59,2 8 1,4 3 1,3 Burst 384 100 59,2 8 2,1 4 3,1 Burst 384 100 59,2 8 2,2 5 4,1 Burst 384 100 59,2 8 2,3 6 2,1 Burst 384 100 59,2 8 2,4 7 2,1 Burst 384 100 59,2 8 3,1 8 4,3 Burst 384 100 59,2 8 3,2 9 4,3 Burst 384 100 59,2 8 3,3 10 3,2 Burst 384 100 59,2 8 3,4 11 3,2 Burst 384 100 59,2 8 4,1 12 3,4 Burst 384 100 59,2 8 4,2 13 4,3 Burst 384 100 59,2 8 4,3 14 4,1 Burst 384 100 59,2 8 4,4 15 4,2 Burst 384 100 59,2 8

Figure 4-5. Table of Resource Configuration

Simulation Period

The amount of time that the simulations should last was set to 10 000 time-units (clock cycles). Tests up to 200 000 time-units showed that 10 000 time-units were enough to produce reliable data. Since the time to deal with the data increased rapidly, 10 000 time-units were chosen.

4.5 Simulation Results

4.5.1 Simulations with equal delay of Switch and Buffer

The following simulations are made with faster settings than those measured in the simulations in VHDL. There are equal delay of the buffer and the switch. In this case the minimum period of the micro-router is 3 clock cycles. For example, if the network clock period is 0,5 ns, this result in a maximum bandwidth of 1/(3*0,5*10-9) =667 Mpackets/s. We present them here in order to show how the network could behave if the component delays are optimised to this speed.

32

4.5.1.1 Simulations with 1 Continuous and 15 Bursty Resources This simulation is based on the resource behaviour that you can see in Figure 4-5. When evaluating what happens to the sent packets we have divided the packets into three different categories, namely Transferred, Cancelled and Dropped.

• Transferred packets are those that have been sent on time, that is when the resource wants to put it to the network, and have reached their destination.

• Cancelled packets are those that haven’t been sent on time because the buffers are full. These packets will be dropped by the transmitting resource.

• Dropped packets are those that are sent on time but later got dropped by a micro-router because the required buffers are full.

When looking at the results it can be seen that a network that works with 0,5 time units per clock cycle gets no dropped packets and only a few cancelled packets. The network is so fast that the buffer configuration has no, or very little influence on the performance. When the speed of the network is lowered to 0,6 time units per clock cycle, we start to se the effects of the buffers. The configurations with only one output buffer begin to drop packets and also to cancel a lot of the packets.

When more than one output buffer is used the network works quite satisfactorily, and has no drops and only a few cancelled packets. The highest throughput is reached by [in-buffer:out-buffer] 02:02, because this result in the largest amount of transferred packets. The difference in performance between 01:02 and 02:02 is not very large so it will probably not be worth the extra cost of resources.

The reason that it works better with more than one output buffer is that more packets can be routed into an optimal route, and from that reason causes a lower number of routings along its path. From what we have seen so far, 02:02 is the best configuration. The reason for this is that the advantages from both input- and output-buffers are combined. The input buffers gives the router a possibility to have a high throughput because there are always messages in the buffers that are ready to be routed, the output buffers makes it possible to route the packets into the correct direction.

If we look at the results from using clk = 0,7 and clk = 0,8 we can once again see that the results from simulations with more than one output buffer are superior to those with only one output buffer. At clk = 0,8 the buffer configurations with 01:03 and 02:02 gets almost the same results, this can be connected to the higher spreading, see Figure 4-8, 02:02 causes when the load is high.

Clk=0,5

0%

50%

100%

Dropped 0 0 0 0 0 0

Cancelled 66 19 13 17 2 3

Transferred 18173 18220 18226 18222 18237 18236

01:01 01:02 01:03 02:01 02:02 03:01

Clk=0,6

0%

50%

100%

Dropped 953 0 0 2903 0 2411

Cancelled 2671 140 118 4167 35 3078

Transferred 14615 18099 18121 11169 18204 12750

01:01 01:02 01:03 02:01 02:02 03:01

33

Clk=0,7

0%

50%

100%

Dropped 3170 0 0 3763 0 4008

Cancelled 7381 517 442 6788 194 6246

Transferred 7688 17722 17797 7688 18045 7985

01:01 01:02 01:03 02:01 02:02 03:01

Clk=0,8

0%

50%

100%

Dropped 4593 0 0 4195 0 5027

Cancelled 11050 1300 1084 8929 1096 9127

Transferred 2595 16938 17154 5114 17142 4084

01:01 01:02 01:03 02:01 02:02 03:01

Figure 4-6. Transfer statistics for 1 continuous and 15 bursty resources

The mean time to transfer packets with different configurations on both network speed and buffer configuration once again proves the configurations with several output buffers to be superior. The configuration with 02:02 is slightly faster than 01:02 and 01:03 on networks working on 0,5-0,6 time units. When the period time is increased further configurations with one input buffer doesn’t increase their mean transfer time as much as 02:02 does.

The reason for this can be that the probability that the wanted output buffer is full is less since it takes longer time between two routings from the same input buffer. Once a packet is routed a new packet has to be transferred in to the input buffer and during this time it’s possible to make space in the desired output buffer. This result in less spreading, see Figure 4-8, of the packets and lower number of switchings for each packet, which in the end gives a lower transfer time for the packets.

Mean transfertime

0

50

100

150

200

250

Bufferconfiguration (In:Out)

Tim

e u

nit

s

Clk=0,5 18 14 14 18 14 17

Clk=0,6 41 19 19 58 19 59

Clk=0,7 81 26 26 97 27 113

Clk=0,8 230 35 35 158 47 236

01:01 01:02 01:03 02:01 02:02 03:01

Figure 4-7. Transfer mean time for 1 continuous 15 bursty resources

Figure 4-8 is targeted to show how the packets travel within the NOC. This measurement we call the spreading factor. It is defined as the percentage of packets in their optimal path compared to packets out of their optimal path. For example a 60 % spreading shows that out of one hundred packet switchings, switches that reside out of the packets optimal path switch 60 of them. Switches that lay in the packets optimal path make 40 switchings. If a packet doesn’t travel the shortest possible path the number of switchings will increase rapidly, and may cause drops if there is heavy traffic.

34

As you can see from the figure below the buffer configuration affects the percentage of switchings caused by misdirected packets, which proves the importance of evaluating what buffer configuration to use. Especially the importance of more than one output buffer can be seen clearly.

The buffer configuration 01:02 has a slightly higher spreading than 02:02 and 01:03 at a high frequency of the network clk=(0,5;0,7), this can be explained by the reason that this configuration only has three buffers compared to four in the other two. The probability that all buffers are full is higher when there are fewer buffers, and the router then has to find an optional route.

When the network-clock is set to 0,8 the configuration 02:02 suddenly begins to spread the packets more than 01:02, which seems rather odd because more buffers should give better result in most situations. The reason for this is that the router is now able to route the second packet in the input queue before there is any space in the desired output buffer, and this causes the router to send the packet into a non-optimal route.

Spreading

0%

20%

40%

60%

Clk=0,8 48% 8% 5% 38% 10% 41%

Clk=0,7 34% 5% 3% 33% 3% 31%

Clk=0,5 11% 2% 1% 9% 1% 9%

01:01 01:02 01:03 02:01 02:02 03:01

Figure 4-8. Spreading factor with 1 continuous and 15 bursty resources

4.5.1.2 Simulation with 16 bursty resources versus 14 bursty and 2 inactive resources Simulation Set-up: These simulations are made upon the set-up table in Figure 4-9 but with some differences between the two models. In the simulations model All Bursty the set-up of the resources are exactly as in the table. The simulation model All Burst (-2) is also based on this table but the sending part of the resources 1,2 and 3,3 are turned off.

35

Sen

der

Rec

eive

r

Beh

avio

ur

Max

Rat

e/

Pac

ket

(MH

z)

Mea

n

Rat

e (M

Hz)

Mea

n

Bu

rst

Gap

(ns)

S

pre

adin

g

Per

iod

in

ns

Min

ns

Mea

n

Bu

rst

Gap

(p

er

pac

ket)

us

Max

ns

Mea

n

Nr

of

pac

kets

/ B

urs

t

1,1 3,1 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 1,2 1,1 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 1,3 2,4 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 1,4 1,3 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 2,1 3,1 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 2,2 4,1 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 2,3 2,1 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 2,4 2,1 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 3,1 4,3 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 3,2 4,3 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 3,3 3,2 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 3,4 3,2 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 4,1 3,4 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 4,2 4,3 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 4,3 4,1 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8 4,4 4,2 Burst 384 100 59,2 75% 2,60E-09 2 7,396 13 8

Figure 4-9. Simulation set-up for 16 vs. 14 bursty resources

When comparing the number of packets transferred we can once again see that the network that runs with a period of 0,5 time units is so fast that the buffers have a very small influence on the behaviour of the network. There are no drops and the numbers of cancelled packets are also very small using this configuration. When increasing the period time to 0,7 time we can see that the influence of the buffers starts to increase. The simulations made with more then one output buffer is significantly faster than the ones with only one output buffer. The 02:02 configuration is slightly better then 01:02 and 01:03 at this speed in the All Burst model, but when increasing the period to 0,9 the configuration 02:02 falls back and 01:02 and 01:03 are the best. In the All Burst (–2) model the simulation with 02:02 stays the best in all simulations.

Number of transferred packets16 Bursty Resources (All Burst)

02000400060008000

1000012000140001600018000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f tr

ansf

erre

d

pac

kets

0,5

0,7

0,9

Number of transferred packets14 Bursty Resources (All Burst-2)

02000400060008000

1000012000140001600018000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f tr

ansf

erre

d p

acke

ts

0,5

0,7

0,9

Figure 4-10. Number of transferred packets, 16 vs. 14 bursty resources

36

The number of cancelled messages clearly shows what configurations that are capable of maintaining the dedicated throughput. All buffer configurations are able to route without drops with the period of 0,5 time units. When increasing the period to 0,7 the configurations with multiple output buffers starts to show their advantages. All three configurations manage to distribute the packets with only a few cancelled packets.

The configuration with two input, and two output-buffers is the one that causes the smallest number of cancelled packets. The period time is then increased further to 0,9 and the configuration 02:02 in the All Burst simulation is not the best option any more. It seems like the configuration 02:02 is the best when the load is low, and when the load increases the configurations 01:02 and 01:03 gets the best, where the later is slightly best. The reason for this can once again be connected to the higher spreading that 02:02 causes when the load is high.

Number of cancelled packets16 Bursty Resources (All Burst)

0100020003000400050006000700080009000

10000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f ca

nce

lled

pac

kets

0,5

0,7

0,9Number of cancelled packets

14 Bursty Resources (All Burst-2)

0100020003000400050006000700080009000

10000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f ca

nce

lled

pac

kets

0,5

0,7

0,9

Figure 4-11. Number of cancelled packets, 16 vs. 14 bursty resources

Once again we can see the importance of the output buffers and there are no drops when the configurations 01:02 and 01:03 are used. The configuration 02:02 starts to drop heavily when the load increases and the configurations 01:02 and 01:03 are the ones that don’t drop in any of these two simulations.

Number of dropped packets16 Bursty Resources (All Burst)

0

1000

2000

3000

4000

5000

6000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f d

rop

pet

p

acke

ts

0,5

0,7

0,9Number of dropped packets

14 Bursty Resources (All Burst-2)

0

1000

2000

3000

4000

5000

6000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f d

rop

ped

pac

kets

0,5

0,70,9

Figure 4-12. Number of dropped packets, 16 vs. 14 bursty resources

37

When looking at the mean transfer time it is observed that the results from all the set-ups with one output buffer increase their transfer time much faster than the other set-ups. The set-ups with more than one output buffer have almost the same performance as long as the load is not too high. When the load gets higher the 02:02 set-up falls back and the set-ups 01:02 and 01:03 are by no doubts the fastest.

The reason that 02:02 suddenly changes from the fastest of the three set-ups with multiple output buffers to the slowest, as the load increases seems a little strange. One theory about that this happens is that when there is a low load on the network the two input buffers helps the router to keep a high throughput. When the load increases there is not enough time to make space in the output buffer, and the packets are sent through a non-optimal path. This is the reason why the transfer time for 02:02 suddenly becomes longer than the time for 01:02 and 01:03.

Mean Transfer Time for Packets 16 Bursty Resources (All Burst)

0

100

200

300

400

500

600

700

0,5 0,7 0,9

Network Speed, Period Time

Tra

nsf

erti

me

01:01

01:02

01:0302:01

02:02

03:01

Mean Transfer Time for Packets14 Bursty Resources (All Burst-2)

0

100

200

300

400

500

600

700

0,5 0,7 0,9 1,1 1,3


Tra

nsf

erti

me

01:01

01:02

01:03

02:01

02:0203:01

Figure 4-13. Transfer Meantime, 16 vs. 14 bursty resources

4.5.1.3 Simulations with different burst length To investigate the importance of buffers when sending messages in bursts, we decided to examine what happens when the mean burst- length is increased. The mean number of packets per burst was now changed from 8 to 16. The expected result was to see that a longer mean burst- length would benefit from a larger amount of buffers. From the figure below it is possible to see that the results we got did not prove that at all.

The number of transferred packets are very similar in the two cases, but when the network speed is set to 0,7 the simulations with a mean burst length of 16 are much better in two cases. It’s when only one output buffer is used that the numbers of cancelled packets are significantly lower. When a longer burst- length is used this also gives longer time between the bursts because the total amount of packets sent shall not increase. The probability that some other resource is sending packets that degrades the performance in a transmission is from this reason lower when sending longer bursts.

38

Number of transferred packets16 Bursty Resources, Burst Length 8

02000400060008000

100001200014000160001800020000

01:01 01:02 01:03 02:01 02:02 03:01Bufferconfiguration (In:Out)

Nu

mb

er o

f tra

nsf

erre

d p

acke

ts

0,5

0,70,9

Number of transferred packets16 Bursty Resorces, Burst Length 16

02000400060008000

100001200014000160001800020000


Nu

mb

er o

f tr

ansf

erre

d

pac

kets

0,5

0,70,9

Number of cancelled packets16 Bursty Resources, Burst Length 8

0

2000

4000

6000

8000

10000

12000

14000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f ca

nce

lled

p

acke

ts

0,5

0,70,9

Number of cancelled packets16 Bursty Resorces, Burst Length 16

0

2000

4000

6000

8000

10000

12000

14000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f ca

nce

lled

p

acke

ts

0,5

0,70,9

Number of dropped packets16 Bursty Resources, Burst Length 8

0

1000

2000

3000

4000

5000

6000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f d

rop

ped

p

acka

ges

0,50,70,9

Number of dropped packets16 Bursty Resorces, Burst Length 16

0

1000

2000

3000

4000

5000

6000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f d

rop

ped

pac

kets

0,50,70,9

Figure 4-14. Simulation results with different burst length

39

4.5.2 Simulations with unequal delay of Switch and Buffer

The following simulations show the results of the simulations when the switch and buffer delays are set to the delays that were obtained in VHDL. The main difference in the delay configurations is that the switch is a little slower than the buffers. The minimum period of the micro-router is 5 clock cycles. For example, if the network clock period is 0,5 ns, this result in a maximum bandwidth of 1/(5*0,5*10-9) =400 Mpackets/s. The set-ups are identical with the previous simulations.

4.5.2.1. Simulations with 1 Continuous 15 and Bursty Resources The results show that a network clock with 0,3 time units gets no dropped packets and very few cancelled packets. We can see that the configurations with multiple in-buffers seem to transfer most packets. At a network speed of 0,4 time units per clock cycle, we can see small effects of the out-buffers. The configurations with several input-buffers begin to drop packets an also to cancel a lot of the packets. It can be seen that the out-buffers does not have the big effect that were seen in the previous simulation. The switch cannot take advantage of the larger out-buffers since it is slower and thus having it difficult filling them up. It is perhaps surprising that the in-buffers are so bad, but the routing algorithm can explain it. At high traffic they make it possible to route packets in a non-optimal direction, thus creating a lot of extra traffic.

Clk=0,3

0%

50%

100%

Dropped 0 0 0 0 0 0

Cancelled 82 60 52 23 16 6

Transferred 18157 18179 18187 18216 18223 18233

01:01 01:02 01:03 02:01 02:02 03:01

Clk = 0,4

0%

50%

100%

Dropped 0 0 0 3065 2020 3237

Cancelled 1302 905 710 4152 3243 3593

Transferred 16937 17334 17529 11022 12976 11409

01:01 01:02 01:03 02:01 02:02 03:01

Clk=0,5

0%

50%

100%

Dropped 2775 2565 2426 3972 3759 4354

Cancelled 6840 6487 6049 7216 6862 7014

Transferred 8624 9187 9764 7051 7618 6871

01:01 01:02 01:03 02:01 02:02 03:01

Figure 4-15. Transfer statistics for 1 continuous and 15 bursty resources

The mean time to transfer packets with different configurations on both network speed and buffer configuration shows that adding more buffers will only increase the time. The configuration with 01:01 is slightly faster than 01:02 and 01:03. Here the configurations with multiple input-buffers perform worse at every clock setting.

40

Transfer time (1 Continuous and 15 Bursty Resources)

0,020,040,0

60,080,0

100,0120,0140,0

160,0

0,3 0,4 0,5


Tra

nsf

erti

me

01:01

01:02

01:0302:01

02:02

03:01

Figure 4-16. Transfer mean time for 1 continuous and 15 bursty resources

4.5.2.2 Simulation with 16 bursty resources versus 14 bursty and 2 inactive resources When comparing the number of packets transferred we can once again see that the network that runs with a period of 0,3- 0,4 time units is so fast that the buffers have a very small influence on the behaviour of the network. There are no drops and the numbers of cancelled packets are also very small using this configuration. When increasing the period time to 0,5 we can see that the influence of the buffers starts to increase.

The simulations made with more then one output-buffer is slightly faster than the ones with only one output-buffer. In the 14 bursty model simulations with multiple out-buffers, the effects of the buffers appear more distinctively.

Number of transferred packets(14 Bursty Resources, 8 pkts per Burst)

0

5000

10000

15000

20000


Nu

mb

er o

f tr

ansf

erre

d

pac

kets

0,4

0,50,6

Number of transferred packets(16 Bursty Resources, 8 pkts per burst)

02000400060008000

1000012000140001600018000


Nu

mb

er o

f tr

anfe

rred

p

acke

ts

0,3

0,40,5

Figure 4-17. Number of transferred packets, 16 vs. 14 bursty resources

All buffer configurations are able to route without drops with the period of 0,3 time units. When increasing the period to 0,5 the configurations with multiple output-buffers starts to show their advantages, especially in the case with 14 resources. The period time is then increased further to 0,6 and the configurations 01:02 and 01:03 causes the fewest number of cancelled packets, where the later is slightly better.

41

Number of cancelled packets(14 Bursty Resources, 8 pkts per Burst)

0

1000

2000

3000

4000

5000

6000

7000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f d

rop

ped

p

acke

ts

0,4

0,50,6

Number of cancelled packets(16 Bursty Resources, 8 pkts per burst)

01000

2000

3000

4000

5000

60007000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f d

rop

ped

p

acke

ts

0,3

0,40,5

Figure 4-18. Number of cancelled packets, 14 vs. 16 bursty resources

We can see the importance of the output-buffers and there are no drops when the configurations are 01:02 and 01:03 up to a clock of 0,5. The configuration 03:01 starts to drop heavily when the load increases.

Number of dropped packets(14 Bursty Resources, 8 pkts per Burst)

0

1000

2000

3000

4000

5000


Nu

mb

er o

r d

rop

ped

p

acke

ts

0,40,50,6

Number of dropped packets(16 Bursty Resources, 8 pkts per burst)

0

1000

2000

3000

4000

5000


Nu

mb

er o

f d

rop

ped

p

acke

ts0,30,40,5

Figure 4-19. Number of dropped packets, 14 vs. 16 bursty resources

In the mean transfer time diagram it is observed that the results from all the set-ups with several in-buffers increase their transfer time faster than the other set-ups. The set-ups with more than one output buffer have almost the same performance as long as the load is not too high. The single in-, out-buffer seems to have the fastest transfer time in all clock settings.

Transfer time (14 Bursty Resources, 8 pkts per Burst)

050

100150200250300350400

0,4 0,5 0,6


Tra

nsf

erti

me

01:01

01:02

01:0302:01

02:02

03:01

Transfer time All Burst(16 Bursty Resources, 8 pkts per burst)

050

100150200250300350400

0,3 0,4 0,5Network Speed, Period Time

Tran

sfer

time

01:01

01:02

01:03

02:0102:02

03:01

Figure 4-20. Transfer Meantime, 14 vs. 16 bursty resources

42

4.5.2.3 Simulations with different burst length The number of transferred packets seems to be influenced more when the bursts are 8 packets. The explanation to this is probably the same as in the simulation with equal times.

Number of transferred packets(16 Bursty Resources, 8 pkts per burst)

02000400060008000

1000012000140001600018000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f tr

anfe

rred

p

acke

ts

0,3

0,4

0,5

Number of transferred packets(16 Bursty Resources, 16 pkts per Burst)

02000400060008000

1000012000140001600018000

01:01 01:02 01:03 02:01 02:02 03:01Bufferconfig (In:Out)

Num

ber

of

tran

sfer

red

pac

kets

0,30,40,5

Number of cancelled packets(16 Bursty Resources, 8 pkts per burst)

0100020003000400050006000700080009000

10000


Nu

mb

er o

f ca

nce

lled

p

acke

ts

0,30,40,5

Number of cancelled packets(16 Bursty Resources, 16 pkts per Burst)

0100020003000400050006000700080009000

10000

01:01 01:02 01:03 02:01 02:02 03:01


Nu

mb

er o

f

can

celle

d p

acke

ts

0,30,40,5

Number of dropped packets(16 Bursty Resources, 8 pkts per burst)

0

1000

2000

3000

4000

5000

6000


Num

ber

of d

ropp

ed

pac

kets

0,30,40,5

Number of dropped packets(16 Bursty Resources, 16 pkts per Burst)

0

1000

2000

3000

4000

5000

6000


Nu

mb

er o

f d

rop

ped

p

acke

ts

0,30,40,5

Figure 4-21. Simulation results with different burst length

4.6 Chapter Discussion

SDL seems to be well suited to make models and simulate this complex system. The time to simulate 10 000 clock cycles took about 2 minutes. The most time was spent on handling the data in Excel. It took about 5 minutes to calculate the results of one simulation.

43

One must not forget the fact that using SDL, the total design time for a project may increase. For example it may have been faster to directly make a VHDL design on a higher abstraction level instead of using SDL. The benefits of SDL are however that the simulations can be performed using one tool. Using VHDL it may be necessary to introduce other programs, such as Matlab or user written functions in C for generating simulation stimulus.

From the different simulations that have been made, it is possible to see that one result is constant throughout all simulations. This is the advantage that comes from using at least two output buffers. The throughput of the system increases, especially in the case of equal delays of buffers and switches. When the output buffer is increased further to three buffer spaces there is only a small extra benefit in performance, but in most cases it will not be worth the extra cost that is connected to extending the buffer size.

It also seems that adding more input-buffers only makes the network performance worse. One surprising result, showed in both delay configurations was the one where longer burst-length did not benefit more from larger buffers compared to the shorter burst- length. The configuration that we would like to recommend for a general-purpose system is one with one input buffer and two output buffers. If a NoC with a special purpose is designed there should be simulations made with different configurations to see what configuration that is the most suited.

44

5 NoC: Hardware Design

It was decided to make the design in VHDL. It was not intended to make several designs in VHDL and thus the design is described in Register Transfer Level (RTL) code. This makes it possible to make an automatic transfer to net- list using a synthesis tool. The programming language tool supports graphical description of synchronous Finite State Machines (FSM) and this have been used to specify the behaviour and code generation.

5.1 Model Requirements

The model serves the purpose of describing a digital circuit at a synthesizable level of abstraction. Main objectives of the VHDL model are:

• To determine delay/execution time in terms of clock cycles for various blocks

• To determine hardware cost of various NoC blocks. Implementation results are shown in chapter 6.

5.2 Design Structure

Since the model in SDL seems to serve its purpose, the block division that was used in that design will be used for encapsulating functional units in VHDL. There are however some differences. The units that are written as processes in SDL are described as components containing one state machine in VHDL. Also the Switch Control and the RNI unit, which are internally divided into several processes, i.e. two separate state machines in SDL, are only represented using one state machine each in VHDL. The reason for this is that the design should not take up unnecessary space and time. Since these processes in SDL have a sequential functional behaviour, it is enough with one state machine in VHDL. If more state machines were used there would have to be synchronization signals between these.

The VHDL network model is not parameterised as the SDL model. The network mesh is created by instantiating components and then manually connecting them to each other, as in Figure 5-1. The package Noc_package have been created in order to gather type definitions and constants in one place.

45

NOC_Node_Transceiver0_1

Indata_South_RTR

Indata_South_WrIndata_South_Enable

NOC_Node_Transceiver1_1

Outdata_North_RTR

Outdata_North_WROutdata_North_Enable

Indata_South_RTR_1Indata_South_Enable_1

Outdata_SouthOutdata_South_Wr

Outdata_South_EnableOutdata_South_RTR

Indata_North

Indata_North_RTR

Indata_North_WrIndata_North_Enable

Outdata_South

Outdata_South_WrOutdata_South_Enable

Outdata_South_RTR

Outdata_South_Enable_1

ClkReset

ClkReset

RouterID

rx

RouterID

rx

RouterID_1

Indata_East_Wr

Indata_East_Enable

Indata_East_RTR

Outdata_East_Wr

Outdata_East_Enable

Outdata_East_RTR

Send_pid_1

Indata_EastIndata_East

ClkReset

Outdata_West

ClkReset

NOC_Node_Uart_Rec

Indata_South_RTR

Indata_South_Wr

Indata_East_RTR

Indata_East_Wr

Outdata_East

Outdata_East_RTR

Outdata_East_Wr

Indata_East_Enable

Indata_South_Enable

Outdata_East_Enable

Indata_South_WrIndata_South_Enable

NoC

I0Outdata_East

Indata_South

NoC

I1

Indata_SouthOutdata_SouthOutdata_South_Wr

Indata_West

Indata_West_RTR

Indata_West_WR

Outdata_West_RTR

Outdata_West_WR

Indata_West_Enable

Outdata_West_Enable

RouterID

Send_pid

Outdata_South_EnableOutdata_South_RTR

Dest_PID

Indata_South_1Indata_South_Wr_1

Outdata_South_1Outdata_South_Wr_1

Dest_PID_1

RouterID

ClkReset

ClkReset

RouterID_2

TxTx

RouterID

Indata_East_Enable_1

Indata_East_RTR_1

Outdata_East_Wr_1

Outdata_East_Enable_2

Outdata_East_RTR_1

NOC_Node_Uart_Trans

Indata_East

Indata_East_RTR

Indata_East_Wr

Outdata_East

Outdata_East_RTR

Outdata_East_Wr

Outdata_North_RTR

Outdata_North_WR

Indata_East_Enable

Outdata_East_Enable

Outdata_North_Enable

Indata_South_RTROutdata_NorthIndata_South

Indata_East_1

Indata_East_Wr_1

NoC

I2Outdata_East_1

Indata_North

Indata_North_RTR

Indata_North_Wr

Indata_West

Indata_West_RTR

Indata_West_WR

Outdata_West

Outdata_West_RTR

Outdata_West_WR

Indata_North_Enable

Indata_West_Enable

Outdata_West_Enable

Outdata_South_RTR_1

ClkReset

ClkReset

Outdata_North

RouterID_3

NoC

I3

Send_pid

Dest_PID

Send_pid_3

Dest_PID_3

Figure 5-1. VHDL block model of NoC at node level

5.2.1 Micro-Router

The Micro_Router in Figure 5-2 is of the same component structure as the blocks in SDL.

Switch and Control unit

The objective of the Switch is to check the input buffers if there are any messages waiting to be transmitted and if there are, switch them to an appropriate output buffer in a safe way. Handshaking is used for packet transactions. When ready to handle a packet the Switch sets the RTR signal to 1. After reset all these are thus set to 1. If an input buffer has a packet it will set the WR signal to 1. The Switch will then try to route this packet according to the same algorithm as in the SDL design. If there is a free output buffer this has its RTR signal 1. The Switch will then assert its WR signal 1. At the same time the Switch will set the RTR of the input buffer to 0 to indicate that the packet could be taken. After this it will wait for the WR from the in buffer to be 0 and the RTR from the out buffer to be 0 before it sets the WR to 0 to the out buffer and after that is ready to handle a new packet.

46

A counter is rolled forward so it will not start to look for a packet in the same buffer where it just picked up a packet from. This will give a degree of fairness to the behaviour. This was not needed to do in the SDL design since the signals there, are in the form of events that are queued in the order they arrive to a process. In VHDL we cannot determine which of the signal that came first if they arrive within the same clock period.

RNI_Out

RNI_Out_WR

RNI_Out_RTR

West_Out_RTR

RNI_In_Wr

RNI_In_RTR

West_In

West_In_Wr

RNI_In

NOC_Router_BufferSwitch_In_WR

Switch_InSwitch_Out_RTR

Switch_Out_WRSwitch_Out

Switch_In_RTR

South_InRNI_Out South_Out

East_Out_WR

East_Out

East_In

East_Out_RTR

East_In_Wr

East_In_RTR

Switch

East_Out_RTR

North_Out_RTR

South_In_Wr

North_Out_WR

South_In_RTR

West_In_RTR

North_Out_RTR

ResetRouterID

West_In_Wr

West_In

clk North_Out

North_OutClkReset

RouterID

North_In_WrNorth_In

East_Out

East_Out_WR

North_In_RTR

North_In

North_In_Wr

NoC

I5

RNI_Out_RTR

RNI_In

RNI_In_Wr

West_Out_RTR

RNI_Out_WR

RNI_In_RTR

West_Out

West_Out_WR East_In_Wr

East_In

South_Out_RTR

East_In_RTR

South_Out_WR

NOC_Router_Buffer

Switch_In_RTRDatalink_Out_WR

Datalink_Out_Enable

NoC

I2

Reset

Clk

Switch_In_WR

Switch_In

Clk

Reset

Datalink_Out_RTR

Switch_Out_RTR

Switch_Out_WR

Switch_Out

Datalink_In_WR

Datalink_In

Datalink_Out

Datalink_In_RTR

Datalink_In_Enable

Switch_Out_RTRSwitch_Out_WR

Switch_Out

South_In_Wr

South_In_RTR

South_In

Switch_In_WRSwitch_In

Switch_In_RTR

South_Out_WR

South_Out

South_Out_RTR

Figure 5-2. VHDL model of NoC at network layer

Buffers

The buffers in the North, South, West and East direction are of the same type. A special type is used for the buffer towards RNI for separate configuration possibility. If the buffer is ready to receive a packet the In_RTR is set to 1. When a packet is ready to enter the buffer the In_WR signal sets to 1. When WR is 1 the In_RTR is put low to indicate that the buffer is handling the packet. The buffer then checks if the Out_RTR is set and if so it will set the Out_WR to 1 for immediate transfer of the packet. In the case that Out_RTR is low the packet is saved in the buffer. If the buffer is not full and the In_WR signal has gone low the buffer sets the In_RTR high again.

47

5.2.2 RNI

For simplicity the service and buffer capability of the RNI is built together in one FSM. The difference from a pure buffer is that, for instance if it is an out-buffer, the FSM looks up a table and adds the network layer information as the packet from the resource is in the buffer.

Res_Out_WrRout_In_RTR

D_Outdata_RNI_RTR

Clk

NoCUART_RecI0

Clk

Reset

rx Data_Out

Data_Out_WR

Data_Out_RTR

Reset

rx

D_Outdata_RNI_Wr

D_Outdata_RNI_Enable

D_Indata_RNI_Wr

D_Indata_RNI

NOC_RNIRes_In_Wr

Rout_Out_RTR

Res_In_RTR

Res_Out

Data_Out_WR

Clk

Res_In

Reset

Clk

Reset

Data_OutNoC

Rout_OutRout_Out_WR

Rout_Out_Enable

Data_Out_RTR

Data_In

Data_In_WR

I1Rout_In

Rout_In_Wr

Rout_In_Enable

D_Outdata_RNI

Res_Out_RTRData_In_RTR

Figure 5-3. VHDL Block model of RNI and a Resource

5.2.3 Resource

In the design there are three types of resources. These are designed for the requirements of the implementation. There is an UART receiver, which can receive messages at 9600 kbit/s and transmit them towards the RNI as a simple transport layer message. Another is the send ID resource, which receives a number 0-9, forwards it and then sends its own ID that number of times to a specific node in the network. The last type is the UART transmitter, which receives packets from the RNI and sends them at 9600 kbit/s onto a serial channel.

5.3 Design and Simulation Tool

The tool for accomplishing the design was FPGAdvantage 4.0 from Mentor Graphics. This is a set of design tools in an integrated environment. The design is made in Renoir. For target synthesis it uses the Leonardo Spectrum program. A separate synthesis by Modelsim is made in order to simulate the code.

48

5.4 Simulation Results

The design was simulated in several test-benches to find out the timing requirements. The VHDL model serves two purposes. It was desired to find realistic timing values to put back in the SDL model. There was also the goal that the model should be able to implement in an FPGA. As an FPGA is not optimally designed for the purpose, we needed to accept that this model would be a bit slower in clock cycles.

5.4.1 Simulated values

This chapter shows the results that were simulated as realistic if, for instance an ASIC were designed.

Buffers: Forwarding a one packet: 1 clk.

Turn around time to receive new packet: 3 clk

Switch: Load packet for address evaluation and send: 2 clk

Be ready to load new packet: 2 clk

In Figure 5-4 it can be viewed that the reason that these clock cycles occurs are decided by the handshaking procedure that is used. This figure shows how the timing values from the VHDL simulations, were transferred to the SDL timing in the upper part of the figure. Since it takes 2 clocks for the switch to send a packet and it has to wait 3 clock cycles for the RTR from the buffer it sums up to a period of 5 clock cycles. The buffers are one clock cycle faster because it sends on one clock cycle and the turn around time is counted from when it receives. From these values a max bandwidth (BW) of the router can be derived with an example using a specific clock frequency.

Ex. F=1 GHz

Max BW Buffers = 1/(4*1*10-9)= 250 MPackets/s

Max BW Switch = 1/(5*1*10-9)= 200 MPackets/s

Max BW overall = 200 Mpackets/s

Since one packet can pass the buffers faster than the max bandwidth the minimum delay for one packet can be calculated as follows.

In-buffer delay = 1 ns

Switch delay = 2 ns

Out-buffer delay = 1 ns

Total min delay = 4 ns

49

Pkt1 Pkt2 0 1 2 3 4 5 6 7 8 9

Outbuf Inbuf SDL Timing Switch Outbuf Inbuf PIn

PSwitch

POut 0 1 2 3 4 5 6 7 8

VHDL Simulation < Outbuf_WR < Inbuf_RTR to Outbuf < Inbuf_WR to Switch < Switch_RTR to Inbuf < Switch_WR to next Outbuf < Next Outbuf_RTR to Switch

Figure 5-4. The communication process between a switch and buffers

RTR Signal Packet Signal

50

5.4.2 Simulated and Implemented values

The fast design used in the above chapter would not work in the FPGA. It was decided to make more of the used signals clocked thus making the design slower, but safer in the behaviour. Below the values that work in the FPGA are shown.

Buffers: Forwarding a one packet: 2 clk

Turn around time to receive new packet: 3 clk

Switch: Load packet for address evaluation and send: 4 clk

Be ready to load new packet: 4 clk

It is shown that the clock cycles to perform the packet transportation is increasing resulting in that both max BW and minimum delay are getting worse.

Ex. F=1 GHz

Max BW Buffers = 1/(5*1*10-9)= 200 MPackets/s

Max BW Switch = 1/(10*1*10-9)= 100 MPackets/s

Max BW overall = 100 MPackets/s

The minimum delay for one packet is calculated as follows.

In-buffer delay = 2 ns

Switch delay = 4 ns

Out-buffer delay = 2 ns

Total min delay = 8 ns


Although it was not possible to make a working implementation in FPGA with the faster values it can be concluded that these are the most realistic in an ASIC implementation. The reason for this is that the possibilities for the VLSI designer vastly exceed the ones existing in this project. Of course one can manually make adjustments in the technology mapping but this is too time consuming and is therefore left out.

51

6 NoC: Prototyping on FPGA

6.1 Prototype Board

The design is implemented with the help of a prototyping board with a Xilinx Spartan II XC2S100-PQ208 on board. This is a 100k gates, 600 CLB FPGA, 1200 FF that is clocked at 40 MHz. The board has a driver for RS232, which can be connected, to the ports on the FPGA.

6.2 Functional Description

A two times two Network on Chip (NoC) was to be implemented into an FPGA. The reason for this is that there is not room for a bigger network in the target FPGA. The design structure can be viewed in Figure 6-1.

6.2.1 Communication

Communication between two micro-routers passes through one output-buffer. Because that the FPGA is too small there is not room to implement the data- link layer. Data also travels through an input-buffer before it reaches the receiving micro-router.

Between the resource and micro-routers there is a Resource Network Interface (RNI). The RNI converts data between the format used in the resource and the format in the network layer. The RNI is connected in the same way as communication between micro-routers.

6.2.2 Resources

Resource zero receives ASCII characters from the RS-232 interface and pass them on to resource one followed by the ASCII code for zero as code for resource number zero.

Resource one receives ASCII characters from resource zero and passes them on to resource three. After each message that passes through the resource recognizes number that the ASCII code represents, sends it and after that sends its ID number in ASCII code that number of times.

Resource two receives ASCII characters from resource one and passes them on to resource three. After each message that passes through resource three, the ASCII code for three will be sent in the same manner as in resource one.

Resource two receives ASCII characters from resource three and sends them via RS-232 to the PC.

6.2.3 I/O-ports

There are two connections to the PC through a RS-232 interface. Resource zero receives ASCII characters via RS-232. Resource two also has a RS-232 interface and sends ASCII characters to the PC.

52

Figure 6-1. Overview of Network on Chip Prototype in FPGA

Label Size Data 8 bits MSN 1 bits Message Sequence Number PSN 1 bits Packet Sequence Number DPID 2 bits Destination Process Identification SPID 2 bits Source Process Identification HC 1 bits Hop Counter RID 2 bits Resource Identification Check 1 bit Check bit

Figure 6-2. Bits in implementation

6.3 Technology Mapping tool

The technology mapping was made using Xilinx Design manager 4.1i. There is also a possibility to do this stage in the Leonardo Spectrum. However, there had to be special configurations for the used prototype board regarding start-up clock and we could not find this setting in Leonardo. This was the reason that it could not be used.

Micro Router (0:0)

RNI

Resource 0 Rx

RS-232 Interface

Micro Router (1:1)

Micro Router (1:0)

Micro Router (0:1)

RNIRNI

RNI

Resource 2 Tx

RS-232 Interface

Resource 3 Store,

modify, forward

Resource 1 Store,

modify, forward

RS-232

RS-232

PC

53

6.4 Implementation Result

In Figure 6-3 there is a summary of how much space the different components take up. The tool is set for area optimisation. The total packet size is, in the case of the buffers, switch/route and micro-router 17 bits because they only operate on the network layer.

Slices Flip-Flops Gates Logic levels

Buffer (1 pkt) 20 24 312 4 Buffer (2 pkts) 32 41 574 5 Buffer (3 pkts) 41 58 824 5 Switch/Route 105 45 1659 10

Micro-Router 385 370 6245 9 Implementation 947 751 15509 5

Figure 6-3. Implementation Result

The micro-router is synthesized with 2 output-buffers and one input-buffer in all of the 5 directions. If we manually sum up these components it will be in slices:

32*5+20*5+105=365

In flip-flops:

24*5+41*5+45=370

In gates:

312*5+574*5+1659=6089

The implementation is done with only one buffer in each direction because of the limited space. Here there are only 3 in/out-buffers since it is only a 2*2 network and not connected buffers at the boundary are not added in the design.


The size of the components when synthesized separately seems to be roughly the same as when synthesized together. An explanation why the total micro-router is larger can be that the gate-depth is lowered to meet timing constraints. It is unsatisfactory that the implementation had to be done with the slower technology. However this seems to work well and shows that the principles of the design works.

54

7 Results

The objective for this thesis has been divided into three parts. The first and most important objective was to evaluate the SDL-language for modelling and simulation of a NoC. The results of this part are described in part 7.1. The second objective was to make a model of NoC using VHDL, this model should be based on the description that was made in SDL. The results are described in chapters 7.2 and 7.3. The third and last objective was to implement a small NoC into an FPGA, this is described in chapter 7.4.

7.1 SDL Modelling and Simulation of NoC

One part of the work was to build the simulator in SDL. It was important to make this model as flexible and reliable as possible. The way we implemented this into the simulator was to build every functional unit as a separate design block. The advantages that the block model gave to the simulator, was that it became very easy to change the functionality of one single part of the node, for example the micro-router. It is now very easy to build a new micro-router based on a different routing-algorithm and then make some new simulations to see how the behaviour has changed. Properties that were changed on a more regular basis were implemented as synonyms in the uppermost block level, examples of such properties can be buffer-size and how long time that was spent in different situations.

From the simulation results it is easy to see that a NoC built with more than one input-buffer is not a very good solution. With this configuration the routers start to route packets into the wrong direction as soon as the load of the network increases. This is caused by the behaviour of the micro-router, when a packet can’t be routed towards the destination it will be passed on to any other available buffer. When the router has to send packets into the wrong direction, this results in extra traffic that causes even more packets to be misdirected.

7.2 Designing NoC using VHDL

A VHDL model has been designed based on SDL model. In SDL the model was divided into several functional units, called blocks. When we developed the VHDL model we took every such block and translated it into components. It was not possible to directly translate SDL processes and procedures into VHDL, but the behaviour is very similar to that in SDL. In this project we have only designed one base model in VHDL, and it is also possible to optimise this model further. The simulations made from the VHDL gave us information on how many clock cycles different processes took.

55

7.3 Implementation of NoC prototype in FPGA

The reason that we choose to implement a NoC prototype into an FPGA was that we wanted to see that the code was implementable and to visually prove that our model works. The prototype had to be very simple because we only had a simple prototyping board with a 100k gates Spartan II. We decided to build a simple UART that could communicate with a PC via the RS-232 interface, see Figure 6-1. It was now possible to use the PC both for input of data to the system and to view the results on the screen. The size of the network could only be 2x2 and two of the resources were already used for the UART. The behaviour of the two remaining resources are the same, when resource 1 receives a digit from resource 0, for example 3 it forwards the digit followed by 3 packets containing the resource number (1 in this case) to resource 3. Resource 3 works exactly the same as above. When a digit is received it is forwarded to resource 2, and resource 3 sends its resource number as many times as the forwarded digit tells. Resource 2 sends all the packets it receives from resource 3 to PC.

PC à Resource 0 2

Resource 0 à Resource 1 2

Resource 1 à Resource 3 2,1,1

Resource 3 à Resource 2 2,3,3,1,3,1,3

Resource 2 à PC 2,3,3,1,3,1,3

Figure 7-1. Description of communication in NoC-prototype.

56

8 Conclusions

In this project we have developed a generic model of a NoC architecture using SDL. We have also designed a small size NoC in VHDL and prototyped it in a 100k gate FPGA. Our project demonstrate the feasibility of the NoC concept.

To develop something like a NoC is very interesting since it is a relatively unexplored area. A complete design flow has tried to be done, which has forced us to make things simple. As we look at every step in the design we see that there are possibilities to improve the models at all levels of abstraction. Though the most effort has been spent on the SDL model and the simulations there still can be lots of things to do. For example, since we see that it can be used for making fast simulations it may be worthwhile testing other routing algorithms.

When we were planning for NoC prototype we first thought that it would be possible to implement a larger network, but all the buffers took a large amount of space. The network size was from this reason set to two by two, it’s a little bit too small to draw any real conclusions from but we thought that it would be fun to get a working network-prototype into an FPGA anyhow. There were also very little space left for the resources, and we had to make them very simple. It would be very interesting to build a lager network with some more advanced resources like processors and RAM to see how a NoC could be used.

One thing that seems important is to include in the network is designing mechanisms to get hard real-time properties for messages between resources. Some researchers have proposed that a circuit switched network could be the solution for this [16]. A circuit switched network makes it possible to transfer data very fast as soon as the data path has been locked.

Another solution for increased real-time properties is to give packets in the NoC different priorities. Packets that are given high priority can pass the packets with lower priority in for example the buffer. It’s also possible for the router to give packets with high priority advantages. If there is a packet with lower priority in the output buffer it can be dropped to make space. In this way it is possible to implement some sort of hard real-time properties into a packet-switched network.

Similar behaviour can be seen in all simulations made on routers with only one output buffer, and this is the reason why we recommend at least two output buffers.

By increasing the number of buffers to three output buffers or two inputs and two outputs we can only get a very small extra benefit. In some simulations it is even possible to see that the number of transferred messages is lower, and the transmission times are longer. Because of this and because of the limited resources in a chip we propose the configuration with one input buffer and two output buffers for the NoC.

57

References

[1] Shashi Kumar, Axel Jantsch, Juha-Pekka Soininen, Martti Forsell, Mikael Millberg, Johnny Öberg, Kari Tiensyrjä, and Ahmed Hemani,

A network on chip architecture and design methodology, In Proceedings of IEEE Computer Society Annual Symposium on VLSI, April 2002.

[2] Edwin Rijpkema, Kees Goossens, and Paul Wielage,

A Router Architecture for Networks on Silicon, Proceedings of Rogress, 2nd Workshop on Embedded Systems, 2001

[3] Luca Benini, Giovanni De Micheli,

Networks on Chips: A New SoC Paradigm, IEEE Computer Society, 2002

[4] Christer Bohm, Circuit Switching for High Performance Integrated Services Networks, Royal Institute of Technology, Department of Teleinformatics, June 1996

[5] James F. Kurose,

Computer Networking, Addison Wesley Longman 2001, ISBN 0-201-47711-4

[6] Dhiman Deb Chowdhury, High Speed Lan Technology Handbook, Springer 2000, ISBN 9-783540-665977

[7] Gary N. Higginbottom, Performance Evaluation of Communication Networks, Artech House 1998, ISBN 0-89006-870-4

[8] Roland Airiau et al., Circuit Synthesis with VHDL, Kluwer Academic Publishers 1999, ISBN 0-7923-9429-1

[9] Stefan Sjöholm m.fl.,

VHDL för konstruktion, Studentlitteratur 1999, ISBN 91-44-01250-0

[10] Douglas L. Perry, VHDL, McGraw-Hill Inc. 1994, ISBN 0-07-049434-7 [11] A. Olsen et al.,

System Engineering Using SDL-92, North-Holland 1997, ISBN 0-444-89872-7 [12] Jan Ellsberger et al.,

SDL Formal Object-oriented Language for Communicating Systems, Prentice Hall 1997, ISBN 0-13-621384-7

58

[13} Paul Wielage and Kees Goosens, Networks on Silicon: Blessing or Nightmare?, Philips Research Laboratories, Eindhoven, The Nederlands [15] Yi-Ran Sun, Simulation and Performance Evaluation for Networks on Chip, M Sc Thesis, KTH, Sweden [16] Dake Liu et al., SoCBUS: The solution of high communication bandwidth on chip and short TTM,

Proc of the Real-Time and Embedded Computing Conference, Gothenburg, Sweden, Sep 2002

59

9 Vocabulary

Burst - With bursty behaviour it means that a certain number of

packets is put out after each other with maximum rate, called burst, and after that there is a delay called burst gap before the process repeats itself. The number of packets between the delays is, in this project randomly selected according to the Poisson distribution.

Continuous - Continuous behaviour is, in this project, when it is possible to set the delay between two packets transmitted, which gives the output frequency.

FPGA - Field Programmable Gate Array. A programmable logic device that uses static ram to store the configuration. It has a high logic capacity but the device have to be re-programmed after a power shutdown.

NoC - Network on Chip. NoC is one proposed solution for

communication in future SoC design. The main idea is that resources on the chip are supposed to communicate with each other through a network.

RNI - Resource Network Interface. RNI is an interface that adapts the interface of the resource to the network.

Router - In a switched network there is a need to find a route through several switches. Therefore the switches that are cross-points in the network also implement a routing function. They are called routers because of this functionality.

SoC - System on Chip. Multiple stand-alone VLSI-designs are

stitched together on a chip to provide one functional system.

VHDL - Very high speed integrated circuit Hardware Description Language. Initially developed for the US Department of Defence in order to describe digital circuits. Now also widely used for design and synthesis of these circuits.

VLSI - Very Large Scale Integration. Term for the process used

when manufacturing chips containing several hundred thousand up to a million transistors.

Documents

Network on Chip Based FPGA