[Tutorial] NoC the Next Generation of Multi-Processor SoC

N t k ChiEAIT, 2011

Network-on-ChipThe Next Generation ofThe Next Generation of

Multi-Processor System-on-Chip Presenters

Dr. Santanu ChattopadhyayAssociate Professor

Dept. of Electronics and Electrical Communication Engineering

Santanu KunduResearch Scholar

p g gIndian Institute of Technology, Kharagpur.

email: santanu, [email protected] Feb, 2011

2

Lecture – 1Lecture 1

IntroductionIntroduction

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

After mass market production ofI t d ti

3

After mass market production ofdual-core and quad-coreprocessor chips, the trendtowards Multi Core processing is

Introduction

towards Multi-Core processing isnow a well established one.

In multi-core processing,

End NodeEnd NodeEnd NodeEnd Node

…SW Interface SW Interface SW Interface SW Interface

Device Device Device Device

In multi core processing,multiple processor (i.e. CPU,DSP) along with multiplecomputer components (i.e.

Lin

k

Lin

k

Lin

k

Lin

k…HW Interface HW Interface HW Interface HW Interface

computer components (i.e.microcontroller, memory blocks,timers, etc.) are integrated ontoa single silicon chip This

Communication Medium

a single silicon chip. Thisarchitecture is often called asMulti-Processor System-on-Chip(MPSoC)

…

Architecture overview of (MPSoC).

Multi-Processor System-on-Chip


Introduction

4

Each on chip component referredt I t ll t l P t (IP)

System-on-Chip (SoC)Introduction

to as Intellectual Property (IP)block.

The communication medium usedThe communication medium usedin modern multi-processor chips isbus based.

Upto tens of cores in a single chip,the performance of these bus basedchips are satisfactory. But beyondthat its performance degrade withnumber of cores attached.

The communication backbone used in modern SoC is shared bus.


Limitation of Shared Global Bus

5

• Communication Bottleneck: A shared bus allows only onecommunication at a time and even in a hierarchical bus a


communication at a time, and even in a hierarchical bus, asingle communication can block all buses of the hierarchy.

• Scalability: Bus based SoC does not scale with the system sizeScalability: Bus based SoC does not scale with the system sizeand its bandwidth is shared by all the systems attached to it.

Node Node

XNode

X



6

• The intrinsic parasitic resistanced it b it hi h


and capacitance can be quite highfor a long bus line.

• The global bus delay increasesexponentially with decrease inprocess technology.

• E er additional IP block adds to• Every additional IP block adds toparasitic capacitance and causesincreased propagation delay.

• In deep sub-micron era, 80% ormore of the delay of critical pathswill be due to globalinterconnects.Relative Evolution of wire and gate delays

Reference: International Technology Roadmap for Semiconductor (ITRS) Documents (2003), Available at: http://public.itrs.net/Files/2003ITRS/Home2003.htm.


Shared Global Bus to Segmented Bus

7

Shared Global Bus to Segmented Bus

R

R

R

R

• Shared global bus is segmented by inserting repeaters (R).

Segmented Bus Multi-Level Segmented Bus

• In segmented bus, delay increases linearly with decrease in processtechnology .

• No improvement in bandwidth as it is still shared by all the coresp yattached to it.

• At the system level, it has a profound effect in changing the focusfrom computation to communicationfrom computation to communication.


Point to Point Dedicated Links

8

Advantage:

Point-to-Point Dedicated Links

• Bandwidth is higher than the sharedbus.

Drawback:7

01

Drawback:

• Switch size increases with increasein number of cores.

6 2

• Number of links needed increasesexponentially as the number ofcores increases.

45 3

• More number of metal layers arerequired in placement and routing.


Centralized Crossbar Switch

9

Centralized Crossbar Switch

Node Node

Components:

• Crossbar switch and

Advantage:

Node Node

Crossbar Switch

• Crossbar switch and

• Point-to-point links.Advantage:

• A crossbar switch enhance thescalability to some extent.

Node Node

Drawback:

• However, connecting largenumber of cores with a singlegswitch is not very effective asit is not ultimately scalableand, thus, it is an, ,intermediate solution.


Network-on-Chip: A Paradigm Shift

10

Network-on-Chip: A Paradigm ShiftOff-Chip vs. On-Chip Networks

Th b d id h f ff hi k io The bandwidth of off-chip networks is typically much lower than on-chip networks.

o Off-chip network is often affected by clock skew whereas clock skew problem is less significant for on-chip networks.

Only 3 components…

g p

o Off-chip networks has higher latency than their on-chip counter part.

1. Network Interface (NI)

2. Switch (Router)

3 Point-to-Point Links

o Area is not a strong constraint for off-chip networks, but for on-chip network it is one of the major constraint3. Point-to-Point Links

Reference: Benini, L. and Micheli, G.D. (2002) ‘Network on chips: a new SOC paradigm’, IEEE Computer, Vol. 35, No. 1, pp.70–78.

of the major constraint.


Layers of Abstraction in Network-on-Chip

11

Session Layer- NoC Abstraction

Layers of Abstraction in Network-on-Chip

(Open Core Protocol Standardization)

Transport LayerTransport Layer- Network Interface

Network Layer- Router / Switch

Data Link Layer- Flow Control ProtocolFlow Control Protocol - Error Handling

Physical Layer- Physical Wire Connection


SoC to NoC: An Evolution

12

SoC to NoC: An Evolution

SoC NoC

SoC • Bandwidth is

limited, shared• Aggregate bandwidth

grows

• Speed goes down as N grows

• Central arbitration

• Speed unaffected by N

• Distributed arbitration

oC

Central arbitration

• No layers of abstraction

Distributed arbitration

• Separate abstraction layers

N However:

• Fairly simple.

However:

• Complex architecture.


Design Goal of Network-on-Chip

13

Design Goal of Network-on-Chip

High throughput

Low latency

S l bl hiScalable architecture

Less energy consumption

Smaller area requirements

R li bili i C i iReliability in Communication.

Quality-of-Service Support

Lecture – 2

Architecture Design and Performance Evaluation of

Network-on-Chip


Design Issues in Network-on-Chip

15


• Topology Selection

• Switching Techniques

• Routing

• Flow Control Protocol &• Flow Control Protocol & GALS Implementation

• Buffering• Buffering

• Arbitration


S it hi T h i

16

Switching Techniques

Ci it S it hiBuffers

for “request”tokens

• Circuit Switching

Source Destination

Request for circuit establishment(routing and arbitration is performed during this step)

end nodeDestination

end node

(routing and arbitration is performed during this step)


Ci it S it hi

17

Circuit Switching

Buffers for “ack” tokens

Source end node

Destination

Request for circuit establishmentend node

end node

Acknowledgment and circuit establishment(as token travels back to the source connections are established)(as token travels back to the source, connections are established)

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur18

Ci it S it hiCircuit Switching

Request for circuit establishment

Source end node

Destination end node

Acknowledgment and circuit establishment

Message transport(neither routing nor arbitration is required)( g q )


Ci it S it hiCircuit Switching

X

Source end node


Acknowledgment and circuit establishment

Packet transport

High contention, low utilization low throughput



20

• Store-and-forward Packet switching


Buffers for data

Store-and-forward Packet switching

Packets are completely stored before any portion is forwarded packets

Store Drawback:

1. Larger Buffer

2 M L

Source end node


2. More Latency

end node end node



21


• Store-and-forward Packet switching

Requirement:buffers must be

Store-and-forward Packet switching

Packets are completely stored before any portion is forwarded

sized to holdentire packet

Latency per router depends on the size of the packet

StoreForward Drawback:

1. Larger Buffer,

2 M r L t n

Source end node


2. More Latency

end node



22


• Virtual Cut-Through Packet SwitchingRequirement:

buffers must be sized to hold entire packet

Packets completely stored at the switch

Drawback:

L B ffBusy Larger Buffer

Advantage:

Lesser LatencySource

BusyLink

Destinationy

Source end node


Latency/ router reduced by forwarding header flit of a packet as soon as space for y/ y g p pthe entire packet in the next router.



23


• Wormhole Packet Switching

R i

Advantage: Lower Buffer Space, Lesser Latency.

Dra back: Thro ghp t lesser than Virt al C t Thro gh Requirement:packets can be

largerthan buffers

Drawback: Throughput lesser than Virtual Cut Through

BusyLink

Source Destination

Link

end node end nodePackets stored along the switch


Network Interface (NI) Module

24


Protocol Conversion

Clock Domain Shifting



25

Network Interface (NI) Modulepacket (64x32)bits

Fli i iFlitization

eop bop Src_add Dest _addHeader(32-bit) GT/BE

Payload 1(32-bit)

DATA 1GT/BEeop bop

Payload 2 DATA2GT/BEeop bop

(32-bit)

y(32-bit)

DATA2GT/BEeop bop

Tailer DATA nGT/BEeop bop

…

Deflitization

packet (64x32)bits

(32 bit)

• 1 Packet = 64 Flits

• 1 Flit = 32 bits p ( )1 Flit 32 bits



26


• Switching TechniquesSwitching Techniques

• Topology Selection • Topology Selection

• Routing

• Flow Control Protocol & GALS Implementation

• Buffering

• Arbitration


Topology Selection

27

N b f Li kTopology Selection• Diameter

Maximum shortest path distance between two nodes in

• Number of LinksA topology with large number

of links can support highbandwidthp

the network. Networks with small diameters arepreferable.

• Average Distance

bandwidth.

Average Distance is the average among the distancesbetween all pairs of nodes of a graph. A topologyhaving lesser average distance is preferable.

• Bisection Width• Bisection WidthMinimum number of wires removed in order to bisect

a network. A larger bisection width enables fasterinformation exchange, and preferable.

• Topology selection is2D Mesh with 16 cores

• Node DegreeNumbers of channels connecting the node to its

neighbors. The lower this number, the easier to build p gyapplication dependent.

g ,the network.

Reference: Interconnection Network Architectures (2001) pp.26–49, Available at: www.wellesley.edu/cs/ courses/cs331/notes/notesnetworks.pdf


Existing Topologies in NoC

28


All switches are connected to the fourclosest other switches and target

2D Mesh

core

s

closest other switches and targetresource block via two opposite uni-directional links, except thoseswitches on the edge of the

mes

hof

16

c switches on the edge of thelayout.

For M×N Mesh,Di t (M + N 2)

2D m Diameter: (M + N - 2)

Bisection Width: min (M, N)No. of routers required: (M * N)Node Degree: 3 (corner)Node Degree: 3 (corner),

4 (edge), 5 (central).CLICHÉ: Chip-Level Integration of Communicating Heterogeneous Elements g g

Reference: Kumar, S., Jantsch, A., Soininen, J. P., Forsell, M., Millberg, M., Oberg, J., Tiensyrja, K. andHemani, A. (2002) ‘A network on chip architecture and design methodology’, Proc. of. ISVLSI, pp.117–124.



29


Wires are wrapped around from2D Torus

core

s

ppthe top component to thebottom and rightmost toleftmost

Tor

usof

16

c leftmost.

For M×N Torus,

2D T Diameter: M/2 + N/2

Bisection Width: 2 * min (M, N)

No of routers required: (M * N)No. of routers required: (M * N)

Node Degree: 5

Disadvantage: The long end-around connections can yield excessive delays

Reference: Dally, W. J. and Towles, B. (2001) ‘Route packets, not wires: on-chip interconnectionnetworks’, Proceedings of the 38th Design Automation Conference (DAC 2001), pp.684–689.

Disadvantage: The long end-around connections can yield excessive delays.



30

Existing Topologies in NoC Solving Delay Problem of Torus

Reducing theimaximum

physicallink length



31

Existing Topologies in NoC Folded Torus

ores

16 c

ores

orus

of 1

6 co

ed T

orus

of 1

2DTo

2D F

olde

d

Reference: Dally, W.J. and Seitz, C.L. (1986) ‘The torus routing chip’, Journal of DistributedComputing, Vol. 1, No. 4, pp.187–196.



32

Existing Topologies in NoC Octagon For a network having N number of IP

bl k

s

Diameter: 2 * N/8 .blocks,

D b k

on o

f 8

core

s

For a system consisting of more thaneight nodes, the network is

Drawback:

2DO

ctag

o eight nodes, the network isextended to multidimensionalspace.

Wiring complexity increases linearlywith number of nodes.

Reference: Karim, F., Nguyen, A. and Dey, S. (2002) ‘An interconnect architecture for networkingsystems on chips’, IEEE Micro, Vol. 22, No. 5, pp.36–45.



33

Existing Topologies in NoC Binary Tree

A binary tree-based network with N

of 1

6 co

res (power of 2) number of IP core has,

Diameter: log2 N

inar

y T

ree

o

Bisection Width: 1

No of Routers required: (N/2 1)

2DB No. of Routers required: (N/2 − 1)

Node Degree: 5 (leaf), 3 (stem), 2 (root)

Dr b k Bi ti n Width i r lDrawback: Bisection Width is very less.

Advantage: Lesser Diameter.

Reference: Jeang, Y. L., Huang, W. H. and Fang, W. F. (2004) ‘A binary tree architecture for application specificnetwork on chip (ASNOC) design’, IEEE Asia-Pacific Conference on Circuits and Systems, pp.877–880.



34

Existing Topologies in NoC Fat Tree Every level has same number switches. The

functional IP blocks reside at the leaves and the

of 1

6 co

res functional IP blocks reside at the leaves and the

switches reside at the vertices.

For N number of IP blocks, the network has,

2DFa

t Tre

e o For N number of IP blocks, the network has,Diameter: log2 N/4

Bisection Width: N/2

2

SPIN: Scalable, Programmable, Integrated Network

No. of Routers required: (N. log2 N)/8

Node Degree: 8 (non-root node), 4 (root node).

Advantage: Large Bisection Width, Smaller Diameter

Drawback : High Node Degree

Reference: Guerrier, P. and Greiner, A. (2000) ‘A generic architecture for on-chip packet-switchedinterconnections’, Proceedings of Design, Automation and Test in Europe (DATE 2000), pp.250–256.

Drawback : High Node Degree



35

Existing Topologies in NoC Butterfly Fat Tree (BFT) In the network, the IPs are placed at the

l d i h l d h

f 16

cor

es

leaves and switches placed at thevertices. For N number of IPs, thenetwork has,

2DBF

To

Diameter: log2 N/4

Bisection Width: √NAdvantage- Requires lesser number of switches

Low diameter and Large bisection

Bisection Width: √N

No. of Routers needed: (≈ N/2)

- Low diameter and Large bisection width

Drawback- High node-degree.

Node Degree: 6 (non-root), 4 (root)

Reference: Pande, P. P., Grecu, C., Ivanov, A. and Saleh, R. (2003), ‘High-throughput switch-based interconnectfor future SoCs’, Proc. Int’l Workshop on System-on-Chip for Real Time Applications, pp.304–310.

High node degree.


Mesh-of-Tree Topology

36

Mesh-of-Tree Topology- In M × N MoT where M

denotes the number ofR T d NRow Trees and Ndenotes the number ofColumn Trees. Both Mand N are power of 2and N are power of 2.

- Number of nodes

= 3*M*N – (M + N).

- Small Diameter

(2 log2 M + 2 log2 N).

- Large Bisection Width

4 × 4 M h f T ti 32

g

[min (M,N)].

Drawback

Non planer topology

Reference: Kundu, S. and Chattopadhyay, S. (2008), “Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture”, ACM Great Lake Symposium on VLSI, pp. 343–346.

4 × 4 Mesh-of-Tree connecting 32 cores - Non-planer topology.



37




• Routing

• Flow Control Protocol &

• Routing

GALS Implementation

• Buffering

• Arbitration


Routing

38

RoutingSource Routing vs. Distributed RoutingSource routing

Routing control unit in switches is simplified; computed at source.

Headers containing the route tend to be larger increase overhead.Distributed routing

Next route computed by finite-state machine or by look-up table.

Deterministic Routing vs. Adaptive RoutingDeterministic routingDeterministic routing

Always follow a specified path.

Easy to implement and supports in-order delivery.Ad i iAdaptive routing

Different paths based on congestion and faults; destroys in-order delivery.

Historical channel load information, length of queues, status of nodesand links.


Routing Challenges

39

Routing Challenges• Livelock

• Arises from an unboundednumber of allowed non-

Live-lock in Adaptive Routing

minimal hops.

• Solution: restrict thenumber of non-minimalhops allowed

D

hops allowed.

• Deadlock• Arises from a set of

packets being blockedpackets being blockedwaiting only for networkresources (i.e., links,buffers) held by otherpackets in the set.

• Probability increases withincreased traffic &d r d il bilitdecreased availability.


Routing Dependent Deadlock

40

Routing Dependent Deadlockp p1

0

1ci = channel i si = source node id d i i d i k i

s1 s2c3

c1 c2

4

p1p2

c0

c3

p2

di = destination node i pi = packet i

d3c1 c2

c4

c5c11

d4

c0

s5

c12

c4 c5

c7 8

p2p3

c3

c6

p3

c12

p5

d1d2c7c8

c10

c5c11

c6

d5

c12 c7 c8

c10 c11

p3p4

p4

c6

c9

p4

c12

s3s4c9

c10 c11 p4c9

Routing of packets in a 2D mesh Channel dependency graph


Routing Dependent Deadlock Avoidance

41

Routing Dependent Deadlock AvoidanceDeterministic Routing in 2D mesh using Dimension Ordered Routing

E t bli h d i all b d t k di i E lEstablish ordering on all resources based on network dimension. Example:X-Y Routing: First, route horizontally and match the Y co-ordinate; and then routevertically and match X co-ordinate.

X Y R tin N l in th Ch nn l D p nd n Gr phX-Y Routing No cycle in the Channel Dependency Graph



42

Routing Dependent Deadlock AvoidanceDeadlock Free Adaptive Routing in 2D Mesh: Turn Model

West First

North LastAdaptive Routingp gDeterministic

Routing

Negative First

Reference: Glass, C. J. and Ni, L. M. (1992), ‘Turn Model for Adaptive Routing’, Proceedings ofInternational Symposium on Computer Architecture, pp. 278 – 287.



43

Routing Dependent Deadlock AvoidanceDeadlock Free Adaptive Routing in 2D Mesh: Odd-Even Turn Model

Rule 1. Any packet is not allowed tok EN d EStake an EN turn and ES turn at any

nodes located in an even column.

Rule 2. Any packet is not allowed totake an NW turn and SW turn at anytake an NW turn and SW turn at anynodes located in an odd column.

Reference: Chiu, G. M. (2000), ‘The Odd-Even Turn Model for Adaptive Routing’, IEEE Transactions on Parallel and Distributed Systems, pp. 729 – 738.



44

Routing Dependent Deadlock AvoidanceDeterministic Routing in 2D Torus and Folded Torus by using Virtual Channels

Messages at a node numbered less than their destinationMessages at a node numbered less than their destinationnode are routed on the high channels, and messages at anode numbered greater than their destination node arerouted on the low channels.

n0 n1 n2 n3n0 n1 n2 n3n0 → n2 n1 → n3 n2 → n0 n3 → n1

Reference: Dally, W. J. and Seitz, C. L., (1987) ‘Deadlock Free Message Routing in Multiprocessor Interconnection Networks’, IEEE Transactions on Computers, vol. C-36, no. 5, pp. 547 – 553.


Deadlock Recovery

45

Allow deadlock to occur, but once a potential deadlocksituation is detected, break at least one of the cyclic

Deadlock Recovery

situation is detected, break at least one of the cyclicdependencies to gracefully recover. The common techniquesare,

Regressive recovery (abort-and-retry): Remove packet(s)from a dependency cycle by killing (aborting) and later re-injecting (retry) the packet(s) into the network after somej g ( y) p ( )delay.

Progressive recovery (preemptive): Remove packet(s) fromd d l b i h k ( )a dependency cycle by rerouting the packet(s) onto a

deadlock-free lane.



46




• Routing

• Flow Control Protocol & • Flow Control Protocol & GALS Implementation

• Buffering

GALS Implementation

• Arbitration


Flow Control Protocol

47




48



Fl C l P l

49



Globally Asynchronous Locally Synchronous

50

Globally Asynchronous Locally Synchronous (GALS) style of Communication

Reference: Kundu, S. and Chattopadhyay, S. (2007) ‘Interfacing Cores and Routers in Network-on-Chip Using GALS’, IEEE International Symposium on Integrated Circuits (ISIC 2007), pp.



51




• Routing


• Buffering• Buffering

• Arbitration


Counter based FIFO

52

Counter based FIFO• Binary Counter Based- Drawback

1. There can be considerable ambiguity when a count is read during count transition.

• Gray Code Counter Based- Drawback

1. Power of 2 FIFO depth. Area wastage for non- binary FIFO depth.

Reference: Yi, Cheng, “Gray code sequences”, U. S. Patent 6703950, March 9, 2004.


Gray Counter Based Dual Clock FIFO

53

Gray Counter Based Dual Clock FIFO

Reference: Cummings, C. E. and Alfke, P. (2002) ‘Simulation and Synthesis Techniques for Asynchronous FIFODesign with Asynchronous Pointer Comparisons’, Synopsys Users Group Conference, vol. User Papers.


Functionality of Asynchronous Comparator

54

Functionality of Asynchronous Comparator

Full = ( (waddr == raddr) && (wr_dir != rd_dir) )

Empty = ( (waddr == raddr) && (wr_dir == rd_dir) )


Metastability

55

Metastability

• Full and Empty Signals are controlled by both thecontrolled by both the clocks. Thus probability of arising Metastable States.

• 2-State Synchronizer are used to reduce the probability of Metastability.

• Full Signal is synchronized with the ‘wr-clk’ and Empty Si l i h i d i hSignal is synchronized with the ‘rd-clk’.

Full = ( (waddr == raddr) && (wr_dir != rd_dir) )

E ( ( dd dd ) && ( di d di ) )Empty = ( (waddr == raddr) && (wr_dir == rd_dir) )



56




• Routing


• Buffering

• Arbitration• Arbitration


Arbitration

57

Arbitration


Router Architecture

58

Router Architecture

Input Channel• Input Buffer• Routing Computation Unit

Output Channel• Output Buffer• Arbiter

• Control Unit • Control Unit

Reference: Kundu, S. and Chattopadhyay, S. (2008) ‘Network-on-chip architecture design based on Mesh-of-Treedeterministic routing topology’, Int’l Journal of High Performance Systems Architecture, Vol. 1, No. 3, pp. 163-182.


Wormhole Router Architecture Data Path

59

Wormhole Router Architecture Data Path

b ffk rolPhy

sica

lch

ann

el

Output buffer nk trol P

hysi

cal

chan

nel

sBar (ST)

Input buffer(IB)Li

nkCo

ntr p

(OB) Lin

Con

Routing Control Unit(RC)

HeaderFlit

Cross

Input buffer(IB)

Phy

sica

lch

ann

el

Link

Control

Output buffer(OB) Li

nkCo

ntrol Phy

sica

lch

ann

el

Routing Algorithm


HeaderFli

Crossbar Control

ArbitrationUnit (SA)

(IB)C (OB) C

CRITICALFlit

Routing Algorithm

ControlOutput Port #

IB (Input Buffering) RC (Route Computation) SA (Switch Alloc) ST (Switch Trav) OB (Output Buffering)

PATH

( p g) ( p ) ( ) ( ) ( p g)


Flit Traversal Through Wormhole Router

60

Flit Traversal Through Wormhole Router

T)

Input buffer(IB)Li

nkon

trolPhysical

chan

nel

Output buffer(OB) Li

nkon

trol Ph

ysical

chan

nel

CrossBar (ST(IB)

Input buffer(IB)

Physical

chan

nel

LCo

Link

ontrol

( ) LCo

Output buffer(OB) nk nt

rol Physical

chan

nel


Routing Algorithm

HeaderFlit


HeaderFlit

Routing Algorithm

Crossbar Control

ArbitrationUnit (SA)

Output Port #

(IB)LCo (OB) Lin

Con

g

IB (Input Buffering) RC (Route Computation) SA (Switch Alloc) ST (Switch Trav) OB (Output Buffering)

IB RC SA ST OBPacket Header

IB

IB

IB

IB

IB ST

IB IB ST

IB IB ST

OB

OB

OB

Packet Payload 1

Packet Payload 2

Packet Payload 3 S OPacket Payload 3


Performance Evaluation

61

Performance Metrics

Throughput: Unit: flits/ cycle/ IPlength)(Packet x Packets) Accepted (Maximum

=TP


g p

Latency: The time (in clock cycles) that elapses from between the occurrence of amessage header injection into the network at the source node and the occurrence ofa tail flit reception at the destination node

/ y /time)(Totalx blocks)IPof(Number

TP

a tail flit reception at the destination node.P = total number of messages,

Li= latency of each message i.

Bandwidth: Bandwidth refers to the maximum number of bits can send successfully to

Lavg = P

LiP

∑1

Bandwidth: Bandwidth refers to the maximum number of bits can send successfully tothe destination through the network per second. It is represented as bps (bits/sec).

d

Cost Metrics Energy dissipation: Energy consumed by routers and links at different workload. Average energy/packet and average energy/clock cycle are being measured.

Area requirements: Percentage chip area occupied by the switch and links havetaken into consideration.


Simulator Design for Performance Evaluation

62

Simulator Design for Performance EvaluationTypes of Simulator

1. Cycle Accurate:Sample the state of the signals at every clock edge (positive or negative).

Much faster than event driven simulation.

2. Event Driven:2. Event iven:Most accurate as every active signal is calculated for everydevice during the clock cycle as it propagates.

Each signal is simulated for its value and its time of occurrence.g

Excellent for timing analysis and verify race conditions.

Computation intensive (depends on the number of activities) andhence very slowhence very slow.

To calculate the performance metrics like throughput, latency etc., the delayafter each and every gate is not required. In that case Cycle Accurate Simulatoris the best choice.


Existing NoC Simulators

63

Existing NoC Simulators

Some Existing NoC Simulators

Drawbacks

NIRGAM li i d M h lNIRGAMUniversity of Southampton,

UK

limited to Mesh topology;No power evaluation

MPARM - XpipesUniversity of Bologna, Italy

Not freely available

NS2 Packet level transactionNS2Open Source

Packet level transaction


Cycle Accurate Simulator for NoC Modeling

64

Cycle Accurate Simulator for NoC ModelingThe simulator should operate at the granularity of individual architectural

components of the router.co po e s o e ou e .

SystemC is normally preferred.

Traffic Generators are used for evaluating the performance of NoC.

Input Channel• Input Buffer• Routing Computation Unit

Output Channel• Output Buffer• Arbiter• Routing Computation Unit

• Control Unit• Arbiter• Control Unit

Router

1. Throughput2. Latency3. Bandwidth

Network

Traffic Generation

• Poisson Distribution• Self-Similar Traffic•Appli ti n Sp ifi Tr ffi•Application Specific Traffic


Traffic Generator

65

Traffic GeneratorApplication Driven Traffic is the best suited for performance evaluation.

D t il bilit f th th ti t ffi d l l dDue to unavailability of the same, synthetic traffic source models are also used.

Nature of traffic is generally bursty in NoC.• A Poisson process

When observed on a fine time scale will appear burstyBurst length of a Poisson arrivalBurst length of a Poisson arrivalprocess tends to be smoothedby averaging over long enoughtime scale.P i f ilPoisson process fail to capturethe actual burstiness of NoCtraffic .

Short range DependenceShort range Dependence

Reference: Varatkar, G.V. and Marculescu, R. (2004) ‘On-chip traffic modeling and synthesis for MPEG-2 videoapplications’, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 12, No. 1, pp. 108-119.


Traffic Generator

66

Traffic Generator• A Self-Similar (fractal) process

When aggregated over wide range ofWhen aggregated over wide range oftime scales, will maintain its burstycharacteristic. Self-similarity manifestsitself in several equivalent fashions:

Slowly decaying variance

Long range dependence

Non-degenerate autocorrelations

Heavy Tailed

A Self-Similar process can be generated by super-positioning ON-OFF Pareto Sources

Reference: Park, K. and Willinger, W. (2000) ‘Self-Similar network traffic and performance evaluation’, A Wiley-Interscience Publication, John Wiley & Sons, Inc.


Traffic Parameter

67

d = 6

d = 5

Offered Load: Number of packets injected for particular time interval.

Traffic Parameter

d = 3d = 2d = 1d = 0

d = 4

S

Locality Factor: Ratio of traffic destined to the local clusterfrom a core to the total traffic injected by each core.

Locality Factor = 0 signifies Uniform Distributed Traffic.

For example in 4x4 Mesh, the distances (d) of the destinations from one corner source are at d = 1, 2, 3, 4, 5, and 6. If locality factor = 0.5, then , , , , , y ,

50 percent of the traffic will go to the cluster having d = 1. Rest 50 percent traffic will be distributed as

o 15% will go to the cluster having d = 2o 12.5% will go to the cluster having d = 3o 10% will go to the cluster having d = 4o 7.5% will go to the cluster having d = 5

do 5% will go to the cluster having d = 6

If there is more than one core in a cluster, the traffic will be randomlydistributed among them.



68

TheoreticallyPerformance Evaluation

T l

Performance of any network depends on the following network parameters. Distance Average

Links ofNumber Throughputα

Theoretically,

Topology

Locality factor of the traffic

Buffer Position and Buffer Depth

S i hi T h i

Latency α Average Distance


Number of cores attached

Here, Wormhole router architecture is used to Mes

h

, Wform the network with following parameters,

Number of cores attached = 32

Message Length = Packet Length = 64 flits

MMessage Length = Packet Length = 64 flits

Each flit consists of 32 bits

Total Simulation cycle = 2 lacs with

BF

T

10,000 cycle settling time



69

Throughput varies with topology and locality factor


Throughput = Maximum Accepted Traffic in flits/cycle/IP

We kept buffer depth = 6 in both input and output channels of the router in all the cases



70

Performance EvaluationLatency decreases with increase in Locality Factor in different topologies



Power Evaluation Flow

71

Power Evaluation FlowRouter Power Evaluation

Reference: Synopsys prime power , Design vision manual.(Version Y-2006.06)

Operating Condition: Process = 1, Voltage = 1 volt, Temp = 75 C0



72

Power Evaluation FlowLink Length Estimation

MeshMesh

Estimated Length of Wires:Length of Wires:

1.25 mm,

2.5 mm



73


fl T ( T)

Link Length Estimation

Butterfly Fat Tree (BFT)

EstimatedEstimated Length of Wires:

1.25 mm,,

5.0 mm



74

Interconnect ModelingCopper wire (resistivity = 17 nΩ-m) of Metal Layer 4 (Semi-global) has been taken.


To reduce the wiring area we have chosen the minimum dimension of Metal Layer 4. The dimensions are,

Width (W) = 0.2 µmLayer 5

Spacing (S) = 0.2 µmPitch = W + S = 0.4 µmThickness (T) = 0.5 µm

Layer 4( )

H = 0.75 µmDielectric Constant = 2.9 Layer 3

C i f iCross-section of interconnectsLink Energy Evaluation

Parasitic Components (R, C, L) of Three Wire Model has been extracted from FieldSolver tool of HSPICE. The energy consumption of middle wire for different transitionsgy pis also obtained from HSPICE.



75

Three wire modeling


Data rate : 32 × 200 M bits/sec

Driver sizes are designed based

on length of the wire.

Load Capacitance on the other

end of the wire is 5fF

Look Up Table (LUT) is made

for middle line energy

consumption


Energy Consumption in Mesh TopologyEnergy Consumption in Mesh TopologyNetwork Energy = Router Energy + Link Energy

Si l i f 2 l l k l i h l k i d f 5Simulation runs for 2 lacs clock cycle with clock period of 5 ns

Internal Power D i tDominates



Comparison of Energy Consumption

77

Comparison of Energy Consumption



Energy – Performance Trade-OffThroughput Variation with FIFO Depth & Position in Mesh

Energy Performance Trade Off

FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output ChannelFIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel


Energy – Performance Trade-OffEnergy Performance Trade OffLatency Variation with FIFO Depth & Position in Mesh

FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6FIFO Depth 4-0 => Input Channel FIFO Depth =4, No FIFO at Output ChannelFIFO_Depth_4 0 > Input Channel FIFO Depth 4, No FIFO at Output ChannelFIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel


Energy – Performance Trade-Off

Simulation runs for 2 lacs clock cycle with clock period of 5 ns

Energy Performance Trade OffEnergy Variation with FIFO Depth & Position in Mesh

FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4FIFO D h 4 6 > I Ch l FIFO D h 4 O Ch l FIFO D h 6FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output ChannelFIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel


Energy – Performance Trade-OffEnergy Performance Trade OffTrade-Off in Mesh at saturation (load = 160)

FIFO D h 6 0 h b E P f T d OffFIFO_Depth_6-0 shows best Energy-Performance Trade-Off


Network Energy Consumption in Mesh afterNetwork Energy Consumption in Mesh after FIFO Optimization

Si l i f 2 l l k l i h l k i d f 5Simulation runs for 2 lacs clock cycle with clock period of 5 ns

Internal Power Still DominatesDominates

We kept FIFO depth = 6 in input channel and no FIFO at output channel


Comparison of Energy Consumption after

83

Comparison of Energy Consumption after FIFO Optimization

We kept FIFO depth = 6 in input channel and no FIFO at output channeland no FIFO at output channel


Internal Power

84

Internal Power

Netlist View of a D-type flip-flop with synchronous clear input in S D i Vi iSynopsys Design Vision

• Internal power = short circuit power + Internal node switching powerInternal power short circuit power Internal node switching power

• Output node of the clock-buffer switches continuously with free running clock

To minimize Internal Power: Stop the clock when the network is idle


Internal Power Minimization

85

Internal Power MinimizationNetlist View of FIFO Memory


Network Energy Consumption in Mesh afterNetwork Energy Consumption in Mesh after Clock Gating in FIFO

Simulation runs for 2 lacs clock cycle with clock period of 5 ns

We kept FIFO depth = 6 in input channel and no FIFO at output channel


Comparison of Energy Consumption after

87

Comparison of Energy Consumption after Clock Gating in FIFO

We kept FIFO depth = 6 in input channel and no FIFO at output channeland no FIFO at output channel


Network Area Comparison

88

Network Area Comparison

% SoC Area Overhead

BFT Mesh

2 424 3 701

Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.

2.424 3.701


Scalability Measurement

89

Scalability MeasurementScalability is a property which exhibits performance proportional to the

number of cores employed.

As the size of a scalable system is increased, a corresponding increase inperformance is obtained.

BW = [(Throughput * Number of cores attached * Number of bits in a flit) / clock period]


Head-of-Line Blocking in Wormhole Router

90

Head of Line Blocking in Wormhole Router

VC0

XX

X

2D mesh, no VCs, XY routing


Introduction of Virtual Channels

91

Introduction of Virtual Channels• Multiple Virtual Channels multiplexed on a single physical link to improve

performance.

• Payload flits use the VC acquired by the header flit while tailer flit releases VC.

VC 0 VC 0

Physical

Switch A Switch B

VC 1

MUX VC 1D

EMUX

ydata link

VC control

VC Scheduler

VC control

Reference: Dally, W. J. (1992) ‘Virtual Channel Flow Control’, IEEE Trans. on Parallel and Distributed Systems, Vol. 3, No. 2, pp. 194–205.


Virtual Channels

92

Virtual ChannelsVC0

VC1

X

2D mesh, 2 VCs, XY routing

VC avoids HOL blocking.

routingg


Virtual Channels

93

VC0

VC1

Virtual Channels

XXX

No VCs

X

No VCs available

VC mitigates HOL blocking but can

li i i2D mesh, 2 VCs, XY ro ting not eliminate itrouting


Virtual Channel Based Router Architecture

94

Virtual Channel Based Router Architecturehysical

hann

el Input buffers

ol ol hysical

hann

el

Ph ch

Link

Contr o

... MUXD

EMUX Li

nkCo

ntr o Ph ch

sBar

Physical

channe

l

Input buffers

nk ntrol

. MUX

DEM

nk ontrol Physical

channe

l

Cross

Lin

Co .. M

MUX Li

nCo

Routing Control and

Arbitration Unit


Virtual Channel Based Router Architecture

95

Input buffers

sical

nnel

ysical

anne

l

Virtual Channel Based Router ArchitectureLink

ControlPhy

cha

... Link

Control Phy

cha

MUX

DEM

UX M

UX

Input buffers

cal

nel

cal

nelCrossBar

Link

ControlPhysi c

chann

...

DEM

UX M

UX

Link

Control Physi

chann

MUX

Routing Control and

Arbitration UnitArbitration Unit

Reference: N. Kavaldjiev, G. J. M. Smit, and P. G. Jansen, “A Virtual Channel Router for On-Chip Networks”, in Proc. of IEEE Int’l SOC Conference. IEEE Computer Society Press, pp. 289–293, 2004.


Determination of Number of Virtual ChannelsDetermination of Number of Virtual Channels

- Upto 4 virtual channels throughput increases, but beyond that it saturates.- Energy dissipation increases with increase in the number of virtual channels.- For Energy-Performance Trade-off, 4 virtual channels with each physical

Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) “Performance evaluation and design trade-offs for MP-SOC interconnect architectures”, IEEE Trans. on Computers, Vol. 54, No. 8, pp.1025–1040.

gy , p ychannel is preferred.


Throughput Improvement in Mesh usingThroughput Improvement in Mesh using Virtual Channel Architecture

N f Vi l Ch l 4No. of Virtual Channel = 4


Latency Improvement in Mesh using

98

Latency Improvement in Mesh using Virtual Channel Architecture

No. of Virtual Channel = 4


Energy Overhead in Mesh using Virtual

99

Energy Overhead in Mesh using Virtual Channel Architecture



Performance of Some

100

Performance of Some Other Topologies


Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) “Performance evaluation and design trade-offs for MP-SOC interconnect architectures”, IEEE Trans. on Computers, Vol. 54, No. 8, pp.1025–1040.


Network Area Comparison with Virtual

101

Network Area Comparison with Virtual Channel Architecture

% SoC Area Overhead% SoC Area Overhead

Mesh BFT

Without VC With VC Without VC With VC

Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.

3.701 6.145 2.424 3.507


Quality of Service (QoS) Support

102

Quality of Service (QoS) Support

• Conceptually, two disjointnetworks– a network with throughput and

latency guarantees (guaranteedy g (gthroughput, GT)

– a network without those guarantees(best-effort, BE)( , )

• Several types of commitment inthe network

bi d– combine guaranteed worst-casebehavior

– with good average resource usage

Architectural Modification for Supporting QoS

Reference: Rijpkema, E., Goossens, K., Radulescu, A., Dielssen, J., Meerbergen, J. V., Wielage, P., and Waterlander, E. (2003) “Trade-offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip”,IEE Proc. Computers and Digital Techniques, Vol. 150, No. 5, pp. 294-302.

Lecture – 3

Application Mappingpp pp g


Task of Application Mapping

104

Task of Application Mapping


M i P bl F l i C G h

105

Mapping Problem Formulation – Core Graph

• Directed graph G = (V, E)

• Each vertex vi represents acorecore

• Each edge ei,j ε E representscommunication between viand vand vj

• Weight of edge ei,j is commi,j,is the bandwidth requirement


Mapping Problem Formulation –

106

Mapping Problem Formulation –NoC Topology Graph

• A directed graph P = (U,F)

• Each vertex u ε U is a router• Each vertex ui ε U is a router

• Each edge fi,j ε F represents a direct communication betweendirect communication between the vertices

• Weight of edge fi j denoted by g g i,j ybwi,j represents the available bandwidth across the edge


M F i

107

Map Function

• map: V Umap: V U

• Each edge k of the core graph represents a commodity dk

• Each commodity has a value vl(dk) representing thed d f fbandwidth requirement of the communication from vi to vj

• Bandwidth constraint:

An edge in the topology graph must have enough bandwidthAn edge in the topology graph must have enough bandwidthto accommodate all commodities passing through it

• Minimize communication cost:

Σ l(dk) di ( (dk) d (dk))Σk vl(dk) dist(source(dk), dest(dk))


M i S l i

108

Mapping Solution


M i Al i h

109

Mapping Algorithms

• Mapping problem is intractableMapping problem is intractable

• Several approaches are possible: ILP, Heuristics (PMAP, GMAP, PBB, NMAP, BMAP etc.), Meta-search heuristics (GA, PSO, Simulated Annealing)

• Other variants of the problem combining,

T k h d lin– Task scheduling

– Power consumption

– Alternative routing paths etc.Alternative routing paths etc.


M i i h Mi i P h R i

110

Mapping with Minimum-Path Routing (NMAP)

• Three phases – Initialize, Minimum path computation, Iterative improvement

• Initialize:1. Core with maximum communication demand placed onto the

node with maximum number of neighborsg

2. Select the core that communicates most with the mapped cores

3. Place selected core onto the node that minimizes communication cost with mapped onescommunication cost with mapped ones



111

Mapping with Minimum-Path Routing

• Shortest Path:• Shortest Path:

– Minimum path routing

– Commodities are sorted on descending order of flows

– For each commodity, shortest path is identified

As soon as a commodity path is finalized cost of each– As soon as a commodity path is finalized, cost of each edge on the path increased by the value of the commodity



112

Mapping with Minimum-Path Routing

• Iterative Improvement:Iterative Improvement:

– Iteratively swap vertices pair-wise to obtain a better mapping

– Traffic splitting:

• Multiple shortest paths may exist

• Formulate a multi-commodity flow problem to satisfy bandwidth requirements for solutions that have lesser communication costs but do not satisfy yall the bandwidth requirements


Bi i l M i Al i h (BMAP)

113

Binomial Mapping Algorithm (BMAP)

• NMAP algorithm is O(N4logN)NMAP algorithm is O(N logN)

• BMAP is a three stage algorithm with complexity O(N2logN)

– Binomial Merging Iteration

– Topology Mapping

– Hardware cost Optimization


BMAP: Binomial Merging Iteration

114

BMAP: Binomial Merging Iteration

1. Calculate IP Ranking: Rank of IP core i,1. Calculate IP Ranking: Rank of IP core i,

ranking(i) = Σ (requirement(i, j) + requirement(j, i), j = 1 to N

requirement(i, j) is the bandwidth requirement from i to j

2. Merge IP Set: Based on ranking merge two IP-sets at a time: logN time

3. Refreshing IP Set: Ranking is recalculated. Ranking of IP3. Refreshing IP Set: Ranking is recalculated. Ranking of IP Set k generated by merging IP Set i and IP Set j is,

ranking(k) = ranking(i) + ranking(j) – requirement(i,j) –requirement(j i)requirement(j,i)


M i A E l

115

Merging: An Example


BMAP T l M i d T ffi

116

BMAP: Topology Mapping and Traffic Surface Creation

• After mapping, a traffic surface is generated

• It shows the traffic load of each router

Mi i l h i i d• Minimal path routing is used

• Based on this surface, hardware can be optimized by selecting p y gproper routers from the library


BMAP H d C O i i i

117

BMAP: Hardware Cost Optimization

1 Dummy Router Elimination:1. Dummy Router Elimination:– Dummy routers added at start point to have 4n routers

– BMAP puts these routers at boundaries, hence can be eliminated

2. Router Selection:– Sharing single buffer among low bandwidth input channels

– Choice of router is made from library

3 Unfolding:3. Unfolding: – Add additional routers and links for larger bandwidth

requirements


BMAP H d O i i i A E l

118

BMAP: Hardware Optimization - An Example


Network on Chip Synthesis:

119

Network on Chip Synthesis: SUNMAP + xpipes


SUNMAP T l M i

120

SUNMAP: Topology Mapping

• Optimizes for area power or delay within designOptimizes for area, power or delay within design constraints

• Uses heuristics to perform mapping

• onto topologies: mesh, torus, hypercube, clos, and butterfly

• B ilt in fl pl nn f p n l i• Built in floor-planner for area, power analysis

• Choice of different routing functions


SUNMAP T l M i

121

SUNMAP: Topology Mapping

Heuristic approach with several phases:Heuristic approach with several phases:

• Initial mapping using a greedy algorithm (from communication graph)

– Compute optimal routing (using flow formulation)

1. Floorplan solution

2. Check area and bandwidth constraints

3. Compute mapping cost

• Iterative improvement loop (Tabu search)• Iterative improvement loop (Tabu search)

• Allows manual and interactive topology creation


System configuration

122

System configuration// In this topology: 8 cores, 8 memories, 4x4 torus// ----------------------------- IP cores// name, switch number, clock divider, buffers, type

( 0 i h 0 1 6 i i i )core(core_0, switch_0, 1, 6, initiator);core(mem_8, switch_11, 1, 6, target:0x00);[…]// ----------------------------- switches// name, input ports, output ports, buffers

• Specifies

– NIs (I/Os, clocks, // , p p , p p ,switch(switch_0, 5, 5, 6);switch(switch_1, 5, 5, 6);[…]// ----------------------------- links// name so rce destination

buffers)

– switches (I/Os, buffers)// name, source, destination

link(link0, switch_0, switch_1);link(link1, switch_1, switch_0);[…]// ----------------------------- routes

buffers)

– links

– routes// source, destination, hopsroute(core_0, pm_8, switches:0,1,5,6,7,11);route(core_1, pm_9, switches:1,5,9,8);route(core_2, pm_10, switches:2,6,5,9);route(core 3 pm 11 switches:3 2 6 10);route(core_3, pm_11, switches:3,2,6,10);[…]


i C il Pl f G i

123

xpipes Compiler: Platform Generation

• Input:Input:

– System configuration: Topology, Routing tables, Parameters(flit width, buffering, …)

– Component Library

• Creates a class template for each type of network p n nt b d p n p n nt nfi ti n (I/Ocomponent based upon component configuration (I/O

ports, buffer sizing)

• Hierarchical instantiation of the platform in SystemCp y


Network-on-Chip Synthesis Tool: xpipes

124

Network-on-Chip Synthesis Tool: xpipes

MPARM Architecture

Reference: Bertozzi, D. and Benini, L. (2004) “xpipes: A Network –on-Chip Architecture for Giga Scale Systems-on-Chips”, IEEE Circuits and Systems Magazine, pp. 18-31.

Lecture – 5

Conclusion and Future of Network-on-Chip


Network on Chip: At a Glance

126

Network-on-Chip: At a GlanceTopics Covered

I f Hi h C i i L

Some More TopicsNeed of Network-on-Chip

NoC Architecture Design

Impact of Higher Communication Layers in NoC Performance

Test and Verification of NoC


Design Trade-Off

Thermal Modeling of NoC

Metrics and Benchmarks for NoC.

Application Mapping on NoC

Signal Integrity and Reliability Issues

Floorplan-aware NoC architecture optimization

Fault Tolerant Architecture in NoCg g y y Fault Tolerant Architecture in NoC

CAD Tools for NoC


Limitation of 2D Network on Chip

127

Limitation of 2D Network-on-ChipThe conventional 2D integrated circuit (IC) has limitedfloor-planning choices, and consequently, it limits thefloor planning choices, and consequently, it limits theperformance enhancements arising out of NoC architectures.

Need for more and more bandwidth but not at the cost ofNeed for more and more bandwidth but not at the cost of increased power consumption.

Reference: Carloni, L. P., Pande P. P., Yuan X. (2009) “Networks-on-Chip in emerging interconnect paradigms: Advantages and Challenges” ACM/IEEE Int’l Symp. On Network s-on-Chip, pp. 93-102.


NoC Research Groups in Foreign Universities

128

NoC Research Groups in Foreign Universities1. Prof. Luca Benini, University of Bologna, Italy.2. Prof. Giovanni De Micheli, EPFL, Switzerland.3 Prof William J Dally Stanford University USA3. Prof. William J. Dally, Stanford University, USA. 4. Prof. Partha Pratim Pande, Washington State University, USA.5. Prof. Radu Marculescu, Carnegie Mellon University, USA.6. Prof. Bashir M Al-Hashimi, University of Southampton, UK.7. Prof. Chita R. Das, Pennsylvania State University, USA.8. Prof. Niraj K. Jha, Princeton University, USA.9. Prof. Sashi Kumar, Jonkoping University, Sweden.10. Prof. Axel Janstach, Royal Institute of Technology (KTH), Sweden.10. Prof. Axel Janstach, Royal Institute of Technology (KTH), Sweden.11. Prof. Jari Nurmi, Tampere University of Technology, Finland.12. Prof. Andre Ivanov, University of British Columbia, Canada.13. Prof. Resve Saleh, University of British Columbia, Canada.14 P f I l Cid T h i I l I i f T h l I l14. Prof. Israel Cidon, Technion-Israel Institute of Technology, Israel.15. Dr. Davide Bertozzi, University of Bologna, Italy.16. Dr. Srinivasan Murali, EPFL, Switzarland.

dand many more …


NoC Research in Indian Universities

129

NoC Research in Indian Universities1. Prof. Santanu Chattopadhyay, Indian Institute of Technology, Kharagpur.2. Prof. S. K. Nandy, Indian Institute of Science, Bangalore.y, , g3. Prof. Bharadwaj Amruthur, Indian Institute of Science, Bangalore.4. Prof. M. R. Bhujade, Indian Institute of Technology, Bombay.

J l C f d W k h N CJournals, Conference, and Workshop on NoC

Microprocessor and Microsystems Journal Elsevier (MICPRO)

IEEE/ACM International Symposium on Networks-on-Chip

Microprocessor and Microsystems Journal, Elsevier (MICPRO)

IEEE Int’l Workshop on Network on Chip Architectures (NoCArc)

IEEE/ACM International Symposium on Networks on Chip

IEEE Int l Workshop on Network on Chip Architectures (NoCArc)


NoC Research in Industries

130

NoC Research in IndustriesTilera Corporation Arteris Inc. Silistix Inc. NXP Semiconductor

IBM Corporation(Cyclops-64/Blue Gene)

130

AetherealAethereal


Network on Chip Books

131

Network-on-Chip Books



132




133



Bibliography

134

BibliographyFor detailed updated reference, the audience are directed to the following link:http://www.cl.cam.ac.uk/~rdm34/onChipNetBib/onChipNetwork.pdf

Below we are giving some of our contributions in NoC research:[1] S. Kundu and S. Chattopadhyay, “Interfacing Cores and Routers in Network-on-Chip Using GALS”, IEEE

International Symposium on Integrated Circuits (ISIC), 2007.[2] S. Kundu and S. Chattopadhyay, “Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture”,

ACM Great Lake Symposium on VLSI (GLSVLSI) 2008ACM Great Lake Symposium on VLSI (GLSVLSI), 2008.[3] S. Kundu, R. P. Dasari, K. Manna, and S. Chattopadhyay, “Mesh-of-Tree based scalable Network-on-Chip

Architecture”, IEEE Region 10 Colloquium and International Conference on Industrial and InformationSystems (ICIIS), 2008.

[4] S. Kundu and S. Chattopadhyay, “Mesh-of-Tree based Network-on-Chip Architecture Using Virtual Channelbased Router” IEEE VLSI Design and Test Conference (VDAT), 2008.

[5] S. Kundu and S. Chattopadhyay, “Network-on-chip architecture design based on mesh-of-tree deterministicrouting topology”. International Journal for High Performance Systems Architecture, Vol. 1, No. 3, pp.163–182,Inderscience Publisher, 2008.

[6] S. Kundu, R. P. Dasari, K. Manna, and S. Chattopadhyay, “Performance Evaluation of Mesh-of-Tree Based[6] d , , , d p d y y, dNetwork-on-Chip Using Wormhole Router with Poisson Distributed Traffic”, IEEE VLSI Design and TestConference (VDAT), 2009.

[7] S. Kundu, K. Manna, S. Gupta, K. Kumar, R. Parikh, and S. Chattopadhyay, “A Comparative PerformanceEvaluation Of Network-on-Chip Architectures Under Self-Similar Traffic”, IEEE International Conference onAd i R t T h l i i C i ti d C ti (ARTC ) 2009Advances in Recent Technologies in Communication and Computing (ARTCom), 2009.


Microprocessor

135

Microprocessor Research Laboratory

Th YThank You