Upload
trankhue
View
226
Download
6
Embed Size (px)
Citation preview
NOC: Networks on Chip MPSoC:Multiprocessor System on Chip
EE8205: Embedded Computer Systemshttp://www.ee.ryerson.ca/~courses/EE8205/
Dr. Gul N. Khanhttp://www.ee.ryerson.ca/~gnkhan
Electrical and Computer EngineeringRyerson UniversityRyerson University
OverviewOverview• Introduction to
SoC
and
MPSoC
• Networks on a Chip • Bus-based and Point-to-point
NoC
Systems
• Regular and Application Specific
NoC
Topologies• Routing and Switching Techniques• NOC Topology Generation and Analysis
Introductory Articles onIntroductory Articles on MPSoCMPSoC andand NoCNoC available at the courseavailable at the course webpagewebpage
NOC and SOC Design 2
System on a Chip Systems-on-Chip (SoC)
•
Advances in chip design and integration.•
Incorporate multiple components on a single chip.
•
MPSoC
has addressed ever-increasing performance requirements.
Samsung S3C6400 Platform
NOC and SOC Design 3
Samsung S3C6410 Platform
NOC and SOC Design 4
S3C6410 System on Chip• A
16/32-bit RISC low power, high performance micro-processor
•
Applications include
mobile phones, Portable
Navigation Devices and other general
applications.
•
Provide
optimized H/W performance for the
2.5G and 3G communication services,
•
Includes
many powerful hardware accelerators for motion video
processing, display control and scaling. An
•
Integrated
Multi Format Codec (MFC) supports
encoding and decoding of MPEG4/H.263,
H.264.
•
Many hardware peripherals such as camera interface, TFT 24-bit LCD controller, power management, etc.
NOC and SOC Design 5
ARM 11 (v6) based SOC
NOC and SOC Design 6
S3C6410 based Mobile Processor
Navigation System
iPhone
based on ARM1176JZS3C6410
NOC and SOC Design 7
System on Chip Design Flow • Specify:
* What does the customer really want?• Architect:
* Find the most cost and performance effective architecture to implement it?* What existing components can we adapt and re-use?
• Evaluate: * What is the performance impact of a cheaper architecture?
• Implement:* What can we generate automatically from libraries and customization?
Use separate computation, communication and performance
NOC and SOC Design 8
System-on-Chip and
NoC System-on-Chip ---to--- NetworkNetwork--onon--ChipChip
Analog ComponentADC/DAC
VGA CORE
DSP
CPU
MPEG CORE
NOC and SOC Design 9
SoC
StructureNoC-based System on a Chip
Proc
Proc Proc
Cache L2
A tile of the chip
control
data
spare
parity
A tile of the chip
Instr $
Data $NetworkInterface
p1
p2
p3
p4
Switch Fabric
Control Logic p0
core
control
data
spare
parity
A computational block
Switch Fabric
Control Logic p0
Instr $
Data $NetworkInterface
core
p1 p3bus
A communication link
NOC and SOC Design 10
System on Chip Design Flow
CommunicationRefinement
Mapping
SystemBehavior
SystemArchitecture
PerformanceSimulation
BehaviorSimulation
21
3
Flow To Implementation4
NOC and SOC Design 11
System on Chip Design Flow
Annotation of architectural timing and
Energy onto behavior
PerformanceSimulation
behavior annotated with architectural effects
Analyze / VisualizeResults
NOC and SOC Design 12
SoC Appl.-
Wireless LAN Physical Layer
OFDM Physical Layer/Digital BB
MAC
OFDM TXOFDM TX OFDM RXOFDM RX
Network
Application
HiperLan/2
PicoRadio
Protocol StackProtocol Stack
Multi-media WirelessNetworks;High Rate: 10 Mb/secLow Power: 10-100 mW
Ad Hoc Networks:Low Rate: b/sec - kb/secLow Power: 100μW
DynamicDynamicReconfigurationReconfiguration
NOC and SOC Design 13
Wireless LAN
SoCASIC FPGA
Microcontroller
busbus
ADAD
DAPA
Analog front end
ADAD
DAPA
FPGA
Micro-controller
crossbar bus
f0 f1 f2Analog front end
**-Main Points-**• Which micro-controller to use?• Do we need more
FPGAs?• DSP or ASIC?• Which MAC?• Where will the MAC run? • Which other
appls. can I add?• Is the chip reusable?• Is too much memory?
Digital Modem
ProtocolUserInterface
clockmanager
sleep modemngmt
BlocTurbo Codec
ADAD
DAPA
Analog front end
NOC and SOC Design 14
Wireless LAN Physical Layer Design Flow
Implementation
Application Specification
Algorithm Exploration
Functional Simulation and Refinement
Architecture Exploration:Performance Simulation
Architecture Refinement
SystemC or C (Matlab/Simulink, …)
English (UML, SystemC…)
Coware (….)
Coware(, …)
TX
OFDM RXOFDM TX
RX
OFDM Physical Layer
Higher Layers
Functional IP Reuse
Mapping
SystemC
Mapping
Functional Partitioning
NOC and SOC Design 15
Physical Layer
SoC
Architecture
FPGA FFT FIR UART BUFFER
FPGAconfig. mem. Int. bridge
Micro -
Clock gen.
SPS2(instruction/
data RAM)
XBARInterface
Processor busInterface
XBAR
Processor bus
JtagInterface
DPR2/SPS2Bridge
TEST(0..2)
Ck, reset
CK2 CK1 MCK VDD VSS Reset
I/D caches
Datapath
NOC and SOC Design 16
Multiple Processor/Core System-on- Chip
Inter-node communication between CPU/cores can be performed by message passing or shared memory. Number of processors in the same chip-die increases at each node (CMP and
MPSoC).
• Memory sharing will require: SHARED BUSSHARED BUS* Large
Multiplexers
* Cache coherence techniques* Not Scalable
• Message Passing: NOCNOC* Scalable* Require data transfer transactions * Has overhead of extra communication
NOC and SOC Design 17
NOC: Network-on-Chip
Shared bus is not a long-term solution• It has poor scalabilityOn-Chip micro-networks suit the demand of scalability and performance
System Bus
NOC and SOC Design 18
NOC and Off-Chip Networks
NOCNOCSensitive to cost:
area and powerWires are relatively cheapLatency is criticalTraffic is known a-prioriDesign time specializationCustom NoCs are possible
OffOff--Chip NetworksChip NetworksCost is in the linksLatency is tolerableTraffic/applications
unknownChanges at runtimeAdherence to networking
standards
NOC and SOC Design 19
On-Chip Communication Structures
NOC and SOC Design 20
On-Chip Bus Interconnection
For highly connected multi-core systemCommunication bottleneck
For multi-master busesArbitration will become a complex problem
Power grows for each communication event as more units attached will increase the capacitive load.
A crossbar switch can overcome some of these problems and limitations of the buses
Crossbar is not scalable
NOC and SOC Design 21
SOC Communication StructuresDedicated Point-to-Point
• AdvantagesOptimal in terms of bandwidth, availability, latency and power usage
Simple to design and verify as well as easier to model
• DisadvantagesNumber of links may increase exponentially with the increase in number of cores
Hardware AreaRouting Problems
NOC and SOC Design 22
SOC Communication StructuresNetwork on Chip
• AdvantagesStructured architecture – Lower complexity and cost of SOC design
Reuse of components, architectures, design methods and tools
Efficient and high performance interconnect.Scalability of communication architecture
• DisadvantagesInternal network contention can cause a latencyBus oriented IPs need smart wrapping hardwareSoftware needs clear synchronization in
multiprocessor systems
NOC and SOC Design 23
Networks-on-Chip• Interconnect for SoCs, CMPs, MPSoC and
FPGAsMulti-hop, packet-based communicationEfficient resource sharing
• Scalable communication infrastructureprovides scalable performance/efficiency in
PowerHardware AreaDesign productivity
NOC and SOC Design 24
Networks-on-Chip• Interconnect for SoCs, CMPs, MPSoC and
FPGAsMulti-hop, packet-based communicationEfficient resource sharing
• Scalable communication infrastructureprovides scalable performance/efficiency in
PowerHardware AreaDesign productivity
NOC and SOC Design 25
NoCNoC ??A chip-wide network: Processing Elements (PEs) are inter- connected via a packet-based network in NoC Architecture
textROUTER
PE 1
textROUTER
PE 5
textROUTER
PE 9
textROUTER
PE 13
textROUTER
PE 2
textROUTER
PE 6
textROUTER
PE 10
textROUTER
PE 14
textROUTER
PE 3
textROUTER
PE 7
textROUTER
PE 11
textROUTER
PE 15
textROUTER
PE 4
textROUTER
PE 8
textROUTER
PE 12
textROUTER
PE 16
MSG
MSG
Packetized Message
Decoded Message
NOC and SOC Design 26
Network-on-Chip vs. Bus Interconnection• Total bandwidth grows• Link speed unaffected• Concurrent spatial reuse• Pipelining is built-in• Distributed arbitration•
Separate abstraction layers
However• No performance guarantee• Extra delay in routers• Area and power overhead?• Modules need NI• Unfamiliar methodology
BUS inter-connection is fairly simple and familiar
However• Bandwidth is limited, shared• Speed goes down as N grows• No concurrency • Pipelining is tough• Central arbitration•
No layers of abstraction (communication and computation are coupled)
NOC and SOC Design 27
On-Chip Buses
• Ad hoc BusesTraditional Data/Address Buses
• ARM AMBA BusAdvanced Micro controller Bus Architecture
• IBM Core Connect BusCoreConnect
Bus Architecture
NOC and SOC Design 28
AMBA On-Chip Bus
AMBA evolved from ARM’s internal bus development:
•
ASB/AHB: Advance System Bus/High Performance bus with support for pipelining, burst transfer and multiple bus masters
• APB: Advance
Periphral
Bus with all slave devices
• Bridge: A slave on ASB that connect it to APB
NOC and SOC Design 29
AMBA based Single Chip GPS Controller■ Suitable for handheld
and personal navigation systems■ ARM7TDMI 16/32 bit RISC CPU based host■ Complete embedded memory system:
Flash 256 KB, RAM 64 KB.■ 12 channel GPS correlation DSP■ 4 channels A/D■ 4 serial communication interfaces■ One serial peripheral interfaces (SPI)■ Real-time clock module ■ 16-bit watchdog timer
NOC and SOC Design 30
IBM
CoreConnect
On-Chip BusCoreConnect
is an SOC Bus proposed by IBM having:
• PLB: Processor Local Bus, PLB Arbiter, PLB to OPB Bridge• OPB: On-Chip Peripheral Bus, OPB Arbiter• DCR: Device Control Register Bus and a Bridge
NOC and SOC Design 31
CoreConnect
Advance FeaturesIBM
CoreConnect
Bus with
32-, 64-, and 128-bit versions to
support a variety of applications• PLB: Fully synchronous, supports up to 8 masters
-
Separate read/write data buses-
Burst transfers, variable and fixed-length, Pipelining
-
DMA transfers and No on-chip tri-states required-
Overlapped arbitration, programmable priority fairness
• OPB: Fully synchronous, 32-bit address and data buses-
Support 1-cycle data transfers between master and slaves
-
Arbitration for up to 4 OPB master peripherals-
Bridge function can be master on PLB or OPB
• DCR: Provides fully synchronous movement
of GPR data between CPU and slave
logic
NOC and SOC Design 32
CoreConnect
Bus based
SoC
NOC and SOC Design 33
Comparing AMBA and CoreConnect SoC
Buses
NOC and SOC Design 34
NoC: Buses to NetworksOriginal Bus Features•
One transaction at a time
•
Central Arbiter•
Limited bandwidth
•
Synchronous•
Low cost
S
S
Shared Bus to Segmented Bus
NOC and SOC Design 35
Advanced Bus
Segmented BusSegmented Bus• More General/Versatile
bus architecture• Pipelining capability• Burst transfer • Split transactions• Overlapped arbitration • Transaction preemption,
resumption & reordering
Shared Bus to Segmented Bus
S
S
NOC and SOC Design 36
Buses to Networks
• Architectural paradigm shift: Replace wire spaghetti by network• Usage paradigm shift: Pack everything in packets• Organizational paradigm shift
Confiscate communications from logic designersCreate a new discipline, a new infrastructure responsibility
NOC and SOC Design 37
NoC
Related Main ProblemsGlobal interconnect design problems:
• Delay• Power• Noise• Scalability• Reliability
System integration Productivity problem
Chip Multi Processors For power-efficient computing
NOC and SOC Design 38
NoC
and Global Connections DelayLong wiring delay is dominated by Resistance•
Add repeaters
• Repeaters will become latches(with clock frequency scaling)
• Latches can become NoC routers
NoC router
NoC router
NoC router
NOC and SOC Design 39
NoC: Long Wiring DelaysLong wiring delay is dominated by Resistance•
Add repeaters
• Repeaters will become latches(with clock frequency scaling)
• Latches can become NoC routers
NoC router
NoC router
NoC router
NOC and SOC Design 40
NoC
Wiring Design
•
NoC
links:–
Regular
–
Point-to-point --
no fan-out tree (problem)–
Can use transmission-line layout
–
Well-defined current return path
•
Can be optimized for noise / speed / power–
Low swing, current mode, ….
NOC and SOC Design 41
NoC
ScalabilityCompare the wire-area for same performance
n
n
dd
n
n
dd
NoC:
n
n
dd
Bus
Segmented Bus:
Pt-to-Pt:
( )3O n n
( )2O n n
( )O n
( )2O n n
NOC and SOC Design 42
NoC
a PlatformSystem modules may use
different clocks/voltages.NoC can take care of
synchronization.NoC design may be
asynchronous.No waste of power when
links/routers are idle.It eliminates ad-hoc global
wire engineering.It separates computation
from communication.It supports modularity &
reuse of cores.
NoCNoC platform for System platform for System Integration, Testing and Integration, Testing and
DebuggingDebugging
NOC and SOC Design 43
CMP and
NoC
•
Uniprocessors
cannot provide Power-efficient performance growth
Interconnect dominates dynamic powerGlobal wire delay doesn’t scaleInstruction-level parallelism is limited
•
Power-efficiency requires many parallel local computations
Chip Multi Processors (CMP)Thread-Level Parallelism (TLP)
Network is another choice for CMP
Inter- connect
Gate
Diff.
Uni-processor dynamic power
(Magen et al., SLIP 2004)
Die Area (or Power)
Uni-processor Performance
“Pollack’s rule”
(F. Pollack. Micro 32, 1999)
NOC and SOC Design 44
Network-on-Chip Topologies
Application Specific Irregular Topologies
NOC and SOC Design 45
Irregular
NoC
Topologies
•
Based on the concept of using only what is necessary.
•
Application-specific topologies.
•
Eliminate unneeded resources and bandwidth from the system.
•
Leads to reduced power and area use.
•
Requires additional design work.
NOC and SOC Design 46
NOC Topology1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Mesh Physical implementation
NOC and SOC Design 47
NOC
Torus
Topology
Torus Physical implementation
1 2 4 3
13 14 16 15
5 6 8 7
9 10 12 11
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
NOC and SOC Design 48
NOC Abstraction Layers Network ModelingSoftware Layers
O/S, applicationNetwork and Transport Layers
Network topologySwitchingAddressingRoutingQuality of ServiceCongestion control, end-to-end flow control
Data Link LayerFlow control (handshake)Handling of contentionCorrection of transmission errors
Physical LayerWires, drivers, receivers, repeaters, signaling, circuits,..
e.g. crossbar, ring, mesh, torus, fat tree,…Circuit / packet switching: VCT, wormhole
e.g. guaranteed-throughput, best-effort
Logical/physical, source/destination, flow, transactionStatic/dynamic, distributed/source, deadlock avoidance
NOC and SOC Design 49
Definitions and Terminology
Switch: The component of the network that is in charge of flit routing.
Flit Latency: The time needed for a FLIT to reach its target PE from its source PE.
Packet Latency: The time needed for a PACKET to reach its target PE from its source PE.
Packet Spread: The time from the reception of the first flit of a packet to the reception of the last one.
NOC and SOC Design 50
Message Abstraction
Message
Packet
Header Payload
Flit Typ
e
Dest.
VC
Typ
e
Body
VC
Typ
e
Tail
VC
Packet:Packet: An element of information that a processing element (PE) sends to another PE. A packet may consist of a variable number of flits.”
Flit:Flit: The elementary unit of information exchanged in the communication network in a clock cycle.
NOC and SOC Design 51
Switching Techniques
Circuit Switching
Packet Switching –
Routing ProtocolsStore and Forward: Router cost is packet based. Packet size also affects latency and buffering requirements. Stalling happens at two nodes and the link between them.
Wormhole: Router cost is based on header. Header can effect latency and buffering at the router is based on the header size.
Stalling can happen at all the nodes and links spanned by the packet..
Virtual Cut-through: Router cost depends on header and packet size. Stalling at local nodes level.
NOC and SOC Design 52
Relevant Parameters: RoutingMinimum latency is of paramount importance in
NOC/SOC (inter-process communication). Ideally: 1 clock latency per switch/router (flit enters
at time t and exits at t+1)Maximum switch clock frequency (technology +
routing logic limits)Deadlock freeNo flits are ever lost; once a flit is injected in the
NOC, it must reach its destination may be after a long time.
NOC and SOC Design 53
Fixed Shortest Path Routing
Suitable for Regular Topologiese.g. Mesh, Torus, Tree, etc.
X-Y routing (fist x then y direction.
Simple Router No deadlock scenarioNo retransmissionNo reordering of messagesPower-efficient
NOC and SOC Design 54
Wormhole RoutingIn wormhole routing a header flit “digs”
the
path and hold.Successive flits are routed to the same path or
directionIn case of blocks and loss-less
NoC
we need:
BuffersA back-pressure mechanism if we don’t have
infinite size FIFOs…
NOC and SOC Design 55
Wormhole
Src
Dest
NOC and SOC Design 56
Wormhole
Src
Dest
H F
F 2
F 3
F 4
T F
NOC and SOC Design 57
Wormhole
Src
Dest
F 2
H F
F 3
F 4
T F
NOC and SOC Design 58
Wormhole
Src
Dest
F 3
F 2
HF
F 4
T F
NOC and SOC Design 59
Wormhole
Src
Dest
F 4
F 3 F2
HF
T F
NOC and SOC Design 60
Wormhole
Src
Dest
F 4
F 3 F2
HF
T F
NOC and SOC Design 61
Wormhole
Src
Dest
F 3
F2
HF
F 4
T F
NOC and SOC Design 62
Wormhole
Src
Dest
F3
F2
F 4
T F
HF
NOC and SOC Design 63
Wormhole
Src
Dest
F4
F3
T F
HFF2
NOC and SOC Design 64
Wormhole
Src
Dest
TF
F4
HFF2F3
NOC and SOC Design 65
Wormhole
Src
Dest
TF
HFF2F3F4
NOC and SOC Design 66
Wormhole
Src
DestHFF2F3
TFF4
NOC and SOC Design 67
Deflection RoutingHot Potato Hot Potato –– Deadlock Free RoutingDeadlock Free RoutingEvery flit can be routed to different directions
(no packet notion at the switch level)If the optimal direction is blocked, the flit is “deflected” to
another direction Switch latency of 1 clock cycle whatever the level of congestionMinimum buffer requirements
Packets reorderingAdaptive routingNo bufferingNo back pressureWorks with Torus/Mesh
Wormhole RoutingNo packets reorderingStatic routingBuffering ( ≥ 2 flits/port)Back pressureXY routing needs mesh
Hot-Potato
Src
Dest
Hot-Potato
Src
Dest
H
F
F2F3T
F
Hot-Potato
Src
Dest
F2H
F
F3T
F
Hot-Potato
Src
Dest
F3 F2H
F
T
F
Hot-Potato
Src
Dest
T
F
HFF2F3
Hot-Potato
Src
Dest
T
F
HFF2F3
Hot-Potato
Src
Dest
F3
TF
H
F
F2
Hot-Potato
Src
Dest
TF
F3
F2HF
Hot-Potato
Src
Dest
F3
F2HFTF
Hot-Potato
Src
DestF2HFTFF3
NOC and SOC Design 78
Network-on-Chip
NOC and SOC Design 79
Core to Network Connection
NOC and SOC Design 80
NOC Switch/RouterGeneric
Router/Switch
NOC and SOC Design 81
Another Generic Router with Virtual ChannelsVCID
Input 0(From West)
Input 1(From North)
Input N(From PE)
Demux
VC Allocater(VA)
Routing Logic
Flit_in
Credit_out
Full Crossbar(5x5)
Credit_in, Output VC Resv_State
Mux
Scheduling
Switch Allocater(SA)
VC0
VC(V-1)
VC0
VC(V-1)
VC0
VC(V-1)
NOC and SOC Design 82
A Typical Router Pipeline
ROUTING& BUFFERS
VCALLOCATION ARBITRATION SWITCH
TRAVERSAL
FLIT IN
FLIT OUT
NOC and SOC Design 83
VC: Virtual-Channels
NOC and SOC Design 84
CAD Problems for NOCApplication Mapping (map tasks to cores)(map tasks to cores)
Floorplanning/Placement (within the network)(within the network)
Routing (of messages)(of messages)
Buffer Sizing (size of FIFO queues in the routers)(size of FIFO queues in the routers)
Timing Closure (Link bandwidth capacity allocation)(Link bandwidth capacity allocation)
Simulation (Network simulation for traffic, delay, power (Network simulation for traffic, delay, power modeling)modeling)
Testing … Combined with problems of designing NOC itself(topology synthesis, switching, virtual channels, arbitration,(topology synthesis, switching, virtual channels, arbitration,flow control,flow control,…………))
NOC and SOC Design 85
Topology Generation and Analysis•
Aim:
Generate a viable network topology.Analyze the generated topology.
•
Targeted Network:Best-effort, wormhole switched.Lookup table based source routing.No virtual channel support.Round Robin switch output arbitration.One NI per component master or slave interface.All transactions converted to packets of the same length (flit count).Burst beats converted to separate packets.
NOC and SOC Design 86
System Input and Output
•
Input:Core GraphNetwork Parameters
•
Output:Topology GraphRoute tablesRecommended Operating Clock Frequency
NOC and SOC Design 87
Topology Generation
•
Aims:Provide physical links.Minimize latency on select paths.Use a minimum of resources.
•
Two algorithms are used.ALG1: Point-to-Point Oriented Topologies.ALG2: Partitioned Crossbar Topologies.
•
Heuristic approach.
NOC and SOC Design 88
Point-to-Point Oriented Topologies
NOC and SOC Design 89
Partitioned Crossbar Topologies•
Initial topology: Fully-
Connected Crossbar (single switch).
•
Ideal latency situation.
•
May violate maximum port requirement.
•
Partitioning process.
NOC and SOC Design 90
Topology Analysis•
Aim:
Estimate achievable performance.Account for interference in the system.
•
Use of
Petri
Nets.
•
Partitioned analysis.Analyze components in isolation.Sum contention effects across paths.
•
Two Stages:Frequency selection.Path verification.
NOC and SOC Design 91
Verification Process
•
Verify all path latencies.
Write packet latency.Read packet latency.
•
Adjust delays based on contention.
•
Contention Areas:Switch output.Destination NI.
NOC and SOC Design 92
Contention Estimation
NOC and SOC Design 93
Frequency Selection
•
Cyclical relation between contention and frequency.
•
Frequency is fixed before contention is analyzed.•
To find minimum valid frequency:
Interval halving process.Large number of frequency points.
NOC and SOC Design 94
Simulation Environment•
SystemC
based.
•
Collection of models:Generators and Sinks.Master and Slave NIs.Various Switches.
AMBA AXI protocol implemented.
NOC and SOC Design 95
Results
•
Applications and generated topologies.
•
Comparative results.
•
Resource Usage.
•
Accuracy tests.
NOC and SOC Design 96
MPEG4 -
Decoder
Clock Frequency:3.43 GHz
A)
B)
NOC and SOC Design 97
MWD Application
Clock Frequency:573.4 MHz
A)
B)
NOC and SOC Design 98
AV Benchmark
NOC and SOC Design 99
AV Topologies
A) B)
Clock Frequency:2.31 GHz
NOC and SOC Design 100
Comparative Results I
NOC and SOC Design 101
Comparative Results II
NOC and SOC Design 102
Resource Usage
Topology Mesh Fat Tree
Custom 1 Custom 2
MPEG4Decoder
46 44 22 14
MWDApplicatio
n
59 47 13 17
AvBenchmar
k
87 67 25
NOC and SOC Design 103
Accuracy Test Results I
NOC and SOC Design 104
Accuracy Test Results II