Upload
pavel-gicevski
View
227
Download
0
Embed Size (px)
Citation preview
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
1/46
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
2/46
1
Table of Contents1 Project Overview ........................................................................................................................................ 2
Abstract ..................................................................................................................................................... 2
2 Project Objectives ...................................................................................................................................... 3
3 Understanding System-on-a-chip (SoC) ..................................................................................................... 4
3.1 System-on-a-chip - basics ................................................................................................................... 4
3.2 System-on-a-Chip - Structure .............................................................................................................. 5
Whatsinside of a SoC? ............................................................................................................................. 5
4 Briefly about Multi-core processors .......................................................................................................... 7
4.1 Multithreading, Hyper-Threading, or Multi-Core?....................................................................... 8
5 On-Chip Network to a Multi-core System ................................................................................................ 10
5.1 Abstract ............................................................................................................................................. 10
5.2 Topology ............................................................................................................................................ 10
5.3 Routing .............................................................................................................................................. 12
5.4 Flow Control ...................................................................................................................................... 14
5.5 Router Micro-architecture ................................................................................................................ 17
6 Project Methodologies, Results, and Achievements ............................................................................... 18
6.1 Usage of SmartphonesSurvey (Q & A) .......................................................................................... 19
6.2 Analysis on the survey results ........................................................................................................... 22
6.3 Discrete event simulations ................................................................................................................ 23
6.4 Example of DES in real life ................................................................................................................ 24
6.5 Components of a discrete-event simulation ..................................................................................... 24
6.6 Network Simulators as DES ............................................................................................................... 26
6.7 Network Simulations with OPNET..................................................................................................... 26
7 Implementing the project in OPNET ........................................................................................................ 27
7.1 Adding Traffic .................................................................................................................................... 27
7.2 Network On-Chip realizations in OPNET ........................................................................................... 29
7.3 Results Comparison........................................................................................................................... 41
8 Conclusions .............................................................................................................................................. 44
9 References ............................................................................................................................................... 45
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
3/46
2
1 Project Overview
AbstractSince smartphones and tablets are basically smaller computers, they require pretty much the
same components we see in desktops and laptops in order to offer us all the amazing things
they can do (apps, music and video playing, 3D gaming support, advanced wireless features,
etc).
But smartphones and tablets do not offer the same amount of internal space as desktops and
laptops for the various components needed such as the logic board, the processor, the RAM,
the graphics card, and others. That means these internal parts need to be as small as possible,
so that device manufacturers can use the remaining space to fit the device with a long-lasting
battery life.
In recent years, due to the continuous development in the field of silicon technology, it is
possible to implement complex electronic systems in a single integrated circuit. Systems-on-
chips (SoCs) are small, powerful multi-core systems that are being implemented in a vast
number of ways across the booming electronics market, primarily in small mobile devices. But
there comes a question: What architecture design should be proposed to solve this problem, or
with other words, how one tiny little computer can be designed so that smartphones can rise
up to PCs levels?
The architecture complexity of these SoCs requires new design methodologies and the
development of a seamless design flow that integrates existing and emerging tools. As micro
and nano technologies continuously progress, it has led to a growing integration and clock
frequency increment in electronics systems. These combined effects have led to an increase
both in power density and energy dissipation which consequently must be managed, above all,
in portable systems. Design and technology issues relating to power efficiency are crucial, in
particular for power optimized cell libraries, clock gating and clock trees optimization, and
dynamic power management. Thats why this projectdiscusses options for different low-power,
faster, and cheaper design techniques for systems-on-chips, at a design level and an
architectural level.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
4/46
3
2 Project Objectives
This project intends to contribute to solutions for the growing industrial need to design reliablenetwork-on-a-chip with efficient multi-core processors for mobile platforms as future systems
on a chip. In particular, it intends to provide theory and practical examples where possible
designs can take place and maybe contribute in the future SOCs development.
The system on a chip design doesnt only require new design methodologies and development
of design flow of already known elements and tools. Here comes one of the main objectives of
this project: What are the needs of the users of this growing technology? This question derives
many possible situations, and clearly gives us idea that not all PCs components and
performances should be just copied into our hand devices. Thats why this project is based on a
previous research about smartphone users and their view of their devices (either as calling
device, gaming platform, or business tool).
Based on SOCs structure, Multi-core systems architectures, and the research we have done, this
project will offer On-Chip network that will bring high performance, smartly used space, and
network interconnections between the components of the chip-which means long-lasting
battery.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
5/46
4
3 UnderstandingSystem-on-a-chip (SoC)
3.1 System-on-a-chip - basics
System-on-a-chip (SoC) technology is the packaging of all the necessary electronic circuits and
parts for a "system" (such as a cell phone or digital camera) on a single integrated circuit ( IC ),
generally known as amicrochip . For example, a system-on-a-chip for a sound-detecting device
might include an audio receiver, an analog-to-digital converter (ADC ), a microprocessor,
necessarymemory, and theinput/outputlogic control for a user - all on a single microchip.
System-on-a-chip technology is used in small, increasingly complex consumer electronic
devices. Some such devices have more processing power and memory than a typical 10-year-
old desktop computer. In the future, SoC-equippednanorobots (robots of microscopicdimensions) might act as programmable antibodies to fend off previously incurable diseases.
SoC video devices might be embedded in the brains of blind people, allowing them to see; SoC
audio devices might allow deaf people to hear. Handheld computers with small whip antennas
might someday be capable of browsing the Internet at megabit-per-second speeds from any
point on the surface of the earth.
SoC is evolving along with other technologies such as silicon-on-insulator ( SOI ), which can
provide increasedclock speed while reducing thepower consumed by a microchip.
Image 1: How far the technology has gone?
All necessary desktop computers components to be packed into a couple of cm long chip(This image is in an ownership fo WIKIPEDIA.COM)
http://searchcio-midmarket.techtarget.com/definition/integrated-circuithttp://searchcio-midmarket.techtarget.com/definition/microchiphttp://searchcio-midmarket.techtarget.com/definition/microchiphttp://searchcio-midmarket.techtarget.com/definition/analog-to-digital-conversionhttp://searchcio-midmarket.techtarget.com/definition/microprocessorhttp://searchcio-midmarket.techtarget.com/definition/microprocessorhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchcio-midmarket.techtarget.com/definition/input-outputhttp://searchcio-midmarket.techtarget.com/definition/input-outputhttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://search400.techtarget.com/definition/Silicon-on-Insulatorhttp://searchcio-midmarket.techtarget.com/definition/clock-speedhttp://searchcio-midmarket.techtarget.com/definition/powerhttp://searchcio-midmarket.techtarget.com/definition/powerhttp://searchcio-midmarket.techtarget.com/definition/clock-speedhttp://search400.techtarget.com/definition/Silicon-on-Insulatorhttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://searchcio-midmarket.techtarget.com/definition/input-outputhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchcio-midmarket.techtarget.com/definition/microprocessorhttp://searchcio-midmarket.techtarget.com/definition/analog-to-digital-conversionhttp://searchcio-midmarket.techtarget.com/definition/microchiphttp://searchcio-midmarket.techtarget.com/definition/integrated-circuit8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
6/46
5
3.2 System-on-a-Chip - Structure
Whatsinside of a SoC?
Now that we know what a SoC is, lets take a quick look at the components that can be found
inside it. Mind you, not all the following parts are built in all the different SoCs that were going
to show you later on, but in order to better understand how a SoC works, you should have a
general picture of what goes inside it:
CPU the central processing unit, whether its single- or multiple-core, this is what
makes everything possible on your smartphone. Most processors found inside the SoCs
that were going to look at will be based on ARM technology, but more on that later .
Memory just like in a computer, memory is required to perform the various tasks
smartphone and tablets are capable of, and therefore SoCs come with various memory
architectures on board.
GPUthe graphic processing unit is also an important component on the SoC, and its
responsible for handling those complex 3D games on the smartphone or tablets. As you
can expect, there are various GPU architectures available out there, and were going tofurther detail them in what follows.
Northbridgethis is a component that handles communications between the CPU and
other components of the SoC including the southbridge.
Southbrigea second chipset usually found on computers that handles various I/O
functions. In some cases the southbridge can be found on the SoC.
Cellular radios some SoCs also come with certain modems on board that are needed
by mobile operators. Such is the case with the Snapdragon S4 from Qualcomm, which
has an embedded LTE modem on board responsible for 4G LTE connectivity.
Other radiossome SoCs may also have other components responsible for other types
of connectivity, including Wi-Fi, GPS/GLONASS or Bluetooth. Again, the S4 is a good
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
7/46
6
example in this regard.
Timing sources includingoscillators andphase-locked loops.
Externalinterfaces including industry standards suchasUSB,FireWire,Ethernet,USART,SPI.
Peripherals includingcounter-timers, real-timetimers andpower-on reset generators.
Analog interfaces includingADCs andDACs.
Voltage regulators andpower management circuits.
Other circuitry.
Image 2:Simplified look at the layout of Samsung's Exynos 5 Dual. The CPU and GPU are
there, but they're just small pieces of the larger puzzle.(This image is in an ownership of http://www.intechopen.com)
http://en.wikipedia.org/wiki/Oscillatorhttp://en.wikipedia.org/wiki/Phase-locked_loophttp://en.wikipedia.org/wiki/Electrical_connectorhttp://en.wikipedia.org/wiki/Universal_Serial_Bushttp://en.wikipedia.org/wiki/FireWirehttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/USARThttp://en.wikipedia.org/wiki/Serial_Peripheral_Interface_Bushttp://en.wikipedia.org/wiki/Counterhttp://en.wikipedia.org/wiki/Timerhttp://en.wikipedia.org/wiki/Power-on_resethttp://en.wikipedia.org/wiki/Analog_signalhttp://en.wikipedia.org/wiki/Analog_to_digital_converterhttp://en.wikipedia.org/wiki/Digital_to_analog_converterhttp://en.wikipedia.org/wiki/Voltage_regulatorhttp://en.wikipedia.org/wiki/Power_managementhttp://en.wikipedia.org/wiki/Power_managementhttp://en.wikipedia.org/wiki/Voltage_regulatorhttp://en.wikipedia.org/wiki/Digital_to_analog_converterhttp://en.wikipedia.org/wiki/Analog_to_digital_converterhttp://en.wikipedia.org/wiki/Analog_signalhttp://en.wikipedia.org/wiki/Power-on_resethttp://en.wikipedia.org/wiki/Timerhttp://en.wikipedia.org/wiki/Counterhttp://en.wikipedia.org/wiki/Serial_Peripheral_Interface_Bushttp://en.wikipedia.org/wiki/USARThttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/FireWirehttp://en.wikipedia.org/wiki/Universal_Serial_Bushttp://en.wikipedia.org/wiki/Electrical_connectorhttp://en.wikipedia.org/wiki/Phase-locked_loophttp://en.wikipedia.org/wiki/Oscillator8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
8/46
7
4 Briefly about Multi-core processors
A multi-core processor is a singlecomputing component with two or more independent
actualcentral processing units (called "cores"), which are the units that read and
executeprogram instructions.The instructions are ordinaryCPU instructions such as add, move
data, and branch, but the multiple cores can run multiple instructions at the same time,
increasing overall speed for programs amenable toparallel computing.Manufacturers typically
integrate the cores onto a singleintegrated circuitdie (known as a chip multiprocessor or CMP),
or onto multiple dies in a singlechip package.
Processors were originally developed with only one core. Multi-core processors were
developed in the early 2000s byIntel,AMD and others. They may have two cores (Dual core)
(e.g.AMD Phenom II X2,Intel Core Duo), four cores (Quad core) (e.g.AMD Phenom II X4,Intel's
quad-core processors, seei5,andi7 atIntel Core), 6-cores (e.g.AMD Phenom II X6,Intel Core i7
Extreme Edition 980X), 8-cores (e.g.Intel Xeon E7-2820,AMD FX-8350), 10-cores (e.g.Intel Xeon
E7-2850)or more.
A multi-core processor implementsmultiprocessing in a single physical package. Designers may
couple cores in a multi-core device tightly or loosely. For example, cores may or may not
sharecaches,and they may implementmessage passing orshared memory inter-core
communication methods. Commonnetwork topologies to interconnect cores includebus,ring,
two-dimensional mesh, andcrossbar.
Homogeneous multi-core systems include only identical cores, and on the othersideheterogeneous multi-core systemshave cores that are not identical. Just as with single-
processor systems, cores in multi-core systems may implement architectures such
assuperscalar,VLIW,vector processing,SIMD,ormultithreading.
Multi-core processors are widely used across many application domains including general-
purpose,embedded,network,digital signal processing (DSP), andgraphics.
The improvement in performance gained by the use of a multi-core processor depends very
much on the software algorithms used and their implementation. In particular, possible gains
are limited by the fraction of the software that can berun in parallel simultaneously on multiple
cores; this effect is described by Amdahl's law. In the best case, so-calledembarrassinglyparallel problems may realize speedup factors near the number of cores, or even more if the
problem is split up enough to fit within each core's cache(s), avoiding use of much slower main
system memory. Most applications, however, are not accelerated so much unless programmers
invest a prohibitive amount of effort in re-factoring the whole problem. The parallelization of
software is a significant ongoing topic of research.
http://en.wikipedia.org/wiki/Computinghttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Instruction_sethttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Integrated_circuithttp://en.wikipedia.org/wiki/Die_(integrated_circuit)http://en.wikipedia.org/wiki/Chip_carrierhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/AMDhttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Callisto.22_.28C2.2FC3.2C_45_nm.2C_Dual-core.29http://en.wikipedia.org/wiki/Intel_Core_Duohttp://en.wikipedia.org/wiki/Intel_Core_Duohttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Zosma.22_.28E0.2C_45_nm.2C_Quad-core.29http://en.wikipedia.org/wiki/Intel_Core_i5http://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Corehttp://en.wikipedia.org/wiki/Intel_Corehttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Thuban.22_.28E0.2C_45_nm.2C_Hexa-core.29http://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29http://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessorshttp://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessorshttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Message_passinghttp://en.wikipedia.org/wiki/Shared_memoryhttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Bus_topologyhttp://en.wikipedia.org/wiki/Crossbar_switchhttp://en.wikipedia.org/w/index.php?title=Homogeneous_computing&action=edit&redlink=1http://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/Embedded_processorhttp://en.wikipedia.org/wiki/Network_processorhttp://en.wikipedia.org/wiki/Digital_signal_processinghttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/Parallel_processinghttp://en.wikipedia.org/wiki/Amdahl%27s_lawhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Amdahl%27s_lawhttp://en.wikipedia.org/wiki/Parallel_processinghttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/Digital_signal_processinghttp://en.wikipedia.org/wiki/Network_processorhttp://en.wikipedia.org/wiki/Embedded_processorhttp://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/w/index.php?title=Homogeneous_computing&action=edit&redlink=1http://en.wikipedia.org/wiki/Crossbar_switchhttp://en.wikipedia.org/wiki/Bus_topologyhttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Shared_memoryhttp://en.wikipedia.org/wiki/Message_passinghttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessorshttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29http://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Thuban.22_.28E0.2C_45_nm.2C_Hexa-core.29http://en.wikipedia.org/wiki/Intel_Corehttp://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Core_i5http://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Zosma.22_.28E0.2C_45_nm.2C_Quad-core.29http://en.wikipedia.org/wiki/Intel_Core_Duohttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Callisto.22_.28C2.2FC3.2C_45_nm.2C_Dual-core.29http://en.wikipedia.org/wiki/AMDhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Chip_carrierhttp://en.wikipedia.org/wiki/Die_(integrated_circuit)http://en.wikipedia.org/wiki/Integrated_circuithttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Instruction_sethttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computing8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
9/46
8
4.1 Multithreading, Hyper-Threading, or Multi-Core?
Programs are made up of execution threads. These threads are sequences of related
instructions. In the early days of the PC, most programs consisted of a single thread. The
operating systems in those days were capable of running only one such program at a time. The
result was-as some of us painfully recall-that your PC would freeze while it printed a document
or a spreadsheet. The system was incapable of doing two things simultaneously. Innovations in
the operating system introduced multitasking in which one program could be briefly suspended
and another one run. By quickly swapping programs in and out in this manner, the system gave
the appearance of running the programs simultaneously.
By the beginning of this decade, processor design had gained additional execution resources
(such as logic dedicated to floating-point and integer math) to support executing multiple
instructions in parallel. Intel saw an opportunity in these extra facilities. The company reasoned
it could make better use of these resources by employing them to execute two separatethreads simultaneously on the same processor core. Intel named this simultaneous processing
Hyper-Threading Technology and released it on the Intel Xeon processors in 2003. According to
Intel benchmarks, applications that were written using multiple threads could see
improvements of up to 30% by running on processors with HT Technology. More important,
however, two programs could now run simultaneously on a processor without having to be
swapped in and out (See Figure 1.) To induce the operating system to recognize one processor
as two possible execution pipelines, the new chips were made to appear as two logical
processors to the operating system.
Fig.4.1 HT Technology enables two threads to execute simultaneously on a single processor core
The performance boost of HT Technology was limited by the availability of shared resources to
the two executing threads. As a result, HT Technology cannot approach the processing
throughput of two distinct processors because of the contention for these shared resources. To
achieve greater performance gains on a single chip, a processor would require two separate
cores, such that each thread would have its own complete set of execution resources. Enter
multi-core.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
10/46
9
Multi-Core Processors
Multi-core processors, as the name implies, contain two or more distinct cores in the same
physical package. Figure 2 shows how this appears in relation to previous technologies.
Fig. 4.2 Multi-Core processors have multiple execution cores on a single chip
In this design, each core has its own execution pipeline. And each core has the resourcesrequired to run without blocking resources needed by the other software threads.
While the example in Figure 2 shows a two-core design, there is no inherent limitation in the
number of cores that can be placed on a single chip. Intel has committed to shipping dual-core
processors in 2005, but it will add additional cores in the future. Mainframe processors today
use more than two cores, so there is precedent for this kind of development.
The multi-core design enables two or more cores to run at somewhat slower speeds and at
much lower temperatures. The combined throughput of these cores delivers processing power
greater than the maximum available today on single-core processors and at a much lower level
of power consumption. In this way, Intel increases the capabilities of server platforms as
predicted by Moores Law while the technology no longer pushes the outer limits of physical
constraints.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
11/46
10
5 On-Chip Network to a Multi-core System
5.1 Abstract
On-chip network architecture can be defined by four parameters: its topology, routing
algorithm, flow control protocol, and router micro architecture.
Throughout this section, we will discuss how different choices of the above four parameters
affect the overall costperformance of an on-chip network. Clearly, the costperformance of an
on-chip network depends on the requirements faced by its designers. Latency is a key
requirement in many on-chip network designs, where network latency refers to the delay
experienced by messages as they traverse from source to destination. Most on-chip networks
must also ensure high throughput, where network throughput is the peak bandwidth thenetwork can handle.
Another metric that is particularly critical in on-chip network design is network power, which
approximately correlates with the activity in the network as well as its complexity.
5.2 Topology
The effect of a topology on overall network costperformance is profound. A topology
determines the number of hops (or routers) a message must traverse as well as the
interconnect lengths between hops, thus influencing network latency significantly.
As traversing routers and links incurs energy, a topologys effect on hop count also directly
affects network energy consumption. As for its effect on throughput, since a topology dictates
the total number of alternate paths between nodes, it affects how well the network can spread
out traffic and thus the effective bandwidth a network can support. Network reliability is also
greatly influenced by the topology as it dictates the number of alternative paths for routing
around faults. The implementation complexity cost of a topology depends on two factors: the
number of links at each node (node degree) and the ease of laying out a topology on a chip
(wire lengths and the number of metal layers required). Figure 2.1 shows three commonly used
on-chip topologies. For the same number of nodes, and assuming uniform random traffic where
every node has an equal probability of sending to every other node, a ring (Fig. 2.1.a) will lead
to higher hop count than a mesh (Fig. 2.1.b) or a torus [11] (Fig. 2.1.c). For instance, in the
figure shown, assuming bidirectional links and shortest-path routing, the maximum hop count
of the ring is 4, that of a mesh are also 4, while a torus improves it to 2. A ring topology also
offers fewer alternate paths between nodes than a mesh or torus, and thus saturates at a lower
network throughput for most traffic patterns. For instance, a message between nodes A and B
in the ring topology can only traverse one of two paths in a ring, but in a 33 mesh topology,
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
12/46
11
there are six possible paths. As for network reliability, among these three networks, a torus
offers the most tolerance to faults because it has the highest number of alternative paths
between nodes.
Figure 2.1 Common on-chip network topologies: (a) ring, (b) mesh, and (c) torus
While rings have poorer performance (latency, throughput, energy, and reliability) when
compared to higher dimensional networks, they have lower implementation overhead. A ring
has a node degree of 2 while a mesh or torus has a node degree of 4, where node degree refers
to the number of links in and out of a node. A higher node degree requires more links and
higher port counts at routers. All three topologies featured are two-dimensional planar
topologies that map readily to a single-metal layer, with a layout similar to that shown in the
figure, except for torus which should be arranged physically in a folded manner to equalize wire
lengths (see Fig. 2.2), instead of employing long wrap-around links between edge nodes. A
torus illustrates the importance of considering implementation details in comparing alternative
topologies. While a torus has lower hop count (which leads to lower delay and energy)compared to a mesh, wire lengths in a folded torus are twice that in a mesh of the same size, so
per-hop latency and energy are actually higher. Furthermore, a torus requires twice the number
of links which must be factored into the wiring budget. If the available wiring bisection
bandwidth is fixed, a torus will be restricted to narrower links than a mesh, thus lowering per-
link bandwidth, and increasing transmission delay. Determining the best topology for an on-
chip network subject to the physical and technology constraints is an area of active research.
Figure 2.2 Layout of a 8x8 folded torus
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
13/46
12
5.3 Routing
The goal of the routing algorithm is to distribute traffic evenly among the paths supplied by the
network topology, so as to avoid hotspots and minimize contention, thus improving network
latency and throughput. In addition, the routing algorithm is the critical component in faulttolerance: once faults are identified, the routing algorithm must be able to skirt the faulty
nodes and links without substantially affecting network performance. All of these performance
goals must be achieved while adhering to tight constraints on implementation complexity:
routing circuitry can stretch critical path delay and add to a routers area footprint. While
energy overhead of routing circuitry is typically low, the specific route chosen affects hop count
directly and thus substantially affects energy consumption.
While numerous routing algorithms have been proposed, the most commonly used routing
algorithm in on-chip networks is dimension-ordered routing (DOR) due to its simplicity. With
DOR, a message traverses the network dimension-by dimension, reaching the coordinate
matching its destination before switching to the next dimension.
In a two-dimensional topology such as the mesh in Fig. 2.3, dimension-ordered routing, say XY
routing, sends packets along the X-dimension first, followed by the Y-dimension. A packet
traveling from (0,0) to (2,3) will first traverse two hops along the X-dimension, arriving at (2,0),
before traversing three hops along the Y-dimension to its destination. Dimension-ordered
routing is an example of a deterministic routing algorithm, in which all messages from node A
to B will always traverse the same path. Another class of routing algorithms is oblivious ones,
where messages traverse different paths from A to B, but the path is selected without regards
to the actual network situation at transmission time. For instance, a router could randomlychoose among alternative paths prior to sending a message. Figure 2.3 shows an example
where messages from (0,0) to (2,3) can be randomly sent along either the YX route or the XY
route. A more sophisticated routing algorithm can be adaptive, in which the path a message
takes from A to B depends on network traffic situation. For instance, a message can be going
along the
XY route, sees congestion at (1,0)s east outgoing link and instead choose to take the north
outgoing link towards the destination (see Fig. 2.3).
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
14/46
13
Fig. 2.2 DOR illustrates an
XY route from (0,0) to (2,3)
in a mesh, while Oblivious
shows two alternative routes
(XY and YX) between thesame sourcedestination pair
that can be chosen obliviously
prior to message
transmission. Adaptive shows
a possible adaptive route that
branches away from the XY
route if congestion is
encountered at (1,0)
In selecting or designing a routing algorithm, not only must its effect on delay, energy,
throughput, and reliability be taken into account, most applications also require the network to
guarantee deadlock freedom. A deadlock occurs when a cycle exists among the paths of
multiple messages. Figure 2.4 shows four gridlocked (and deadlocked) messages waiting for
links that are currently held by other messages and prevented from making forward progress:
The packet entering router A from the South input port is waiting to leave through the East
output port, but another packet is holding onto that exact link while waiting at router B to leave
via the South output port, which is again held by another packet that is waiting at router C to
leave via the West output port and so on.
Fig. 2.4A classic network
deadlock where four packets
cannot make forward
progress as they are waiting
for links that other packetsare holding on to
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
15/46
14
Deadlock freedom can be ensured in the routing algorithm by preventing cycles among the
routes generated by the algorithm, or in the flow control protocol by preventing router buffers
from being acquired and held in a cyclic manner. Using the routing algorithms we discussed
above as examples, dimension-ordered routing is deadlock-free since in XY routing, there
cannot be a turn from a Y link to an X link, so cycles cannot occur. Two of the four turns in Fig.
2.4 will not be permitted, so a cycle is not possible. The oblivious algorithm that randomlychooses between XY or YX routes is not deadlock-free because all four turns from Fig. 2.4 are
possible leading to potential cycles in the link acquisition graph. Likewise, the adaptive route
shown in Fig. 2.3 is a superset of oblivious routing and is subject to potential deadlock. A
network that uses a deadlock-prone routing algorithm requires a flow control protocol that
ensures deadlock freedom.
Routing algorithms can be implemented in several ways. First, the route can beembedded in
the packet header at the source, known as source routing. For instance,the XY route in Fig. 2.3
can be encoded as , while theYX route can be encoded as . At each hop, the routerwill read the left-most direction off the route header, send thepacket towards thespecified outgoing link, and strip off the portion of the header
corresponding to thecurrent hop. Alternatively, the message can encode the coordinates of the
destination,and comparators at each router determine whether to accept or forward the
message. Simple routing algorithms are typically implemented as combinationalcircuits within
the router due to the low overhead, while more sophisticated algorithms are realized using
routing tables at each hop which store the outgoing link a message should take to reach a
particular destination. Adaptive routing algorithms need mechanisms to track network
congestion levels, and update the route. Route adjustments can be implemented by modifying
the header, employing combinational circuitry that accepts as input these congestion signals, or
updating entries in a routing table. Many congestion-sensitive mechanisms have been
proposed, with the simplest being tapping into information that is already captured and used
by the flow control protocol, such as buffer occupancy or credits.
5.4 Flow Control
Flow control governs the allocation of network buffers and links. It determines when buffers
and links are assigned to which messages, the granularity at which they are allocated, and how
these resources are shared among the many messages using the network. A good flow control
protocol lowers the latency experienced by messages at low loads by not imposing highoverhead in resource allocation, and pushes network throughput through enabling effective
sharing of buffers and links across messages. In determining the rate at which packets access
buffers (or skip buffer access altogether) and traverse links, flow control is instrumental in
determining network energy and power. Flow control also critically affects network quality- off
service since it determines the arrival and departure time of packets at each hop. The
implementation complexity of a flow control protocol includes the complexity of the router
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
16/46
15
micro-architecture as well as the wiring overhead imposed in communicating resource
information between routers.
In store-and-forward flow control, each node waits until an entire packet has been received
before forwarding any part of the packet to the next node. As a result, long delays are incurred
at each hop, which makes them unsuitable for on-chip networks that are usually delay-critical.
To reduce the delay packets experience at each hop, virtual cut-through flow control allows
transmission of a packet to begin before the entire packet is received. Latency experienced by a
packet is thus drastically reduced, as shown in Fig. 2.5. However, bandwidth and storage are
still allocated in packet-sized units. Packets still only move forward if there is enough storage to
hold the entire packet. On-chip networks with tight area and power constraints find it difficult
to accommodate the large buffers needed to support virtual cut-through (assuming large
packets).
Like virtual cut-through flow control, wormhole flow control cuts through flits, allowing flits to
move on to the next router as soon as there is sufficient buffering for this flit. However, unlike
store-and-forward and virtual cut-through flow control, wormhole flow control allocates
storage and bandwidth to flits that are smaller than a packet. This allows relatively small flit-
buffers to be used in each router, even for large packet sizes. While wormhole flow control uses
buffers effectively, it makes inefficient use of link bandwidth. Though it allocates storage and
bandwidth
in flit-sized units, a link is held for the duration of a packets lifetime in the router.
As a result, when a packet is blocked, all of the physical links held by that packet are left idle.
Throughput suffers because other packets queued behind the blocked packet are unable to use
the idle physical links.
Fig. 2.5Timing for (a) store-and-forward and (b) cut-through flow control at low loads, where
tr refers to the delay routing the head flit through each router, ts refers to the serialization delay
transmitting the remaining flits of the packet through each router, and tw refers to the time
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
17/46
16
involved in propagating bits across the wires between adjacent routers. Wormhole and virtual-
channel flow control have the same timing as cut-through flow control at low loads.
Virtual-channelflow control improves upon the link utilization of wormhole flow control,
allowing blocked packets to be passed by other packets. A virtual channel (VC) consists merelyof a separate flit queue in the router; multiple VCs share the physical wires (physical link)
between two routers. Virtual channels arbitrate for physical link bandwidth on a flit-by-flit
basis. When a packet holding a virtual channel becomes blocked, other packets can still
traverse the physical link through other virtual channels. Thus, VCs increase the utilization of
the critical physical links and extend overall network throughput. Current on-chip network
designs overwhelmingly adopt wormhole flow control for its small area and power footprint,
and use virtual channels to extend the bandwidth where needed. Virtual channels are also
widely used to break deadlocks, both within the network, and for handling system level
deadlocks.
Figure 2.6 illustrates how two virtual channels can be used to break a cyclic deadlock in thenetwork when the routing protocol permits a cycle.
Fig. 2.6Two virtual
channels (denoted by solid
and dashed lines) and their
associated separate buffer
queues (denoted as two
circles at each router) used
to break the cyclic route
deadlock in Fig. 2.4
Since each VC is time-multiplexed onto the physical link cycle-by-cycle, holding onto a VC
implies holding on to its
associated buffer queue rather than locking down a physical link. By enforcing an
order on VCs, so that lower-priority VCs cannot request and wait for higher-priority
VCs, there cannot be a cycle in resource usage. At the system level, messages that
can potentially block on each other can be assigned to different message classes
that are mapped onto different virtual channels within the network, such as request
and acknowledgment messages of coherence protocols. Implementation complexity
of virtual channel routers will be discussed in detail next in Sect. 2.2.5 on routermicro architecture.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
18/46
17
Buffer backpressure:
Unlike broadband networks, on-chip network are typically
not designed to tolerate dropping of packets in response to congestion. Instead,
buffer backpressure mechanisms stall flits from being transmitted when buffer space
is not available at the next hop. Two commonly used buffer backpressure mechanismsare credits and on/off signaling. Credits keep track of the number of buffers
available at the next hop, by sending a credit to the previous hop when a flit leaves
and vacates a buffer, and incrementing the credit count at the previous hop upon
receiving the credit. On/off signaling involves a signal between adjacent routers that
is turned off to stop the router at the previous hop from transmitting flits when the
number of buffers drop below a threshold, with the threshold set to ensure that all
in-flight flits will have buffers upon arrival.
5.5 Router Micro-architecture
How a router is built determines its critical path delay, per-hop delay, and overall network
latency. Router micro-architecture also affects network energy as it determines the circuit
components in a router and their activity. The implementation
of the routing and flow control and the actual router pipeline will affect the efficiency
at which buffers and links are used and thus overall network throughput. In
message, while errors in the control circuitry can lead to lost and mis-routed messages.
The area footprint of the router is clearly highly determined by the chosen router micro-
architecture and underlying circuits, terms of reliability, faults in the router datapath will lead
to errors in the transmitted message, while errors in the control circuitry can lead to lost and
mis-routed messages.
Figure 2.7 shows the micro architecture of a state-of-the-art credit-based virtual
channel (VC) router to explain how typical routers work. The example assumes a
two-dimensional mesh, so that the router has five input and output ports corresponding
to the four neighboring directions and the local processing element (PE) port.
The major components which constitute the router are the input buffers, route computation
logic, virtual channel allocator, switch allocator, and the crossbar switch.Most on-chip network routers are input-buffered, in which packets are stored in
buffers only at the input ports.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
19/46
18
Fig. 2.7Credit-based virtual channel router microarchitecture
6 Project Methodologies, Results, and
Achievements
The purpose of this section is to summarize the developments that took place within this
project and put them in large scientific and technological context. The principles underlying this
approach and all the methods used in this project are derived from a previously composed
research.
As already explained in the beginnings of this project, our purpose is to concentrate on the
customers needs. To accomplish that, a survey took place over the internet, so we could gain
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
20/46
19
some other point of view that is maybe different from ours, or with other words to conclude for
what the smartphones are mostly used.
Later on, according to the gained results from the survey and based on SOCs structure and
Multi-core systems architectures, this project will offer On-Chip network that will bring high
performance, smartly used space, and network interconnections between the components ofthe chip-which means long-lasting battery.
All the testing over the new design that will be proposed will be done over a Discrete Event
Simulator, but more about that, later in this section.
6.1 Usage of Smartphones Survey (Q & A)
The survey took place from 4 March, 2014 to 15 March, 2014 and it was created over the free
survey websitewww.freeonlinesurveys.com. Later on it was shared over the social media and
it was answered by 43 people. In the following sections the questions and results of the survey
are presented.
http://www.freeonlinesurveys.com/http://www.freeonlinesurveys.com/http://www.freeonlinesurveys.com/http://www.freeonlinesurveys.com/8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
21/46
20
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
22/46
21
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
23/46
22
6.2 Analysis on the survey results
From the results we can see that all generations have some representatives, but mainly this
survey was answered by people aged from 15 to 22 years (59.5 %). Also, most of the surveyedare users of the Android Operating System (69 % vs 19% Apple). According to the question how
much time the users spend daily on their smartphones (1-4 hours - 59.6%), and how much their
smartphones batteries last (10-12 hours - 41.9 %), it can be concluded that the battery is
maybe one of the biggest problems of todays smartphones, which means, our priority in the
new architecture plan.
The connection on the internet is also main factor, but we must approach to a design that
implements both Wi-Fi and 3G/4G connections.
From the ranking of items that was offered to the surveyed people, we can clearly see why
most of them use their phones. It looks like this is new era in this technology, since 40.48 % said
that their first usage priority of the phones are the social media (Facebook, Twitter, Foursquare,
etc.), and on the other side, 38.10 % claimed that their first priority is to use the phone for
Phone Callsrather than the other offered items in the list.
Other thing that we can learn from this question is that people are using their phones mostly
for Internet Calls (Viber, Skype, etc.), Instagram (as a social network and picture editor), Internet
surfing, Music players, etc. But, clearly lot of people use their phones like cameras and for
playing games. On the last question of this survey we can see that almost half of the people
(41.9 %) prefer playing games in HD graphics.
What we learned from the collected results?
Light and heavy data transfer should be offered in our plan
Internet browsing
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
24/46
23
HD Graphic Card
Conference calling
HD editing
Quick and with low-delay database
Medium and Fast data links
Cellular support
From the results it can be concluded that different performances of the processors would be
needed. For example its not the same if two processors are executing a program for picture
editing and on the other side if one specifically Graphic processor executes the code.
Also, if we want to preserve our battery to last longer, it would be useful if we only use regular
processors for easy tasks (f.e light database transfer).
Now, we can conclude that it would be better to design heterogeneous multi-core systems
which include non-identical processors. With that on mind, now the interconnections of the
network and all its components will be of great importance.
To present the planed design, firstly, in the next section the simulator on which the tests will be
done is going to be presented, and later on, the architecture with facts.
6.3 Discrete event simulations
In the field ofsimulation, adiscrete-eventsimulation (DES), models the operation of
asystem as a discretesequence of events in time. Each event occurs at a particular instant in
time and marks a change ofstate in the system. Between consecutive events, no change in the
system is assumed to occur; thus the simulation can directly jump in time from one event to the
next.
This contrasts withcontinuous simulation in which the simulation continuously tracks the
system dynamics over time. Instead of beingevent-based, this is called an activity-based
simulation; time is broken up into small time slices and the system state is updated according to
the set of activities happening in the time slice. Because discrete-event simulations do not have
to simulate every time slice, they can typically run much faster than the corresponding
continuous simulation.
Another alternative to event-based simulation is process-based simulation. In this approach,
each activity in a system corresponds to a separate process, where a process is typically
http://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Discrete_timehttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Systemhttp://en.wikipedia.org/wiki/Sequence_of_eventshttp://en.wikipedia.org/wiki/State_(computer_science)http://en.wikipedia.org/wiki/Continuous_simulationhttp://en.wikipedia.org/wiki/Event-driven_programminghttp://en.wikipedia.org/wiki/Event-driven_programminghttp://en.wikipedia.org/wiki/Continuous_simulationhttp://en.wikipedia.org/wiki/State_(computer_science)http://en.wikipedia.org/wiki/Sequence_of_eventshttp://en.wikipedia.org/wiki/Systemhttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Discrete_timehttp://en.wikipedia.org/wiki/Simulation8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
25/46
24
simulated by athread in the simulation program. In this case, the discrete events, which are
generated by threads, would cause other threads to sleep, wake, and update the system state.
A more recent method is the three-phased approach to discrete event simulation (Pidd, 1998).
In this approach, the first phase is to jump to the next chronological event. The second phase is
to execute all events that unconditionally occur at that time (these are called B-events). The
third phase is to execute all events that conditionally occur at that time (these are called C-
events). The three phase approach is a refinement of the event-based approach in which
simultaneous events are ordered so as to make the most efficient use of computer resources.
The three-phase approach is used by a number of commercial simulation software packages,
but from the user's point of view, the specifics of the underlying simulation method are
generally hidden.
6.4 Example of DES in real life
A common exercise in learning how to build discrete-event simulations is to model aqueue,
such as customers arriving at a bank to be served by a teller. In this example, the system
entities are Customer-queue and Tellers. The system events are Customer-
Arrival and Customer-Departure. (The event of Teller-Begins-Service can be part of the logic of
the arrival and departure events.) The system states, which are changed by these events,
are Number-of-Customers-in-the-Queue (an integer from 0 to n) and Teller-Status (busy or
idle). Therandom variables that need to be characterized to model this
systemstochastically are Customer-Interarrival-Timeand Teller-Service-Time. An agent-based
framework for performance modeling of an optimistic parallel discrete event simulator is
another example for a discrete event simulation.
6.5 Components of a discrete-event simulation
State
A system state is a set of variables that captures the salient properties of the system to be
studied. The state trajectory overtime S(t) can mathematically represented by a step
function whose values change in correspondence of discrete events.
Clock
The simulation must keep track of the current simulation time, in whatever measurement units
are suitable for the system being modeled. In discrete-event simulations, as opposed toreal-
time simulations,time hops because events are instantaneous the clock skips to the next
event start time as the simulation proceeds.
http://en.wikipedia.org/wiki/Thread_(computing)http://en.wikipedia.org/wiki/Queueing_theoryhttp://en.wikipedia.org/wiki/Random_variablehttp://en.wikipedia.org/wiki/Stochastichttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Stochastichttp://en.wikipedia.org/wiki/Random_variablehttp://en.wikipedia.org/wiki/Queueing_theoryhttp://en.wikipedia.org/wiki/Thread_(computing)8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
26/46
25
Events list
The simulation maintains at least one list of simulation events. This is sometimes called
the pending event set because it lists events that are pending as a result of previously simulated
event but have yet to be simulated themselves. An event is described by the time at which itoccurs and a type, indicating the code that will be used to simulate that event. It is common for
the event code to be parametrized, in which case, the event description also contains
parameters to the event code.
When events are instantaneous, activities that extend over time are modeled as sequences of
events. Some simulation frameworks allow the time of an event to be specified as an interval,
giving the start time and the end time of each event.
Random-number generators
The simulation needs to generaterandom variables of various kinds, depending on the system
model. This is accomplished by one or morePseudo-random number generators.The use of
pseudo-random numbers as opposed to true random numbers is a benefit should a simulation
need a rerun with exactly the same behavior.
One of the problems with the random number distributions used in discrete-event simulation is
that the steady-state distributions of event times may not be known in advance. As a result, the
initial set of events placed into the pending event set will not have arrival times representative
of the steady-state distribution. This problem is typically solved by bootstrapping the simulation
model. Only a limited effort is made to assign realistic times to the initial set of pending events.
These events, however, schedule additional events, and with time, the distribution of event
times approaches its steady state. This is called bootstrapping the simulation model. In
gathering statistics from the running model, it is important to either disregard events that occur
before the steady state is reached or to run the simulation for long enough that the
bootstrapping behavior is overwhelmed by steady-state behavior. (This use of the
term bootstrapping can be contrasted with its use in bothstatistics andcomputing.)
Statistics
The simulation typically keeps track of the system'sstatistics,which quantify the aspects of
interest. In the bank example, it is of interest to track the mean waiting times. In a simulation
model, performance metrics are not analytically derived fromprobability distributions,butrather as averages overreplications,that is different runs of the model.Confidence
intervals are usually constructed to help assess the quality of the output.
http://en.wikipedia.org/wiki/Random_variableshttp://en.wikipedia.org/wiki/Pseudorandom_number_generatorhttp://en.wikipedia.org/wiki/Bootstrapping_(statistics)http://en.wikipedia.org/wiki/Bootstrapping_(computing)http://en.wikipedia.org/wiki/Statistichttp://en.wikipedia.org/wiki/Probability_distributionshttp://en.wikipedia.org/wiki/Replication_(statistics)http://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Replication_(statistics)http://en.wikipedia.org/wiki/Probability_distributionshttp://en.wikipedia.org/wiki/Statistichttp://en.wikipedia.org/wiki/Bootstrapping_(computing)http://en.wikipedia.org/wiki/Bootstrapping_(statistics)http://en.wikipedia.org/wiki/Pseudorandom_number_generatorhttp://en.wikipedia.org/wiki/Random_variables8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
27/46
26
Ending condition
Because events are bootstrapped, theoretically a discrete-event simulation could run forever.
So the simulation designer must decide when the simulation will end. Typical choices are at
time t or after processing n number of events or, more generally, when statistical measure
X reaches the value x.
6.6 Network Simulators as DES
Discrete event simulation is used in computer network to simulate new protocols for different
network traffic scenarios before deployment.
Incommunication andcomputer network research, network simulation is a technique where a
program models the behavior of a network either by calculating the interaction between the
different network entities (hosts/packets, etc.) using mathematical formulas, or actuallycapturing and playing back observations from a production network. The behavior of the
network and the various applications and services it supports can then be observed in a test
lab; various attributes of the environment can also be modified in a controlled manner to assess
how the network would behave under different conditions.
There are many both free/open-source and proprietary network simulators. Examples of
notable network simulation software are, ordered after how often they are mentioned in
research papers:
ns (open source) OPNET (proprietary software)
NetSim (proprietary software)
6.7 Network Simulations with OPNET
OPNET Technologies, INC. is a software business that provides performance management for
computer networks and applications.The company was founded in 1986 and went public in 2000.
OPNET can serve for a variety of needs. Compared to the cost and time involved in setting up
an entiretest bed containing multiple networkedcomputers,routers and data links,OPNET is
relatively fast and inexpensive. They allow engineers, researchers to test scenarios that might
be particularly difficult or expensive toemulate using real hardware - for instance, simulating a
scenario with several nodes or experimenting with a new protocol in the network. Network
http://en.wikipedia.org/wiki/Communicationhttp://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Ns_(simulator)http://en.wikipedia.org/wiki/OPNEThttp://en.wikipedia.org/wiki/NetSimhttp://en.wikipedia.org/wiki/Test_bedhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Data_linkhttp://en.wikipedia.org/wiki/Emulatehttp://en.wikipedia.org/wiki/Emulatehttp://en.wikipedia.org/wiki/Data_linkhttp://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Test_bedhttp://en.wikipedia.org/wiki/NetSimhttp://en.wikipedia.org/wiki/OPNEThttp://en.wikipedia.org/wiki/Ns_(simulator)http://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Communication8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
28/46
27
simulators are particularly useful in allowing researchers to test new networking protocols or
changes to existing protocols in a controlled and reproducible environment. A typical network
simulator encompasses a wide range of networking technologies and can help the users to
build complex networks from basic building blocks such as a variety of nodes and links. With the
help of simulators, one can design hierarchical networks using various types of nodes like
computers,hubs,bridges,routers, switches, links, mobile units etc.
Various types ofWide Area Network (WAN) technologies like TCP, ATM, IP etc. andLocal Area
Network (LAN) technologies likeEthernet,token rings etc., can all be simulated with a typical
simulator and the user can test, analyze various standard results apart from devising some
novel protocol or strategy for routing etc. Network simulators are also widely used to simulate
battlefield networks inNetwork-centric warfare
Minimally, a network simulator must enable a user to represent anetwork topology,specifying
the nodes on the network, the links between those nodes and the traffic between the nodes.
More complicated systems like OPNET allow the user to specify everything about the protocols
used to handle traffic in a network. Graphical applications allow users to easily visualize the
workings of their simulated environment. Text-based applications may provide a less intuitive
interface, but may permit more advanced forms of customization.
7 Implementing the project in OPNET
7.1 Adding Traffic
According to the theory presented until now, and the results gained from the survey, we made
three architectural designs using OPNET.
We stated that its going to be better if we organize network that is going to be heterogeneous
(with different processors), but we must pay attention of the interconnections between the
components of our System-on-a-Chip. Thats why this section is going to present different
interconnections of 8 CPUs. Like most of the modern smartphones, our proposed architecture
will be built from basic CPUsthat will implement simple tasks (keep the system running, light
data transfer, cellular calls, etc), and on the other side additional CPUs in charge for more
complicated tasks (heavy data transfer, picture editing, playing HD videos, etc). In the testing in
OPNET two versions of Intels full nodes were used: Intel_D875PBZ_P4 (3200MHz), and
Intel_VC820 (800MHz), where the first model represents architectures basic CPU and the
second model represents the additional CPU.
http://en.wikipedia.org/wiki/Communication_protocolhttp://en.wikipedia.org/wiki/Network_hubhttp://en.wikipedia.org/wiki/Network_bridgehttp://en.wikipedia.org/wiki/Wide_Area_Networkhttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/Token_ringhttp://en.wikipedia.org/wiki/Network-centric_warfarehttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Network-centric_warfarehttp://en.wikipedia.org/wiki/Token_ringhttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Wide_Area_Networkhttp://en.wikipedia.org/wiki/Network_bridgehttp://en.wikipedia.org/wiki/Network_hubhttp://en.wikipedia.org/wiki/Communication_protocol8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
29/46
28
For the testing to be complete, the simulation implemented traffic data and real life events.
Figure 7.1 provides the profile configuration table with the profiles created specifically for
testing events over the network. Five profiles were created in such a way that both basic and
additional CPUs would be tested over different tasks.
Fig. 7.1 Profile configuration table
Over the next page a brief documentation will be given about all five profiles created, including
their applications tables, start time offset, and the duration of the applications. Each of these
applications offers and simulates events over the network. The traffic will be tested over
different network configurations and evaluation will take place on different performances.
1. Telecom
2. Data Transfer
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
30/46
29
3. Video Conference
4. Web Browsing
5. Gaming
7.2 Network On-Chip realizations in OPNET
In this section different network architectures will be tested in OPNET. All of them are designed
and theoretically proofed in section 5(On-Chip Network to Multi-core systems) by four
parameters: topology, routing algorithm, flow control protocol, and router micro architecture.
Statistics about the networks will be collected based on the following information:
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
31/46
30
Simulation process of execution
queuing delay of every node in the network
point -to-point throughput of the channels between the CPUs
throughput of the channels between the CPUs and the router
and the global delay of the network
7.2.1 Star network (and 4 basic CPUs interconnected
forming a ring network)
Fig.7.2 Combination of Star and Ring Network in OPNET modeler
This network is actually combination of two networks. Firstly all 8 nodes are forming a Star
network (since all of them are connected directly to the router in the middle), and another
network (Ring) is formed from the basic CPUs, which are interconnected with each other
(every node is connected to its neighbors.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
32/46
31
Characteristics of the network:
Server
Application Configuration
Profile Configuration
Routerto assign tasks to the nodes
1000BaseX (1 Gbps) duplex links to interconnect additional nodes with the router
10GbpsBaseT (10 Gbps) duplex links to interconnect basic nodes with the router and
between each other
Max Hop Count of the Network - 2
Max Node Degree3
Simulation Process:
The execution lasted 90 simulation seconds and completed 37, 475, 887 events, or withAverage speed of 309, 876 events/sec.
Queuing delay of the nodes in the network:
This statistic is collected on every CPU node in the network in order to see the difference
between both Basic and Additional CPUs over the queuing delay. Clearly, all the basic
processors show bigger delay in queuing packets.
*Note that Basic CPUs are working on 3200 MHz and Additional CPUs on 800 MHz.
Fig. 7.3 Queuing
delay of the nodes in
the network.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
33/46
32
Point -to-point throughput of the channels between the CPUs:
Only the Basic Intel nodes are interconnected between their neighbors, so that means that
every node is connected through channel with two other nodes. For simplicity, these are the
throughput statistics from two pair of nodes (between Intel 1 and Intel 2, and Intel 1 and Intel
3). From the figure bellow it can be seen that the rate of successful message delivery between
the two channels is showing a difference, but when analyzing one channel it can be seen that
the rate of exchange is quite the same.
Fig. 7.4 Point-to-point throughput of the channels between Basic CPUs
Throughput of the channels between the CPUs and the router:
One node is taken from both CPUs as their representative. The figure bellow clearly explains
that more tasks were assigned to the basic nodes then the additional ones, or it can mean that
the channel connection Additional nodes with the router doesnt allow such high data transfer.
See the figure bellow.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
34/46
33
Fig. 7.5 Throughput of the channels between the Router and both types of nodes
Global Delay of the Network:
The delay of the network reaches a maximum point of 700 microseconds, which is quite a good
result.
Fig. 7.6 Global delay of Star Network
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
35/46
34
7.2.2 Ring network (and interconnections of all nodes with
the router)
Fig. 7.7An image of Ring Architecture designed in OPNET
In the previous network we had interconnection only on the Basic nodes, now there is aconnection with both models of nodes. Every node now is connected with the router and plus
with both neighbor nodes. In two points there is a connection between Basic and Additional
Nodes (Intel Basic 1 Additional CPU 1, and Intel Basic 4 Additional CPU 4). The idea is to
see how the nodes will interact, or in other words to see how the events will be assigned now.
Characteristics of the network:
Server Application Configuration
Profile Configuration
Routerto assign tasks to the nodes
1000BaseX (1 Gbps) duplex links to interconnect all the nodes between them and with
the Router
10GbpsBaseT (10 Gbps) duplex links to interconnect the Router and the Server
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
36/46
35
Max Hop Count of the Network - 1
Max Node Degree3
Simulation Process:
The execution lasted 90 simulation seconds and completed 40, 614, 395 events, or with
Average speed of 358, 979 events/sec.
Queuing delay of the nodes in the network:
No major difference can be seen in the queuing delay between the two types of nodes.
Fig. 7.8 Queuing delay on the nodes in the network
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
37/46
36
Point -to-point throughput of the channels between the CPUs:
In the figure bellow we can see representative of all possible interconnections in this particular
network, since Basic CPUs are connected between them, Additional CPUs also, and there are
two connections between Basic and Additional CPUs.The rate of successful message delivery is
reaching 2000 bits/sec in all of the cases. (See figure bellow).
Fig. 7.9 Point-to-point throughput on channels connecti-
ng Basic, Additional, and Basic-Additional nodes
Throughput of the channels between the CPUs and the router:
Fig. 7.9Both the channels betweenBasic and Additional nodes and the
router reached successful delivery of
5.9 MB/sec, but the channels to Basic
nodes delivered 4 times more packets
then the other ones.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
38/46
37
Global Delay of the Network:
Fig. 7.10 The delay of the network differs for only for 30 microseconds from the Star
Architecture
7.2.3 Mesh network
In this architecture all the nodes are again connected with the router, but also every node is
connected with their neighbors and plus with one different node (additional with basic, and
vice versa). The idea is to make a safer network, where if more than one link fails there are still
other paths that can help to reach the goal node.
*Note that this is not a full mesh network where all the nodes are interconnected with each
other.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
39/46
38
Fig. 7.11An image of mesh architecture designed in OPNET
Characteristics of the network:
Server
Application Configuration
Profile Configuration
Routerto assign tasks to the nodes
100BaseT (10 Gbps) duplex links to interconnect Basic CPUs with the Router and witheach other
1000BaseX (1 Gbps) duplex links to interconnect Additional CPUs with Basic CPUs and
with each other
Max Hop Count of the Network - 2
Max Node Degree - 4
Simulation Process:
The execution lasted 90 simulation seconds and completed 55, 280, 320 events, or with
Average speed of 328, 347 events/sec.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
40/46
39
Queuing delay of the nodes in the network:
Fig. 7.12 The queuing of packets is quite bigger in the Additional nodes in comparison to the
Basic ones.
Point -to-point throughput of the channels between the CPUs:
An interesting situation is presented in the figure bellow. Both of the channels can deliver up to
1 Gbps data transfer, but in the first case Additional CPU 1 Basic Intel 4, the Basic CPU isonly sending 3000 bits/sec. That can show us the node doesnt need a lot help for executing its
tasks.
Fig. 7.13 Point-to-point
throughput on channels
connection of different CPUs
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
41/46
40
Throughput of the channels between the CPUs and the router:
In the channel between the router and the Additional CPU we can see that the successful
message delivery reaches 250, 000, 000 bits/sec from both sides of the link. On the other
channel where the Router is connected with a Basic CPU, the delivery from the Router to the
node is of the reverse situation.
Fig. 7.14 Throughput on channels connection the Router and the nodes
Global Delay of the Network:
Fig. 7.15 The global
delay of the network reaches
almost 1000 microseconds,
which is almost identical with
the previously offered
architectures.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
42/46
41
7.3 Results Comparison
Network Delay:
Fig. 7.16 The delay is almost identical in Star and Ring Topology assignment of the network
Simulation Events:
The execution time of all the simulations is fixed on 90 seconds, but the architectures differ in
the number of events they can complete. Figure 7.17 provides the difference between the
networks over number of events completed. Of course this should be seriously taken as an
influential factor in deciding what architecture should be proposed.
Fig. 7.17 Number of events completed
0100
200
300
400
500
600
700
800
900
1000
20 sim/s 40 sim/s 60 sim/s 80 sim/s 100 sim/s
Star
Ring
Mesh
0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000
Star
Ring
Mesh
Number of Events
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
43/46
42
Queuing Delay of the nodes:
For creating the ideal architecture a lot of attention should be given on the delay of the nodes.
Generally speaking the nodes in Mesh network have lowest queuing delay. Additional CPUs
have delay 5.7 Microseconds, and Basic CPUs have delay of 400 Microseconds.
In comparison to that, the nodes in Star and Ring Network have delay of 20000 Microseconds
(Additional Nodes), and 10000 Microseconds (Basic Nodes), respectively.
Point-to-point throughput of channels
The point-to-point throughput is almost the same in the first two networks, as given in the
figure bellow. On the other side, when Mesh topology is implemented, the throughput of the
channels can differ from 375 Bps up to 50 Mbps. Also we must note that in Mesh topology
there are link connections between two Basic nodes, two Additional nodes, and Basic withAdditional nodes.
Fig. 7.18 Throughput of channels between nodes. The successful delivery rate is given in bits/s.
Hop count, Max node degree, and Alternative paths:
As for its effect on throughput, since a topology dictates the total number of alternate paths
between nodes, it affects how well the network can spread out traffic and thus the effective
bandwidth a network can support. Network reliability is also greatly influenced by the topology
as it dictates the number of alternative paths for routing around faults.
0
500
1000
1500
2000
2500
Star Ring
Basic
Additional
Additional and Basic
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
44/46
43
Attributes/Topologies Star Ring Mesh
Hop Count 2 1 2
Node Degree 3 3 4
Alternative Paths More than 3 More than 5 More than 5
A star topology offers fewer alternate paths between nodes than a mesh or ring, and thus
saturates at a lower network throughput for most traffic patterns. In a case of fault, the mesh
topology offers the most alternate paths. Also the hop count implies lower network
throughput, so it is another negative thing about Star network.
While star network have poorer performance (latency, throughput, energy, and reliability)
when compared to higher dimensional networks, they have lower implementation overhead. A
ring has a node degree of 3 while a mesh has a node degree of 4, where node degree refers to
the number of links in and out of a node. A higher node degree requires more links and higher
port counts at routers.
So it can be concluded that the most suitable network when discussing about topology would
be the Mesh network, since we look for faster performances, and reliable bandwidth.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
45/46
44
8 Conclusions
With all the accumulated effort invested in this project, there are reasons to believe that thearchitectures provided together with theories would be quite closer to industrial acceptance.
We summarize the progress with respect to the main objectives of the project, namely,
reliability, high performance, and long lasting battery.
Reliability: This is a major obstacle for acceptance of a particular design.The proposed
solution should be reliable network-on-a-chip with efficient multi-core processors for
mobile platforms as future systems on a chip. Thats why eight processors and links
with high bandwidth were offered. Also with high number of alternate paths, Mesh
would offer tolerance for faults, which again will make the network reliable.
High Performance: Accordingly to the trends in todays smartphone technology, where
high performance is needed, the entire proposed network that was designed in OPNET
offered high performance because of the eight processors. Four processors of 3200MHz
and four additional working on 800MHz would be totally enough for completing events
in a fast manner. Now comparing the results we gained, the overall performance
depends on latency, throughput, and energy.
The point-to-point throughput on channels in the three proposed networks showed
that the nodes are working together in processing events. In mesh, there is higher
deviation of the throughputs of the channels, from 375 Bps up to 50 Mbps, which is a
lot more than the throughput in star and ring network.
The latency or delay of the networks proposed showed that the ring is the most
suitable network. But now, we should also take in mind the number of events
completed by the networks. Mesh executes 1/3 events more than the other networks
for the same time, so again it is more suitable solution.
Long Lasting Battery:The battery of a system-on-a-chip cannot be the same as the ones
in laptops, or tablets, etc. But, with smart usage of the power of the chip, a lot of
energy can be saved. In the section with OPNET testing we saw that all the processors
in the network are dividing the tasks between them, depending on how occupied a
particular processor is. Also the plan is to not use all of the processing power when
there are not enough events to be executed.
8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip
46/46
9 References
(1)A. Vajda, Programming Many-Core Chips Springer Science+Business Media, LLC2011
(2)S.W. Keckler et al. (eds.), Multicore Processors and Systems, Integrated Circuites and
Systems, Springer Science+Business Media, LLC 2009
(3)Brayan Schauer, Multicore Processors A Necessity, 2008 ProQuest, Released
September 2008
(4)Multicore Processors and Systems, Retrieved from
https://noggin.intel.com/technology-journal
(5)Association of Computing Machinery,Retrievedfromhttp://www.acm.org/sigs
(6)ARM Smartphones, Retrieved from
http://www.arm.com/markets/mobile/smartphones.php
(7)The PC Inside your Phone, Retrieved fromhttp://arstechnica.com/
(8)CPU Info Center,http://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFt
(9)Computer Architecture Page,http://arch-www.cs.wisc.edu/wwwarch/public/home
(10) http://wikipedia.com
https://noggin.intel.com/technology-journalhttps://noggin.intel.com/technology-journalhttp://www.acm.org/sigshttp://www.acm.org/sigshttp://www.acm.org/sigshttp://www.arm.com/markets/mobile/smartphones.phphttp://www.arm.com/markets/mobile/smartphones.phphttp://arstechnica.com/http://arstechnica.com/http://arstechnica.com/http://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://arch-www.cs.wisc.edu/wwwarch/public/homehttp://arch-www.cs.wisc.edu/wwwarch/public/homehttp://arch-www.cs.wisc.edu/wwwarch/public/homehttp://wikipedia.com/http://wikipedia.com/http://wikipedia.com/http://arch-www.cs.wisc.edu/wwwarch/public/homehttp://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://arstechnica.com/http://www.arm.com/markets/mobile/smartphones.phphttp://www.acm.org/sigshttps://noggin.intel.com/technology-journal