Soc - Intro, Design Aspects, HLS, TLM

Delivered by:

Subhash Iyer,

Program Head,

Soft Polynomials (I) Pvt. Ltd., Nagpur

(CDAC ATC)

Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1

2Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.

Introduction

What is SoC ?

SoC characteristics

Benefits and drawbacks

Solution

Major SoC Applications

Summary


Technological Advances

today’s chip can contains 100M transistors

transistor gate lengths are now in term of nanometers

approximately every 18 months the number of transistors on a chip doubles – Moore’s law

The Consequences

components connected on a Printed Circuit Board can now be integrated onto single chip

hence the development of System-On-Chip design


System on a board

System on a Chip




Introduction

What is SoC ?

SoC characteristics


Solution


Summary


Version A:The VLSI manufacturing technology advances has made possible to put millions of transistors on a single die. It enables designers to put systems-on-a-chip that move everything from the board onto the chip eventually.

Version B:SoC is a high performance microprocessor, since we can program and give instruction to the uP to do whatever you want to do.

Version C:SoC is the efforts to integrate heterogeneous or different types of silicon IPs on to the same chip, like memory, uP, random logics, and analog circuitry.

All of the above are partially right, but not very accurate!!!


• SoC not only chip, but more on “system”.

• SoC = Chip + Software + Integration

• The SoC chip includes:

• Embedded processor

• ASIC Logics and analog circuitry

• Embedded memory

• The SoC Software includes:

• OS, compiler, simulator, firmware, driver, protocol stack

• Integrated development environment (debugger, linker, ICE)

• Application interface (C/C++, assembly)

• The SoC Integration includes :

• The whole system solution

• Manufacture consultant

• Technical Supporting


A typical digital system design involves a significant amount of custom logic circuitry, but also includes pre-designed major components, such as processors, memory units and various types of input/output (I/O) interfaces.

In the traditional approach for designing such systems, a new integrated circuit (IC) chip is created for the custom logic circuits, but each pre-designed component is included as a separate chip

Different approach for realizing digital systems, called embedded system design. It leverages the advanced capabilities of today's IC technology by implementing many of the components of the system within a single chip, such as a field programmable gate array (FPGA).


Offer large logic capacity, exceeding severalmillion equivalent logic gates, and includededicated memory resources

Include special hardware circuitry that isoften needed in digital systems, such asdigital signal processing (DSP) blocks (withmultiply and accumulate functionality) andphase-locked loops (PLLs) (or delay-lockedloops (DLLs)) that support complex clockingschemes

Support a wide range of interconnectionstandards, such as double data rate (DDRSRAM) memory, PCI and high-speed serialprotocols.



Introduction

What is SoC ?

SoC characteristics


Solution


Summary


Top Level Design

Unit Block Design

Integration and Synthesis

Trial Netlists

System Level Verification

Timing Convergence& Verification

Fabrication

DVT

DVT Prep

6 12 12 4 14 ?? 5 8 Time in Weeks

Time to Mask order4861

Unit Block Verification

ASIC Typical Design Steps • Typical ASIC design can take up to two years to complete


Top Level Design

Unit Block Design

Integration and Synthesis

Trial Netlists

System Level Verification

Timing Convergence& Verification

Fabrication

DVT

DVT Prep

4 14 5 4

Time in Weeks

Time to Mask order24

33

Unit Block Verification

4 2

• With increasing Complexity of IC’s and decreasing Geometry, IC Vendor steps of Placement, Layout and Fabrication are unlikely to be greatly reduced

• In fact there is a greater risk that Timing Convergence steps will involve more iteration.

• Need to reduce time before Vendor Steps.

• Need to consider Layout issues up-front.

SoC Typical Design Steps


Design reuse is facilitated if “standard”

internal connection buses are used .

All cores connect to the bus via a standard

interface .

Any-to-any connections easy but …

Not all connections are necessary .

Global clocking scheme .

Power consumption .

Standardization is being addressed by the

Virtual Socket Interface Alliance (VSIA)


• AMBA (Advanced Microcontroller Bus Architecture) is a collection of buses from ARM for satisfying a range of different criteria.

• APB (Advanced Peripheral Bus): simple strobed-access bus with minimal interface complexity. Suitable for hosting peripherals.

• ASB (Advanced System Bus): a multimastersynchronous system bus.

• AHB (Advanced High Performance Bus): a high-throughput synchronous system backbone. Burst transfers and split transactions.


• One solution to the design productivity gap is to make ASIC designs more standardized by reusing segments of previously manufactured chips.

• These segments are known as “blocks”, “macros”, “cores” or “cells”.

• The blocks can either be developed in-house or licensed from an IP company.

• Cores are the basic building blocks .


• Soft Macro– Reusable synthesizable RTL or netlist of generic library elements

– User of the core is responsible for the implementation and layout

• Firm Macro– Structurally and topologically optimized for performance and area

through floor planning and placement

– Exist as synthesized code or as a netlist of generic library elements

• Hard Macro– Reusable blocks optimized for performance, power, size and

mapped to a specific process technology

– Exist as fully placed and routed netlist and as a fixed layout such

as in GDSII format .


Reusability

portability

flexibility

Predictability, performance, time to market

Soft

core

Firm

core

Hard

core


• Locating the required cores and associated contract discussions can be a lengthy process– Identification of IP vendors

– Evaluation criteria

– Comparative evaluation exercise

– Choice of core

– Contract negotiations• Reuse restrictions

• Costs: license, royalty, tool costs

– Core integration, simulation and verification



MPSoC is a system-on-chip that contains multiple instruction-set processors (CPUs).

The typical MPSoC is a heterogeneous multiprocessor: there may be several different types of processing elements (PEs), the memory system may be heterogeneously distributed around the machine, and the interconnection network between the PEs and the memory may also be heterogeneous.

MPSoCs often require large amounts of memory. The device may have embedded memory on-chip as well as relying on off-chip commodity memory.


These chips have:

• one (several) processors

• large amounts of memory

• bus-based architectures

• peripherals

• coprocessors

• and I/O channels


Introduction

What is SoC ?

SoC characteristics


Solution


Summary


• There are several benefits in integrating a large digital system into a single integrated circuit .

• These include– Lower cost per gate .

– Lower power consumption .

– Faster circuit operation .

– More reliable implementation .

– Smaller physical size .

– Greater design security .


• The principle drawbacks of SoC design are associated with the design pressuresimposed on today’s engineers , such as :

– Time-to-market demands .

– Exponential fabrication cost .

– Increased system complexity .

– Increased verification requirements .


Why does it take longer to design SOCs compared to traditional ASICs?

We must examine factors influencing the degree of difficulty and Turn Around Time (TAT) (the time taken from gate-level netlist to metal mask-ready stage) for designing ASICs and SOCs.

For an ASIC, the following factors influence TAT:

• Frequency of the design

• Number of clock domains

• Number of gates

• Density

• Number of blocks and sub-blocks

The key factor that influences TAT for SOCs is system integration (integrating different silicon IPs on the same IC).


Introduction

What is SoC ?

SoC characteristics


Solution


Summary


• Overcome complexity and verification issues by designing Intellectual Property (IP) to be re-usable .

• Done on such a scale that a new industry has been developed.

• Design activity is split into two groups:– IP Authors – producers .– IP Integrators – consumers .

• IP Authors produce fully verified IP libraries – Thus making overall verification task more

manageable• IP Integrators select, evaluate, integrate IP from

multiple vendors– IP integrated onto Integration Platform designed

with specific application in mind



IP cores are classified into three

distinct categories:

Hard IP Cores

Firm IP Cores

Soft IP Cores


Hard IP cores consist of hard layouts

using particular physical design libraries

and are deliverid in masked-level

designed blocks (GDSII format). The

integration of hard IP cores is quite

simple, but hard cores are technology

dependent and provide minimum

flexibility and portability in

reconfiguration and integration.


Soft IP cores are delivered as RTL

VHDL/Verilog code to provide functional

descriptions of IPs. These cores offer

maximum flexibility and reconfigurability

to match the requirements of a specific

design application, but they must be

synthesized, optimized, and verified by

their user before integration into designs.


Firm IP cores bring the best of both

worlds and balance the high performance

and optimization properties of hard IPs

with the flexibility of soft IPs.These cores

are delivered in form of targeted netlists

to specific physical libraries after going

through synthesis without performing the

physical layout.


Resusability

portability

flexibility

Predictability, performance, time to market

Soft

core

Firm

core

Hard

core



Introduction

What is SoC ?

SoC characteristics


Solution


Summary


eS/W: Current application complexity Set-top box: >1 million lines of code

Digital audio processing: >1 million lines of code

Recordable DVD: Over 100 person-years effort

Hard-disk drive: Over 100 person-years effort

In multimedia systems S/W cost (licenses) 6X larger than H/W chip cost

eS/W uses 50% to 80% of design resources

eS/W now an essential part of SoC products


Speech Signal Processing .

Image and Video Signal Processing .

Information Technologies

PC interface (USB, PCI,PCI-Express, IDE,..etc)

Computer peripheries (printer control, LCD

monitor controller, DVD controller,.etc) .

Data Communication

Wireline Communication: 10/100 Based-T, xDSL,

Gigabit Ethernet,.. Etc

Wireless communication: BlueTooth, WLAN,

2G/3G/4G, WiMax, UWB, …,etc


• Consumer devices,

• Networking,

• Communications, and

• other segments of the electronics industry.

microprocessor, media processor,

GPS controllers, cellular phones,

GSM phones, smart pager ASICs,

digital television, video games,

PC-on-a-chip





Systems on chip are everywhere

Technology advances enable increasingly more complex designs

Central Question: how to exploit deep-submicron technologies efficiently?


Introduction

What is SoC ?

SoC characteristics


Solution


Summary


Technological advances mean that complete

systems can now be implemented on a single

chip .

The benefits that this brings are significant in

terms of speed , area and power .

The drawbacks are that these systems are

extremely complex requiring amounts of

verification .

The solution is to design and verify re-

useable IP .



Delivered by:

Subhash Iyer,

Program Head,


(CDAC ATC)


Introduction to SoC Design Aspects


At each level of circuit abstraction, the circuit is equivalent and performs the same target operation, but its structural components (and hence the component’s granularity) are different, and the design issues may be different


Embedded applications in multimedia, wireless communications or networking domain were implemented on Printed Circuit Boards (PCBs).

Composed of discrete Integrated Circuits (ICs) General Purpose Processors

Digital Signal Processors

Application Specific Integrated Circuits

Memories

Further peripherals.

Communication between discrete processing elements and memories is realized by shared bus architectures (like PCi Express)


The transition is from board level integration towards System-on-Chip (SoC) implementations of embedded applications.

Today multiple heterogeneous processing elements and memories can be integrated on a single chip Increased performance

Reduced cost

Improved energy efficiency

This trend originates from tremendous increase in features as well as the multitude of co-existing standards.

Resulting functional complexity clearly promotes Software enabled solutions to achieve the required flexibility and cope with the demanding time-to-market conditions.

However, stringent energy efficiency constraints of mobile applications and cost sensitive consumer devices prohibit the use of general purpose processors.

Tight cost and performance requirements of versatile embedded systems lead to application specific heterogeneous multi-processor architectures



Classical vertical partitioning approach to HW/SW Codesign, where the performance critical parts are implemented as dedicated HW blocks and the rest is executed in SW, is no longer applicable.

Instead HW/SW Co-design can be seen as: Multi-dimensional horizontal mapping problem of an application running on a

heterogeneous multiprocessor platform.

During the mapping process, Exploit application inherent parallelism to achieve performance at reasonable

cost.

For the computationally intensive portions of typical embedded applications the extraction of Task Level Parallelism (TLP) is mostly straight forward:

The partitioning into a set of loosely coupled functional blocks can be naturally derived from the algorithmic block diagram

Two major aspects Processing : A set of processing elements has to

be provided for the efficient execution of the functional tasks.

Communication mapping: The inter-task data exchange has to be mapped to a communication architecture.

Only a joint consideration of architectural choices in both areas bears the opportunity for near optimal quality of results.

Recent architectural advances offer a huge design space with enormous potential for optimization


Bus paradigm as inherited from the PCB era constitutes the major power and performance bottleneck.

Chip-wide communication is envisioned to be handled by full-scale Network-on-Chip (NoC) architectures.

Network-on-Chip architectures Resolve the physical issues

Address the functional aspects of on-chip communication.


So far, the dynamic priority based arbitration scheme of shared busses

creates a mutual dependency between all components connected to the

bus.

Due to this lack of traffic management capabilities every change in the

traffic requirements of the application requires a re-design of the bus

architecture.

Instead, NoC architectures take advantage of sophisticated networking

algorithms to provide elaborated traffic-management capabilities.

By that, the ad-hoc communication mapping is replaced with a

disciplined allocation of the required communication services and the

on-chip network takes care to provide the required resources.

From the system architecture perspective, this separation of the

offered communication services from the architectural resources can be

considered as a virtualization of the actual communication

architecture.

This virtualization effectively decouples the mapping problem for

communication and computation.

The price to pay for the physical and functional benefits of NoC based

communication is a significant penalty in terms of chip area as well as

transfer latency.Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 9

Programmable processing elements achieve significant gains with respect to performance and computational efficiency by: tailoring instruction set

micro architecture to the respective set of tasks

Examples are innovative architectures exploiting Instruction Level Parallelism (ILP)

Data Level Parallelism (DLP)

Despite the increased computational performance, the effective performance is often constricted by the communication architecture, since memory accesses latency does not keep pace with the processing power.


General purpose processors resolve the memory access bottleneck by using sophisticated cache and memory hierarchies.

This is generally not applicable for embedded applications due to the poor memory locality of stream driven and packet based data processing.

Instead, processor architectures are equipped with hardware supported Multi-Threading (HW-MT) to perform task switches with virtually no performance overhead.

By that, the application inherent TLP is exploited with the purpose of hiding memory latency, which effectively leads to a significant increase in the processor utilization.

This technique is already widely employed in the network processor domain but recently finds its way into advanced multimedia and signal processing platforms.

In the light of the latency issue caused by NoC architectures, the importance of memory hiding techniques is likely to increase in the future.


Taking the above considerations together,

future SoCs can be considered as

NoC enabled multi-processor architectures.

On-chip communication backbone connects a

large number of heterogeneous processing

clusters and global storage elements.

Individual processing clusters consist of one

or few application specific programmable

kernels together with tightly coupled

instruction and data memories as well as

local peripherals.


To cope with the resulting design complexity: Achieve virtualization of the architectural resources,

They can be allocated by the system architect in a deterministic way.

This virtualization is provided by NoC approach for communication part

SW and HW operating systems for the control and data processing respectively.

Divide-and-conquer oriented design paradigm

Enables individual optimization of the architectural elements

The price for these benefits A penalty in terms of chip area,

Generally considered to be of constantly decreasing importance.


HW/SW Co-design of a given embedded

application is defined to

Architect a heterogeneous MP-SoC platform

Allocate the architectural resources for the

execution of the application.

Architecture virtualization resolves the

mutual dependencies in the mapping process

Trade-offs in the design space still require a

joint consideration of application and

architecture as well as communication


For example:

Latency of a more complex on-chip network

can be compensated by either:

introducing memory hierarchy

employing hardware multi-threaded processor

kernels.

Obviously, the resulting design space is

virtually infinite

Architecting and the mapping phase cannot

be considered independently without

sacrificing quality of results.


What is needed is:

A system level design methodology

Corresponding tool supported modeling

framework

Transaction-Level Modeling (TLM)

Advocated by the SystemC language

The system level design paradigm

Already incorporated into state-of-the-art

Electronic System Level (ESL) tools


TLM greatly improves modeling efficiency

simulation speed

Abstracts from Low-level communication

details of the Register Transfer Level (RTL),

To complete transaction

Is usually employed in a byte and cycle accurate fashion

We will look more at packet-level TLM paradigm

Cycle-level TLM is still too detailed to explore large design spaces.


Since communication becomes the driving design paradigm for MP-SoC

Exploration framework is based on a sophisticated, communication centric timing model: Generic synchronization interface

Defines a concise set of communication primitives,

Follows the Open Core Open Core Protocol (OCP)

Not biased towards any specific communication architecture.

Additionally the primitives incorporate timing-annotation to achieve reasonable timing accuracy at the highly abstract packet-level TLM layer

The communication timing model captures the impact on performance of the interconnection architecture.

This communication timing model supports the full spectrum of available and proposed communication architectures ranging from today’s shared busses to the emerging NoC paradigm.


Implemented by means of a versatile

modeling framework for architecture

exploration and hardware/software

partitioning

Key advantages:

Modeling efficiency

Higher simulation speed

A declarative specification mechanism for better

design space exploration


TLM is a method used for SoC Design

To specify at a higher level of abstraction

Involves Communication and Computation

Architectures

Unified Timing Model aims to standardize the

TLM approach



Need to know why before what & how!!!


Networking Domain

Multimedia Domain

Wireless Communications


Constitutes implementation of networking standards

IEEE, ITU, ETSI, etc work out communication standards

The purpose of these standards to achieve a high degree of interoperability

ISO/OSI reference model has been providing a common terminology


Networking layer standards in the middle of the ISO/OSI stack address a multitude of higher layer application standards as well as lower physical/link layer standards

Major implementation challenge and effort is of the networking layer

Layer three multi-service access switches are considered as one of the potential killer applications for MP-SoC platforms, since they combine the physical wire speed throughput requirements with flexibility constraints imposed by the individual treatment of different service classes and application characteristics.

Today’s de facto networking layer standard is given by the rather simplistic Internet Protocol (IP).

Lower level layers are nowadays built in as ready made blocks

Physical & link layer data rates of core network equipment are imposing demanding performance requirements

Higher application layers are only present in the terminal devices,

So the relatively low to medium throughput requirements allow for a software implementation of the flexible and control dominated functionality.


Processing of all kinds of media data

Pictures

Audio

Video decoding

Video pixel processing

2D/3D graphics

Standards enable the exchange of media data as well as device inter-operability

MOPS: Mega Operations Per secondCreated by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 26

Advances in processing capabilities and multimedia algorithms together with increased user expectations fuels a constant proliferation of new multimedia standards Digital audio decoding (AC3, OGG, MP3),

Video decoding (MPEG2, MEPEG4, H.263, H.264, DivX, quicktime)

3D graphic processing (DirectX 9)

Apart from the multitude and dynamics of multimedia standards, a flexible implementation platform is also mandatory to meet demanding cost constraints of converging consumer electronics devices such as the Advanced Set-Top Box (ASTB).

Here the processing and communication fabrics have to be shared among the multitude of supported multimedia applications to limit implementation cost.


Wireless communication applications aggressively use digital signal processing to maximize bandwidth efficiency

Again, a multitude of standards exists

Each marks a local optimum in implementation cost

Mobility

power dissipation

performance bandwidth efficiency

Multimedia and wireless communication domains are converging into a new generation of Personal Digital Assistant (PDA) or SmartPhone devices

PDAs have started to support a huge variety of travel and fun related applications with much higher processing requirements, like e.g. localization, navigation, travel assistant, video camera, digital camera, picture editing, MP3 player or games

Additionally, this kind of portable, multimedia enabled PDA devices are obliged to support multiple communication standards, both cable (USB, FireWire) and wireless (3G, WLAN).


Summary of common trends:

New features and value added services: lead to exponentially increasing processing performance and communication requirements.

Standards become more dynamic and sophisticated and are introduced more rapidly: calls for high flexibility of the SoC implementation to meet the resulting time-in-market as well as time-in-market requirements.

For mobile applications and cost sensitive consumer electronic devices: energy efficiency becomes the prevailing cost factor

Heterogeneous Multi-Processor SoC(MP-SoC) platforms are generally believed to meet the above mentioned conflicting performance, flexibility and energy efficiency requirements of demanding embedded applications

Hence, in the course of an MP-SoCplatform design the partitioning of a specific application is a task of major importance


Main Partitioning Principle

Control dominated domain

Data dominated domain

This first order partitioning has major

influence on both the target processing and

communication elements as well as on the

appropriate design methodology.



Examples


Control-plane processing is characterized by: Moderate performance requirements,

Huge amounts of functionality

Calling for maximum flexibility

Developed using an Integrated Design Environment (IDE) which is Architecture agnostic

Software centric

Software engineering techniques Object Oriented Programming (OOP) using

Unified Modeling Language (UML)

C++

Java


To increase the reuse of the control plane Software (across multiple MP-SoC platform generations): Hardware dependant Software (HdS) portions are

wrapped into: stack of middleware Real Time Operating System (RTOS) device driver layers

Parallelism in Control Plane Processing: Instruction Level Parallelism (ILP) Extracted by a VLIW compiler Or a superscalar processor architecture Helps gain performance

Task Level Parallelism Generally not possible due to huge amount of

functionality


Data-plane processing is characterized by:

Computationally intensive data manipulations

Performance at high data rates

Demand for high processing

Demand for high communication performance.

Rapidly evolving standards in all application

domains impose increasing flexibility

constraints.


Need to reach performance requirements of networking, multimedia and wireless communications applications

Requires aggressively exploiting abundant inherent parallelism available in data-plane processing tasks because: Functionality can be straightforwardly partitioned into a set of

loosely coupled tasks with well predictable or even cyclo-stationary execution timing

A well confined data set is associated with a single activation of an individual task.

Data sets associated with successive activations of an individual tasks are mostly independent.

These spatial and temporal properties with respect to second order task partitioning and data dependency can already be identified during the algorithm development stage and lead to an identification of coarse grain TLP.

This application inherent TLP enables the concurrent and parallel execution on MP-SoC platforms.


More about SoC design concepts next !!!


The mains aspects of

SoC architectural elements


Macroscopic metrics for the classification

and evaluation of architectural elements

Cost

Performance

Power Dissipation

Computational Efficiency

Flexibility


Cost of embedded architecture is separated into Non Recurrent Engineering (NRE) cost for the initial design

Recurring chip fabrication cost.

NRE costs factor is caused by the Design effort for HW

SW development

Fabrication of the initial mask set.

Typical NRE cost for 90 nm SoC 10-100 Million USD design effort

1 Million USD per mask set

Fabrication cost determined by Silicon die area

Packaging

Number of pins

Power dissipation requirements


Performance of both computational and communication architectures is classified into: Latency

Throughput

Latency Absolute time passing between the

start and completion of a task,

Throughput Number of accomplished tasks per

time.

Communication throughput is measured in bits per second (bps).

Throughput of programmable processing elements is measured in Millions Instructions Per Second (MIPS)

MIPS measurement is not very accurate


Measured in Watt

Denotes the energy per time required to operate

an embedded system

Is an architecture metric of growing importance

Battery lifetime of mobile devices immediately

depends on the energy consumption.

Packaging cost depends on the heat dissipation

properties, which in turn depends on the power

consumption.

Striving for low power and energy consumption

constitutes the key driver for architecture

differentiation of embedded SoC platforms


Derived from performance and power

consumption

Characterizes efficiency of a given

architectural element with a single value

Computational efficiency of programmable

architectures is predominantly measured in

MIPS/Watt.

Alternatively measured in energy

consumption per task (since MIPS

measurement is not very accurate)


Related to the effort to change the

functionality of a given architectural

element

In contrast to the previous metrics, flexibility

can be hardly measured in an accurate way.

Nonetheless, in the context of rapidly

evolving functionality and standards of

embedded applications, architectural

flexibility is of major importance to achieve

both decreasing time-to-market as well as

increasing time-in-market


A processing element (PE) provides the computational resource to execute a given portion of the application

Dedicated hardware implementation yields best performance

Programmable PEs are controlled by an instruction stream in a highly flexible way

The rather poor performance of programmable PEs has ever fueled computer architecture research towards parallelizing the execution of instructions

Early efforts in parallel computer architectures are classified according to the deployment of control-and data-level parallelism SISD

SIMD

MIMD

MISD


SISD: Single Instruction Single Data

Traditional von-Neumann kind of computer architectures

Sequentially execute a single instruction stream on a single processing resource

SIMD: Single Instruction Multiple Data

Vector processing machines

Perform a single instruction on multiple data items in parallel

Used in architectures for embedded DSP and graphic applications

Exploit inherent data-level parallelism (DLP)

MIMD: Multiple Instruction Multiple Data

Traditional homogeneous multi-processor type of architectures

Employed in scientific supercomputers

MISD: Multiple Instruction Single Data

Rarely encountered class of architectures,

Exploit temporal ILP by: Setting pipeline stages

Executing several instructions simultaneously,


Superpipelining:

Uses deep execution pipelines to increase the clock frequency

Superscalarity

Employs parallel functional units and complex dispatcher architectures to dynamically extract Instruction Level Parallelism (ILP)

Very Large InstructionWord (VLIW)

Execute several statically scheduled instructions on parallel functional units,

Hence the effort for ILP extraction is moved into the compiler

Hardware Multi-Threading (HW-MT)

Such architectures are able to concurrently pursue two or more threads of control by providing separate register resources for each thread context

Domain Specific (DS) Instruction Set

Tailors the programmable PE to a specific application domain

Provide specialized functional units.

DS processor examples are Digital Signal Processors (DSPs) employed in multimedia and wireless communications, or Network Processing Units (NPUs) for networking applications


The applicability of the above listed performance improvement techniques depends on the considered set of target applications.

Superpipelining and Superscalarity are heavily used in high performance General Purpose Processor (GPP) architectures to increase single thread performance of arbitrary applications on the vast expense of silicon area and power dissipation.

On the one hand, embedded applications are severely energy and cost constrained, but still have significant performance and flexibility requirements.

The most promising approach to jointly optimize flexibility and performance is to exploit coarse-grain TLP instead of ILP and map the loosely coupled tasks to individually optimized PEs.

This kind of embedded PEs mostly rely on the more power aware performance optimization techniques, like VLIW, multi-threading and a domain specific or even application specific instruction set.


MIMD control parallelism plays an important

role in embedded SoC architectures

Parallel execution of specialized PEs offers

Chance for improving application performance

Without sacrificing power efficiency


Refers to the multiple instantiation of identical PEs

Corresponds to a single chip implementation of the MIMD principle

Homogeneous multi-processing of general purpose embedded micro controllers Achieves the performance scaling

required for control-plane processing portion of embedded applications

Also found for dataplane processing in domain specific MP-SoCplatforms, where the identical instruction set of the PEs is tailored to a certain application domain


Employs multiple PEs

Different PEs individually tailored to a certain task or task set

Dedicated optimization

Applicable for the data-plane processing as it allows for a manual and static task allocation

The high degree of specialization in heterogeneous multi-processing further optimizes computational efficiency for a well defined set of target applications at the expense of generality


Parallel execution

Requires multiple computational resources

More than one task can be active at the same

point in time.

Concurrent execution

Interleaved processing of several tasks on a

single resource,

At any time only one task can be active


Benefit of concurrent execution is depicted in figure

2 tasks are mapped to a single processing element

Both tasks are divided into 2 processing portions

These are separated by a communication request

After Δtdelay the processing of the first portion is finished and the task is blocked for Δtresponse until the request is accomplished.

Instead of wasting the processor resource during this period, the processor context is swapped to the second task by a scheduler.

Utilization of the processor is increased and the request latency is hidden



The mains aspects of

SoC on-chip communication elements


Basic cost, performance, power, and flexibility metrics apply.

Additionally, Quality of Service (QoS) metrics known from the networking application domain are of increasing importance to manage complex on-chip traffic

The scalability of the communication architecture gains growing attention


Bus based on-chip communication paradigm is derived from the Printed Circuit Board (PCB) domain.

Examples:

VME (Versa Module Eurocard bus)

PCI (Peripheral Component Interconnect)

Advantages:

Easy programming model

High flexibility

Abundant availability of Intellectual Property (IP)

Suited for small and medium scale embedded systems where a small number of blocks exchange moderate amounts of data.


Implement master-slave communication scheme,

Active initiators along with passive target modules are hooked to a shared communication medium

Typical masters: Processors

DMA controllers

Autonomous ASIC blocks,

Typical slaves: Memories

Co-processors

Other peripherals

Other components: Arbitration units: Grant the access to the

communication medium to one of the competing master modules

Decoder units: Activate the target module based on the actual address and the address map, which maps the target modules into the bus address space


Bandwidth

Is the premier performance metric

Denotes the maximum transfer capacity of the

bus

Available bandwidth is measured in bits per

second

Corresponds to the number of parallel data wires

divided by the bus clock period


Pipelining: Well known technique to improve the communication

throughput

Clock frequency is limited by the critical path

Inserting an additional pipeline stage into the critical path allows a higher clock frequency

Yields a higher communication bandwidth

Since the address decoder is usually integral part of the critical path, bus transactions in high performance buses are executed in separate address and data stages


Burst modes: Improve communication throughput for the linear

access of subsequent addresses by a single master

Address counter is incremented automatically

Next data item is transferred with every cycle without renewed arbitration


Unidirectional data links

Distinguish on-chip buses from most on-board

buses

The latter are based on tristate data wires to

maximize the utilization of expensive on-board

wires


Hierarchy

Common bus systems separate high

performance from low performance

communication

Two buses with different speed

characteristics


Multilayer bus architectures

Provide dedicated point-to-point connections

between distinctive initiators and targets to

eliminate bandwidth bottlenecks

Required de-multiplexer at the initiator side is

called input stages, the respective target

multiplexer is called output stage



Crossbar bus architectures:

Provide multiple parallel resources between initiators and targets

Significantly improve the traffic throughput

Degree of parallelism may vary from partial crossbar to full crossbar architectures, where the latter provides an individual resource for each connected target

Arbitration:

Can be based on various algorithms,

Simple round robin

Fixed, Configurable or dynamic priority schemes

Static or Dynamic Time Division Multiple Access

(TDMA).

Even more advanced algorithms are known to

further improve the quality of service.


Locking of a bus: By a single master is a necessary feature to support

read-modify-write kind of semaphore operations.

This feature is required by most micro-controller architectures, which run operating systems

Split transaction buses Allow the master to issue multiple requests without

waiting for a response, i.e. request and response are separated

Out-of-order execution Improves the bus throughput by reordering the sequence

of responses, depending on the availability of the slave component

This feature requires advanced state-machines in the master modules to cope with non-deterministic sequence of responses



Physical Issues. Implemented using a standard cell based semi-custom implementation

flow

Transmission wires are not physically optimized,

timing closure issues and unreliable communication links.

Examples of physical effects are crosstalk noise, electromagnetic interference, and radiation-induced charge injection

Synchronous Design. Most current bus architectures require all connected modules in a

single clock domain.

Due to the parasitic capacities of long bus wires, strong driver transistors are necessary to achieve timing closure

Leads to power dissipation

Future SoC designs will follow the Globally Asynchronous Locally Synchronous (GALS) paradigm,

Chip-wide wires will span multiple clock domains, which disqualifies bus architectures as the future chip-level transport mechanism


Traffic Management. Due to the rather simple arbitration mechanisms, shared buses

provide only rudimentary traffic management support.

Since the communication pattern highly depends on the spatial and temporal execution of the application tasks, meeting the individual QoS requirements like throughput, jitter, or ordering of the respective tasks is very challenging.

This also causes the poor scalability of bus-based communication infrastructures, since every change in the traffic profile of one part of the application and every additional component influences the other parts and requires renewed balancing of the bus architectures.

Interoperability. Although simple standard peripherals, like DMA, IRC, or

memories are available for respective bus systems, it is a tedious and error-prone task to adapt complex IP blocks to a specific bus architecture.

So far efforts to create standard bus interfaces, have not been successful


Alternative on-chip communication concepts To cope with the limitations of shared bus architectures forms the Networks on Chip (NoC) design paradigm

Aims to replace current adhocwiring of IP blocks with a disciplined approach where full-scale on-chip networks provide communication services according to the ISO/OSI reference model

Problems in on-chip communication like signal integrity issues, link reliability, or Quality of Service (QoS) are separately resolved on the respective OSI layer


The four lower layers of the are of interest

Physical Layer deals with the electrical aspects of the data

transmission

E.g. signal voltages, clock recovery, and pulse shape

Data Link Layer provides a reliable data transfer over the physical link.

Error detection by means of block codes and error

correction mechanisms like: Automatic Repeat Request (ARQ)

Forward Error Correction (FEC)

Network Layer implements the arbitration algorithms, buffering

strategies and flow-control mechanisms

So, the networking layer has dominant impact on the

performance and functional behavior of network.

Transport Layer protocols establish and maintain end-to-end connections.

The transport layer manages rate-based flow control,

performs packet segmentation and reassembly, and

ensures message ordering

This abstraction hides the topology of the network,

and the implementation of the links that make up the

network



The challenge in the development of Network-

on-Chip architectures is to combine the know-

how from both the networking and VLSI domain.

Also the users of on-chip networks have to

understand basic networking principles:

First the system architect has to specify design time

parameters of the selected NoC architecture like

topology, buffer sizes, arbitration algorithm.

Later the platform programmer has to configure

runtime parameters like priorities, routing tables,

buffer management thresholds to take advantage of

the capabilities


Transport layer is the first to provide services which are independent of the implementation of the network

Enables the platform programmer to develop embedded software independently from the interconnect architecture

A key ingredient in tackling the challenge of decoupling the computation from communication

Interaction with the network becomes deterministic, rather than prognostic or reactive like in today’s bus based communication architectures

For complex multi-hop networks it is difficult to provide uniform Quality of Service (QOS) guarantees like lower bandwidth bounds, or packet ordering for the complete on-chip traffic

To combine high resource utilization with high QoS requirements of certain traffic types, researchers in the field of computer networks distinguish guaranteed services and best effort service classes


Guaranteed Services Require resource reservation for worst-case scenarios

Can be expensive as guaranteeing the throughput for a stream of data implies reserving bandwidth for the peak throughput, even when its average is much lower.

So, resources are often underutilized

Best-effort Services So not reserve any resources, and hence provide no

guarantees.

Best-effort services utilize resources well as they are typically designed for average-case scenarios instead of worst-case scenarios.

Are also easy to configure,

Require no resource reservation

Main disadvantage: unpredictability of the effective performance



Networking layer is implemented by the

routing nodes of the NoC.

Router based network implementations

classified as:

Switching Mode

Routing Mode

Queuing

Congestion Control

Switching mode:

Circuit switching Connections are set up by establishing a

conceptual physical path from a source to a destination.

Links can be shared between two connections only at different points in time, by using the time-division multiplexing (TDM) scheme

Packet switching

Data is divided into packets and every packet is composed of a header and the payload.

The header contains information that is used by the router to switch the packet to the appropriate output port


Routing mode: applies to packet-switched networks and defines the way packets are transmitted and buffered between network nodes

Store-and-forward An incoming packet is received and stored entirely before it is

forwarded to the next node.

Worm-hole routing An incoming packet is forwarded as soon as the packet header is

evaluated and the next router guarantees that the complete packet will be accepted.

In case the next hob is blocked, the packet tail potentially blocks other resources

Virtual cut-through An incoming packet is forwarded as soon as the next router

guarantees, that the complete packet will be accepted.

In case the next hob is blocked, the packet tail is stored in a local buffer



Queuing: Buffering strategies can be distinguished by the location of the buffers inside

the router.

In the following, N denotes the number of bi-directional router ports.

Input queuing: A router has a single input queue for every incoming link.

Suffers from the so-called head-offline blocking problem, i.e. the router utilization saturates at

about 59%,

Weak link utilization.

Output queuing: ` There are N output queues for every outgoing link resulting in N2 queues.

Yields optimal performance,

The costly N2-fold storage and wiring effort prohibits the implementation for a large number of

ports

Virtual output queuing: Combines the advantages of input queuing and output queuing

Avoids the head-of-line blocking problem.

Each input port maintains a separate queue for each output port

Key factor in achieving high performance using VOQ switches is the scheduling algorithm

Congestion control: Packet switched networks without mechanisms for

bandwidth reservation may run into resource contention and subsequent buffer overflow.

Several solutions prevent packets from entering until contention is reduced Packet discarding: Simply drops packets in case of buffer

overflow Credit based flow control: Packet loss is prevented in a

deterministic way by either signaling congestion via separate wires (back-pressure) or the receiver regularly informs the sender about the available buffer space (window).

Rate based flow control: the sender gradually adjusts the traffic generation rate in response to control flow messages from the receiver. Rate based flow control has to be implemented by the transfer layer and potentially suffers from instability due to long control loops



Architectural trends

Set the stage for the discussion of appropriate system level design methodologies

Processing elements Requirements for performance, power efficiency and

flexibility SIMD, VLIW, super-pipelining, and hardware multi-

threading exploit application inhérent instruction-, data-, and task-level parallelism

Communication: Bus Architectures Vs Network-on-Chip

Virtualization of architectural resources enables ’divide-and-conquer’ Embedded control-plane processing tasks are executed

in the user space the Real Time Operating System (RTOS),

Embedded data-plane processing tasks are executed on HW multi-threaded processing elements

Global communication of control- and data-plane processing elements is performed by elaborated on-chip networks



Delivered by:

Subhash Iyer,

Program Head,


(CDAC ATC)


High Level Synthesis

Low Power Design


At each level of circuit abstraction, the circuit is equivalent and performs the same target operation, but its structural components (and hence the component’s granularity) are different, and the design issues may be different



System level: Highest level circuit abstraction

The system is specified as processes and tasks

A mix of hardware and software.

Concerned with overall system structure and information flow.

Computer systems are described as an interconnected set of processors, memories and switches

Behavioral level, algorithmic level or high level Also called as instruction set level or algorithmic level.

Focus is on the computations performed by an individual processor; i.e., the way it maps sequences of inputs to sequences of outputs

Architecture, microarchitecture, RTL Viewed as a set of interconnected storage elements and functional

blocks.

Behavior of the system is described as a series of data transfers and transformations between the storage elements

Microarchitectural-level representation of the chip resources, such as adders and subtractors, is determined along with decisions such as single-cycle, multicycle, pipelined or superscalar implementation


Logic level System is described as a network of gates and flip-flops,

Behavior is specified by logic equations

Circuit is represented in the form of a netlist at which level logic realizations of functional blocks are determined

Circuit or transistor level Circuit is a netlist of transistors

Decisions such as how and what types of transistors will be used, complementary CMOS, pass transistors, etc. are the main issues

Physical or layout level System is specified in terms of the individual transistors of which it is

composed

Behavior of the system can be described in terms of the network equations

Lowest level of circuit abstraction

Chip is a sequence of layers (masks), each layer of which is composed of polygons.

It is this level that is transferred to the manufacturing process


Design automation terminology,: Optimization

Synthesis

Analysis

In circuit analysis, the behavior or

characteristics of a circuit are studied

The task of synthesis is to take the

specifications of the behavior required for

a system and a set of constraints and goals

to be satisfied and to find a structure that

implements the behavior while satisfying

the goals and constraints

Behavior, structure and physical design: 3

domains in which hardware is described “Behavior”:

Refers to the ways in which the system or its

components interact with their environment

(mapping from inputs to outputs)

interest is in what a design does, not in how it is

built

“Structure” Refers to the set of interconnected components

that constitute the system (described by a netlist)

Focus on constraints, such as area, cost and delay.

“Physical” design Mapping of the structure onto the technology

Ignores what the design is supposed to do

and binds its structure in space or to

silicon


The automatic design process of VLSI circuits is called synthesis


System-synthesis process partitions the tasks

into hardware, software and their

communications

High-level synthesis process is the translation

from behavioral description to its equivalent

structural description

Logic synthesis is the process of mapping

from the design at the RTL to a gate-level

representation that is suitable for input to

physical design


Physical design then addresses aspects of chip implementation Floor planning

Placement

Routing

Extraction

Performance analysis

Output of physical design is the handoff (“tapeout”) to manufacturing

A generalized data stream, GDSII, stream file

Verification of correctness Design rules

Layout versus schematic

Constraints (timing, power, reliability, etc.)


During each phase of the synthesis process,

the functional equivalence of two

consecutive phases is to be checked to

ensure that they are functionally the same

A power and timing analysis study can be

done by using compact models at the

transistor level

At the physical level, more accurate power

and time analysis is possible through the

extraction of accurate parasitics


High-level synthesis is the translation process

from a behavioral description to a structural

description


Analogous to “compilation” that translates a high-level language program in C/C++ to an assembly language program

HLS Also known as behavioral-level synthesis or algorithmic-level synthesis.

Constraints to be considered in HLS are: Area

Performance

Power consumption

Reliability

Testability

Cost.

HLS synthesis allows a design engineer to make decisions at an early stage of the design cycle, thus ensuring correct design.

Typical steps involved are scheduling, binding, allocation, etc.


Advantages:

Continuous and reliable design flow

From system-level abstraction to RTL abstraction automatically without manual handling

Automatic translations from high-level specifications in the form of C or SystemC to RTL description of the circuit in the form of VHDL or Verilog.

Shorter design cycle

More automation: faster designs, lesser cost

Fewer errors

Synthesis process can be verified easily, so the chances of errors will be smaller.

Correct design decisions at the higher levels of circuit abstraction can ensure that the errors are not propagated to the lower levels, which are too detailed and costly to correct

Easy and flexible to search the design space

Synthesis system can produce several designs in a short time

So, the designer has more flexibility to choose the proper design considering different trade-offs of power, leakage, area and delay.

Balanced degree of freedom for power optimization

Power and performance optimization can be performed at any level of circuit abstraction

As the level of abstraction goes lower, the complexity of the circuit increases

Additionally, the degrees of freedom, and thus power reduction opportunities decrease

Hence, high level or behavioral level is an attractive level and provides a balanced degree of freedom for design space exploration.

Documenting the design process

Automated system can track design decisions and their effects

Design debugging and continuation by third parties can be easily done

Useful for macrocell-based design and the sale of designs as intellectual property cores

Availability of circuit technology to more people

Design expertise is moved into synthesis systems

It becomes easier for a non-expert to produce a chip that eets a given set of specifications

Cost of manpower required reduces


The high-level synthesis process takes a system in the form of a hardware description language (HDL) as input and generates an optimal RTL description by:

Compilation

Transformation

Scheduling

Allocation

Binding

Other steps

Power optimization

Leakage optimization

Register optimization

Interconnect optimization

Take place in synthesis either sequentially or along with the fundamental steps

No fixed sequence for performing various high-level synthesis tasks

They are independent of each other

Yet, these tasks should be performed simultaneously for effective optimization


The behavior of a system to be synthesized is usually specified at the algorithmic level using a high-level programming language like C/C++ or a hardware description language (HDL) such as VHDL and Verilog.

The behavior of the system is then compiled into internal representations, which are usually data flow graphs (DFGs) and control flow graphs (CFGs).

Each behavioral specification is transformed into a unique graphical representation.

The DFG is a directed graph that represents data movement, whereas the CFG is a directed graph that indicates the sequence of operations.


In the transformation step, the initial DFG is transformed so that the resultant DFG is more suitable for scheduling and allocation.

These transformations include compiler-like optimizations such as dead-code elimination, common sub-expression elimination, loop unrolling, constant propagation and code motion.

In addition, some hardware-specific transformations like minimization of syntactic variances and retiming may be applied to take advantage of the associativity and commutativityof certain operations


Scheduling is the process of partitioning the set of arithmetic and logical operations in the DFG into groups so that the operations in the same group can be executed concurrently, while taking into consideration possible trade-offs between the total execution cost and hardware cost.

A group of concurrent computations to be executed simultaneously is referred to as a control step.

The total number of control steps needed to execute all operations in the DFG, the minimum number of functional units of each type to be used in the design and the lifetimes of the variables generated during the computation of operations are determined in the scheduling step.


Selection is the process of choosing resources from the library, which involves tradeoffs according to different features like delay, area, power and leakage.

Resource allocation is the process of determining the number of functional units of each type for performing operations, memory units (registers) for storing data values and interconnects for data transportation.

Often, the selection and allocation processes are a single task.

Allocation is further divided into sub-tasks, such as functional unit allocation, memory unit allocation and interconnect allocation.

Resource allocation and binding may share resources so that the same hardware can be used to execute different operations or so that the same register can be used to store more than one variable.


Binding or assignment is the process of assigning variables to memory units and data transfers to interconnections.

Binding is further divided into several sub-tasks, such as functional unit binding, memory unit binding and interconnect binding.

Functional unit binding involves the mapping of operations in the behavioral description into a set of selected functional units.

Memory unit binding maps data carriers (constants, variables, arrays) in the behavioral description onto storage elements (read-only memories, registers, memory units) in the data path.

The interconnect binding task maps every data transfer in the behavior onto a set of interconnection units for data routing.


In the output generation phase, design

output is generated.

The output should be in a form such that

logic-level synthesis tools can optimize the

combinational logic and layout synthesis

tools can design the chip geometry.

The generated output is generally in a low-

level HDL, such as structural VHDL



Data Path Synthesis

Control Synthesis

The controller is typically a finite state machine

that is either microcoded or hardwired


HLS is important for several reasons Reduction of design cycle time

Rapid design space exploration at the higher level of abstraction

Wrong decisions are not propagated to lower levels of design abstraction,

HLS involves several important steps, such as: Scheduling

Allocation

Binding

Several graph theoretical algorithms are available that can perform optimization while performing these tasks.

Two Types Data path

Control synthesis

There are existing tools to perform high-level synthesis explicitly, and some tools perform the behavioral to RTL compilation as an intermediate process.



Delivered by:

Subhash Iyer,

Program Head,


(CDAC ATC)


Introduction to SoC Design Methodology


Design flow of integrated circuits

Application phase

Implementation phase

Both are decoupled

Application to implementation

A specification document written by:

Application team

System architecture specialist

Ad-hoc and informal approach


Problems

Ambiguity of the informal specification document leads to misinterpretations and implementation errors

Lack of reliable performance information before the implementation often causes an over- or under-provisioning of processing and communication resources

Quality of results mainly depends on the intuition and experience of the system architect

Manual creation of the verification environment requires significant effort and again represents a potential source of inconsistencies with the original design intend



Electronic System Level (ESL)

Application is jointly considered with the system architecture to find a feasible and cost effective application to architecture mapping

The declared goal of ESL design is to increase the engineering productivity and quality of results during the specification of the MP-SoCplatform architecture and application mapping

New design paradigm to cope with the: complexity

economics

of the emerging billion-transistor System-on-Chip era.

Architecture centric definition We define platform-based design

as the creation of a stable microprocessor-based architecture that can be rapidly extended, customized for a range of applications, and delivered to customers for quick deployment

Design process based definition The general definition of a

platform is an abstraction layer in the design flow that facilitates a number of possible refinements into a subsequent abstraction layer in the design flow


Multiple, almost orthogonal phases Functional phase

Performed by application specialists

Completely agnostic to architectural considerations.

Includes

Embedded SW development of the control-plane portion

Data-plane algorithm development

The latter is carried out using highly application domain specific tools and methodologies

MP-SoC platform phase All designs tasks, which have to be performed under consideration of the full functional and

architectural complexity the MP-SoC platforms

Example

Specification of the system-architecture

Mapping of the application onto the MP-SoC platform

Development of the hardware dependant Software layers

High-level IP creation phase Design of processing elements (RISC, DSP, MCU, ASIPs)

On-chip interconnect technologies (busses, NoC),

Somain specific standard I /O (PCI-variants, SPIx variants, HyperTransport, I2C, FireWire, QDR, etc.),

Creation of well defined ASIC IP blocks (e.g. an MPEG4 video codec).

Not completely orthogonal to the functional phase, since the design of application specific processing elements and communication IP indeed depends on the considered application

Semiconductor technology and basic IP creation phase Covers standard cells, I/O, memories and the basic technology processes supporting them.

More heterogeneous technologies, combining embedded DRAM, embedded Flash, mixed-signal BiCMOS, RF, and analog

More to do with fabrication technologiesCreated by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 7

Represent the results of the functional phase as a well defined application model as the Executable Specification of the system

System architecture needs to be defined in terms of mapping the application model to the hardware (Main Task)

Embedded SW development

Hardware-Software co-verification task: RTL is verified along with embedded software

Methodology used: Transaction Level Modeling (TLM)


Engineering of integrated circuits has always employed models on different levels of abstraction Model: unique, idealized description of the considered system

Degree of abstraction characterizes the type of model used in the respective design phase

Goal of abstraction is to provide a description the system, which is simple enough

yet sufficiently accurate to enable the necessary investigations

take design decisions

proceed to the next design phase.

Indeed, the design-flow of an embedded system can be considered as a sequence of steps which successively reduce the degree of abstraction in the system model


Functionality refers to the modeling of the system behavior On the highest level of abstraction, the

functionality is condensed to pure mathematic expressions.

Later the functionality is refined to operators,

Finally mapped to logic gates

Timing model captures the temporal properties of the system Degree of abstraction ranges from causality

of events to physical timing of transistors and wires

Data representation Higher level data resolution is reduced to

Tokens and Abstract Data Types (ADT)

Lower levels employ word or bit representations.

The Component granularity describes the finest resolution of the sub-blocks First the component resolution is restricted

to coarse-grain building blocks,

Finally the complete embedded system is composed of fine-grain silicon transistors.


Creation of a system model requires: Modeling language

Well defined execution semantic coordinating the activation of the individual blocks

Model of Computation (MoC) is composed of two parts: Coordination language describes

basic execution semantics with respect to properties like parallelism, synchronism, reactivity and provides the abstracted communication mechanism

The host language provides the language elements for the specification of the system models


Characterized by the total temporal ordering of all occurring communication events

Example is the discrete event simulation MoC, which defines the execution semantics for HDL simulators

Further examples of timed MoCs are synchronous languages like Esterel, Lustre, or Signal, where the events of all communication signals are constrained to occur at identical time stamps

Thanks to their sound mathematical foundation, synchronous languages have gained adoption for the specification, analysis and code-generation of reactive control-dominated applications


Characterized by the fact, that communication

events are only partially ordered

However, various untimed MoCs are popular for

the specification of both data and control

dominated applications

Data-Flow MoCs are heavily employed for algorithmic

modeling and analysis of signal processing

applications

Communicating Sequential Processes (CSP) and

Calculus for Communicating Systems (CCS) are

prominent untimed MoCs which are based on

sequential processes that communicate using a

rendezvous communication mechanism.


The definition of a proper MoC has long been considered to be the silver bullet for system level design and by that for the solving of the design productivity crisis

Initially, the complete system functionality is to be created using the ideal MoC, which provides highest modeling efficiency, simulation speed, and smooth IP reuse

Next, the initial specification would be automatically verified using formal verification technology and metrics like determinism, causality, dead-lock absence, consistency, completeness, and fairness. The golden system specification would then provide the foundation for an automated path to design space exploration to take functional and architectural design decisions

Finally, system level synthesis would be applied to the partitioned system specification providing an automated path to implementation.


Object Oriented Programming (OOP) is a powerful abstraction mechanism, Data and functionality is partitioned and encapsulated inside classe

OOP based languages: UML,C++, or Java

Widely adopted in engineering of arbitrary SW

Gaining importance for the specification of embedded control-plane processing

OOP components interact primarily by sequentially transferring control through method calls

Sequential nature of OOP hinders the intuitive specification, analysis and refinement of the inherent parallel data-plane processing tasks


For this purpose the actor-oriented abstraction scheme has been conceived, where parallel objects interact by sending and receiving messages

Within an actor-oriented design environment, the designer can focus on the specification and analysis of the algorithmic behavior of the individual tasks whereas the communication and synchronization aspects are handled by the underlying parallel Model of Computation

SystemC allows Actor Oriented Programming


Actor-based design languages achieve high modularity in communication modeling by using the Interface Method Call (IMC) principle

IMC mechanism is realized by A set of language elements for

Modules

Ports

Interfaces

Channels.

Processes modeling the behavior are wrapped into modules and access communication services through ports

Available methods are

Declared in the interface specification

Implemented by the channel

Thus the access methods in an interface reflect the specialized properties of the communication style implemented by an particular channel

Actor-oriented design languages offers a generic Model of Computation, which in case of SystemC is based on an event driven simulation kernel

Channels serve as containers for communication and synchronization

The user can extend the generic MoC by creating his own methodology specific channel library


Challenge of System Level Design

The architecture definition and application

mapping have to be considered jointly by taking

the full functional and architectural complexity

into account

In case of a fixed target platform, SLD is

reduced to the application mapping task,

which as a synonym term is also called the

partitioning of the application


Orthogonalization of concerns with respect to all modeling attributes generally enables a divide-and-conquer approach to System Level Design

Separation of interfaces and behavior according to the interface based design paradigm fosters successive communication and structural refinement as well as IP reuse

High modeling efficiency and simulation speed is mandatory to handle the high complexity of SoC designs

Incorporation of hardware specific concepts like timing, reactivity, parallelism, and determinism to express the impact of the platform architecture

Incorporation of software specific concepts like Object Oriented Programming, Operating System (OS) encapsulation, Inter Process Communication (IPC), process concurrency, as well as the creation, mutual preemption, and termination of processes to enable smooth integration of the embedded Software part.

Support for Verification and Validation verification, to first gain evidence on the highest possible level of abstraction, that the correct system is being developed and all performance and cost requirements are met (validation). Later, the validated specification should be reused as a golden reference model for the subsequent refinement, IP integration and implementation steps (verification).

Seamless transition between design phases and abstraction levels from system to gates to avoid long iteration cycles caused by gaps in the design flow.


Question Remains - - - How to do it???


More design aspects


HW/SW Co-simulation has been recognized as a necessary ingredient for HW/SW Co-design.

First HW/SW Co-simulation prototypes linked Hardware Description Language (HDL) simulators to an ISS (Instruction Set Simulators) executing the Software part.

Soon, HDL/ISS Co-simulation environments like became commercially available and are still idelyemployed.

This HDL/ISS approach is severely limited by the slow simulation speed of the HDL simulator, especially in case of large systems with several ISSes and significant hardware portions.

The concept of flexible hardware abstraction levels has been developed,

Here accuracy can be traded against simulation speed.


Maximum simulation speed can be achieved

by using compiled ISS technology together

with highly abstract functional SystemCmodels of the hardware part


The original goal of HW/SW Co-design was to reach the same degree of tool automation known from RTL synthesis, i.e. a formalized system specification is automatically partitioned and synthesized to the optimal target architecture

automated HW/SW partitioning and System Synthesis have never gained industrial relevance Partitioning decision metric is restricted to worst case execution time,

Other important metrics like average performance, cost, and power dissipation are not taken into account.

Even the worst case execution time proved to be hard to estimate in the general case of parallel, data dependent, and interleaved software execution

HW/SW partitioning and automated synthesis is still not recognized as a dominant issue

system architects are interested in the impact on performance of a specific target architecture

To partly automate this mapping, Communication Synthesis

HW/SW Interface Synthesis

emerged as new branches of HW/SW Co-design


Techniques for the analysis of communication requirements and synthesis of the communication architecture

As of today, Communication Analysis and Synthesis techniques need further advancement to cope with emerging Network-on-Chip architectures.

One attempt is to instantiate the NoC library elements (routers, network interfaces, links) from a high-level view of the SoC floorplan

Selection of the actual library elements can be in different ways:

In a application-centric approach, the network topology can be generated from a communication graph of the application

In an architecture-centric approach, the communication architecture can be refined from an abstract channel view via a network topology view towards a micro-architecture view .

So far the analysis of Network on Chip architectures is performed using handcrafted simulation models, which are mostly based on SystemC

The absence of standardized APIs, abstraction levels and modeling frameworks beyond the plain SystemC language so far hinders the creation of interoperable IP models for NoC architectures.

Some of the current projects working on a unified modeling environment for the exploration of NoC architectures are discussed in section 5.3.3 below.


Here, the designer decides on the

partitioning and architecture mapping

The realization of these decisions are

supported by automating the tedious task of

generating the required Software driver

functions as well as the Hardware glue-logic

Recently the technology has been ported to

the SystemC



MP-SoC platform phase is concerned with:

System architecture specification

Application mapping

Abstraction concepts on this level have to support the joint consideration of application and architecture

High level of detail inherent to Register Transfer Level (RTL) implementation models prohibits the investigation and optimization across heterogeneous communication and processing elements

Significant research has been spent on the definition of the appropriate System Level Design language.

Today SystemC is generally considered as the standard language for all kinds of SLD tasks.


SystemC has initially been conceived to replace VHDL and Verilog as a Hardware Description Language

For this reason it naturally provides all hardware specific concepts e.g., time, parallelism, and hierarchy

With version 2.0 SystemC has been thoroughly revised to become a fully elaborated actor oriented design language

The incorporated Interface Method Call (IMC) principle enables a clean separation of interfaces and behavior as well as orthogonalization of further modeling attributes

All kinds of methodology and application domain specific Models of Computation (MoC) can be implemented on top of the generic event-driven SystemC simulator

SystemC 2.0 enables a smooth transition from functional phase to the MP-SoC platform phase, e.g. hybrid simulation of an architecture model in the context of an algorithmic Data-Flow model


Since SystemC is a native C++ library, it

inherently supports Object Oriented

Programming

Final version 2.1 of the language has become

an official IEEE standard

Development of the Transaction Level

Modeling (TLM) kit

Synthesizable subset of SystemC



The characteristic property of TLM:

Pin-level communication interface of RTL models

replaced by a set of interface methods.

This IMC based communication mechanism is

provided by all actor-oriented specification

languages


SystemC based TLM has demonstrated the potential in terms of increased simulation speed and modeling efficiency

The basic TLM API consist of a bidirectional transport and a set of unidirectional put and get interfaces

The bidirectional transport has blocking synchronization Implementation of the interface is allowed to call wait(.)

The unidirectional interfaces are available in a blocking and a non-blocking version

These interfaces can be seen a foundation layer for the creation of more advanced TLM interfaces, which serve a specific methodology or model a specific communication protocol


The two cycle-level TLM layers Bus Accurate (BA)

Cycle Callable (CC)

These levels are particularly suitable to create a cycle-accurate prototype of the system architecture

The (usually cycle-accurate) Instruction Set Simulators (ISS) of the programmable architectures are connected to cycle- and bit-accurate models of memories, communication resources and peripherals


BA and CC difference: BA captures a transaction

within a single method call,

CC models provide separate methods for every phase of a transaction.

The Programmer’s View (PV) abstraction levels address early integration of (usually instruction accurate) ISSes for SW development purposes PV provides a bit and

address-map accurate view of the MP-SoC architecture context for the programmable processing elements

PV is based on the bidirectional blocking transport API


The Open Core Protocol International Partnership (OCP-IP) is getting a lot of traction throughout the industry

OCPIP provides a high configurable SoC protocol and their System Level Design working group has worked from the early days on Transaction Level Modeling


Lowest level: Transaction Layer 1 (TL1)

provides a fully cycle accurate model of the OCP protocol

Fully aligned with the CC abstraction level from OSCI.

Next higher level: Transaction Layer 2 (TL2)

Represents basically a cycle-approximate abstraction of the OCP protocol.

The API contains a large number of OCP specific features

like e.g. thread-busy, handshaketiming, or sideband signals.

The timing is not cycle accurate, but can be annotated to a near-cycle accurate level

Highest Level: Transaction Layer 3 (TL3)

protocol agnostic subset of TL2

API is limited to a concise set of primitives,

Model timing approximate on-chip communication


PV TLM platforms for early SW development as well as cycle-level TLM for HW/SW and TLM/RTL co-verification are successfully deployed throughout the industry

However, both use-cases solve only parts of the challenges during the MP-SoC design phase

Especially the architecture definition and task partitioning is not adequately addressed

PV platforms simulate very fast and are well suited for SW development

Unfortunately they do not contain sufficient timing information for architectural investigations

The blocking semantics of the underlying bidirectional transport API hinders the smooth annotation of further timing information


Cycle accurate models of the SoC platform are too detailed and too slow for architecture definition and task partitioning

First, the effort to create such a cycle-accurate model of the complete platform is way too high to allow for the investigation of a large number of architecture and application mapping alternatives

Second the reachable simulation speed in the order of 100k cycles per second is not sufficient for the analysis of large design parameter choices

As a result, the exploration of broad design spaces is still a cumbersome process in cycle-level TLM based design flows

Cycle-level TLM communication models have architecture specific interfaces.

Thus, every time the designer is inclined to explore a new communication architecture he has to change the interface of the connected functional models


For this reason the Design Space Exploration framework deploys a generic synchronization interface, which provides the same primitives as the newly standardized OCP TL3 API

Obviously, the TL3 API presents the best fit for this purpose

It is compliant with the OSCI TLM standard

Additionally, it is of reasonable complexity, and yet offers sufficient expressiveness to meet the accuracy requirements for design space exploration

By deploying SystemC based Transaction Level Modeling the framework is nicely integrated into the flourishing ESL ecosystem.

This method is interoperable with the PV and cycle-accurate modeling methodologies and can benefit from the commercial tool support, available IP models, and established ESL design methodologies


Component Based Design Ffounded on the assumption, that the processing elements and

communication templates are available IP blocks

Communication Based Design: envisions MP-SoC platform design as a composition of reusable

IP blocks

Different from Component Based Design

Omits the consideration of processing elements

Is exclusively focused on the conceptualization and implementation of the communication architecture.

Communication Based Design can be seen as the corresponding design paradigm to match emerging NoC architectures.

Design Space Exploration (DSE) Environment The goal is to take early design decisions with respect to

system architecture and application mapping on the basis of an abstract performance model.

The embedded application needs to be modeled together with the MP-SoC architecture at a high level of abstraction


Introduction to SoC

Design Space Exploration (DSE)

Methodology


Ultimate goal is to meet the System Level Design requirements as specified and to cope with the full architectural complexity of emerging MP-SoCarchitectures


MP-SoC Framework follows the y-chart principle Set of functional application models is

merged with a set of architecture models in a dedicated mapping step

Developed embodiment of the y-chart principle is called Virtual Architecture Mapping (VAM) which comprises of: Well defined abstraction level above

cycle-level TLM for efficient modeling of embedded applications

Set of generic, parameterizablearchitecture models, which capture the notion of shared and resource limited architectural fabrics for communication and computation

Rigorous definition of a timing model, that embodies the performance of a selected application-architecture-mapping

MP-SoC simulation framework featuring a declarative mapping mechanism to minimize turn-around times during the iterative architecture exploration cycle

Comprehensive set of analysis tools for functional and performance validation



Documents

Soc - Intro, Design Aspects, HLS, TLM