Design of embedded mixed-criticality CONTRol systems under ...€¦ · Many embedded systems are real-time systems, i.e. they operate with timing constraints. The consideration of

CONTREX/OFFIS/R/D3.1.3 Public Extra-functional property models (final)

Page 1

Public

FP7-ICT-2013- 10 (611146) CONTREX

Design of embedded mixed-criticality CONTRol systems under consideration of EXtra-functional

properties

Project Duration 2013-10-01 – 2016-09-30 Type IP

WP no. Deliverable no. Lead participant

WP3 D3.1.3 OFFIS

Extra-functional property models (final)

Prepared by OFFIS, POLITO, POLIMI, KTH, DOCEA, GMV, UC, EDALab, IX

Issued by Ralph Görgen, Kim Grüttner (OFFIS)

Document Number/Rev. CONTREX/OFFIS/R/D3.1.3/1.3

Classification CONTREX Public

Submission Date 2016-09-30

Due Date 2016-09-30

Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)

© Copyright 2016 OFFIS e.V., STMicroelectronics srl., GMV Aerospace and Defence SA, Vodafone Automotive SpA, Eurotech SPA, Intecs SPA, iXtronics GmbH, EDALab srl, Docea Power, Politecnico di Milano, Politecnico di Torino, Universidad de Cantabria, Kungliga Tekniska Hoegskolan, European Electronic Chips & Systems design Initiative, ST-Polito Societa’ consortile a r.l., Intel Corporation SAS.

This document may be copied freely for use in the public domain. Sections of it may be copied provided that acknowledgement is given of this original work. No responsibility is assumed by CONTREX or its members for any application or design, nor for any infringements of patents or rights of others which may result from the use of this document.


Page 2

History of Changes

ED. REV. DATE PAGES REASON FOR CHANGES

OFFIS 0.1 2016-08-11 87 Initial document (based on D3.1.2)

OFFIS 0.2 2016-08-25 88 OFFIS contribution

OFFIS 0.3 2016-09-19 88 Update OFFIS contribution

UC 0.4 2016-09-21 94 Update UC contribution

PoliTo 0.5 2016-09-22 95 Update PoliTo contribution

POLIMI 0.6 2016-09-26 103 Update POLIMI contribution

INTEL 0.7 2016-09-27 107 Update Intel contribution

OFFIS 1.0 2016-06-30 106 Clean-up

IX 1.2 2016-09-30 106 Update IX Contribution

OFFIS 1.3 2016-09-30 106 Final version


Page 3

Changelog

2016-08-16: Added graphic to 4.1.3.2 showing the preprocessing steps for automatic synthesis and adapted the text accordingly. Typo fixes in some parts.

2016-09-14: Added more details about implementation of Timing Model in 4.1.2

2016-09-19: Update of ZYNQ Floorplan and Package model in Sec 4.1.4

2016-09-21: Update of the VIPPE improvements and extensions during the last period of CONTREX (kernel refactoring for efficient cache estimation, several cache levels, interface with SystemC)

2016-09-22: Update of “Modelling of Power and Energy for Batteries” under Sec 3.2.6

2016-09-26: Update of Sec 4.2

2016-09-27: Update of Intel sections in 3.2.2, 3.3, 4.1 and 4.2


Page 4

Contents

1 Introduction ........................................................................................................................ 5

2 Interdependencies of extra-functional properties as a motivation for their consideration during the design ........................................................................................................................ 6

3 State of the art and newly developed quantitative analysis methods of extra-functional properties .................................................................................................................................... 8

3.1 Modelling of timing ..................................................................................................... 8

3.2 Modelling of power and energy ................................................................................. 10

3.3 Modelling of thermal behaviour ................................................................................ 55

4 Description of integrated modelling approaches .............................................................. 60

4.1 Extra-functional property estimation to be applied in UC1 and UC3 ....................... 60

4.2 Extra-functional property estimation to be applied in UC2 ...................................... 84

5 Conclusions .................................................................................................................... 101

6 References ...................................................................................................................... 102


Page 5

1 Introduction

In contrast to the top-down system modelling targeted in WP2, WP3 has the goal to define and develop models, methods and tools supporting the design, analysis and optimization of the node-level execution platform.

As a part of WP3, this deliverable describes the state of the art concerning bottom-up quantitative modelling of extra-functional properties and provides the preliminary description of the modelling approaches that are developed, combined and applied. It is the second Deliverable in Task 3.1 Modelling of execution platform’s extra-functional properties.

The following Section 2 gives a motivation of the importance for a consideration of extra-functional properties during the design. It summarizes the interdependencies and correlations that exist among the properties.

Section 3 describes the state-of-the-art and newly developed bottom-up quantitative modelling of extra-functional properties. It focuses on the following properties which have been identified to be relevant for the design of mixed-criticality systems: timing, energy, and temperature. These properties have been covered by the project in the different use cases.

Based on this, Section 4 describes the modelling techniques developed in the CONTREX project for the estimation of extra-functional properties and their integration within the use cases and for the different hardware platforms. The flows described here are instances of the general CONTREX system modelling methodology described in D2.2.2.


Page 6

2 Interdependencies of extra-functional properties as a motivation for their consideration during the design

The main focus of CONTREX and one of its distinguishing features with respect to other mixed-criticality projects is the consideration of extra-functional properties. This section motivates the need for the consideration of extra-functional properties as a result of the strong coupling among the main properties, i.e., time, power, temperature, and degradation.

While the most prominent coupling is the electro-thermal coupling between static power consumption and the temperature, there are many other more. In general, it can be distinguished between positive and negative correlations. Positive correlation means that the higher the source parameter is, the larger the target parameter will also be. For example, the higher the degradation of the chip has progressed, the slower the circuit is and thus the larger the delay will be. Negative correlation represents a “the more – the less” dependency. As an example, the higher the supply voltage is, the smaller the delay is.

Figure 2-1 visualizes all main positive and negative correlations among the extra-functional properties timing, power (separated into different static/leakage power contributions as well as dynamic power and short circuit power), temperature, and degradation as well as some key technology parameters such as the gate length or oxide thickness and other design parameters such as the supply voltage and body-biasing voltage.

Further the coupling can differentiate between a temporal and spatial perspective. From a temporal perspective a power and temperature estimation can focus on seconds or minutes of operation to cover average and peak temperatures and power values. In contrast, degradation becomes only visible after month and years of operation. From a spatial granularity viewpoint the analysis of an overall system at a coarse grained level is sufficient for power and temperature to reflect electro-thermal coupling, boundary conditions, cooling capabilities, and energy budgets whereas degradation takes place at transistor size. Nevertheless, a degradation analysis

Figure 2-1: Parameter dependencies


Page 7

depends on an accurate temperature prediction. In order to explore the design space and to meet all top-level requirements every analysis of these extra-functional properties should provide fast estimation results. Thus, a scalable level of granularity is crucial and inevitable.

The delay/timing, the power consumption with its different causes, the temperature, as well as the degradation are highly coupled and their estimation in isolation may lead to a reduced estimation accuracy. Furthermore, the coupling may also have an impact on the functionality; for instance, the functional operation may be affected by a thermally induced throttling or a reduced battery capacity.

The unavailability of processing elements due to an artificial but necessary power down as a result of power limitations or a temperature problem is also known as Dark Silicon. Figure 2-2 is taken from [5] and postulates the so called dark silicon gap that will further increase exponentially due to the projected device scaling, core scaling, and multicore scaling. In other words, Dark silicon refers to those parts of a chip that cannot be powered at a certain point in time because the remaining part of the chip completely exploits the power budget.

Figure 2-2: The impact of dark silicon, taken from [5]

In front of this background it is obvious that in mixed-criticality applications it needs to be secured that non-critical applications cannot influence or disturb the execution of the critical tasks neither through direct resource conflicts nor through indirect coupling effects through extra-functional properties.


Page 8

3 State of the art and newly developed quantitative analysis methods of extra-functional properties

This section gives a summary on existing state-of-the-art as well as on newly developed modelling techniques and tools for the consideration of extra-functional properties such as timing, power/energy, thermal behaviour, and reliability/degradation. The overview does not aim to be complete but rather covers the relevant models, concepts and tools to get quantitative figures for the extra-functional properties addressed in the CONTREX project.

3.1 Modelling of timing

Many embedded systems are real-time systems, i.e. they operate with timing constraints. The consideration of the time-related metrics relevant for the analysis and design of a mixed-criticality embedded systems (MCeS) makes necessary the consideration of the two classical types of real-time systems which are converging into a MCeS.

A wide spectrum of systems can be considered as soft-real time systems. Soft real-time systems may occasionally violate a deadline, or other requirements associated with constraints related to time metrics. Throughput and latency are relevant in these types of systems. An example regarding throughput is provided by HDTV where a clear time-related constraints can be the need to refresh a frame at a 100Hz rate. However, neither the user will die, nor the user will notice a bad watching experience if there is single delay or miss in the refresh, for instance once every 1E6 times (i.e., once every 3h). The latency is also relevant. For instance, in the TV example, the latency between the channel zapping and the moment at the first frame of the selected channel is shown in the screen is relevant for the user experience.

Moreover, soft-real time systems have clear needs for enabling cost-effective designs, with minding other extra-functional properties, like power consumption, temperature, etc., and many these extra-functional properties are in turn related to many metrics with a close relation to time metrics. Specifically, we are referring to metrics like amount of executions, cache misses, amount of bus accesses, utilizations (of CPUs, of buses), etc.

While in the soft real-time system design, there is less pressure in the strict fulfilment of time-related constraints (compared to hard real-time system design), it is also true that the search of cost-effective, power efficient designs, has led to highly complex architectures enabling higher optimization of the average or expected use cases. This has leveraged the development of different simulation-based technologies, i.e. based on instruction-set simulators (e.g., RealView ARMulator [42]), based on binary translation (e.g., QEMU [43], OVP [44]), and based on native simulation (e.g., SCoPE or its successor VIPPE [45]), capable to provide estimations of these performance metrics and enabling a trade-off between estimation (simulation) speed and accuracy. In all these performance estimation approaches it is crucial the capability to estimate with sufficiently accuracy, and with sufficient speed (depending on the design space explored) the complex implementation architectures which can be involved in an actual implementation of a soft-real time system.

Table 1 summarizes all the information which the VIPPE tool is capable to report in its current state. In CONTREX it is supporting the estimation of extra-functional performance properties based on native estimation technology (thus favouring speed vs accuracy, e.g. to enable fast DSE flows). A main reported property is the thread time (i.e. time consumed by threads), since it is directly associated to the application, and so to the time-related metrics which can be


Page 9

defined relying on the application, e.g. throughputs and end-to-end latencies. The rests of the metrics reflect a breakdown of interesting activities within the different layers of the system, overall contributing to the reported time. They not only give hints on the causes of the time performance of a given system configuration, but they are also the basis for the report of other extra-functional properties, such as energy and power consumption.

Table 1. Information currently reported by VIPPE.

Per Thread Per CPU Bus info

Time

Executed instructions

Instruction Cache Access

Instruction Cache Misses

Data Cache Accesses

Data Cache Misses

Memory accesses

Utilization

Executed instructions

Instruction Cache Access

Instruction Cache Misses

Data Cache Accesses

Data Cache Misses

Memory accesses

Congestion

accesses (per CPU)

On the other side, there are many safety-critical applications which have hard real-time requirements, i.e. a single violation of the deadline may have disastrous consequences. Thus obviously timing and in particular the worst case execution time (WCET) is a critical property within CONTREX and will be used as a key property in the analytical design space exploration phase (Task 2.3).

Unfortunately, it is not easy to determine the WCET for the today embedded microprocessor platforms. Advanced embedded processors have features like caches or pipelines that make it very hard to determine the WCET even for a single processor. Tools like aiT [6] from AbsInt can make good predictions but do not operate on the source code, but on the generated binary executables. Thus, if source code or compiler is changed, the WCET-analysis has to be redone. Instead, the academic SWEET tool [7] from Mälardalen University, analyses the high-level source code, but cannot give absolute guarantees. WCET-analysis is still a very important research topic, and [8] offers an excellent survey on the topic. Another more practical possibility to determine the WCET is by measurement. However, this has the disadvantage that the longest observed execution may not be the theoretical WCET.

An inherent problem for the determination of the WCET is related to the current architectures for embedded processors, which have been optimized for average case performance and whose behaviour is extremely difficult to be predicted. In recent years, academia has proposed predictable processor architectures like the precision-timed architecture [9], the JOP-processor [10] or the PATMOS architecture [11], which can give guarantees. The advent of multicores in the embedded systems domain has complicated the problem of WCET-analysis, since the execution of a program in a shared memory multiprocessor cannot be analysed independently,


Page 10

but it is influenced by the memory accesses of the programs running on the other cores. Here, predictable bus architectures, like a time division multiplex bus can provide predictability.

The CONTREX project does not deal with WCET analysis, but rather it will use the results from the research of the WCET community. However, in order to give predictability to the design space exploration of task 2.3, CONTREX will validate design space exploration methods with predictable architectures.

3.2 Modelling of power and energy

3.2.1 Steady state power estimation for complex SoCs

Estimating the power consumption or energy demand for a complex integrated SoC, potentially even consisting of heterogeneous computing elements, has been the target of many past research activities and is a challenging task often leading to constrained solutions with a limited spectrum of applicability. While pure datasheets are sufficient when only worst case considerations are in the focus of interest, they cannot reflect the numerous number of parameters having an impact on the overall power consumption as well as configurations the SoC can operate in.

Within CONTREX the Xilinx Zynq Platform is used in the Use Cases 1 and 3. This execution platform is described in D3.2.1 [2]. For this SoC as well as for all other Xilinx SoCs, Xilinx offers an Excel-based power calculation tool called Xilinx Power Estimator (XPE) [1]. Figure 3-1 shows the overview page of the complex and detailed spreadsheet. Beside the general settings the spreadsheet includes several dedicated estimation sheets for configuring the processing system (ARM cores) and the processing logic (FPGA part). Furthermore, it covers the clock tree, IO, DSPs, and the RAM. For the latter, Figure 3-2 visualizes the estimation sheet.

The main drawback of this spreadsheet is that it is only capable to make steady state power estimates and that it is not able to provide transient traces by default. This is because the sheet requires average utilization data for the ARM cores, memory, and the communication infrastructure. In order to cover the transient coupling in a close loop simulation as indicated in Section 2, the sheet needs to be updated frequently with the current utilization rates.


Page 11

Figure 3-1: Xilinx Power Estimation overview page

Figure 3-2: Block RAM Power Estimation within XPE

3.2.2 Architectural level power modelling and estimation framework

To overcome the limitations of the spreadsheet approach one solution is to use a dedicated framework to create a power model of the SoC architecture and enable dynamic power estimation using timed stimuli. By accounting for the major components consuming power (not only digital components but also analogue and mixed-signal components) and all power consumption mechanisms (e.g. dynamic power and leakage power, including its dependence on temperature, power losses in regulators) in a design, an architectural power model can give SoC designers and architects the required insight into the SoC power behaviour to help them assess whether their products will meet the defined power budget requirements. This process is named power budget tracking.


Page 12

Figure 3-3: Tracking power budget

As can be seen in Figure 3-3, power budget tracking applies not only at the early design stage, but throughout all the design flow. During pre-design stage: power tracking consists in establishing a power budget prediction matching the architecture definition, after several architecture variants are tried out and compared through architectural exploration. Then during design and implementation stage, power budget tracking consists in detecting as soon as possible any unexpected deviation (or regression) of the power budget from the target determined at design project start. Eventually during post-silicon stage (once some engineering circuit samples are back from fab): power tracking consists in correlating some measurements performed on real silicon with estimation provided by the architectural model, in order to detect any issues in the actual circuit

Beyond estimating the power budget (and thus providing data to help assess the expected life-time of battery-operated devices), an architectural power model can in addition serve a variety of other purposes:

Provide design guidance, i.e. help make design decision such as defining where and how to use low power techniques (system clock gating, power switching, multiple voltages and frequencies) and determining energy efficient partitioning of software and hardware.

Perform reliability analysis using temperature and voltage variation estimates computed from power consumption estimation.

Prototype, develop and validate power management and thermal mitigation policies that aim at reducing power budget and thermal constraints.

Detect and fix bugs affecting power (when power is higher than expected because of functional issues, e.g. when a power domain cannot be switched-off when idle because of a signal not issued by hardware) and bugs caused by power (when functionality is jeopardized because of power attributes, e.g. when performing some processing in a block while its supply is shut-down).


Page 13

An architectural power model consists of the following parts:

1. A description of the block/sub-system design hierarchy. 2. A description of the voltage distribution. This is an abstract representation of the power

distribution network, from a primary supply source (e.g. a battery), through regulation stages (off- and on-chip regulators), supply muxes and switches, and down to leaf power consumers.

3. A description of the clock distribution. This is an abstract representation of the clock tree, from a primary frequency generator (e.g. a crystal), through frequency synthesizers (off- and on-chip), clock muxes and gates, and down to leaf clocked power consumers.

4. For each block and sub-system, the listing of their power states. Transitioning from a state to another state is commanded either internally through state machines designed in the block or externally through registers and/or pins receiving signals.

5. A set of power functions, attached to each power state, which enable the estimation of the power consumed in the state. The functions are devised from a power characterization process. For blocks designed in CMOS technology (considered the default semiconductor process), there is in each state a power function that defines the static power consumption (caused by the leakage currents). In the active states there is in addition a function that defines the dynamic power consumption, which usually depends on the block activity.

Below are some instances of power functions:

Static power consumption (due to leakage) typically depends on the voltage and the temperature, hence:

Pleak = I_ref * 2 ^ ((Temp - 300) / 20) * Volt (leak)

I_ref is the reference current when Temp = 300°K. It is part of the characterization and is defined in the model.

Dynamic power consumption typically depends on the voltage and frequency supplying a digital block, hence:

Pdyn = Cdyn * Volt^2 * Freq (dyn_1)

Cdyn is the switching activity. It is part of the characterization and is defined in the model.

Dynamic power consumption can alternatively be defined according to the block activity when it does not process at its maximum capability, hence another possible form:

Pdyn = ActivityRatio * Cdyn * Volt^2 * Freq (dyn_2)

The architectural power model presents a so-called stimulation interface, which consists of the parameters necessary to set the block power states and set the variables in the power functions of the selected states to trigger the evaluation of the functions. For instance, for a block having two power states, with power consumption defined by functions leak and dyn_2 as described above, the block stimulation interface is:

{Current state (= ”state1” or “state2”), Volt, Freq, ActivityRatio}


Page 14

A simulation framework can then give access to the stimulation interface. Stimuli varying over the time are passed through the framework to the interface to trigger evaluation of the power functions. Time-varying power consumption estimation is thus performed.

The stimuli passed to the interface and the resulting computed values are reported in time series format. A mix of proprietary formats (such as the Docea SImulation format, .dsi) and standard format (such as the Value Change Dump format, .vcd) can be used to this purpose.

Figure 3-4: Power and temperature over time with Docea Simulation viewer

Figure 3-5: Some results in VCD formats, displayed with Impulse VCD viewer

The architectural power modelling as depicted above allows assessing the impact of some low power optimization techniques used in a design. The major techniques used in the industry are presented in Figure 3-6 below. Clock control techniques are explicitly captured in the architectural clock distribution representation, voltage control techniques in the architectural voltage distribution representation. Process oriented techniques are accounted for through the power functions attached to the states, while activity control techniques are considered through the activity aware characterization and stimulation of the active states.


Page 15

Figure 3-6: Usual low power design techniques

In CONTREX, a dual gate level (using tools available on the market) and architectural level approaches is exploited. The architectural model is populated with power characterization coming either from gate level analysis or from other possible means: power measurements performed on previous design versions, product datasheets or designer’s experience. The model is stimulated first with scenarios written manually that represent basic functional sequences (listing power state and functional mode change over the time), and then with traces generated by a functional simulator (using TLM-level models). This way functional behaviour and power consumption can be jointly analysed and verified.

3.2.3 Power characterisation and simulation of black-box IP components at electronic system level

An essential requirement for complete system power simulation is the power estimation of all components including black-box IP components. The internal structure of a black-box component cannot be annotated or modified to record switching activity in order to derive energy consumption. For this reason, the internal state of an IP component should be abstracted, based on a correlation of its observable I/O behaviour with the estimated power consumption thus generating the so-called Power State Machine (PSM). For each state the estimated energy consumption can be obtained from data sheets, inferred from an estimation of design size or the functional complexity of the design (top-down), or extracted from lower level (e. g. register transfer or gate level) simulations (bottom-up). With the latter approach, the most accurate model can be built [52]. For this reason, we abstract power values of gate-level simulations to create PSM models.


Page 16

Figure 3-7: PSM approach overview for non-invasive simulation of energy consumption

Figure 3-7 gives an overview of our I/O observation-based approach for annotating energy information at black-box IP simulation models. The ports of the IP component are observed over time to approximate the internal functionality. Based on these observations, a Protocol State Machine (PrSM) is controlled. The main task of the PrSM is to trigger state transitions in a Power State Machine (PSM) based on the observation and interpretation of the interaction between component and environment. By modelling the interdependencies between I/O and internal states the PrSM extracts the energetic relevant events to orthogonalise the communication and functional artefacts of the non-functional PSM model. This may lead to a reduction of complexity in the PSM because it only describes the different internal operation modes, whereas the PrSM covers the access protocol of the component. Furthermore, the separation of PrSM and PSM has the advantage that components with the same access protocol and different internal implementations could use the same PrSM, only the PSM has to be changed.

Furthermore, there may be some functional state transitions in the IP component that cannot be detected by observing the interaction with the environment, e. g. if a component finishes its calculation, it may not send data indicating this through one of its ports. For this reason the PSM and PrSM are modelled as timed automata to execute state transitions after a given delay/timeout. PSM and PrSM are each modelled as an Extended Finite State Machine (EFSM) which allows the extension of a simple FSM with (shared) state variables to reduce the complexity. EFSMs extend the state transition with enabling functions and update functions. Enabling functions check conditions on the state variables. Only when the condition is true, the transition is executed. Update functions modify the content of state variables if the related transition is executed. The input parameter for the update function can be both input symbols of the state machine and the values of state variables. The PSM has the restriction that it only


Page 17

allows enabling functions and not update functions, i.e. the PSM can only read the shared state variables but not modify them [53].

As described, PrSM and PSM allow for transitions to be executed after a defined delay. Therefore, the automata are modelled as an Extended Communicating Event Clock Automata (ECECA), which is a combination of EFSM with Communicating Finite State Machines (CFSM) [54] and Event Clock Automata (ECA) [55]. An ECA introduces clocks that are tightly associated with certain symbols of the input alphabet. An ECA may have event-recording and event-predicting clocks. An event-recording clock xa is always reset when the automaton is triggered with the input symbol a. An event-predicting clock yb is reset to a nondeterministic negative value when the automaton is triggered with the input symbol b. The value is checked for 0 at the next occurrence of input symbol b. Due to this association, every non-deterministic ECA can be transformed into a deterministic ECA which is necessary to implement it as executable model. The transitions of an ECA are annotated both with input symbols and clock constraints [55]. For enabling synchronisation between PrSM and PSM we use the channel mechanism of CFSMs presented in [54]. Instead of a full-duplex channel, we utilise an unidirectional channel.

PSM and PrSM are sharing state variables which means that the state variables of the PrSM VPrSM are the same as those of the PSM VPSM. This implies that the transfer messages of the synchronisation channel between PrSM and PSM is extended by the same value range. Hence, a dynamic state variable dependent output can be applied, meaning that every extended PSM state, i. e. the PSM state in combination with the current values of the state variables, has a unique and deterministic output.

3.2.3.1 Data dependency extension

As stated in the last section the PSM model can only model state-based power consumption. But many designs have data-dependent power consumption. The reason for this is in most cases that the amount of switching activity in the component can heavily depend on the switching activity at the inputs of the component. Furthermore, switching activity can depend on more than only the last input vector. This is caused by input data that is stored and processed in several stages and thus influencing power behaviour in a longer time interval. In the stages power is influenced in different ways by the data and the impact of the stages is different, too.


Page 18

Figure 3-8: Modelling of data-dependent power in the PSM model

In the past several metrics were developed to express data-dependent power. These are for example averages values, the sum of the values, the Hamming distance (HD) or Signal distance (SD). The HD describes the number of positions at which the corresponding value is different between two data vectors of the same size. The SD describes the number of positions at which the corresponding values stay at the logic value ’1’ between two data vectors of the same size. Since the internal structure of black-box IP components is not known different metrics may give the best accuracy. This results in three requirements for modelling these conditions:

1. Storing the required amount of the last input data that evolved as data dependent

2. Applying the required metric on the stored data

3. Evaluate the right impact of the metric calculation result

To fulfil these requirements the following modifications and extensions are applied to the PSM model, which is shown in Figure 3-8. First, the constant outputs of the PSM states are replaced by dynamic outputs. This dynamic output is calculated by making use of the applied metric. The shared variables are used to store the required amount of input data and provide them to the PSM. The task of the PrSM is to store the relevant data in the shared state variables and shift them at every occurrence of new input data. This way one state variable always stores the last input vector, another state variables always stores the previous input vector and so on.


Page 19

The dynamic output is only a conceptual modification but not a semantic modification. A unique output value is mapped to every state of the shared variables which is calculated by the output function of each state. Instead of using the output function the related output could be instantiated as separate states. Since the value range of the state variables could be enormous this would lead to a state explosion. By applying the output function we reduce only the displayed state space by maintaining the complete state space.

For the dynamic output calculation the current state and the values of the state variables are determining. Nevertheless, the PSM can still have states which have a constant output. Furthermore, every state can have a different function mapping the state of the state variables to an output value. This function is based on two facts:

1. What is the used difference metric between two data input vectors

2. How is the impact of each difference calculation

The difference metric is strongly dependent of the implementation. Since this is not known for the user, different metrics have to be tested against resulting power values to determine the one with the best accuracy. In some cases even a combination of multiple metrics show best results. Then it is important how the resulting value is transformed in a power or activity value. For example the difference value could be linear, negative linear, quadratic, or exponential.

Since input data is not processed at one time, but forwarded through the design influencing different stages of processing in advanced time, multiple input values have to be stored. These different stages have different impact on the resulting power. If more than two input values are considered, we have more than one difference value. Each of the difference value can have a different impact on the resulting power, one quadratic and the other linear for example. If all have the same kind of impact, the corresponding impact could be different, too. Often also a constant part exists due to the clock induced activity. Formally expressed we have for every state s a mapping outs between state variables and outputs of the following form:

2

01))var,(var((var)

n

iiiiis dxcoutoutput ,

where di is the difference metric function, xi the impact calculation function, and c the constant part, which is optional. var describes the vector of all state variables and vari the individual state variables with index i.

To be able to apply the data-dependent modelling to any abstraction level, we have no restrictions concerning the timing of the input. That means that the input can be cycle accurate, cycle approximate, or even loosely timed. Furthermore, the input data can be at different intervals. Between different abstraction levels the difference metrics or impact calculation may be adapted because the amount of considered events for a specific time interval may differ between the abstraction levels. For example, this applies if we compare register transfer level and transaction level.

3.2.3.2 Characterisation of the PSM model

The characterisation is always based on power estimations that are generated in gate-level simulations. Since we consider only clocked circuits, our components generate most of the dynamic power only when the clock signal has a rising edge. Therefore, we generate gate-level power traces with clock-cycle accuracy to achieve the best results. This power trace is our


Page 20

reference and basis for the PSM model characterisation. Depending on which part of the PSM is characterised, the utilised stimuli are significant. In the next section only the PSM with constant output values is generated. This is done by using only input data with no or at least less switching activity to eliminate or minimize the data-dependent amount of the power consumption. Nevertheless, control data has to be changed to activate all operation modes. In subsequent section it is tried to let the PSM stay in a specific power state (this is not always possible depending on the functional behaviour) and change the input data over a defined value range to eliminate the control-path driven amount of switching activity. An overview of the PSM characterisation flow is shown in Figure 3-9.

Figure 3-9: Overview of PSM characterisation flow

PSM characterisation

The PrSM is constructed based on the protocol behaviour of all ports. If more than one port exist, the corresponding state machines are combined into one product state machine and non-reachable states are removed – e. g. some protocol states cannot be active at the same time depending on the functional behaviour. In this part the gate-level power trace with eliminated data dependencies is used to find plateaus in the power trace. Plateaus describe a part of the power trace in which the average power consumption keeps nearly constant over time. This average power consumption is represented by the states in the PSM. To find all plateaus and thus power states, it is stimuli applied so that the functional component uses all available operation modes. The PSM is built based on the PrSM because the PrSM defines the trigger events and times. The values and plateaus of the power trace are compared to the trigger points and based on this the PSM is created. This way it is identified, if more or less states compared the PrSM are needed. Some control path-dependent state transitions may depend on the communication data of the functional component. In this case these data have to be stored in the shared state variables and corresponding state transitions and states must be inserted.


Page 21

Data-dependent Characterisation For the characterisation process it is most important to find a difference metric that fits well and to determine the amount of input values that is considered. Since there is no rule of thumb how to find these, the user should make some assumptions based on the behaviour of the functional design. This might be, how input data is processed inside the functional component, how long does it take to process the input data until related output is generated, or are there relations between different input data that can influence the switching activity and thus power behaviour. If a decision for the metric, amount of considered input data, and impact factor is defined, the missing parameters can determined. For example linear parameters are calculated by a linear regression. If the error is too high, a parameter may be changed and the resulting model can be tested again if it achieves an acceptable accuracy.

3.2.4 VIPPE: Timing, Energy and Power based on Native Estimation

VIPPE [45][80] is a tool for performance estimation of systems consisting of complex concurrent applications running on multi-core embedded systems. VIPPE relies on native simulation. Roughly, native simulation consists in statically annotating the source code with associated target dependent computational loads. This annotated code is then compiled for the native target. This way, the high computation capabilities of current host platforms are exploited and fast simulation speed can be achieved.

This high simulation and estimation speed is of high interest in the exploration of implementation alternatives. Moreover, native simulation has potential to obtain comparable accuracies vs competing technologies like binary translation (e.g. tools like Quemu [43] and OVPsim [44]).

In addition, VIPPE has the potential to offer higher speed/accuracy ratios than competitor simulation technologies. The reason is that VIPPE is oriented to maximize the exploitation of the underlying parallelism of the host platform. This provides VIPPE a competitive advantage with respect to non-parallelized version of native simulation based simulators, SCoPE [46][47] (VIPPE ancestor tool) and the aforementioned binary translation –based simulators. Moreover, VIPPE also outperforms the parallelized versions of binary translation tools, e.g. Coremu [56][57] (parallelized version of Qemu) and CPUManager+QuantumLeap [58][83] (parallelized version of OVPSim). A first reason is the speed-up given, as long as an efficient instrumentation over the native code enables a faster run than the binary translation of the code. This is especially true when the code runs on top of an embedded OS, and caches in the target architecture are considered. First, VIPPE relies on an OS model, while binary translation tools need to run OS code too. VIPPE employs advanced lightweight cache models, easily instance in the model, and automatically integrated in the time performance figures. Environments, like OVP, require the integration of a heavier memory model, even when only a performance model is integrated (transparent memory models). These aspects are endorsed in the evaluation reports of CONTREX.

Finally, as it is discussed at the end of the section, the work in COMPLEX leads to conclude that future research can lead to an hybrid version of the VIPPE kernel (which has been fully re-factored in CONTREX) in order to achieve the best of the old and the new version, for an optimum simulation performance in all modelling scenarios.

VIPPE requires application sources, and relies on an abstract RTOS model. This is an abstract approach which releases from tackling many SW development details (e.g. tune linker scripts, etc) and portability issues, which facilitates the DSE activity.


Page 22

All these benefits enabled us to show VIPPE as a competitive simulation and performance estimation tool especially in a Design Space Exploration scenario, where many possible implementation solutions need to be considered at an early design stage (previous to SW targeting and refinement).

At the same time, the work in CONTREX has enabled an objective comparison with other technologies, to realize the convenience of binary-translation technologies, and tools like OVP, to tackle VP-based SW development or even to refine DSE results from higher-level DSE. However these technologies require many implementation details, and solving portability issues1. Because of that, CONTREX research points to VIPPE technology as the choice for architectural DSE.

VIPPE takes a full system description as an input, which includes the application architecture (threads and their communication), the instance of a software platform layer (specifically, of an abstract RTOS model), and a model of a multi-core hardware architecture, including the hardly predictable impact of elements, such as instruction caches, data caches and shared buses. Section 3.1 already showed the range of performance metrics currently estimated by VIPPE.

A former novelty of VIPPE regarding its ancestor SCoPE, is that, as other tools within the CONTREX framework, VIPPE supports LLVM as basic annotation infrastructure2.

Figure 3-10: Example of code instrumentation performed by VIPPE

VIPPE performs the annotation on LLVM intermediate code (LLVM IR). A main advantage of this annotation mechanism is its portability across target architectures. Before CONTREX, the annotations supported concerned only to timing estimations. As a first step covered during the last reporting period, the mechanism for annotation on LLVM intermediate code has been extended to support also energy consumption annotations. This has

1 For instance, bare-metal developments require modelling memory sizes and functionality is affected by it. This details are omitted in VIPPE as they are not relevant for assessing time or energy performance, In RTOS-based development we found that incomplete or bad portings of the RTOS targeted, i.e. to perform the shutdown, prevented automated DSE. 2 At the same time, VIPPE also supports a specific compiler to enable the assessment of development done with non-llvm cross-development tools, i.e. gcc based compilers.


Page 23

enabled VIPPE to provide system-level estimation of energy/power consumption taking into account the SW application code. Moreover, the annotation infrastructure of VIPPE has been extended to support other sources of energy consumption, namely the OS&HW dependent software, cache´s activities and bus accesses (considering the activity generated by bus masters). Figure 3-11 shows an example of the extended report given now by VIPPE, which provides a breakdown of energy consumption for the computation, communication and memory elements of the system architecture.

Figure 3-11: VIPPE instrumentation infrastructure supports already energy consumption estimations.

Therefore, in its current status VIPPE already provides a first and portable (across target processor architectures) way for providing timing and energy/power consumption estimates based on native simulation. With this scheme, the account of instructions disregards the processor target architecture. The cycle estimations depend on this instruction account and on the cycles associated to the “LLVM-IR target”. This way, this association serves to distinguish between targets association. The tuning to obtain that association can be done through regression tests, for instance.

Another activity completed has been the extension of VIPPE for supporting improved estimations for specific target architectures of interest (specifically for ARM and MicroBlaze). The extension has consisted in supporting the analysis of the assembler produced for a target dependent analysis, in order to extract from it accurate annotation information for the native compilation branch. This way, a significantly gain on accuracy in the estimation of the instructions account is obtained (and by extension on the cycles estimations).

The implemented mechanism is sketched in Figure 3-12. The left hand side sketches the flow shown in Figure 3-11, supporting target independent annotations. They are performed on the .bc file, which contains the LLVM IR obtained from the native compilation (using only the “-emit-llvm” switch). Figure 3-12 also shows that the LLVM flow is required until the production of the assembler code. Later on, it is possible to apply other toolchains, e.g. gcc.


Page 24

Figure 3-12: VIPPE instrumentation supporting ASM analyzer.

The aforementioned extension is represented in the right hand side of Figure 3-12. As well as the generation of the LLVM IR, a second branch where the source code is compiled for the specific target is applied. The basic blocks of the generated target dependent assembler file (.s) already contain the instructions associated to the instruction set of the target processor architecture, e.g. armv7a. From this assembler file, the ASM analysis is performed by the VIPPE ASM analyser tool (“asmanalyzer” in Figure 3-12). The ASM analyzer produces a file with accurate basic block annotations, labelled “bba” in Figure 3-12. This information is used for the annotation of the .bc file of the native compilation branch.

Figure 3-13: Basic block annotations file obtained by the VIPPE ASM analyser for a bubble sort example.

@main_%entry 8 20 10 0 @main_%for.body 6 16 8 2 @main_%for.cond 3 12 6 1 @main_%for.end 4 12 6 0 @main_%for.inc20 2 6 3 1 @main_%for.cond1 4 14 7 1 @main_%for.body3 2 8 4 0 @main_%for.inc17 3 8 4 1 @main_%for.cond4 5 18 9 2 @main_%for.body7 6 22 11 3 @main_%if.then 12 40 20 6 @main_%for.end22 3 6 3 0


Page 25

Figure 3-12 shows an example of the “bba” file obtained by the VIPPE ASM analysis tool for a simple example (bubble sorting). First column contains the basic block identifier. Then four additional columns are provided. The first column accounts for the amount of target instructions. Notice that, compared to the generic flow this is a precise annotation because the source of the annotations is the “.s” file for the target processor architecture (armv7a in the Figure 3-12 example), and because it is assumed that the same LLVM-based cross-development will be later performed. The second column provides the amounts of cycles associated to the computation of the basic block in the target. For that, VIPPE relies on a list of operation costs, stored in a configuration XML file (opcode_costs.xml). An excerpt of this file is shown in Figure 3-14. The opcode_costs.xml file has one “processor” entry for the “llvm” target. This represents the generic, target independent annotation mechanism. For the target specific annotation mechanism, based on the ASM analysis, one specific “processor” entry is required per specific target instruction set supported, e.g. LEON2, MicroBlaze, armv7a, etc. For each “processor” entry, a list of assembler instructions, actually the instructions associated to the target instruction set is stored. There is a “cost” entry per instruction (identified by its “opcode”)3. In addition, the assembler analyser generates another two columns to annotate the amount of memory accesses and energy consumption. For the latter, the assembler analyser relies on a static information of energy per instruction, which can be also captured in the opcode_costs.xml file, as Figure 3-14 shows.

3 If this list is incomplete, e.g. because the user has added a new instruction to a specific target processor, but did not updated the VIPPE “opcode_costs.xml” file, or because not all the information for the target processor is available, then VIPPE will not account for those instructions. It will, of course, introduce some error in the instruction account, but VIPPE will still work and provide estimations. In most of the cases, especially for widespread and open target instruction sets, generating a complete list should not be an issue.


Page 26

Figure 3-14: Excerpt of the VIPPE opcode_costs.xml file.

As was mentioned, all this annotation information obtained by the ASM analyser and dumped to the “bba” file enables the annotation the native .bc file. Since this information is statically annotated, the accounting cost along the native simulation is minimized.

<processors> <processor type="llvm" > <costs> … </costs> </processor> <processor type="LEON2" > <costs> … </costs> </processor> <processor type="microblaze" > <costs> … </costs> </processor> <processor type="armv7a" > <costs> <cost opcode="sub" cycles="1" energy_pj="2" /> <cost opcode="str" cycles="1" energy_pj="2" /> <cost opcode="mov" cycles="1" energy_pj="2" /> … <cost opcode="ldm" cycles="2" energy_pj="4" /> <cost opcode="pop" cycles="1" energy_pj="2" /> <cost opcode="blt" cycles="3" energy_pj="6" /> <cost opcode="push" cycles="1" energy_pj="2" /> <cost opcode="default" cycles="1" energy_pj="2" /> </costs> </processor> </processors>


Page 27

Figure 3-15: VIPPE currently supports cache estimations for single-core architectures.

3.2.4.1 Cache estimation on Multi-Core platforms and kernel refactoring In D2.3.2, it was reported that VIPPE supported performance estimation accounting for the impact of instruction and data caches for single-core models, feature already covered by SCoPE. The challenge tackled in CONTREX was the support of scalable cache estimation for multi-core target platforms, including both types of L1 caches (instruction and data caches), and further level of caches, e.g. L2 caches (either, unified or not), on parallelized simulations, i.e. executed on top of multi-core host platforms. The activity completed along this last reporting period confirmed the scalability challenge already behind L1 cache estimations. In a first step, the cache performance assessment was implemented on top of the VIPPE kernel and the simulation performance was characterized. This activity led to the conclusion that the structure of the parallel simulation kernel handled so far was not suitable. Because of that, the refactoring of the simulation kernel was decided, designed and done in the remaining time of the last CONTREX period. The Following lines discuss the reasons for this important change in the internal structure of the VIPPE simulation kernel. Then, in a second part, the structure of the novel kernel is shown. As mentioned, a solution for L1 instruction and data cache estimation for the thread-based parallel VIPPE simulation kernel was formerly implemented. It is sketched in Figure 3-16. The thread-based parallelized kernel was used in order to enable speed-ups scalable with the number of threads in the application model. This has the advantage of further scalability, as better simulation speed-ups than simulators exploiting target parallelism whenever the amount of


Page 28

threads in the application is bigger than the amount of target cores (and whenever the amount of host cores is bigger that the amount of target cores). In the solution shown in Figure 3-16, each thread of the application is simulated by a host thread. Each of those threads is instrumented to automate performance annotations related to execution in the target processor. In this scheme, a kernel thread is in charge to schedule the computation slices. The distinguishing aspect of this architecture is that the functional simulation application is run in parallel, as an asynchronous program, where user processes are synchronized only on (explicit synchronizations) semaphores, while the performance simulation is realized by the kernel thread, which considers the scheduling policy, performance annotations and hardware resources to place computation slices as they would do in reality. Notice that under this scheme, the simulation will always respect the synchronizations made explicit on the applications, but races existing on the application can lead to different results (functional indeterminism) among the reality and the simulated model. This is an assumable cost in a DSE context, assuming that races problems are solved e.g. with verification and safe concurrent programming techniques, and/or that present races are not affecting the required behaviour.

Figure 3-16: Sketch of cache performance simulation in the old VIPPE. The sketch communications for transferring cache events from user processes to the kernel process.

Although performance, i.e. delays in a give time slice considering the execution on a given target processor and for a given compiler are accounted in the user process, cache time penalties are not computed in the user process In order to consider cache performance, accesses to memory (both load and store instruction types) generated by user processes need to be considered too. In the centralized architecture of the simulation kernel shown Figure 3-16, user processes need to transfer to the kernel thread all the information describing each cache access (I or D cache access, address, etc), so called cache event. That is, the performance accounting and cache events (accesses to cache memory) are sent to the kernel process (instead being computed within the process). This is required because in that architecture, the user processes have not a notion of the global time (recall that they perform only a functional simulation). Only


Page 29

the kernel process has that notion and knows how coherently locate the computational effort accounted and events of the user processes in the simulated time axis. This is a coherent consequence of the fully decoupling of the functional simulation (user processes) and the performance simulation (kernel process). In order to decouple the simulations as much as possible the kernel thread and the user threads, cache events are dumped to a queue, whose implementation relies on shared memory (to reduce communication times). Therefore, this approach works well for models of multi-core systems without caches. VIPPE beats with this implementation to other competing technologies, e.g. OVP. However, the implementation of cache estimation sketched in Figure 3-16 has enabled us to check that the simulation performance is greatly degraded when the model integrates L1 caches and their impact is estimated. A first important problem of Figure 3-16 simulation kernel architecture is that, given reasonable parameters for the VIPPE slice time, and realistic inputs for the frequency of event generations related to both the Instruction and Data caches, a huge amount of events can be generated at each slice time, thus demanding a huge amount of shared memory. In the implementation, this demand linearly augments with the amount of user processes. The amount of shared memory is limited to 4G for 32bits target architectures, given the annotation method of VIPPE. In the former implementations, only 4 user threads could be simulated. In later versions, work to compress the size associated to each cache event was tried and assessed (e.g. to avoid to transfer the whole access address), which enabled us to yield a VIPPE version capable to simulate up to 20 user processes and so to effectively run the CONTREX use cases. Further improvements to compress more cache event information, or to consider customized sizes of the shared memory usage for user processes – kernel process communications were devised. A study of scalability would still be required for them. In addition, there are some ill scenarios to be detected and solved. For instance, a user process with an infinite loop and not synchronized could act as an infinite source of cache events overflowing the simulator. This can be handled by forcing a synchronization of user processes after a given amount of cache events. In any case, even after solving the aforementioned issues, the critical issue with the thread-based kernel implementation of Figure 3-16 turned out to be the simulation speed degradation in the scenarios where cache memories are present. Cache estimation is a main platform element in terms of impact on performance. Skipping their estimation in strict-real time systems could be considered, whenever it is assumed that any considered/assessed solution will rely on predictable platforms relying on processors without L1 caches. However, a distinguishing and relevant aspect of mixed criticality systems is, in fact, the possibility of implementing strict real-time applications and best effort applications on cost-effective platforms, typically with shared resources and performance oriented architectural elements, i.e. caches, which will allow to reach efficient implementations, but which make accurate prediction difficult. Because of that, enabling L1 Instruction&Data cache estimation can be considered a must in a MCS design context. Therefore, a full kernel refactoring, i.e. a deep simulation kernel restructuring without change on the external behaviour of the tool was decided and implemented in the last period of the project.


Page 30

Our analysis and experiments lead us to conclude that there were two main aspects, almost apparent in Figure 3-16, involving the aforementioned speed degradation on the tread-based parallelized kernel:

The centralization of cache events on the kernel thread involve a sequentialization which takes over the remaining aspects of the simulation, due to the huge amount of events to be processed.

The transmission of events among user processes and the kernel adds also an important time cost (despite shared memory as been selected as a light communication method between processes)

A possible solution considered was to rely on N threads (at least as many as processing elements) for the cache event processing. This cache event processing must convert the cache access on the actual time penalty, which depends on whether either a hit or a miss is assessed and on the cost of retrieving the cache line either from L2 or from memory. However, this would still have the problem of the communication overhead. The new simulation kernel solves the two problems on simulation speed degradation, and in addition, the problem of excessive shared memory demand. The new implemented kernel is sketched in Figure 3-17. A first main difference is that the kernel process directs the execution of the user threads, by telling them the next time wall to execute. Similarly to the previous version, the kernel involves periodic steps, called time slices, such the user threads are synchronized on those periodical time stamps. However, in the new version, each user thread has at the beginning of the time slice a reference of the simulation time synchronized with the other user threads, and so a valid reference to compute the simulation time of any event within the slice time by adding the local accounted time (however, in the previous version of VIPPE, that synchronized reference existed only at the beginning of the user process, and the time slice was basically used by the kernel process for the scheduling on the performance simulation. This allows that, in the new version, the user processes are in charge of processing kernel events. This eliminates the communication overhead problem, and the shared memory demand, as the cache events do not longer need to be transferred to the kernel process. Moreover, at the same time, the sequentialization problem is solved, as cache event processing is distributed among user processes.


Page 31

Figure 3-17: Sketch of cache performance simulation in the new VIPPE kernel.

Although the new kernel performs thread-based parallelization in the sense that there is a host thread per target thread simulated, the effective parallelization is per target core. In other words, the effective maximum parallelization that can be exploited by the underlying host cores is limited by the number of processing cores in the simulated target platform (and no longer limited by the number of application threads). In this sense, VIPPE adopts the same limitation as the most advanced binary translation-based simulation frameworks, i.e. OVP with QuantumLeap simulation technology [83], and QEMU based parallelization, e.g. Coremu [81] or PQEMU[82]. The target-core limitation of the parallelization comes after the need to consider cache estimation. Thread based parallelization relies on the possibility of executing the functional model in parallel to the performance model. The additional parallelisation of thread-based parallelization (vs target core based parallelization) comes from the fact that threads in the functional model are blocked only by explicit synchronizations (in charge of ensuring a partial order which guarantees functional correctness). In the old kernel, this means that it should be possible to execute in the host a piece of code of a user process can be executed before it is actually scheduled in the performance simulation. In the new kernel, this means that when a cache event is produced in the user processes, the cache state at that simulated time should be known. This allows the user process to process the cache event accurately in order to compute the actual time penalty. In the current kernel, the precise knowledge of the cache state when the cache event is produced in the user process is achieved by assuming the aforementioned target core based parallelization. That is, whatever the number of threads, M, the kernel simulates the schedule of those M threads on the N processors by allowing the concurrent execution of N threads, and blocking the remaining ones. Therefore, whenever a thread is scheduled to execute in a new slice, the state of the core cache will be known, and consequently the state at any middle point


Page 32

of the slice, where a cache event can happen. This scheme enables to reuse the same abstract cache model as for single-thread single-core cache performance assessments. In order to optimize simulation speed the kernel performs the annotation at a different level of granularity depending on the L1 cache type. Instruction caches have only read access with a more regular access pattern. Because of that a higher granularity has been assumed for the estimation of instruction caches. Specifically, the annotation of hit/misses is done per basic-block. One or more annotations, depending on whether the basic block overlaps one or more cache lines, are used. The access pattern for data caches is richer (write and read accesses) and follow a less regular access pattern. Because of that, annotation is performed every cache access (cache event), and the performance penalty is computed for each of those cache events.

3.2.4.2 Support of several levels of cache Modelling several levels of cache memories becomes a need in MCS design, as long as embedded platforms are becoming more complex and integrate more levels in the memory hierarchy. A proper configuration and dimensioning of the different levels of cache memories will have, in general, an important impact on the system performance. In fact, the Zynq platform selected as reference platform in the project integrates an L2 cache memory. In CONTREX, VIPPE has been also extended to support performance modelling of L2 and further levels of cache. VIPPE employs the same cache models for the estimation. The idea is that VIPPE is capable to chain the data fetch across the different levels of cache towards the bus. For that, an attribute has been added to the cache model to point to the identifier of the cache instance (cache id), at the next cache level, where, in case of miss, fetch the data. This way, VIPPE is capable of a very abstract performance modelling of cache hierarchies, where access times to the memories of the different cache levels is not accounted, but where a model with a L2 cache will present in general less accesses to the bus in order to retrieve line data, and thus less time penalty.

3.2.4.3 Interface with SystemC CONTREX has been extended in CONTREX to support an interface with SystemC. This interface enables the co-simulation of any generic SystemC code with a VIPPE simulation. VIPPE offers a simple programming API to enable an application to:

Write (Send to SystemC) a data Read (receive from SystemC) a data Declare an interrupt handler and receive and event from SystemC

With this simple interface it is possible to model a component in SystemC, being integrated, for instance, of a SystemC description of part of the platform (using TLM, raw SystemC signals, or SystemC-AMS) . The SW application can access and receive events and data to/from this SystemC part of the platform through the aforementioned interface. This enables the integration of VIPPE models, capable to efficiently simulate the applications and the processing platform, with digital and analog parts of the SW platforms.


Page 33

Moreover, it also enables the integration of VIPPE with other tools to support CyberPhysical system modelling. Specifically, in CONTREX, the SystemC interface has enables the co-simulation of VIPPE with Camelview models. Camelview enables the modelling of the physical part of the system. For that, the SystemC interface is used to conveniently model the I/O peripheral models. These peripheral models rely on the CamelView stub (UC tool developed in CONTREX too) to connect VIPPE with Camelview models. The interface enables to handle the synchronization of the SystemC time axis with the VIPPE time axis, or in other words, of the SystemC time with the time advance on the VIPPE simulation. Final evaluation of VIPPE is reported in work package 5 reports.

3.2.5 Node-level energy/timing estimation

3.2.5.1 SWAT: Software Analysis Toolchain

Energy-efficient hardware and software design are required to ensure continuous scalability [18], and power optimization is now an issue spanning all the design phases and abstraction levels [19]. As a matter of fact, due to the ever increasing complexity and density of embedded software in modern embedded systems, energy optimization is gaining utmost attention and it is of paramount importance to be able to early characterize the software at source-code level.

The early-stage characterization of embedded software power requirements can be performed either statically or dynamically. Static characterization has no overhead on the run-time performance of the application, since the analysis and processing phases are done off-line without, in general, executing the binary code. However, an off-line analysis does not consider input data and interaction with the operating environment. On the other hand, dynamic characterization is able to consider data-dependent energy requirements, at the cost of increased overhead.

Traditionally, estimation of the software power requirements has been performed by means of ad-hoc, energy and cycle-accurate instruction-set simulators (ISS). These simulators provide precise estimates in terms of power and performance characteristics [20], but are generally very computationally demanding; in addition, they are not able to propagate and redistribute power consumption metrics to the basic entities (i.e., source code) that are responsible of that consumption. To solve this issue, and to try to raise the abstraction level of the estimates, the authors in [19] propose to use sampling-based profiling tool (e.g., gprof) providing coarse-grained estimates, and merge those estimates with the output from traditional ISS. The rough estimates were done on a per-function basis, thus without any possibility to propagate the estimate to finer elements. Fine-grain estimates are very important for optimization purposes, since most of the optimization are done inside function body.

Compilation-based techniques are indeed an interesting approach to guide optimization starting from the entire source code analysis. A standard compiler is employed to generate an intermediate representation of the code, which is in turn transformed into a control flow graph, on which relevant power and performance figures are annotated to obtain cycle-accurate models. A great advantage of these techniques lays in their ability to account for the effects of different compilation optimizations. Nevertheless, they cannot redistribute the estimate to the source level entities causing power consumption. A first attempt in redistributing energy


Page 34

estimates to source-level entities is proposed in [21], through instrumentation promotion of C code to C++ code.

The SWAT methodology, described in more detail in the following, tries to take advantage from both scenarios, at the same time mitigating their disadvantages. The SWAT tool-chain is based on a static characterization of the structural features of the source code and a statistical characterization of the power consumption figures of the microprocessor, while data dependency is accounted for through appropriate instrumentation.

In CONTREX, the SWAT methodology and flow will have the primary goal of providing estimations of the size, execution time and energy consumption of a given application for a specific target core processor. The general SWAT methodology and flow developed during a previous project (COMPLEX, FP7-ICT-2009-4 - 247999) is split into different phases, schematically shown in Figure 3-18 and described in the following.

Figure 3-18: Overall detailed software estimation flow.

Target processor characterization. This phase has the goal of providing a simple static yet accurate model of the behaviour of the target processor. This model allows linking the abstract, target-independent, and source-level model of the application to the specific executor. It is important noting that this phase needs to be executed only once per each target processor being considered.

Source-level static modelling. This step builds a data-independent, static model of the source code based on an intermediate representation expressed in the form of a pseudo-assembly LLVM code. Such a model is then transformed into a simplified representation of the


Page 35

characteristics of single basic blocks. Conceptually, such models are totally decoupled from the specific characteristics of the target processor. In practice, though, target executor information is combined with the code model in this phase. Static modelling does not depend on data and thus must be executed only once per each application.

Application dynamic modelling. The dynamic modelling phase, in its simplest form, collects profiling information at basic block and/or function levels. Clearly, the dynamic phase depends on the actual data fed to the application.

Analysis and post processing. This phase concludes the estimation flow by combining static source code models with dynamic information. The output are size, execution time and energy estimates of the target application at different levels of abstraction, namely, basic block, source code line, function and entire application. In addition to these overall figures, a set of analyses on the static and dynamic structure of the code is preformed to derive a detailed characterization of the application.

Figure 3-19, Figure 3-20 and Figure 3-21 show some examples of reports produced by SWAT.

Figure 3-19: SWAT: Assembly-level statistics


Page 36

Figure 3-20: SWAT: Basic-block profiling results

Figure 3-21: SWAT: Dynamic stack usage trace


Page 37

Experimentation with the Keil ISS integrated into the Keil IDE, we concluded that the information provided directly by Keil is sufficient for timing – and consequently power – estimation as required by the CONTREX flow to be applied to the automotive telematics use case. For this reason, the SWAT toolchain will not be used any longer in the flow. This is possible since function-level figures provided by Keil can directly be used in the model fed as input to the non-functional simulator N2Sim.

3.2.5.2 N2Sim: Non-functional Node Simulator

Sensor nodes constitute a specific class of embedded systems, usually characterized by some distinctive traits such as very limited computational and memory resources, periodic mode of operation with very low duty-cycle (often referred to as normally-off computing), several sensing devices and a network interface, both wired and wireless. While a variety of general purpose sensor nodes exists on the market, specific, highly-constrained applications require custom nodes explicitly designed for the application at hand.

Since its appearance, sensor nodes have been studied with particular focus on communication aspects, namely synchronization, protocols, routing algorithms and so on [27][28]. For this reason most of the simulation approaches and tools have concentrated on system-wide networking issues, while almost ignoring node-specific aspects. Table 2 summarizes the main characteristics of some popular wireless sensor networks simulators.

Table 2. Sensor nodes simulators summary

Name Type Class WSN

specific Notes

NS-2

[22][23][24] Simulator

Discrete-Event

No Maximum 100 nodes. No bandwidth or power consumption simulation.

TOSSIM [25][26][28][30][31]

Emulator Discrete-

Event Yes

Up to thousands of nodes. Can emulate radio models and code executions. Only emulate homogeneous applications. Needs PowerTOSSIM to simulate power consumption

EmStar

[32][33] Emulator

Trace-Driven

Yes

Cannot support large number of sensors nodes. Only runs in real time simulation and only apply to iPAQ-class sensor nodes and MICA2 motes.

OMNeT++

[34][35] Simulator

Discrete-Event

No

Supports MAC protocols and some localized protocols in WSN. Simulates power consumptions and channel controls. Limited available protocols.

J-Sim

[36] Simulator

Discrete-Event

No Up to 500 sensor nodes. Can simulate radio channels and power consumptions. Long execution time.

ATEMU

[37] Emulator

Discrete-Event

Yes

Can emulate different sensor nodes in homogeneous and heterogeneous networks Can emulate power consumptions of radio channels. Long simulation time.

Avrora

[38] Simulator

Discrete-Event

Yes Up to thousands of nodes. ISS-based code timing simulation. Specific for AVR-based nodes.


Page 38

SystemC Network Simulation Library [50]

Simulator Discrete-

Event No

Open source extension of SystemC library to provide classes for packet transmission and reception, channels (both shared and point-to-point) and protocols as well as to trace statistics.

The N2Sim simulator developed by PoliMi tackles the simulation problem from a different perspective, as it concentrates on the characteristics of the node rather than that of the communication interface and of the entire network. The simulator, described in more detail in the following, can model periodic and asynchronous tasks, and generic sensing devices, both from the timing and power consumption point of view.

N2Sim is a non-functional event-driven sensor node simulator specifically designed to evaluate high-level extra-functional properties of an application. Typical analyses include duty-cycle estimation, scheduling and task-merging algorithms and policies, power tracing and so on.

As anticipated at the end of Section 3.2.5.1, function-level timing data provided by Keil can be directly used as input to build the model that feeds non-functional simulation. This, though, required changing the specification format provided as input to N2Sim.

In particular, the original version of the simulator only considered two types of entities, namely components (i.e. devices) and tasks. The new version of the simulator combines hardware and software behaviours into a single concept of “event” and to this purpose it requires a more detailed model that explicitly describes the following entities:

Hardware event. An event that involves a specific hardware device (e.g. accelerometer sampling, accelerometer interrupt request, microcontroller execution …)

Job. A job is associated to the execution of some code on the microcontroller. Though jobs are conceived as functions whose execution time is data-independent, minimal variability can be introduced by means of global entities such as FIFOs, flags and global variables.

Task. A task is simply a collection of jobs. It is modelled directly using C code.

The configuration is split in two files: a header file consisting of the declarations and source file containing the implementation of the model. It is worth noting that most of the C source code needed by the configuration file can be easily generated automatically from a higher level model. The general format of the configuration header file is shown in Figure 3-22.

// Device States [Declarations] // Events [Declarations] // Jobs [Declarations] // Tasks [Declarations]

Figure 3-22 – General structure of N2Sim configuration header file


Page 39

The source code counterpart has a similar structure, with the addition of an initial section dedicated to global entities and the explicit declaration of the initial event that triggers the start of the simulation. This structure is depicted in Figure 3-23. // Globals [FIFOs declarations] [Flags declarations] [Variables declarations] // Start event void __start__(EventMode mode) { S.post(<delay>,ACTION(<event>,<mode>),<duration>); } // Events EVENT_ENTRY(<event>) EVENT_START(<event>) { ... } EVENT_END(<event>) { ... } ... // Jobs JOB(<job>) { ... } ... // Tasks TASK_ENTRY(<event>) TASK_START(<event>) { ... } TASK_END(<event>) { ... } ...

Figure 3-23 – General structure of N2Sim configuration source file

According to this model, a hardware device is characterized by the timing and electrical properties of its operating states. As an example consider the model used for the MEMS accelerometer shown in Figure 3-24.

// ------------------------------------------------------------------------ // Device States // ------------------------------------------------------------------------ // Device State V mA // ------------------------------------------------------------------------ DECLARE_VI( LSM303, idle, 1.8, 0.55 ) DECLARE_VI( LSM303, sampling, 1.8, 3.25 ) DECLARE_VI( LSM303, irq, 1.8, 0.00 ) DECLARE_VI( LSM303, reading, 1.8, 5.18 ) // ------------------------------------------------------------------------ // Hardware Events // ------------------------------------------------------------------------ // Device Event Duration Period V/I // ------------------------------------------------------------------------ DECLARE_EVENT( LSM303, lsm_sample, 20, 800, VI(LSM303,sampling) ) DECLARE_EVENT( LSM303, lsm_irq, 1, 0, VI(LSM303,irq ) ) DECLARE_EVENT( LSM303, lsm_read, 85, 0, VI(LSM303,reading ) )

Figure 3-24 – Example of device and event description


Page 40

Relevant properties of the components are the power consumption levels (voltage and current) in the different operating modes and the time associated to the significant events.

Similarly, jobs are described as functions and characterized by an execution time only. The energy figures associated to jobs are derived by associating jobs to the microcontroller, which is described in the device section exactly as any other device. It is worth noting that some activities involve both the execution of code on the microcontroller and the execution of some hardware activity on a certain device. Consider, as an example, the operation of reading the accelerations samples from the hardware FIFO of the device into a software queue. This activity is modelled as the combination of the lsm_read event (see Figure 3-24) with a software job described as reported in Figure 3-25. Purely software jobs, such as digital data filtering, as described exactly in the same way, with the difference that there is no corresponding hardware event.

// ------------------------------------------------------------------------ // JOBS // ------------------------------------------------------------------------ // Device Event Duration Period V/I // ------------------------------------------------------------------------ DECLARE_EVENTS( CPU, lsm_read, 115, 0, VI(CPU,running) )

Figure 3-25 – Example of job description

Tasks are mainly characterized by their C source code. In its simpler form a task is a sequence of jobs. Since the task model can contain arbitrary C code, it is possible to alter the sequence of jobs executed when the task is scheduled. The simulator is non-functional and event-driven and allows for very fast execution. The output is a lengthy and verbose trace that can then be further processed to extract the events of interest only.

The figure below reports an excerpt of a raw (i.e. without any post-processing) simulation trace, where hardware events have been highlighted in red, tasks start/end events are shown in blue and jobs are reported in black.

######################################################################################## # Time Device Duration Event M Voltage Current ######################################################################################## 800 us LIS 20 us lis_sample S 1.8 V 6.5 mA 1600 us LIS 20 us lis_sample S 1.8 V 6.5 mA 2400 us LIS 20 us lis_sample S 1.8 V 6.5 mA 3200 us LIS 20 us lis_sample S 1.8 V 6.5 mA 4000 us LIS 20 us lis_sample S 1.8 V 6.5 mA 4020 us LIS 5 us lis_irq S 1.8 V 1.05 mA 4025 us CPU 50 us lis_isr S 1.8 V 9 mA 4075 us CPU 0 us task_acquisition S 0 V 0 mA 4075 us LIS 120 us lis_read X 1.8 V 7.75 mA 4075 us CPU 120 us lis_read X 1.8 V 9 mA 4195 us CPU 15 us filter X 1.8 V 9 mA 4210 us CPU 0 us task_acquisition E 0 V 0 mA ... 20000 us LIS 20 us lis_sample S 1.8 V 6.5 mA 20020 us LIS 5 us lis_irq S 1.8 V 1.05 mA 20025 us CPU 50 us lis_isr S 1.8 V 9 mA 20075 us CPU 0 us task_acquisition S 0 V 0 mA 20075 us LIS 120 us lis_read X 1.8 V 7.75 mA 20075 us CPU 120 us lis_read X 1.8 V 9 mA 20195 us CPU 15 us filter X 1.8 V 9 mA 20210 us CPU 0 us task_acquisition E 0 V 0 mA 20210 us CPU 0 us task_analysis S 0 V 0 mA 20210 us CPU 18 us crash X 1.8 V 9 mA 20228 us CPU 90 us lowenergy X 1.8 V 9 mA 20318 us CPU 0 us task_analysis E 0 V 0 mA 20426 us CPU 0 us task_dump S 0 V 0 mA


Page 41

20426 us CPU 1000 us dump_lowenergy X 1.8 V 9 mA 20800 us LIS 20 us lis_sample S 1.8 V 6.5 mA 21426 us CPU 0 us task_dump E 0 V 0 mA 21600 us LIS 20 us lis_sample S 1.8 V 6.5 mA 22400 us LIS 20 us lis_sample S 1.8 V 6.5 mA 23200 us LIS 20 us lis_sample S 1.8 V 6.5 mA 24000 us LIS 20 us lis_sample S 1.8 V 6.5 mA 24020 us LIS 5 us lis_irq S 1.8 V 1.05 mA 24025 us CPU 50 us lis_isr S 1.8 V 9 mA 24075 us CPU 0 us task_acquisition S 0 V 0 mA 24075 us LIS 120 us lis_read X 1.8 V 7.75 mA 24075 us CPU 120 us lis_read X 1.8 V 9 mA 24195 us CPU 15 us filter X 1.8 V 9 mA 24210 us CPU 0 us task_acquisition E 0 V 0 mA 24800 us LIS 20 us lis_sample S 1.8 V 6.5 mA

Figure 3-26: N2Sim raw execution trace excerpt

3.2.5.3 Keil uVision

In addition to the abovementioned academic tools, the estimation of timing figures of the computational kernels, as well as the operating system overheads (when applicable) will be performed through cycle-accurate simulation of the code using the ARM Keil uVision debugging environment [51]. When suitably configured, Keil uVision can produce profiling, coverage and timing figures at different levels of abstraction, namely assembly-level, source code level and function level. Figure 3-27 shows a screenshot of the Keil GUI after a simulation session. In particular in the topmost portion of the window the function profiling, code coverage and instruction execution trace are visible.

Figure 3-27: Keil uVision simulation results

The Keil uVision debugger, though, is not power-aware and thus the accurate timing figures resulting from assembly-level simulation will be combined with static energy figures provided by STMicroelectronics and ST-PoliTo to derive accurate dynamic energy profiles.


Page 42

3.2.6 Modelling of power and energy for batteries

Energy storage devices have a crucial role in determining the lifetime of a system, i.e., how long the system can operate autonomously from the grid or from power sources. To this extent, the modelling and simulation of energy components, including batteries, is an important dimension in system design. Building an overall simulation of the energy flow would indeed allow an accurate estimation of energy consumption and losses in the system, and it would provide a forecast of system lifetime.

To accurately model battery behaviour, it is necessary to take into account not only its dynamics (as done in deliverable D3.1.2), but also its operating conditions. Battery behaviour is affected, e.g., by the runtime evolution of loads and by the power dissipations occurring in the system. For this reason, this deliverable proposes to enlarge the scope, and to take into account all components, thus creating a power perspective of the system under analysis.

The goal is to trace power flows, with the goal of achieving an early assessment of system behaviour and of validating the dimensioning of the energy providers. In the proposed framework, power flows are represented in terms of voltage levels and current demand/production over time of the various components. For this reason, voltage (V) and current (I) are the main dimensions of each component, and will be the focus of overall simulation.

3.2.6.1 Module classification and interface modelling

Components naturally have different roles w.r.t. the power flow, i.e., each system includes component that either consume, generate, distribute, or store energy. This difference is reflected upon simulation: components with different roles will have different interfaces and models.

Figure 3-28 shows the different classes of components [67]. Each system features a certain number of loads, i.e., components that require a given amount of power to implement a certain functionality (e.g., digital cores, MEMS, and RF components). Power can be provided by either ESDs or power sources. ESDs can be of different natures, ranging from batteries to supercapacitors and fuel cells. Power sources are almost infinite sources of power, such as photovoltaic cells or thermo-electric energy generators, used for either satisfying the loads power demand or for charging ESDs. ESDs and power sources are managed by an arbiter. The arbiter monitors the state of charge (SOC) of the ESDs, to determine whether they can provide energy or they have to be charged. Furthermore, the arbiter determines what ESD or power source to use, based on the loads request for power. All components are connected through a power bus, that allows for the energy to combine and propagate within the system (either as an ideal conductor or with some power loss). Each component is connected to the power bus through a converter module, necessary to maintain compatibility of voltage levels.


Page 43

Figure 3-28. Template of the reference architecture of a system from the power perspective, including the power bus and the connected components.

It is important to note that any system may contain a number of each type of components. The fundamental requirement is that any system contains at least one ESD or one power source, to provide energy to load components.

All components feature the voltage (V) and current (I) ports. However, the different role of components impacts on their simulation characteristics. Energy providers may be enabled or disabled, depending on the operating conditions (e.g., to prefer a power source to an ESD, whenever possible). They are thus provided with a control port (En) that is used by the arbiter to apply management policies. Furthermore, ESDs require an intrinsic status information, to trace the available charge over time. For this reason, they have two additional status ports: the capacity (E) and the state of charge (SOC). As a result, the interface of a component strictly depends on its role inside of the system, thus reflecting the information and energy flow w.r.t. the other components. Figure 3-28 recaps the resulting.

3.2.6.2 Implementation in SystemC-AMS

Component are implemented as SystemC modules (SC_MODULE).

The interfaceof each component reflects the standard one defined for the corresponding class, as defined in Figure 3-28. Ports are declared as SystemC TDF ports (sca_tdf::sca_in or sca_tdf::sca_out), as adopting a TDF interface is crucial to speed up simulation. Port type depends on the modeled port. Ports V, I, SOC and E are declared of type double (e.g., sca_tdf::sca_in<double>), as they model physical quantities or percentages. On the contrary, En ports represent activation signals and are thus declared as Boolean signals (e.g., sca_tdf::sca_in<bool>). Port direction strictly depends on the information flow.

Despite of adopting a TDF interface, the modularity of SystemC allows to implement the preferred model at any abstraction level. The TDF interface may thus encapsulate a TDF model or an ELN model. This is a nice feature of SystemC, since from the modelling standpoint, energy/power models usable in system-level simulation typically belong to two main categories. Functional models implement a generic evolution of the energy flow through a function, such as an equation, a state machine or even a simple waveform over time [59][60]. On the other hand, circuit-level models emulate the behaviour of a component through an equivalent electrical circuit that mimics the component dynamics in response to the evolving system conditions by an interconnection of electrical components (resistors, capacitors, voltage/current sources) [61][62].


Page 44

Functional components are implemented with the TDF level of abstraction. At TDF, the evolution of the output signals is modelled in a processing() function, that is repeatedly executed at fixed time steps. The implementation of the processing() function is a traditional HDL process, that can exploit all C++ math libraries. This allows to easily reproduce functional models. In case more complex constructs are needed, LSF primitives can be adopted (e.g., derivatives of Laplace transfer functions). The initialize() function allows to instantiate model parameters (e.g., the battery nominal voltage). An example of implementation of a functional model is provided on the left-hand side of Figure 3-29, reporting the SystemC-AMS code for Peukert’s functional model of a battery.

Circuit models are instead implemented by adopting the ELN level of abstraction. ELN provides primitives for modelling network primitives (e.g., resistors or capacitors). Thus, circuit models can be straightforwardly implemented as a concatenation of ELN models, whose binding reflects network topology. The ELN components are instantiated, configured and connected in the SystemC-AMS module constructor (SC_CTOR). A brief snapshot of code is provided on the right-hand side of Figure 3-29, reporting the SystemC-AMS code for a circuital model of a battery. The effectiveness of the proposed approach is enhanced by SystemC-AMS, as it provides a set of primitives that can easily be instantiated, configured and connected to realize the desired circuit. This eases the implementation process, as the designer does not have to manage low level details, including energy conservation laws.

Figure 3-29. SystemC-AMS implementation of a battery: interface (top), alternative models (middle, functional and circuit, respectively), and model implementation (bottom, in TDF and ELN, respectively).

Figure 3-29 highlights that the proposed approach is highly compassable and that it enhances design space exploration [88]. Indeed, both the battery models have the same interface, compliant with the standard ESD interface defined in Figure 3-28. Then, SystemC-AMS allows


Page 45

to easily replace the module implementation, to evaluate different configurations or alternative models at different levels of detail. This is possible thanks to the flexibility of SystemC-AMS, that allows to separate interface definition from the implementation and to mix different levels of abstraction in a single module (e.g., a TDF interface for an ELN circuit model).

3.2.6.3 Models for power components

Models available in the literature strictly depend on the target class of components. This section sketches how each class can be modelled. Concerning how these models are typically built from actual devices, the standard approach is to empirically measure some physical parameters that are needed to derive all the data needed to populate the models. On the other hand, Deliverable 3.1.1 presented an alternative solution, based on data publicly available on battery datasheets. This approach can be adopted for any kind of devices. The key idea is to classify the available model templates depending on what information is necessary to populate them. This implies that the templates are categorized into various levels with progressively increasing accuracy. As a result, the adopted model strictly depends on a trade-off between available information and desired level of accuracy.

The remainder of this section outlines the main models available for batteries, loads and converters that will be adopted for the experimental validation.

Batteries

We consider two popular models with different trade-offs between accuracy and complexity, namely a functional model based on Peukert’s law and a behavioural model based on a circuit equivalent of a battery. These two models are general enough to be applied to any type of battery (chemistry or form factor) and they can easily be identified based on a limited set of information typically available in battery datasheets [63].

The functional model based on Peukert’s equation expresses the non-linear relation between the batteries current I and the equivalent capacity. The lifetime LT for a battery of nominal capacity C and discharged by a (constant) current I is given by

�� =�

��

Where k>1 is the Peukert’s coefficient.

Figure 3-30. Data and model population for Functional Models (Level 1).

The construction of level 1 models requires for the datasheet to provide voltage-time curves (left hand side of Figure 3-30), that contain information about battery lifetime at given discharge


Page 46

rates. From these curves, it is possible to fit the data onto an equation template corresponding to Peukert’s curve.

Battery capacity and state of charge are estimated by reformulating Peukert’s law:

�� = ��

��

· �� = �� −(�·�)

��

Where I is the discharge current, CI is the capacity at discharge current I, I0 is a discharge current used in the datasheets to provide the curves, and C0 is the capacity measured at I0.

These equations are used to implement a SystemC-AMS TDF model that repeatedly computes the values of capacity and state of charge, given the evolving system conditions. The TDF semantics was preferred over the continuous time LSF and ELN abstraction levels as TDF allows to compute the equations at fixed time steps, thus avoiding scheduling management overheads. Note that the battery voltage is constant and its value is equal to the nominal operating voltage taken from the datasheet. The functionality is encapsulated inside of the processing()TDF function, as shown in the code excerpt in Figure 3-29.

The second and more accurate model is an electrical circuit equivalent that mimics the battery behaviour. The circuit-level model used as reference for the CONTREX project has intermediate complexity and requires information provided in the vast majority of datasheets; it is shown in Figure 3-31 [63]. CM denotes the nominal capacity, IB the current drawn from the battery, Voc the open-circuit voltage of the battery, R its internal resistance. Both the resistance and the open-circuit voltage depends on the state of charge (SOC) of the battery, represented electrically as the voltage VSOC.

Population of the model requires availability of voltage discharge curves vs. or SOC (or capacity), possibly for different values of discharge current, as shown in Figure 3-31.

Figure 3-31. Chosen circuit template for Circuit-Level Models at Level 2 (left) . Data and datasheet curves used for model population.

The circuit elements are populated as follows:

Capacity CM is simply obtained by converting the nominal battery capacity CNOM (in A· h) provided with the data-sheet into A·s, i.e., as CM = 3600 · CNOM.

The voltage-controlled voltage generator VB is implemented as a function VBC1(VSOC),

derived automatically by means of a curve fitting process applied to the (voltage, SOC) points in the datasheet curve corresponding to C-rate C1.

If the discharge for two C-rates are available, another curve fitting derives a second function VB

C2(VSOC). This allows to derive VOC(VSOC) and R(VSOC) by solving the equations associated with the electrical circuit:


Page 47

��(��) = R(��)��

�� + ��(��)

��(��) = R(��)�� + ��

��(��)

I.e., by solving:

R(��) = ��

��(��) − ��(��)

�� − ��

��

��(��) = ��(��) + R(��)��

��

For this type of model, SystemC/AMS uses the ELN semantics because the circuit components must be explicitly instantiated and simulated.

Loads

The term “load” comprises any component requiring a given amount of power to implement a certain functionality, e.g., digital cores, MEMS, analogue, and RF components. The functional evolution of these components is disregarded, and they are seen as black boxes that require a certain amount of current (I) at a given voltage level (V), as shown in Figure 3-28. The interface adopted for loads highlights that the focus is solely on tracing the smart system energy flows, represented in terms of voltage levels and current demand/production over time of the various components. The main consequence of this view is that models for functional components are simplistic, to the point that they may fall back to synthetic V and I traces over time.

Figure 3-32. Examples of power models for loads: execution traces for current and voltage (a), synthetic statistic traces (b), and a power state machine (c).

The most accurate models are execution traces, obtained with experimental measurements applied to the component during a typical excerpt of its execution. These models are made up of a couple of waveforms reproducing how current demand and voltage actually evolved over time during the sampled execution excerpt. The traces may be repeated periodically, to simulate longer executions. A drawback is that traces cannot be affected by the application of energy management policies, including, e.g., the application of DVFS techniques in low energy availability periods.


Page 48

Figure 3-32.a exemplifies this by showing a trace for current demand (left) and one for voltage level of the component (right).

In case experimental measurements are not possible or they are considered too accurate for simulation purposes, they can be replaced with synthetic traces (Figure 3-32.b). The accuracy of such models strictly depends on their construction process. Typical consumption and voltage values can be extracted from component datasheet, and the trace may be built by relying on statistical information, or on the identification of different consumption levels associated with the flow of functional execution (e.g., the fetched opcode of a core). Figure 3-32.b shows an example trace modelled as a bimodal distribution (left), where the most frequent values correspond to the typical active and idle current demands (right).

Finally, loads may be modelled as state machines listing the internal states of the component (Figure 3-32.c) [87]. Transition from one state to another may depend on overall system information or on timers, reproducing a typical execution flow of the component and its dynamic power management policy.

All these kind of models can be implemented in SystemC or SystemC-AMS TDF.

DC-DC converters

Power represents the current drawn or generated by a component at some voltage level; it is evident that it is not possible to directly interface components that are at different voltage levels. Appropriate voltage conversion is therefore needed. Notice that the various functional components often have disparate supply voltage levels (e.g., digital, analogue, I/Os). This is typically implemented by converters.

The generic interface of a DC/DC converter (Figure 3-28) consists of two pairs of voltage and current signals. Functionally speaking, the DC/DC converter simply adapts input power to match output power by mean of appropriate circuitry. This process can be characterized by the efficiency of the conversion η:

η = ��

=��

The difference between Pout and Pin represents the losses of the converter.

Since the DC–DC converter is an electronic device, in principle a circuit-level model consisting of the interconnection of the discrete components would guarantee the highest accuracy. However, this would require a specific model for any specific type of converter (e.g., switching vs. linear) and would slowdown the overall simulation. Conversely, a system-level, functional model of the DC–DC converter describes its efficiency does capture all the non-idealities and is also independent of its implementation.

Conversion efficiency is affected, in order of relevance, by output current Iout, difference between input and output voltage ΔV = Vin - Vout, and absolute values of Vin and Vout (as shown in Figure 3-33). For this reason, approximating converters with their nominal efficiency is too far from the real behaviour of the device. For this reason, the efficiency is modelled as a polynomial function of the two most relevant parameters: Iout and ΔV, e.g.:

�(��, ��) = �� +��

� +�� +�� +��


Page 49

Figure 3-33. Converter efficiency curves for the Linear Technology LT1613 DC-DC converter.

3.2.6.4 Experimental validation

This section analyses the proposed framework for the simulation of batteries in the context of a typical system adopted in the CONTREX project. The idea is to compare different models in terms of accuracy, and to understand how they impact on the estimation of the power flows of the system.

The system under analysis reproduces an iNEMO system-on-board [84]. The iNEMO includes four main components:

An ARM®-based 32-bit microcontroller;

A 6-axis digital e-compass module;

A 3-axis digital gyroscope;

A flash memory.

All these components are grouped as two loads: Load1 is the microcontroller, while Load2 encapsulates both sensors and the memory. The system is powered by a Panasonic CG18650CG lithium-ion battery [85]. The battery converter is a Linear Technology LT1613 converter [86], while load converters are modelled as fixed efficiency converters. The overall system is depicted in Figure 3-34.

The starting configuration is as follows. The loads have fixed voltage (3.3V) and current consumption (24.2mA for Load1, 21.9mA for Load2). DC-DC converters have fixed efficiency, η = 84.57% for loads and η = 83.34% for the battery. The battery is modelled with the circuit-equivalent model, and the arbiter implements no management policy (i.e., loads require power from the battery until its state of charge reaches 0%).


Page 50

Figure 3-34. Proposed case study. Components with the blue border are used in the exploration of models.

Load models

Figure 3-35. Comparison of different load models.

For loads, we compare three different models. Figure 3-35 compares the resulting evolution in terms of current over time (top three plots) and of state of charge over time (bottom plot) given the three configurations.

The simplest models for loads is to use their rated power consumption as fixed voltage and current values. The values are extrapolated from the datasheets of the functional components, assumed an operating frequency of 48MHz, Load1 has fixed current demand of 24.2mA, while Load2 has overall current demand of 21.9mA. The constant demand is highlighted by the first plot of Figure 3-35. The corresponding system lifetime is of 60,480s (i.e., 16h 48mins), corresponding to the blue line in the bottom plot of Figure 3-35.


Page 51

The second model is a power state machine model. All devices on the iNEMO system-on-board are provided with a normal state and a sleep state. The state is managed by the arbiter: as long as the battery state of charge is higher than 10%, devices are in normal state; when the battery goes below 10%, all devices are moved to their sleep state to prolong system lifetime. The application of this policy is evident from the second plot of Figure 3-35: the loads power consumption is the same as for the former case until time 58800s, and until this time step the state of charge curve is completely overlapped with the former one (orange line). At time 58800s, the state of charge of the battery crosses the 10% threshold, thus resulting in the loads moving to the sleep state. This lowers the power consumption of both loads, and it prolongs the lifetime of the system until 65,460s (i.e., 18h 11mins), with an extension of 8% w.r.t. the former configuration.

The third model is applied only to Load1, as it is a synthetic load model based on the microcontroller execution flow. The model is instruction-based, i.e., power consumption depends on the fetched instruction. To implement this model, opcodes have been partitioned into four classes: memory opcodes (i.e., load and store), arithmetic, branches and conditions), and NOPs. As a result, the current demand flickers over time (third plot of Figure 3-35). The power management policy is still applied, and it moves the device to the sleep state as soon as battery state of charge goes below 10%. This new model prolongs system lifetime until 88,380s (i.e., 24h 34mins), with an extension of system lifetime of 46% with respect to the starting configuration. This is evident from the bottom plot of Figure 3-35 (purple line).

This experiment proves that the model adopted for loads affects the accuracy of system estimation. Also, the possibility of controlling loads (e.g., to switch them to a low power state) is crucial to prolong system lifetime and to enable more realistic scenarios, where the system should be energy-autonomous for at least 24 hours.

In the following experiments, Load1 is modelled with its instruction-based model, and Load2 is modelled with the power state machine.

DC-DC converter models

Figure 3-36. Comparison of different DC-DC converter models.


Page 52

This experiment proves that an accurate modelling of the DC-DC converter is necessary to correctly estimate the energy dissipation in the system. For this reason, we compared the adoption of a fixed efficiency (from the converter datasheet, by looking at the average operating conditions of the loads) with a functional approximation of the efficiency curve (built through curve fitting form the DC-DC converter datasheet). To reduce the design space, load converters are not used in the analysis, and their efficiency remains fixed (η = 84.57%).

The battery DC-DC converter is a Linear Technology LT1613 converter [86]. The functional model is built with the MatlabCurveFitting curve, from the efficiency versus load graph. The resulting model is function of load demand:

η = 0.8965e��.�� + −0.0631e��.��

Figure 3-37. Construction of the polynomial model of the battery DC-DC converter.

Figure 3-36 shows the resulting evolution of efficiency over time (top) and state of charge of the battery (bottom), when using fixed efficiency (η = 83.34%) or the polynomial estimation. Figure 3-36 highlights that the fixed efficiency does not approximate well the converter behaviour. The flickering power consumption of the microcontroller (third plot of Figure 3-35) forces the DC-DC controller to work also in sub-optimal conditions, reaching efficiency as low as 30%. This implies that a larger power portion is dissipated, and it shortens system lifetime of 17.5% with respect to the configuration with fixed efficiency (lifetime is shortened to 72,900s, i.e., 20h 15mins).

An accurate estimation of power dissipation in DC-DC converters is thus necessary, to have a clearer figure of system evolution under changing conditions. As a result, the following experiments will use the functional DC-DC converter for the battery.

Battery models

As a final experiment, we compare the accuracy of the functional Peukert’s model with the circuit-equivalent model. The system is powered by a Panasonic CG18650CG lithium-ion battery, whose main characteristics are depicted in Figure 3-38.

The Peukert’s model of the battery is built from datasheet specification, by configuring the model with the nominal voltage and capacity of the battery. The circuit-equivalent model is built from the voltage versus discharge capacity plot, reported in Figure 3-38. The resulting populated circuit is depicted in Figure 3-39.

Figure 3-40 highlights that the functional model for the battery does not approximate well battery behaviour, as it overestimates system lifetime (242,700s, i.e., 67h 25mins). This is mainly caused by the incorrect estimation of battery voltage, that is fixed at 3.6V for Peukert’s


Page 53

model, while is degrades from 3.0V to 0.0V for the more accurate circuit-equivalent model. This affects not only battery behaviour, but also the efficiency of the DC-DC converter, as the model underestimates the power dissipation. Finally, the Peukert’s model of the battery estimates system lifetime by averaging the load demand, thus losing the link with load dynamics. This factor is instead modelled by the circuit-equivalent model of the battery, that adapts battery voltage and state of charge accordingly, and thus gathers a mode adherent figure of the evolution of the state of charge over time.

Figure 3-38. Characteristics of the Panasonic CG18650CG lithium-ion battery from the datasheet.

Figure 3-39. Circuit-equivalent model for the Panasonic CG18650CG lithium-ion battery.


Page 54

Figure 3-40. Comparison of different battery models.

Concluding message on the modelling of power and energy for batteries

To conclude, we may thus observe that the choice of power and energy models is crucial to estimate the extra-functional behaviour of a system. Battery modelling is important, as battery dynamics is complex and it depends on a number of factors. However, the modelling of all system components is fundamental to have an accurate simulation of the operating conditions of the battery (see following Table, where the most adherent configuration is highlighted in blue). Thus, the designer should always consider the runtime characteristics of loads, and the power dissipation caused by the DC-DC converters.

Load1 Load2 Battery DC-DC

converter Battery Lifetime

1 Fixed, 24.2mA

Fixed, 21.9mA

Fixed, η = 83.34% Circuit-equivalent

60,480s

2 Power state machine

Power state machine


65,460s

3 Instruction-based

Power state machine


88,380s

4 Instruction-based

Power state machine

Functional Circuit-equivalent

72,900s

5 Instruction-based

Power state machine

Functional Peukert’s model

242,700s


Page 55

3.3 Modelling of thermal behaviour

There are basically three fundamental modes of heat transfer:

Conduction: the transfer of energy in solid objects and between objects that are in physical contact

Convection: the transfer of energy between an object and its environment, due to fluid motion

Radiation: the transfer of energy from the movement of charged particles within atoms that is converted to electromagnetic radiation

Because in CONTREX the modelling of temperature will essentially concern the modelling of electronic components and devices (solid objects), conduction is the primary heat transfer mechanism to be modelled. Convection and radiation occurring in the surrounding environment will nonetheless be accounted for through conditions determining how heat flows through the boundaries of the modelled object.

Heat transfer by conduction follows a parabolic Partial Differential Equation (PDE) that specifies the complete spatial and time profile of a temperature distribution within a computational domain, limited by a boundary. It has the following form:

where:

Κ is the thermal conductivity in W/mK,

Cp is the specific heat capacity (a material property that indicates the amount of energy a body stores for each degree increase in temperature, on a per unit mass basis) in J/kgK,

ρ is the mass density in kg/m3,

T is the temperature distribution, and

Q is the internal heat generation rate per unit volume in W/m3.

The solution of the PDE requires the determination of initial conditions (temperature distribution at time t0) and boundary conditions (specifying how heat is transferred to the surrounding environment) for the computational domain. The computational domain in CONTREX is the modelled electronic component.

Figure 3-41 illustrates the most common techniques that exist to solve a PDE.


Page 56

Figure 3-41: Techniques to solve heat transfer PDE

As can be seen, there are basically two categories of methods: the analytical methods and the numerical methods.

Analytical methods only exist for restricted forms of the PDE, corresponding to simple geometries and simple boundary conditions. They are thus non generic approaches to the thermal estimation problem. The use of the Green’s functions is likely the most popular analytical method. In mathematics, a Green's function is the impulse response of an inhomogeneous differential equation defined on a domain, with specified initial conditions or boundary conditions. Green's functions thus provide solutions to the PDE. Their usefulness is limited yet, because unless the application domain can be readily decomposed into one-variable problems, it may not be possible to write them down explicitly.

Numerical methods are a more generic approach. They consist in first discretizing the heat transfer PDE system on a grid. This stage is called mesh generation, and it divides a complex problem over a large domain into many simple equations (that relate to a single independent variable and its derivatives) over many small subdomains. Then the resulting Ordinary Differential Equation (ODE) system is solved recombining all sets of equations into a global system of equations for the final calculation. The well-known Finite Element Method (FEM), Finite Difference Method (FDM) and Finite Volume Method (FVM) exploit this principle [12]. They are implemented in several commercial tools.

A major drawback to the numerical methods is however that. The number of equations of the ODE system can easily exceeds several hundreds of thousands to get a reasonable accuracy of the thermal estimation. This huge number of unknowns needs significant CPU-time and memory resources to be solved. While this can still be acceptable if few thermal simulations are to be run, this often proves to be prohibitive and limits thermal engineer’s ability to analyse the device thermal behaviour, especially when numerous and time-varying power stimulus profiles exist and steady-state analysis is no more sufficient.

For this reason additional techniques have been developed to overcome this limitation.


Page 57

A first technique uses optimization as a means to generate a simplified model (called a Compact Thermal Model - CTM) from a qualified set of numerical simulations covering a large range of environmental conditions surrounding the modelled device. The optimization process is used to minimize the difference between thermal estimates output by solving the CTM model and reference estimates obtained from either simulating the original numerical model or performing temperature measurement on the real device. Among these techniques the DELPHI method is by far the most popular. The method, created 20 years ago and standardized by the Joint Electron Device Engineering Council (JEDEC) [13], enables the generation of resistor based models (exploiting the well-known analogy between electricity and thermal) of discrete electronic components, seen as a single power dissipation source. Because the models are resistive only, transient simulation is not possible, only steady-state analysis can be performed. Several improvements to the original method have then been brought to enable transient analysis (through the addition of capacitances to the generated models) and to address the modelling of complex components where several sources of power dissipation must be considered (as is typically the case for multi-function System-on-Chips and for System-in-Packages having several silicon chips in a single package) [14][15]. However these improved techniques require some additional reference estimation points, making the preparatory process to generate CTM models more complex and time consuming. Besides, none of these techniques allows any trade-off between accuracy and speed of simulation: the form of the compact model is entirely determined a priori by the number of power sources and the number of thermal exchange surfaces defined between the modelled device and its environment. Thus there is no degree of freedom to improve or degrade the thermal estimation accuracy or to refine the spatial granularity of the thermal behaviour observation. Last, the optimization-based methods do not generally yield satisfying results when applied to board level devices or electronics equipment, because of its simplistic modelling assumptions that do not scale well with geometrical complexity.

Another approach consists in coarsening the mesh. The objective is to substitute coarse meshed structures for fine meshed ones, in other words to lump the distributed thermal system. The coarse structures are typically RC ladder networks, like the Cauer or Foster structures. While the approach is rather straightforward for 1-D thermal conduction, where the main question is how to determine the proper number of RC pairs, it becomes much more complex for 2-D and the 3-D cases, where the major question is how to choose a proper network topology. In spite of numerous research works, there is still no generic answer to this. The RC network lumping process remains non-automatic and it requires a designer to choose the correct number and position of the RC ladders without strict guidelines, and to perform a time-consuming parametrization. This process is made even more complex because of the distributed nature of thermal conduction (compared to electrical conduction), making lumping methods hard to apply.

Figure 3-42: electrical flow vs heat flow - thermal conduction is much distributed


Page 58

A third approach is the modal approach. It uses the eigenvalues, which are the “vibration” modes, and the eigenfunctions, which are the spatial forms of “vibrations” for each mode, of the discretized thermal system, as one would do to study how an elastic string vibrates under the action of arbitrary external force. It is indeed well-known that the actual response of the string can be approximated quite well by considering only the few first harmonics. The same principle is applied to the study of heat conduction in order to construct a CTM. Nevertheless the question which still remains is how to choose the “dominant” modes. There is actually no general answer to this, so the choice of important modes thus unfortunately requires the designer’s action, which makes a modal approach manual. Moreover, because of the distributed nature of thermal conduction, the eigenvalues tend to be uniformly spread over a large range of values, which makes it even harder for designers to isolate the relevant modes.

A last approach relies on mathematical treatments to reduce the original ODE system into a lower-dimensional system, while still preserving some essential properties attached to the original system that make the reduced model and the original one exhibiting very close behaviours. The reduced ODE system, which now has typically a few thousands degrees of freedom at most instead of several hundreds of thousands, can be solved in a timely and easy manner using a classical ODE solver [16][17]. The Model Order Reduction (MOR) technique is certainly the most widely used method for this. It has been originally developed in the control theory and is applicable to first order ODE.

In control theory the ODE is rewritten using the state-space formulation as:

where:

T is the internal state vector of temperatures

A is the thermal system matrix

u(t) is the power input vector

B is the input distribution matrix

y(t) is the output vector to observe the system behaviour

E is the output selection matrix

As illustrated above, the sate-space formulation is made of two parts. The first one is used to calculate the state variables (T) of the system based on the power input vector (u(t)). This input vector contains the value of each heat source and it varies over the time. The value affected to each heat source of the thermal model is directly the power dissipated by the corresponding component and its time variations follow the component power profile, basically driven by the activity variations induced by the software execution in the context of CONTREX. The second part of the state-space model aim to calculate the temperature outputs from the state variable values. These outputs must be formulated as linear combinations of the state-variables. Thus, the outputs can represent average temperatures, e.g. on the component area, or local temperatures, e.g. on thermal sensors approximated by a point.


Page 59

The MOR basically consists in determining a low-dimensional subspace of the original system space and projecting the original system on that subspace. As can be seen in Figure 3-44, the number of inputs and outputs is the same before and after the reduction process, while the number of equations (dimension of the state vector T) is small after. The tricky aspect of any MOR is in finding a subspace such that solving the projected system yields a good approximation to the original problem.

Figure 3-43: State-space formulations before and after reduction process

Figure 3-44: General application schema of reduction

MOR can apply to a large range of problems, basically to every description that can be analysed with a standard numerical analysis (hence any type of systems from a SoC architecture to a complete electronic equipment). The mathematical process exploited in this method has moreover very valuable properties: it preserves the passivity and stability of the original system, hence making the method very robust, and it offers means to tune the trade-off between the model size (and so the simulation speed) and the expected accuracy. The process however is difficult to master and its software implementation demands a high level of quality to maintain an adequate balance between numerical precision and computation load.

In CONTREX, a MOR based approach will be used. Thanks to its high performance, it will make it possible to run transient (dynamic) thermal simulation to feed the degradation analysis and to simulate coupled functional-thermal behaviour, including run-time reaction to temperature where appropriate.


Page 60

4 Description of integrated modelling approaches

This section aims to describe how the existing and newly developed methodologies of Section 3 have been combined and used for the CONTREX purpose.

4.1 Extra-functional property estimation to be applied in UC1 and UC3

In the following, an extra-functional property estimation flow is described that combines several existing point tools and techniques that have been described in Section 3 and addresses the above-mentioned problems. It uses a hierarchical composition of traces as the fundamental data structure and exchange format. These traces can be tailored to the needs of the different analysis steps via stream processing. In the following an overview on the proposed flow is given, describing its main elements, tools, and interfaces. Its individual point tools and transformations used in the implementation of the flow are either commercially available, public domain or result out of former research projects. Figure 4-1 shows an overview of the flow. The simulation in a virtual platform or in form of native simulation gets the application, the extra-functional property models, and the platform model as input and is driven by an environment model. This simulation generates primary traces such as clock frequency, component utilization, etc. and stream processors compute the second level traces like power per component. The power mapper can process these traces together with a component level floorplan to a power map, which can be used for thermal simulations. On the right side of the picture, we can see the tertiary traces. They are specific quantities whose behaviour is described in the contracts specification. Contract satisfaction monitors can be used with them to validate the contracts during simulation runtime.


Page 61

Figure 4-1: Extra-functional property estimation flow

4.1.1 System simulation

System simulation step is aimed at gathering the evolution of extra-functional properties over time. Two different simulation methods will be used: high level native simulation and virtual platform based simulation. These methods provide different trade-offs between accuracy and speed: while VP-based simulation is expected to provide more accurate estimations on system performance than native simulation, its simulation time is expected to be significantly greater. Thus, they should not be strictly considered mutually exclusive alternatives, but rather different simulation approaches at different abstraction levels.

These different approaches are addressed below.

4.1.1.1 High Level Native Simulation

In this approach, a native simulation of the system’s execution is performed. Figure 4-1 shows how native simulation is integrated in the generic extra-functional properties flow. In this variant, native simulation enables a fast and early advance of time and power performance metrics and plays more the role of “top-down estimation approach” approach which meets the bottom-up trace-based approach in the middle.


Page 62

Specifically, the system model serves for deriving a former and fast native simulation (blue box) capable to provide a high-level performance estimation of time and power performance estimations (green box on the right of the “Native Simulation” blue box).

As represented by the downwards arrow from the “Native Simulation” blue box, the native simulation can serve for reporting primary traces (Section 4.1.1.2). The traces derived from VIPPE shall have the same components, while their accuracy can be lesser than the one of the trace derived from binary translation (i.e., Open Virtual Platform - OVP). However, these traces should enable also high-level estimation of temperature, and thus the possibility to consider these extra-functional properties in the filtering of the design space already in the “system-level” enabled by native simulation.

A first assessment on the possibility to derive the primary traces from VIPPE indicates that at least most of the components of the traces, such as the dynamics of the voltages and of the frequencies, and the activity of the system can be supported. For instance, it has been concluded that VIPPE can be extended to support the modelling of Dynamic Voltage and Frequency Scaling (DVFS), and thus to support the report of voltage and frequency traces. This completion of the assessment, and the eventual development of the tracing and evaluation of the accuracy achieved are on-going tasks.

There are additional specificities in the flow shown in Figure 4-1: Extra-functional property estimation flow. Regarding the model, it will include sufficient information to enable the generation of an executable simulation-based performance model. Such information stands for a system model coupled to an environment model (e.g., captured in UML/MARTE in Use Case 1). The System model, in turn, encompasses the description of the application and platform architectures, and the different mapping at the different layers of the system description (e.g, thread to RTOS instance, RTOS instance to processing). Moreover, the functional code associated to each application component will be part of the input model, since this is a requirement for native estimation.

In Use Case 1, the complete input model is composed on one side of the system and environment model, captured in UML/MARTE, and on the other, of the source code of the functionality associated to the application components. All this information is used as an input to a Performance Model generator (which will be part of the CONTREX Eclipse Plugin, in short CONTREP, reported in D2.2.1). Then, the performance model generator of CONTREP will produce the VIPPE executable performance model, which can be connected and interact with the exploration tool. The native simulation technology underlying the VIPPE performance estimation tool will enable a fast estimation of the time- and power-related extra-functional properties.

4.1.1.2 Extended Virtual Platform Simulation

In this approach, the basis for gathering the application dependent evolution of extra-functional properties over time is the execution of the application in a virtual platform that is extended by three means.

First, power models are attached to the simulator defining possible power modes for each component that are characterized by a supply voltage, a clock frequency, and activity metrics.

Secondly, the simulator itself needs to have an introspection mechanism in order to be aware of power mode changes that may either be taken automatically due to inherently implemented power management policies or due to sleep state transitions that are triggered by the application.


Page 63

Within CONTREX the Open Virtual Platform (OVP) Simulator will be used and adopted offering an extended API via the M*SIM extension.

Thirdly, the simulation is extended by a tracing engine that protocols the extra-functional properties on appearance as described in the following.

Virtual Platform for ZYNQ provided by Cadence

A preconfigured virtual platform for the Xilinx ZYNQ SoC is provided by Candence (Figure 4-2).

Figure 4-2: Cadence Virtual Platform for ZYNQ

The virtual platform uses OVP components, for instance the ARM processors models, but as well other components added by Cadence. It implements the main parts of the ZYNQ processing system such as the ARM dual core, the communication system, and a large number of interfaces. Further custom components can be integrated in form of TLM 2.0 conforming SystemC modules. Finally, the platform allows connecting some of the virtual interfaces to real interfaces of the host computer; e.g., the Ethernet interface of the virtual platform can be mapped to the network interface of the host computer to enable access to a real network from within the virtual platform. In CONTREX, this virtual platform will be extended as described above.

4.1.1.3 IP-XACT enabled integration of extra-functional models into virtual platforms

We propose a framework for the concurrent simulation of both functionality and extra-functional properties. The latter are modelled as different information flows, managed by dedicated “virtual buses” and formalized through the adoption of IP-XACT.

The goal of this activity consists in reducing both functionality and extra-functional properties to a single simulation framework. The proposed approach envisions a multi-layer, bus-centric


Page 64

simulation framework. The approach is multi-layer in that it is structured hierarchically, with each property corresponding to a simulation layer; each component is then implemented as a set of property-specific models. The approach is bus-centric in that each property is simulated by adopting a specific “virtual bus” that conveys and elaborates property-specific information to derive property-specific status of the overall system. This allows reducing all layers to a common structure, easing synchronization and information exchange.

The resulting framework is thus composed of a number of property-specific models per component, managed by property-specific buses and bus managers, in charge of aggregating information [67]. The layered, bus-centric approach allows considering each property separately, even when layers are simulated simultaneously in a single simulator instance. To this purpose, a fundamental step is to determine, for each property, (1) the information flows, (2) the role of the components, and (3) the interfaces, together with (4) information exchanged with other layers, and (5) the specific role of both the bus and the manager. Each layer is characterized by four main information:

Layer-Specific Signals: These are the main property-specific signals that are tracked by the simulation engine in each layer. Power behaviour is determined by simulating voltage and current signals. Temperature is the defining characteristic of the temperature layer, while the reliability layer traces the failure rate of system components.

Inter-Layer Signals: These signals are used to exchange information between layers. Since all layers are simulated in a single simulation run, these inter-layer information exchanges occur at run time and each layer will react to changes in those signal instantaneously. Figure 4-3.c exposes these inter-relations among layers: the power layer requires workload information from the functional layer (e.g., idle/active duty cycle). The temperature layer requires both workload and power information, while at the same time temperature information is needed to estimate static power at power layer. Finally, the reliability layer requires information produced by all the other layers (reliability is in fact affected by component usage, temperature, or current density).

Role of Bus and Manager: The role of the bus and of the manager determines the simulation semantics of each layer. For instance, in the temperature layer, the semantics corresponds to a circuit-level simulation equivalent of the thermal network, while at reliability layer the semantics is an aggregation function to determine overall failure rate from local values. The role of the manager is related to

o possible “protocols” governing the interconnection among the components (as in functional simulation, where the bus is a physical component implementing specified rules)

o possible policies that govern the system evolution (e.g., disabling a component based on some condition).

Layer-Specific Data/Information: These data refer to additional information essential for the simulation but not involved with the simulation semantics. As an example, when simulating temperature we need to know both the floorplan of the components (to determine heat exchange) and the three-dimensional geometrical structure of the system.


Page 65

Figure 4-3 IP-XACT based generation of a mixed functional/extra-functional virtual platform.

To ease formalization and automation of the proposed framework and to generalize its applicability, interfaces are described by means of IP-XACT [68], a standard XML format for describing interfaces of digital IPs and systems with mechanisms for its extension, thus allowing to add the support for extra-functional domains. IP-XACT supports three main different description schemas. A component description essentially contains the interface of an IP, provided as the list of its ports. A design description represents the instances of components in a system and the interconnection between such instances. A design configuration description stores additional configuration information that can be used at later design stages, e.g., by tool-chains.

Each system component is provided with one IP-XACT component description per layer, describing the layer-specific interface. IP-XACT component descriptions continue by listing the layer-specific ports of the component, i.e., the signals used for intra-layer communication. Extra-functional ports cannot be reduced to integers or bit vectors, as they rather are real values that model a continuous physical evolution. They are thus represented by using the IP-XACT extension for analogue mixed signal modelling defined in [69] and they are associated with a default value, annotated with the measure unit (e.g., Volt or Ampere) and a prefix (e.g., kilo and milli). The overall system is provided with one IP-XACT design description per layer, defining the connections between the layer-specific ports of components. Each description lists the components involved in the layer, i.e., implementing a model for the corresponding extra-functional property. IP-XACT design configuration descriptions are adopted for modelling layer-specific data, e.g., floorplan coordinates and material characteristics, to preserve a homogeneity in information management.

All these IP-XACT descriptions, together with traditional IP-XACT descriptions for functional components represent the set of information (depicted in Figure 4-3.a) which should be used to generate the virtual platform configuration by using the HIFSuite tool (Figure 4-3.b). This activity is ongoing. The proposed approach is used to integrate power models in the virtual


Page 66

platform. Such models can be in the form of power state machines generated by different techniques (e.g., those described in Section 4.1.3.2).

4.1.1.4 Observable Properties

Primary traces are transient traces of observable properties that can directly be observed during the execution of an application on the extended virtual platform or during the high level native simulation. The purposes of theses traces are to derive the dissipated power over time at a component-wise level in order to feed a subsequent transient temperature estimation and to establish a link between the application execution and its power consumption to validate if defined constraints on the power budget are met. Thus, the primary traces need to cover every parameter that is necessary to compute the power consumption trace per component and to link it with the application execution:

Transient supply voltage traces per component: Multiple voltage domains in a heterogeneous SoC (e.g. different cores, DSPs or IO blocks) can operate at different supply voltages that may even change over time due to power management techniques. A single voltage domain can cover multiple components.

Transient clock frequency traces per component: Power management techniques can be applied at component level and often rely on a reduction of the clock frequency. Thus the clock frequency needs to be traced in order to derive the power consumption.

Transient activity information per component: The activity is the fundamental primary parameter for dynamic power. The concrete metric of activity being traced within the system simulation depends on the needs of the targeted power model and will be elaborated within the project. One possible implementation would be to trace the switched capacitance as a measure of activity.

Transient function call traces per component: To establish a link between the power consumption and the application a tracing of function calls is an adequate way. Further it enables the use of design space exploration methods.

4.1.2 Modelling of timing

4.1.2.1 Model of time for OVPsim

OVPsim is basically a functional platform simulator, which uses binary translation to reach its high simulation performance. The simulated time of the platform is estimated by using the specific parameter instructions per second of the processor model. In that way the simulation kernel counts the executed instructions and divides them by the mentioned instructions per second to give an approximate timing of the simulation. This is not very accurate since in the most processor architectures many instructions are needing more than one clock cycle for execution. Use case 1a, the multi-rotor demonstrator described in [4], uses Xilinx MicroBlaze Softcores to process its safety-critical flight algorithms. OVP offers a processor model of this MicroBlaze Softcore that is able to execute binary code which was generated by the original Xilinx Toolchain for the real system.

For having a quite estimation of the executed cycles and the timing of the virtual platform, a quasi-cycle-accurate timing model [79] was developed for the MicroBlaze processor model. Figure 4-4 presents an overview of the connection of the timing model to the virtual platform.


Page 67

The structure of the virtual platform is only an example which was used for the evaluation of the timing model.

Figure 4-4 Overview of the test virtual platform and the connection of the timing model

The timing model uses the Innovative CpuManager Interface (ICM) API [3]. This API offers functions to get the needed run-time information about the virtual platform. No further connection of the timing model to the virtual communication channels or busses are needed.

The goal of the quasi-cycle-accurate timing model is to improve the accuracy without involving the simulation slow-down of a really cycle accurate processor model. This is done by using information about the executed instructions on the processor model and the processor instruction set and its micro architecture.

The instruction accurate timing model assumes that each instruction is executed in one processor clock cycle. This is not the case for most real processors, where instructions may require multiple clock cycles. In addition, there are effects like pipeline hazards that prevent the execution of one instruction in each cycle. Hence, the execution of n instructions normally requires n + x cycles. The quasi-cycle-accurate timing model estimates the x by taking the above mentioned effects into account. In very complex processor architectures, it is also possible, that more than one instruction is executed in a single cycle. This is not considered in our work, because our focus is on simple processor architectures.

The internal structure of the overall timing model is shown in Figure 4-54.

4 For the present implementation OVP version 20150901 was used.


Page 68

Figure 4-5 Structure and more detailed information about the components of the timing model

Its first stage is the Instruction and Event Sniffer that is integrated into the virtual platform by using the ICM API. It is called for every executed instruction on the processor model. A fetch callback, pointing to the specified address range of the processor, is registered to get access to the executed instructions:

void icmAddFetchCallback ( icmProcessorP processor , Addr lowAddr , Addr highAddr , icmMemWatchFn writeCB , void* userData );

For the timing model, the range embraces the entire instruction memory area. That way, the callback function writeCB is called upon every instruction fetch. The callback function’s signature is: #define ICM_MEM_WATCH_FN(_name) \ void _name ( \ icmProcessorP processor , Addr address \ , Uns32 bytes , const void* value \ , void* userData , Addr VA )


Page 69

typedef ICM_MEM_WATCH_FN((*icmMemWatchFn));

With the processor pointer and the fetch address, it is possible to access the current instruction opcode. The fetch callback function uses another ICM API function to read the value of the specified address: bool icmReadProcessorMemory ( icmProcessorP processor , Addr simAddress , void* buffer , Uns32 bytes );

The fetch callback function forwards the returned instruction opcode to the Processor Instruction Model, which selects the correct Instruction Class object. For each instruction, a single instruction object is stored that includes all needed data. This object includes: name, instruction size (16/32 Bit), bitmask type for further decoding of registers, needed registers, result return stage of pipeline, needed number of base cycles, number of extra cycles (for example if a conditional branch is taken), and Boolean values for special instruction handling (like branch if or division). These Instruction Class objects are used in the Cycle Model and the Pipeline Model. The Cycle Model uses the number of base cycles and the Boolean values for special instruction handling. If a special instruction handling is indicated, the Cycle Model analyses if the extra cycles are needed. If necessary, it accumulates them to the number of base cycles for the current instruction and returns the cycle count value to the Processor Instruction Model. To decide if extra cycles are required for an instruction, register values are needed in some cases. The following function writes the value of a named register into a specified buffer: bool icmReadReg ( icmProcessorP processor , const char* name , void* buffer );

The Pipeline Model estimates the effects of pipeline hazards. For that purpose, it stores the last i fetched instructions, where i is the number of available pipeline stages. It analyses if the current instruction needs to read or write any registers which are in use in other pipeline stages. If the model finds any of such dependencies, it estimates the occurrence of pipeline stalls by using the current number of instruction cycles and the distance between pipeline stages. Furthermore, a simple branch prediction model is integrated, to handle the flushing of the pipeline for the case of a wrong prediction. The stall cycles are returned to the Processor Instruction Model and accumulated with the determined cycles of the Cycle Model. The Instruction and Event Sniffer retrieves the final cycle value for the current instruction and computes the overall consumed time (overall cycles divided by defined clock frequency of the platform). The following API call uses this absolute time for advancing the OVP simulation time: bool icmAdvanceTime ( icmTime time );

Furthermore, for the evaluation of our timing model, a memory read callback is registered (analogue to the fetch callback) at the address of the timer count register to recognize start- and endpoints of the benchmarks.


Page 70

4.1.3 Modelling of power and energy

4.1.3.1 Abstracted power model for ZYNQ platform Next to the timing also the power which is consumed by the embedded platform is of interest. Since the Xilinx ZYNQ is a very complex system that is composed of many single components in one package also its power consumption depends on many different factors. To estimate the power consumption of the ZYNQ Xilinx offers their users the Xilinx Power Estimator [1] (XPE), which was mentioned in section 3.2.1. It is represented by an Excel spreadsheet that is able to compute the overall power consumption of the ZYNQ system by a given static configuration. This is very inflexible to be used as a power model which should be connected to the virtual platform, since its state will change while executing applications. To solve this problem OFFIS developed a power model that is based on the XPE with some minor simplifications. It is available as C implementation. The power model in particular is divided into two parts, like the ZYNQ itself, the processing system (mainly the ARM dual-core) and the programmable logic. These two parts are again divided into subparts. The power estimation of the ARM processing system is computed by the needed dynamic power as wells as the leakage. Input parameters are the present clock rates of the ARM cores, the memory, the AXI bus and the I/O ports, each in [MHz]. Next to these values the used cores (0, 1 or 2) and the bandwidth of the AXI bus (32 or 64 bit) have to be chosen. Further some data about the utilization rates of the processing system has to be given to the model. Like the load of the ARM cores, the read and write rates of the memory as well as the utilization of the AXI bus, each in percent (in range between 0 and 1). Many of these information is available by the configuration of the Xilinx Vivado Project for the given platform, like the number of used ARM cores, their clock frequency, as well as the bandwidth and clock frequency of the AXI bus or the clock frequency of the I/O ports. The utilization parameters (which have to be given in percent) have to be estimated by the given application that is executed by the processing system. The static power consumption (leakage) of the processing system needs no further input parameters to be computed. // Compute the dynamic power of the processing system within the // Zynq 7020 under a few assumptions // int nCores is number of active cores [0-2] // double clk_cpu is clock frequency of cpu in [Mhz] // double load_cpu is load of processors in % [0-1] // double clk_mem is clock frequency of memory in [Mhz] // double readrate_mem is read rate of external DDR3 memory in % [0-1] // double writerate_mem is writerate of external DDR3 memory in % [0-1] // double clk_axi is AXI dock frequency in [Mhz] --> 0 if not used // double usage_axi is usage rate of AXI interface in % [0-1] // int axi_bw is bitwidth of axi interface. Can be either 32 or 64 // double clk_io is clock frequency of IO in [Mhz] double ps_dynpower_estimation( int nCores , double clk_cpu , double load_cpu , double clk_mem , double readrate_mem , double writerate_mem , double clk_axi , double usage_axi , int axi_bw , double clk_io ); // compute the leakage power of the processing system within the


Page 71

// Zynq 7020 under a few assumptions double ps_leakpower_estimation();

The power estimation of the programmable logic is much more complex, since it has many configurable parameters. In that way the computation of its power consumption is divided into six functions. Five of these are belonging to the dynamic power and the other one to the static power consumption. Also many of the needed input parameters of the dynamic power estimation of the programmable logic part are available in the particular Xilinx Vivado Project. E.g. clock frequencies, quantity of elements, like Digital Signal Processors (DSPs), Look-Up-Tables (LUTs) etc. Again the utilization parameters have to be estimated by the given application. The following function is used to compute the dynamic power consumption of each single clock tree that is used in the programmable logic. For that the clock frequency and the fanout of the clock tree is needed. The fan-out describes the amount of synchronized elements by the clock tree. Further the type of the clock tree is needed to get information about the dimension of the clocked area. The average values of enabling the clock buffer and the slice clock have to be estimated by the user and the given application. // compute the dynamic power of the programmable logic CLOCK TREE // within the Zynq 7020 under a few assumptions // double clk is the clock frequency of the programmable logic in [MHz] // int fanout defines the number of registers and other // synchronous elements clocked // double clock_buffer_enable defines the average % [0-1] of time // the clock buffer is enabled // double slice_clock_enable defines the average % of the time the // slices are clock enabled // int type defines the type of the clock: 0=Global, 1=Regional, // 2=IO, 3=CMT, 4=Other double pl_dynpower_core_clock_estimation( double clk , int fanout , double clock_buffer_enable , double slice_clock_enable , int type );

Also the power consumption of each DSP slice of the programmable logic can be computed by a dedicated function. The number of DSP slices, the clock frequency, as well as the information about activated multipliers, pipeline registers and pre adders can be acquired by die Xilinx Vivado Toolchain. The toggle rate of the output also has to be estimated by the executed application. // compute the dynamic power of the programmable logic DSP within // the Zynq 7020 under a few assumptions // int dsp_slices count of DSP slices with the following configuration // double clk clock frequency of the DSP slices in [MHz] // double toggle_rate the toggle rate of the DSP’s outputs in % [0-1] // bool mult_used enabling the multiplier in DSPs // bool mreg_used enabling the pipeline registers in DSPs // bool preadd_used enabling the pre adder in DSPs double pl_dynpower_core_dsp_estimation( int dsp_slices , double clk , double toggle_rate , bool mult_used , bool mreg_used , bool preadd_used );


Page 72

The programmable logic contains as well LUTs that can be used to implement logic functions, shift registers, SRAM cells or standard registers. The dynamic power consumption of each can be estimated by the core logic function. The quantity of the mentioned elements is listed in the synthesis report of Xilinx Vivado, as well as the clock frequency. The toggle rate of the logic elements and also the average fan-out of the logic element have to be estimated. The fan-out of each element can be dynamic since the path length and amount of switched capacities can change over time. // compute the dynamic power of the programmable logic LOGIC // within the Zynq 7020 under a few assumptions // double clk clock frequency of the programmable logic in [MHz] // int logic_luts count of used logic luts // int shiftreg_luts count of used shift registers // int sram_luts count of distributed RAMs // int registers count of used registers // double toggle average toggle rate of here described LOGIC in % [0-1] // double avg_fanout average fanout of the signals double pl_dynpower_core_logic_estimation( double clk , int logic_luts , int shiftreg_luts , int sram_luts , int registers , double toggle , double avg_fanout );

The Mixed-Mode Clock Manager (MMCM) is able to generate multiple clocks from a given clock source. Its power estimation can be computed by the following function. All needed information like modes, clock frequencies and other parameters, e.g. clock dividers and multipliers, are configured by the user in the Xilinx toolchain project. The average of the power down time of each used MMCM has to be estimated, like all other average rates, by the user. // compute the dynamic power of the programmable logic MMCM within // the Zynq 7020 under a few assumptions // int mmcm_or_pll_mode - 0=mixed-mode clockmanager (mmcm), // 1=phase-locked loop (pll) // double clk clock frequency of the clock manager [MHz] // int phase_shift phase shift of clock: 0=NONE, 1=Fixed, 2=Dynamic // double divide_counter clock frequency divider [1-128] // double multiply_counter clock frequency multiplier [1-64] // int clock{0-6}_divide explicit clock deviders for outputs 0-6 [1-128] // double power_down average power down time in % [0-1] double pl_dynpower_core_mmcm_estimation( int mmcm_or_pll_mode , double clk , int phase_shift , double divide_counter , double multiply_counter , int clock0_divide , int clock1_divide , int clock2_divide , int clock3_divide , int clock4_divide , int clock5_divide , int clock6_divide , double power_down );


Page 73

The programmable logic contains also Block RAM (BRAM) components for applications that need RAM structures. The BRAMs can be configured in many different modes. The BRAM components are dual ported. In that way the two ports can be parametrized independently. For each BRAM configuration the following function is available. // compute the dynamic power of the programmable logic BRAM within // the Zynq 7020 under a few assumptions // int blockrams count of used BRAMs // int mode in which mode the BRAM is used 0=RAMB18, 1=RAMB36, 2=RAMB18SDP, // 3=RAMB36SDP, 4=RAMB36SDP_ECC, 5=FIFO18, 6=FIFO36, 7=FIFO18_36, // 8=FIFO36_72, 9=FIFO36_72_ECC, 10=CASC (pair) // double togglerate average toggle rate of data signals in % [0-1] // double clkA clock frequency of port A // double enablerateA average enable rate of port A in % [0-1] // int bitwidthA Bitwidth of port A 0,1,2,4,9,18,36,72 // int writemodeA write mode of port A 0=READ_FIRST, 1=WRITE_FIRST, // 2=NO_CHANGE // double writerateA average write rate of port A in % [0-1] // (read rate = 1-write rate) // double clkB clock frequency of port B // double enablerateB average enable rate of port B in % [0-1] // int bitwidthB Bitwidth of port B 0,1,2,4,9,18,36,72 // int writemodeB write mode of port A 0=READ_FIRST, 1=WRITE_FIRST, // 2=NO_CHANGE // double writerateB average write rate of port A in % [0-1] // (read rate = 1-write rate) double pl_dynpower_core_bram_estimation( int blockrams , int mode , double togglerate , double clkA , double enablerateA , int bitwidthA , int writemodeA , double writerateA , double clkB , double enablerateB , int bitwidthB , int writemodeB , double writerateB );

At last the static power consumption (leakage) of the programmable logic part is computed by a function which needs no further inputs. // compute the static power of the programmable logic within // the Zynq 7020 under a few assumptions double pl_leakpower_estimation()

4.1.3.2 Automatic generation of PSMs

In most cases black-box IP components of third party vendors do not come with power information. The reason is that power is strongly dependent of the used target technology and the vendor cannot provide power and energy values for each target technology. Therefore, the user has 1) to estimate the power for the used technology and depending on the usage scenarios and 2) to create the corresponding PSM.


Page 74

In a PSM, the energy behaviours of the IP are associated to a set of states. In its simplest form, the power consumption of each PSM state is modelled as a constant value derived by a designer estimate or from the IP’s data sheet [70], [71]. When a higher level of accuracy is desired and more precise information about the IP’s energy behaviours are available, the power consumption of a PSM state is computed by a more complex function. For example, in [72], [74], such a function is derived by means of a calibration process based on linear regression, which exploits, as golden reference, power traces generated at gate level, where the IP’s power consumption can be more precisely estimated. However, despite of the wide adoption of PSMs, in the most of the works either the presence of PSMs is assumed [70], [71], [76] or they are manually defined starting from a more or less precise knowledge of the functional blocks composing the target IP [73], [75]. In the past, OFFIS proposed automatic approaches to create the association between PSM states and their power consumptions, but the identification of such states remains manual [73], [74]. To ease the manual characterisation an Eclipse plug-in was developed as described below.

To allow a more precise definition of PSMs, in CONTREX, EDALab and OFFIS have started the definition of a fully-automatic methodology for the generation of PSMs. A visiting researcher from OFFIS spent three months at EDALab for setting up the cooperation. The collaboration between OFFIS and EDALab is based on the integration between an assertion mining approach defined by EDALab for extracting functional behaviours of the design and mapping them on the states of a PSM, and a calibration approach identified by OFFIS that allows characterizing the energy consumption of data- and time-dependent IP components. Some preliminary information on this approach is reported below.

Manual characterization through Eclipse plug-in

This plug-in can be used by the designer of an IP component for a systematic power characterization and state-based power model abstraction. Furthermore, the plug-in can be used to share the PSM model together with its characterization data for future use and refinement. The inputs required by the plug-in are a reference gate-level power trace and I/O traces for specific use cases. Based on these data, the PSM and PrSM, including shared state variables, are developed in an incremental process. Initially a PSM, a PrSM, and optionally state variables are created. Based on these, a system-level power trace can be generated. This trace can be compared against (i.e. overlaid with) the gate-level power trace. By further correlating the I/O traces with the gate-level power trace, the state machines and state variables can be further improved and refined until the desired model accuracy has been reached.


Page 75

Figure 4-6: PSM plug-in GUI

As shown in Figure 4-6 the plug-in has five windows. The Package Explorer window handles all projects including the model files in XML format and input traces in VCD format, representing the individual use cases. The Power Simulation Editor shows the gate-level and system-level power traces, which are graphically overlaid for comparison reasons. Furthermore, this window shows selected I/O traces with the time axis of their corresponding power traces. This way, I/O and power traces can be directly related and correlated. The user interfaces allows for scaling and zooming the traces as well as enable/disable individual traces to be able to look into detail for defined periods. In the Protocol State Machine and Power State Machine window the PrSM and PSM models are built. They show all states and states transitions. Furthermore, each transition shows the triggered/activating event, used for synchronization between PrSM and PSM. The Properties windows shows all information of the state machines, such as states, transitions, guards, outputs, updates on the state variables, events, and state variables, as well as the I/O traces (wires), the power traces, and an error value between system-level and gate-level power trace. The user can select a special instant of time in the Power Simulation Editor. The Properties window then updates all values relating to this timestamp. This way the user can exactly trace the value changes in the model. The outputs and updates are user written functions to enable complex operations executed on the shared state variables.


Page 76

Figure 4-7: Preprocessing for automatic synthesis of PSM

To support the user with the characterisation, we implemented a concept algorithm for automatic synthesis of the PSM connects to the PSM plug-in. For this process some preliminaries have to be met which are shown in Figure 4-7:

For IP blocks it is assumed that the protocol is completely defined. This could be an exact textual description, a message sequence chart, or a state machine. This could be used to build a PrSM, which described in section 3.2.3. Therefore, for IP blocks the PrSM is already defined.

There exist multiple use cases, including stimuli, which cover all operational states. Furthermore, the states should be traversed in different orders to detect dependencies between states. These use cases will be executed in a functional and timed simulation with the design that was synthesized to gate level. This results for each use case in

o timed traces for the inputs and outputs,

o a timed execution order of the PrSM including related PSM events,

o and an activity trace describing all gate switches of the IP component.

Since the PrSM is at system level and a transactor may be needed to transform the I/O from gate level to system level I/O to simulate the same functional behaviour in the PrSM as in the gate level simulation.

The activity trace can be transformed into a power over time trace with the help of the target technology and the synthesized gate-level design. Since each PSM states represent an average power consumption, it is assumed that the mean of the power is nearly the same for the complete time the state is active. For that reason, the trace is separated into segments at the points where the mean changes more than a specified threshold. These points are called change points. The change points will be matched with the PSM events emerged by the PrSM. In the best case, all change points can be matched. If not all change points could be matched, then the threshold was to less and has to be increased so that less change points are found. If some PSM events could not matched, there are two options. Either this is ok because not every protocol state change must lead to a PSM change or the threshold was too big and has to be decreased to find more change points. In the next step these segments are mapped into states. For each transition between two segments a transition is created between the two representing states. If two states have the same input transitions and their annotated mean power is in a specific range, these states are merged.


Page 77

This way, a chain of states and transitions is transformed into a state machine. Additionally, the variance can be considered when comparing states for a possible merge. For all cases it is important to consider possible data dependencies in the power behaviour because this may influence both mean and variance.

Property mining of extra-functional properties

Figure 4-8 shows an overview of the methodology. The proposed methodology does not require to instrument the functional model of the target component and it can be applied even in case of black-box IPs. It only assumes the availability of two kinds of training traces: a set of functional traces exposing the IP’s behaviours, and a corresponding set of power traces over time that characterize the dynamic part of the IP’s energy consumption depending on its switching activity. It starts by dynamically mining temporal assertions from the functional traces. Such assertions are logic formulas that capture the functional behaviours of the IP over time as proposed by EDALab in [77] and [78]. From them, the states and the transitions of a corresponding set of PSMs are generated, under the hypothesis that different functional behaviours may expose different power consumptions. Then, each state is associated to a power consumption by exploiting the reference power traces. The obtained PSMs are then merged and optimized to generate a compact set of more accurate PSMs. These PSMs are finally implemented into a SystemC module, in the form of a Hidden Markov Model, to allow their efficient and effective simulation concurrently with the simulation of the IP functional model.

Currently, the part concerning the extraction of functional behaviours has been defined by EDALab and a prototypal tool has been realized. Given a set of functional traces, the tool captures the behaviours of the corresponding IP through a set of proposition traces that are automatically generated by a mining procedure. It works in two phases. In the first phase, for each functional trace Φ, the procedure extracts a set of atomic propositions, which hold frequently on Φ, predicating over the primary inputs (PIs) and primary outputs (POs) of the IP. The atomic propositions represent relations between PIs and POs of the IP that hold in a set of subtraces of Φ. The output of this phase is represented by a matrix m, where the generic element in position [i,j] reports the truth value of the j-th atomic proposition at the i-th time instant of the functional trace. In the second phase, the atomic propositions are combined into a set Prop of propositions, such that in each simulation instant of Φ one and only one of propositions belonging to Prop holds. In particular, a composition procedure generates one proposition from each row of the matrix m by composing in an AND formula all atomic propositions that are marked as true. Finally, the proposition trace is obtained by identifying which proposition is true in each simulation instant of the functional trace. The extracted propositions are used to generate temporal assertions that capture the functional behaviours of the IP exposed by the functional trace. Such behaviours will be then mapped on states of the PSM that will be calibrated in cooperation with OFFIS.


Page 78

Figure 4-8: Overview of the proposed approach for extraction of PSMs through assertion mining.

4.1.4 Modelling of thermal behaviour

4.1.4.1 Power Mapping

In order to feed a subsequent temperature simulation a two-dimensional power map trace needs to be generated. Therefore, the component-level traces are mapped to a component-level floorplan as it is visualized in Figure 4-9. In this step the power is averaged to the total component area because neither the power consumption nor the place-and-route data is available for commercially available SoCs at a more fine-grained level. The power mapping results in two map traces: one for the dynamic power and one for the leakage power distribution.

Figure 4-10 shows the described mapping of power traces to components as it is done in in Intel® Docea™ Thermal Profiler tool. It supports multiple formats of traces such as VCD, CSV. Furthermore it supports equations that can be entered as power model to cover the electrothermal coupling.


Page 79

Figure 4-9: Component level floorplan

Figure 4-10: Mapping of power traces to components

4.1.4.2 Floorplan for ZYNQ platform To run thermal simulations for the ZYNQ platform, a floor plan for the ZYNQ SoC is required. Unfortunately, such details are not available from Xilinx. That is why we implemented an artificial floorplan. It contains the relevant elements of the SoC and is realistic from our point of view. But, we cannot tell how near it is to the real ZYNQ floorplan. Nevertheless, the floorplan can be used to show the application and benefits of the proposed design flow. A schematic view of the floorplan is shown in Figure 4-11. It shows the two ARM cores and two MicroBlaze cores implemented in the programmable logic area (MB1, MB2). The unused programmable logic areas are filled up with Fill_area1 and Fill_area2. These dummy components are used to represent the static power consumption of unused FPGA area.


Page 80

Figure 4-11: Artificial ZYNQ Floorplan

4.1.4.3 Thermal Model Generation of IC Package

Thermal estimation requires not only the power traces but also a compact thermal model of the targeted IC package. This model reflects the geometry, used materials and their thermal properties, and the environment such as applied active or passive cooling measures.

Figure 4-12: Material properties

Figure 4-12 shows an excerpt from a material database of Intel® Docea™ Thermal Profiler defining all physical material properties that are relevant for a thermal estimation: thermal capacity, conductivity in different directions, and density. Based on this database, the geometries of the IC package and its different layers are defined as shown in Figure 4-13. It covers the die stack that is embedded in the surrounding package materials as well as the connection to the PCB, the PCB itself, and the boundary conditions.


Page 81

Figure 4-13: IC package geometry editor

Based on this IC package description the compact thermal model is created to be used in the following thermal estimation. The characterization step needs to be done only once for each targeted IC package.

4.1.4.4 IC package for ZYNQ platform Similar to the floorplan, a package description for the ZYNQ SoC has been realized in Intel® Docea™ Thermal Profiler. Geometries and materials for the ZYNQ are not available from Xilinx as well. But, package descriptions are available for Virtex6 ICs. We adapted these description and adjusted them according to information available from the ZYNQ datasheet. With this, we do not have an exact IC package model for the ZYNQ, but a very realistic one (Figure 4-14).

Figure 4-14: ZYNQ Package description


Page 82

4.1.4.5 Thermal Estimation

A thermal simulation in Intel® Docea™ Thermal Profiler consists in reading a compact thermal model of the IC package and then in stimulating it with power figures varying over the time. This stimulations can be performed in two different modes:

1. Live stimulation: the power values are not stored in a file but are calculated during the execution of an external power simulator and communicated live to Intel® Docea™ Thermal Profiler. At some specific times, the external simulator can request temperature feedback from Thermal Profiler’s simulator, to enable thermal-aware decision making or to update temperature dependent parameters. Thermal results are besides stored in a dts (standing for Docea Thermal Simulation) file for offline post-processing. The communication between both simulators is enabled by the Intel® Docea™ Thermal Profiler simulation C++ API (Application Programming Interface). An example of power simulator that can be used is the OVP/Cadence virtual platform featured with power estimation capability.

2. Off-line stimulation: the power values, per component and versus timing, must have been previously recorded as a power trace eg in csv format (Comma Separated Value). Then, the trace is entered in Intel® Docea™ Thermal Profiler and the thermal simulation over the time is performed. The thermal results and the input power stimulus are saved in a dts file format for later post-processing.

The second mode is implemented in CONTREX, as illustrated in the figures below. Thermal Profiler computes a temperature distribution map that evolves over the time as shown in Figure 4-15.

Figure 4-15: Thermal map of floorplan within Intel® Docea™ Thermal Profiler

Further, it also outputs aggregated temperature traces at the component-level granularity that only contain the minimum, maximum, and average temperature for each component since constraints can be made on these properties.


Page 83

Figure 4-16 shows a transient power and temperature trace of a single component over time. It shows the electro-thermal coupling due to temperature-dependent leakage currents.

Figure 4-16: Transient analysis of power and temperature within Intel® Docea™ Thermal Profiler

In its latest version Intel® Docea™ Thermal Profiler is featured with additional result analysis capabilities. As can be seen in Figure 4-17 one can now get an overview of some key thermal indicators like maximum temperature per observable temperature (at component, floorplan and probe level). The result viewer can also be used to analyse temperature in time and in space by means of the chart and floorplan views. These tools enable a quick location of thermal hotspots.

The result viewer moreover integrates a feature for threshold checking. One can specify temperature thresholds (in °C) at different levels (system, floorplan, single component and probe) and get reports related to the thresholds. In the key indicators table, metrics such as the time spent above a threshold or the time of first threshold crossing are reported. In the chart view, one can see the time slots where the threshold has been crossed. And last, in the floorplan view, the tool can display only the area where temperature is above the specified threshold.

Figure 4-17: Hotspot location and threshold crossing checking in Intel® Docea™ Thermal Profiler


Page 84

4.1.5 Checking the models

The partners provided different models of extra functional properties. These models were checked on the simulation platform of iXtronics CAMeL-View within different combinations of test cases. The test cases included discrete as well as continuous system models and combinations of both. It could be shown that in all test cases the performance as well as the memory usage was sufficient for the class of applications addressed by the CONTREX platform. The planning of the tests and the design of the test environments were performed in close relation with the partners.

4.1.6 Conclusion

Sections 4.1.1 to 4.1.4 describe the extra-functional properties estimation flow that is applied in UC1 and UC3.

4.2 Extra-functional property estimation to be applied in UC2

The automotive domain use-case covers several system layers, from the individual MEMS inertial sensors, to the sensing and processing nodes, up to the server and cloud infrastructure and applications. For this reason, the use case has been divided into the four “scenarios” described in detail in Section 3 of the deliverable D1.2.1.

Due to this peculiar heterogeneity of the use-case, the set of models and tools involved in the estimation, design and validation phases is different from scenario to scenario. As a starting point to describe the estimation methodology that will be applied to this use-case it is useful to refer to the Vodafone Automotive design flow, as it is expected to be extended with the CONTREX tools. Such a flow, already described in use case related deliverables, is reported in Figure 4-18 for the sake of clarity.


Page 85

Figure 4-18: Vodafone Automotive extended design flow

The models involved cover several non-functional properties and span over different levels of the system hierarchy. Table 3 summarizes the models adopted, the related tools and the partners actively involved.

Table 3. Components, models and tools in UC2

Component Model Model Tools Partners

Sensors Time, Energy PSM N2Sim PoliMi, STM,

ST-Polito Battery / Power supply

Charge/discharge Efficiency

Mathematical N/A PoliTo

Microprocessor Time

Energy Mathematical, Lookup tables

Keil uVision AcePlorer

PoliMi, Docea, STM

Software drivers Energy Timing

Mathematical, Lookup tables

Keil uVision PoliMi,

ST-Polito Software computational kernels

Energy Timing


Keil uVision PoliMi,

ST-Polito

Software tasks Energy Timing


N2Sim PoliMi

Node (hw/sw) Energy Timing

Discrete event N2Sim PoliMi

Communication Size overhead

Latency Lookup tables N/A EUTH


Page 86

A different view of the models/tools adopted highlights the modelling aspects and the related tools, with respect to the system architecture. Figure 4-19 shows the overall architecture of the entire system as conceived initially.

Figure 4-19: Overall system architecture (old architecture)

Due to resource limitation of the current implementation of Vodafone Automotive Main ECU (originally labelled as Cobra Main ECU), it has been decided to replace it with the Kura Pervasive Platform. Furthermore, in the initial architecture, the low-cost sensor node was intended for executing the self-calibration only, while the main crash detection functions were intended to be executed on the Cobra sensor node. Since the iNemo platform has sufficient memory and computational resources, the entire sensing application has been ported on the iNemo platform. The new architecture after such improvements is shown in Figure 4-20.

Figure 4-20: Revised Vodafone Automotive overall system architecture

The modelling, simulation and estimation flow is structured according to a bottom-up approach. A schematic view of the approach is shown in Figure 4-21, where the models and tools related to the cloud portion of the end-to-end application is omitted, being highly independent from the underlying node architecture.


Page 87

Figure 4-21: Bottom-up modelling, simulation and estimation approach

The hardware components are logically the lowest architectural level of the system and are modelled according to the following principles.

1. At the lowest logical level of the system architecture there is the node analogue hardware, namely battery/ies, power supply circuitry and voltage regulators. These components are modelled by PoliTo and will constitute one of the elements of the node model to be simulated with N2Sim.

2. The next logical level collects auxiliary digital or analogue/digital components such as sensors, external flash memories and communication interfaces. All such components integrate a digital interface, typically through a bus such as SPI, I2C, UART or other and thus can be conceived as black-boxes providing some “functions” (e.g. power-on, power-off, change power mode, initialize, configure, read, …) and being characterized by some internal “logical state” (e.g. off, on, full-power, low-power, idle, deep-sleep, sensing, communicating, …). According to this view it is natural to adopt a finite states machine model where “functions” represent FSM transitions between states. To account for non-functional aspects, states are characterized by an average power consumption and transitions are characterized by a latency and an average power consumption. Note that alternative but equivalent set of figures can be used (e.g. frequency, effective capacitance, switching activity, current …). Characterization of these components will be performed mostly by explicit measurements and data provided by vendors and will lead to power models based on state machines to be integrated into the system description provided as input to the N2Sim simulator.

3. A similar approach can be used for the microprocessor model, provided that a suitable set of states and actions are identified. To characterize the states of a simple microcontroller it is often sufficient to consider the power supply voltage (sometimes


Page 88

variable), the clock frequency and the operating mode (this applies when the microcontroller supports power optimization techniques such as power and/or clock gating. The actions supported by such simple microprocessors/microcontroller are simply the transition between operating modes (active, idle, sleeping …) and the assembly instruction execution cycle. For more complex architectures, such as multicores and ultra-low power devices, the number of states and transitions may significantly increase due to the finer control over peripherals and the possibility to dynamically changing the state of individual cores. At the finest level of granularity, execution time and power consumption of individual assembly instructions can be derived. Within the context of this use-case, and due to the significant contributions of sensors and communication interfaces, it is considered sufficient to adopt a coarser-grained model where the small energy consumption differences between different instructions (it has been shown that in RISC or RISC-like architectures the maximum energy consumption difference over the entire instruction set tends to be less than 10%, with a Gaussian distribution) can be neglected. Such coarse grain models will be used in a twofold way: firstly, they will be combined with execution time and profiling data provided by the cycle-accurate assembly level simulation performed by the Keil uVision debugger; secondly, they will be used for the task-level energy estimation performed with SWAT. It will be evaluated during the development of the project whether or not use the processor model in the N2Sim simulator, characterizing thus tasks, operating system and drivers only in terms of time and relying on the simulator to derive energy figures and traces.

From the software point of view, the hierarchical levels that can be identified are the following:

Interrupt service routines

Device drivers,

Operating system primitives,

Computational kernels

Tasks

Application

For these components, different modelling approaches have been identified.

1. Interrupt service routines and device drivers are considered here only from the software point of view that is ignoring the energy contributions related to the effects the software produces on the device being controlled. Such contributions (in terms of states/actions) are, in fact, already modelled in the hardware portion of the system. The energy consumed by the software only is thus related to the function (or function tree) execution time and the microprocessor average power consumption in the specific state. The execution time of such low-level and small (in terms of LOCs and/or object code size) software components will be obtained by instruction-set simulation using Keil uVision and enriched with power figures through simple (ad-hoc) interface filters. Similarly, computational kernels are time-characterized using the same approach, with the difference that such primitives are not associated to any external hardware contributions. The output of this modelling phase is a table (or, equivalently, a configuration file) to be used for simulation with N2Sim. It is worth noting that most of


Page 89

the functions considered by this modelling phase have (almost) constant execution time, i.e. they either have exactly one single execution path or they have a few – often well-balanced – execution paths with slightly different execution times. We expect such differences to be negligible, allowing for an average execution time to be used.

2. With the exception of interrupt service routines (asynchronous and periodical), all function characterized in the previous phase are collected into tasks, whose execution time and, consequently power consumption, not only depends on the structure of the tasks itself, but also – and mostly – on data being processed. A set of task models depending on the “operating state” of the system (that is, indirectly, on input data) will be produced as output and converted in a form suitable for integration within the N2Sim input model.

3. The application is constituted by a set of tasks, synchronized by events triggered by either external interrupts or logical data-dependent conditions. A knowledge of the typical operating conditions of the Vodafone Automotive black box will allow to build one or more models of the entire node and to simulate them using N2Sim. An minor extension to the previous model now allows N2Sim to explicitly account for the power-management policies that are implemented at run-time.

The overall output of the estimation will be constituted by simulation traces and summaries produced by the N2Sim simulator.

To integrate and automate the described flow some extensions are needed:

Integration with Keil uVision. Data provided by instruction-set simulation of the computational kernels and from the on-target execution of the driver’s functions and interrupt service routines has been directly integrated into the N2Sim simulator model. This phase is actually performed manually, being the amount of data to be passed from Keil to N2Sim very limited.

N2Sim has been extended with the explicit notion of interrupt service routine and with a support to model a certain degree of data dependency for task models.

N2Sim task synchronization has been introduced. While a totally general support to such features is out of the scope of the project and goes beyond the purposes of the simulator itself, a basic synchronization support allows a more realistic modelling of the real application.

The final model of the application (an improvement of the model previously defined) has been completed and fed to the simulator, under different configurations, corresponding to the actual operating modes of the system. The figures in the following show the results of a few simulations runs, whose operating conditions and meaning are also described. CASE 0. Normal operating mode, no events. In this case the system is fully active, with all “key-on” functions being executed. This is the normal operating condition while the vehicle is “key-on”. It is worth noting that the functions do not depend on the fact that the car is moving or not.


Page 90

Figure 4-22: Normal operating mode, no events – Acquisition and analysis task

Zooming into a shorter time period, the activities related to acceleration sampling, interrupt requests, interrupt service routines and acceleration data read operations can be seenas the following figure shows.

Figure 4-23: Normal operating mode, no events – Low-level acquition activities

Finally, a zoom on the analysis task shows its duration during normal mode operation. This is shown in the following figure.


Page 91

Figure 4-24: Normal operating mode, no events – Duration of analysis task

CASE 1. Normal operating mode, with self-calibration convergence. The figure shows a portion of the trace where the self-calibration algorithm reaches convergence and needs to perform the following additional operations, with respect CASE 0:

- Computation of the Euler angles

- Analysis of statistical data accumulated since the last convergence event

- Writing the statistical information (angle density functions, in the form of histograms) and the new angle tern to flash memory.

The following figure shows that when a self-calibration event is found, additional operations are needed to compute the euler angles. Comparing this figure with Figure 4-24, the different duration can clearly be noticed.

Figure 4-25: Normal operating mode, with self-calibration event


Page 92

CASE 2. Normal operating mode, with crash event. The figure shows a portion of the trace where the crash algorithm detects an accident and the following additional operations are needed, with respect to CASE 0:

- Computation crash features

- Writing the crash picture to flash memory.

The tasks and their durations are shown in the figure below.

Figure 4-26: Normal operating mode, with crash event

CASE 3. Normal operating mode, with polling. Periodically (every 1, 5 or 60 s, depending on the application, a polling command is sent from the Main ECU to the iNemo node to retrieve a number of information concerning the logical state of the device and a summary of the dynamical conditions over the period of time since the previous command. When a polling command is received, the following additional operations are needed, with respect to CASE 0:

- Communication with the Main ECU (command receive)

- Finalization of feature calculations. Most of features are computed at the time of acquisition, but need to finalized at the end of the “polling period”)

- Communication with the Main ECU (answer send)


Page 93

The following two figures show the communication activities. The first shows the polling command and answer on a large time-scale, while the second shows the details at driver and interrupt level

Figure 4-27: Normal operating mode, with polling, at task level

Figure 4-28: Normal operating mode, with polling, low-level


Page 94

CASE 4. Low-battery operating mode, no events. In this case the system activity is reduced in order to save energy. This operating mode is only active (a part from very short transients) when the vehicle is “key-off”. It is worth noting that the functions are expected to be meaningful only when the vehicle is not moving (filtering of “false” events is mostly performed server side).

Figure 4-29: Low-battery operating mode, with no events

CASE 5. Low-battery operating mode, with crash event. This is similar to CASE 2, but with the difference that the vehicle is “key-off” and thus some functions (e.g. the self-calibration) are disabled.

Figure 4-30: Low-battery operating mode, with crash event


Page 95

CASE 5. Low-battery operating mode, with low-energy event. This is similar to CASE 5, but with the difference that event is not a crash, but rather a low-energy impact. In this case, the following additional functions need to be performed:

- Analysis of the event over time (in real time, from the moment when the event is supposed to begin)

- Identification of the end of the event

- Post-processing of event data for feature extraction

- Event pre-filtering to discard false-positive

- Writing the event summary and event features to flash memory.

Figure 4-31: Low-battery operating mode, with a low-energy event

The simulation traces can also be post-processed to derive instantaneous power consumption over time, as well as global figures such as average power consumption, peak power, duty-cycle, and so on.

The figres in the following show the power consumption in several operating conditions. Since the time granularity of the simulation is rather fine, meaningful portions of the traces have been selected. Furthermore, the power consumption has been plotted on a logarithmic scale to cover the wide range of values without loss of visual information.


Page 96

Figure 4-32: Normal operating mode, with no events

Figure 4-33: Normal operating mode, with self-calibration


Page 97

Figure 4-34: Normal operating mode, with crash event

Figure 4-35: Normal operating mode, with polling command


Page 98

Figure 4-36: Low-battery operating mode, with no events

Average Power Evaluation. Collecting the results of the execution traces and computing the total energy consumed over a fixed period of time in the different operating modes yield the average power consumption figures reported in the figure below (Figure 4-37). The real power consumption of the iNemo only are represented in light blue, while the constant deep blue bars are a constant consumption related to the electronics of the evaluation board used to perform measurement (this contribution has been measured, as it is clearly not part of the simulated system. In view of industrialization, this constant contribution will dramatically be reduced, probably becoming as low as less than 1mA. Note that these power consumption estimates are confirmed by the average current measurements discussed in Deliverable D3.3.3.


Page 99

4.2.1 Power modelling and analysis of the SeCSoC device

The SeCSoC power model is built following the architectural modelling described in section 3.2.2 and using available specification documents and existing data sheets. A simple model, where each power function in a given power state simply consists of a fixed average power figure, was first created for every SeCSoC IP. Then the power functions were made dependent on the voltages and clock frequencies available on the IP boundaries (quadratic dependency on voltage and linear dependency on frequency). This way the influence of the voltage and frequency planning on power consumption can be simulated through what-if analysis.

The average power values used in the power functions come from a pre-characterization of each IP block. This characterization can rely on different sources: projections from previous projects, guess or marketing requirements, RTL or gate level power simulations or silicon measurements. In CONTREX gate level power simulations were used. The SeCSoC was put in a variety of typical processing and low power states, and in each state and for each IP the average power consumption was captured.

In addition the SeCSoC model contains a description of the voltage and frequency distribution. The description consists of a representation of the voltage and clock networks together with the models of the blocks located on the networks that generate or transform the voltage levels and frequencies:

The voltage regulators, battery, power gates on the voltage networks The PLL, clock synthesizers, clock gates on the clock networks

Figure 4-37 - Average current consumption figure in the normal and low-power modes.


Page 100

With this we are able to apply adequate voltage and clock values on design domains during simulation and to refine power consumption estimation. For instance, the power consumption of a voltage regulator depends on the current load at its output. By means of the voltage network description, as well as the regulator power consumption model (using the regulator efficiency extracted from data sheet curves), the dependency is automatically solved by the Intel® Docea™ Power Simulator tool. Thus, without changing anything in the stimulation (see next paragraph), we can estimate the voltage regulator power consumption and check if the maximum output current is within the specified values.

Figure 4-38: Power model creation (1 and 2) and stimulation (3) in Intel® Docea™ Power Simulator

Finally, power simulation is performed using stimuli coming from either functional or performance simulation, RTL emulation, or measure captured on boards from previous projects (when early in the design flow) or current project (when later in the design flow). In CONTREX we used some TLM (Transactional Level Modelling) functional models that generate signals persisted in vcd files (Value Change Dump) while they simulate. The signals represent power-related information such as state residencies and voltage and frequency values. A mapping file is used to convert the signal names to match with the parameter names as they appear in the model stimulation interface.

With this procedure put in place, Intel® Docea™ Power Simulator runs a power simulation over the vcd trace and generates a variety of power consumption results: the consumption variations over the time as well as the statistics on average consumption, for each IP, each domain, and per power type (static a.k.a. leakage, and dynamic). This includes the power consumption of the voltage regulators and their output currents.


Page 101

5 Conclusions

This deliverable describes the state of the art concerning bottom-up quantitative modelling of extra-functional properties and provides an intermediate description of the modelling approaches that are developed, combined and applied. It is the final Deliverable in Task 3.1 Modelling of execution platform’s extra-functional properties.


Page 102

6 References

[1] Xilinx Power Estimator (XPE) (http://www.xilinx.com/products/design_tools/logic_design/xpe.htm)

[2] Deliverable D3.2.1: Definition of execution platform and cloud service abstraction

[3] Deliverable D3.2.2: Implementation of execution platform and cloud service models (preliminary)

[4] Deliverable D4.1.1: Definitions and intermediate implementation of demonstrator’s applications

[5] Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingram and Doug Burger, Power Challenges May End the Multicore Era, Published in Communication of the ACM, Vol. 45, No. 2, February 2013.

[6] aiT. http://www.absint.com/ait/, Last visit. Oct. 2014

[7] SWEET (SWEdish Execution Time tool). http://www.mrtc.mdh.se/projects/wcet/sweet/index.html, Last visit. Oct. 2014

[8] R. Wilhelm et al., “The worst-case execution-time problem—overview of methods and survey of tools, ”Trans. on Embedded Computing Systems, vol. 7, no. 3, pp. 1–53, 2008

[9] Michael Zimmer, David Broman, Chris Shaver, and Edward A. Lee. FlexPRET: A Processor Platform for Mixed-Criticality Systems, In Proceedings of the 20th IEEE Real-Time and Embedded Technology and Application Symposium (RTAS), Berlin, Germany, April 15-17, 2014

[10] M. Schoeberl, JOP: A Java Optimized Processor for Embedded Real-Time Systems. PhD thesis, Vienna University of Technology, 2005

[11] PATMOS - A Time-predictable Processor for Real-Time Systems. http://patmos.compute.dtu.dk/, Last visit. Oct. 2014

[12] J. G. Korvink, E. B. Rudnyi, A. Greiner, Z. Liu, “MEMS and NEMS Simulation. In MEMS: A Practical Guide to Design, Analysis, and Applications”, Noyes Publications, 2005

[13] JESD15-4 DELPHI Compact Thermal Model Guideline, October 2008

[14] Cheikh Dia, Najib Laraqi, Eric Monier-Vinard, Valentin Bissuel, Olivier Daniel "Extension of the DELPHI methodology to Dynamic Compact Thermal Model of electronic Component ", THERMINIC 2011, 17th International Workshop on Thermal investigations of ICs and Systems, Paris France, September 2011

[15] Cheikh Dia, Eric Monier-Vinard, Valentin Bissuel, Olivier Daniel, “DELPHI style compact modeling by means of genetic algorithms of System in Package devices using composite sub-compact thermal models dedicated to model order reduction”, ITHERM 2012, San Diego USA, May 30-June 1 2012

[16] Tamara Bechtold, “Model Order Reduction of Electro-thermal MEMS”, Thesis, 2005

[17] Pieter Jacob Heres, “Robust and efficient Krylov subspace methods for Model Order Reduction”, Thesis, 2005

[18] International Technology Roadmap for Semiconductors, Design chapter, 2010. Available at http://www.itrs.net/

[19] T. Simunic, L. Benini and G. De Micheli, Energy-efficient design of battery-powered embedded systems, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 9, No. 1, pp. 15–28, 2001


Page 103

[20] D. Brooks, V. Tiwari, M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. Proceedings of International Symposium on Computer Architecture. New York, USA, 2000

[21] M. Ravasi and M. Mattavelli. High-level algorithmic complexity evaluation for system design. Journal of Systems Architecture, Vol. 48, No. 13–15, pp. 403–427, 2003

[22] Yunjiao Xue, Ho Sung Lee, Ming Yang, Kumarawadu, P., Ghenniwa, H.H., Weiming Shen, “Performance Evaluation of NS-2 Simulator for Wireless Sensor Networks”, Electrical and Computer Engineering, CCECE Canadian Conference on, 22-26 April 2007, pp.1372 – 1375, ISBN: 1-4244-1020-7

[23] http://en.wikipedia.org/wiki/Ns-2, Last visit. Oct. 2014

[24] http://www.isi.edu/nsnam/ns, Last visit. Oct. 2014

[25] http://docs.tinyos.net/index.php/TOSSIM, Last visit. Oct. 2014

[26] Muhammad Imran, Abas Md Said, Halabi Hasbullah, “A Survey of Simulators, Emulators and Testbeds for Wireless Sensor Networks”, Information Techonology(ITSim), 2010 International Symposium in, June 2010, pp. 897 – 902. ISBN: 978-1-4244-6715-0

[27] E. Egea-Lopez, J. Vales-Alonso, A. S. Martinez-Sala, P. Pavon-Marino, J. Garcia-Haro;Simulation Tools for Wireless Sensor Networks”, Summer Simulation Multiconference, SPECTS, 2005, pp.2-9

[28] Sangho Yi, Hong Min, Yookun Cho, Jiman Hong, “SensorMaker: A Wireless Sensor Network Simulator for Scalable and Fine-Grained Instrumentation”, computational science and its application-ICCSA, 2008, Volume 5072/2008, pp. 800-810

[29] Clay Stevens, Colin Lyons, Ronny Hendrych, Ricardo Simon Carbajo, Meriel Huggard, Ciaran Mc Goldrick, “Simulating Mobility in WSNs: Bridging the gap between ns-2 and TOSSIM 2.x”, 13th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, 2009, ISBN: 978-0-7695-3868-6

[30] Polley, D. Blazakis, J. McGee , D. Rusk, J.S. Baras, “ATEMU: A Fine-grained Sensor Network Simulator”, First Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks, Santa Clara, CA, October 4-7, 2004

[31] Philip Levis, Nelson Lee, Matt Welsh, David Culler, “TOSSIM: Accurate and Scalable Simulation of Entire TinyOS Applications”, SenSys, 2003, ISBN:1-58113-707-9J.

[32] Elson, S. Bien, N. Busek, V. Bychkovskiy, A. Cerpa, D. Ganesan, L. Girod, B. Greenstein, T. Schoellhammer, T. Stathopoulos, D. Estrin, “EmStar: An Environment for Developing Wireless Embedded Systems Software”, 2003.

[33] Lewis Girod, Jeremy Elson, Alberto Cerpa, Thanos Stathopoulos, Nithya Ramanathan, Deborah Estrin, “ EmStar: a Software Environment for Developing and Deploying Wireless Sensor Networks” , USENIX Technical Conference, 2004

[34] http://en.wikipedia.org/wiki/Omnet%2B%2B, Last visit. Oct. 2014

[35] http://www.omnetpp.org/home/what-is-omnet, Last visit. Oct. 2014

[36] http://sites.google.com/site/jsimofficial, Last visit. Oct. 2014

[37] http://www.hynet.umd.edu/research/atemu, Last visit. Oct. 2014

[38] http://compilers.cs.ucla.edu/avrora, Last visit. Oct. 2014

[39] High-Level Power Modeling, Estimation, and Optimization Enrico Macii, Massoud Pedram, Member, IEEE, and Fabio Somenzi IEEE Transactions on Computed-Aided Design of Integrated Circuits and Systems, Vol. 17, No. 11, November 1998


Page 104

[40] R.L. Wright, & M.A. Shanblatt, “Improved power estimation for behavioral and gate level designs,” in IEEE Computer Society Workshop on VLSI, pp. 102-107, 2001

[41] R. Burch, F. Najm, P. Yang, & T. Trick, “A Monte Carlo approach for power estimation,” in IEEE Transactions on VLSI Systems, vol. 1, pp. 63-71, 1993

[42] RealView ARMulator ISS Version 1.4.3 User Guide. Available in http://infocenter.arm.com/help/topic/com.arm.doc.dui0207d/DUI0207D_rviss_user_guide.pdf

[43] QUEMU website. http://wiki.qemu.org/Main_Page. Last visit. Oct., 2014.

[44] OVP website. http://www.ovpworld.org/. Last visit. Oct., 2014.

[45] L. Diaz, P. Sánchez. "Host-compiled Parallel Simulation of Many-core Embedded Systems". San Francisco, DAC2014. 2014-06

[46] J. Castillo, H. Posadas, E. Villar, M. Martinez. “Energy Consumption Estimation Technique in Embedded Processors with Stable Power Consumption based on Source-Code Operator Energy Figures”. In proceedings of XXII Conference on Design of Circuits and Integrated Systems, DCIS'07. Nov., 2007

[47] S. Real, H. Posadas, E. Villar. “L2 Cache Modeling for Native Co-Simulation in SystemC”. Symposium of Industrial Embedded Systems (SIES 2010). Jun. 2010

[48] G. Ammons, R. Bodík, and J. R. Larus, “Mining specifications,” in Proc. of ACM POPL, 2002, pp. 4–16.

[49] M. Bonato, G. Di Guglielmo, M. Fujita, F. Fummi, and G. Pravadelli, “Dynamic property mining for embedded software,” in Proc. of ACM/IEEE CODES+ISSS, 2012, pp. 187–196.

[50] SystemC Network Simulation Library – version 2, 2013, URL: http://sourceforge.net/projects/scnsl, last visit Oct. 2014.

[51] KEIL Tools by ARM - µVision IDE overview, URL: http://www.keil.com/uvision/, last visit Oct. 2014.

[52] H. Lebreton, P. Vivet, Power Modeling in SystemC at Transaction Level, Application to a DVFS Architecture, in: IEEE Annual Symposium on VLSI, ISVLSI’08, 2008, pp. 463–466.

[53] G. J. Holzmann, Design and Validation of Computer Protocols, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1991, pp. 176–178.

[54] D. Brand, P. Zafiropulo, On communicating finite-state machines, J. ACM 30 (2) (1983) 323–342. doi:10.1145/322374.322380.

[55] R. Alur, L. Fix, T. A. Henzinger, Event-clock automata: a determinizable class of timed automata, Theoretical Computer Science 211 (1) (1999) 253–273.

[56] Coremu website. http://sourceforge.net/p/coremu/home/Home/. Last visited, August, 2015.

[57] A.Wang et al. “COREMU: A Scalable and Portable Parallel Full-system Emulator”. In PPoPP’11, February 12–16, 2011, San Antonio, Texas, USA.

[58] Imperas SW Limited. Imperas Installation and Getting Started Guide. Version 1.4.25.3. Oct. 2014.

[59] S. Park, Y. Wang, et al. Battery Management for Grid-connected PV Systems with a Battery. In Proc. Of ACM/IEEE ISLPED, pages 115–120, 2012.

[60] W. Peukert. Űber die Abhängigkeit der Kapazität von der EntladestromstärkebeiBleiakkumulatoren. In ElektrotechnischeZeitschrift, page 20, 1897.


Page 105

[61] L. Benini, G. Castelli, A. Macii, E. Macii, M. Poncino, and R. Scarsi, "Discrete-Time Battery Models for System-Level Low-Power Design", IEEE Transactions on VLSI, Vol. 9, No. 5, October 2001, pp. 630-640.

[62] M. Chen and G. A. Rincon-Mora, "Accurate Electrical Battery Model Capable of Predicting Runtime and I-V Performance," IEEE Transactions on Energy Conversion, vol. 21, no. 2, pp. 504-511, Jun. 2006.

[63] M. Petricca, D. Shin, et al. An automated framework for generating variable-accuracy battery models from datasheet information. ACM/IEEE ISLPED’14, pages 365–370, 2013.

[64] M. A. Faruque and F. Ahourai. A Model-Based Design of Cyber-Physical Energy Systems. In Proc. of IEEE ASPDAC, pages 97–105, 2014.

[65] ST EFL700A39 EnFilm rechargeable solid state lithium thin filmbattery datasheet,” ST, Nov. 2013.

[66] European Union, FP7 ENIAC MOTORBRAIN project, grant agreement number 270693, www.motorbrain.eu.

[67] S. Vinco, A. Sassone, F. Fummi, et al. An open-source framework for formal specification and simulation of electrical energy systems. In IEEE/ACM ISLPED, pages 287–290, 2014.

[68] Accellera. IEEE Standard 1685-2009 for IP-XACT, 2010.

[69] Accellera. Recommended Vendor Extensions to IEEE 1685-2009 (IP-XACT), 2013. www.accellera.org.

[70] L. Benini, R. Hodgson, and P. Siegel, “System-level power estimation and optimization,” in Proc. of IEEE ISLPED, 1998, pp. 173–178.

[71] R. Bergamaschi and Y. Jiang, “State-based power analysis for systems- on-chip,” in Proc. of ACM/IEEE DAC, 2003, pp. 638–641.

[72] S. Schurmans, D. Zhang, D. Auras, R. Leupers, G. Ascheid, X. Chen, and L. Wang, “Creation of esl power models for communication architectures using automatic calibration,” in Proc. of ACM/IEEE DAC, 2013.

[73] D. Lorenz, P. A. Hartmann, K. Grüttner, and W. Nebel, “Non-invasive power simulation at system-level with SystemC,” in Integrated Circuit and System Design. Power and Timing Modeling, Optimization and Simulation, ser. LNCS. Springer, 2013, vol. 7606, pp. 21–31.

[74] D. Lorenz, K. Gruettner, and W. Nebel, “Data-and state-dependent power characterisation and simulation of black-box RTL IP components at system level,” in Proc. of Euromicro DSD, 2014, pp. 129–136.

[75] H. Lebreton and P. Vivet, “Power modeling in SystemC at transaction level, application to a DVFS architecture,” in Proc. of IEEE ISVLSI, 2008, pp. 463–466.

[76] L. Benini, A. Bogliolo, and G. De Micheli, “A survey of design techniques for system-level dynamic power management,” IEEE Trans. on VLSI, vol. 8, no. 3, pp. 299–316, 2000.

[77] A. Danese, T. Ghasempouri, and G. Pravadelli, “Automatic extraction of assertions from execution traces of behavioural models,” in Proc. of ACM/IEEE DATE, 2015, pp. 67–72.

[78] N. Bombieri, F. Busato, A. Danese, L. Piccolboni, and G. Pravadelli, “Exploiting GPU Architectures for Dynamic Invariant Mining,” in Proc. of IEEE ICCD, 2015.


Page 106

[79] Sören Schreiner, Ralph Görgen, Kim Grüttner and Wolfgang Nebel, “A Quasi-Cycle Accurate Timing Model for Binary Translation Based Instruction Set Simulators”, in 4th Workshop on Virtual Prototyping of Parallel and Embedded Systems (ViPES), 2016

[80] VIPPE website. http://vippe.teisa.unican.es. Last visited, Sept. 2016.

[81] Z. Wang, R. Liu, Y. Chen, X. Wu, H, Chen, W. Zhang and B. Zang. “COREMU: A Scalable and Portable Parallel Full-system Emulator”. In PPoPP’11, February 12–16, 2011, San Antonio, Texas, USA.

[82] J.H. Ding, P.Chang, W.Hsu, YC Chung. “PQEMU: A Parallel System Emulator based on QEMU”. 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[83] Imperas QuantumLeap. http://www.imperas.com/articles/imperas-delivers-quantumleap-simulation-synchronization-%E2%80%93-industrys-first-parallel-virtual.

[84] iNEMO system-on-board, ST Microelectronics, http://www.st.com/content/st_com/en/products/mems-and-sensors/inemo-inertial-modules/inemo-m1.html

[85] Panasonic, CG18650CG lithium-ion battery, http://www.meircell.co.il/files/Panasonic%20CGR18650CG.pdf

[86] Linear Technology, LT1613 converter, http://cds.linear.com/docs/en/datasheet/1613fs.pdf

[87] L. Benini, A. Bogliolo, G.D. Micheli, A survey of design techniques for system-level dynamic 687 power management. IEEE Trans. Very Large Scale Integr. Syst. 8(3), 299–316 (2000)

[88] S. Vinco, A. Sassone, D. Lasorsa, E. Macii and M. Poncino, "A framework for efficient evaluation and comparison of EES Models," IEEE PATMOS, 2014, pp. 1-8.

Documents

Design of embedded mixed-criticality CONTRol systems under ...€¦ · Many embedded systems are real-time systems, i.e. they operate with timing constraints. The consideration of