On The Analysis Of Hardware Event Monitors Accuracy In

On The Analysis Of Hardware Event

Monitors Accuracy In MPSoCs For Real-time

Computing Systems

Author:

Javier Enrique Barrera Herrera

Supervisor:

Hamid Tabani Barcelona Supercomputing Center

Co-supervisor:

Francisco J. Cazorla Barcelona Supercomputing Center

Tutor:

Leonidas Kosmidis Department of Computer Architecture

Universitat Politecnica de Catalunya

Barcelona Supercomputing Center

High Performance Computing

Master in Innovation and Research in Informatics

Facultat d’Informatica de Barcelona

Universitat Politecnica de Catalunya

Computer Architecture - Operating Systems Departament

Barcelona Supercomputing Center

June 23, 2020

Acknowledgements

In the first place, I would like to thank my advisors, Hamid, Fran and Leonidas for their guidance

and mentoring through the development of this Thesis.

I also want to thank the rest of the people of the CAOS group at BSC who have always offered

help when needed.

Moreover, I would like to acknowledge the BSC institution for financially support my Master studies,

and also to the following institutions that have partially supported this work: the Spanish Ministry

of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P, the UP2DATE Eu-

ropean Union’s Horizon 2020 (H2020) research and innovation programme under grant agreement

No 871465, the SuPerCom European Research Council (ERC) project under the European Union’s

Horizon 2020 research and innovation programme (grant agreement No. 772773), and the HiPEAC

Network of Excellence.

Last but not least, I would like to thank my family for their unconditional support in my life and

studies.

Abstract

The number of mechanical subsystems enhanced or completely replaced by electrical/electronic

components is on the rise in critical real-time embedded systems (CRTES) like those in cars,

planes, trains, and satellites. In this line, software is increasingly used to control (safety-related)

critical aspects of CRTES. More complex software requires unprecedented computing performance

requirements, that can only be achieved by deploying aggressive processor designs, multiprocessor

system on chips (SoCs or MPSoCs). The other side of the coin is that MPSoCs make software

timing analysis – a mandatory pre-requisite for CRTES – more complex.

Performance Monitoring Units (PMUs) are at the heart of most advanced software timing analysis

techniques to control and bound the impact of contention in Commercial Off-The-Shelf (COTS)

System-on-Chips (SoCs) with shared resources (e.g., GPUs and multicore CPUs). However, PMUs

are designed with an assurance level below the role they assume in software timing analysis.

In this Thesis, we aim at taking an initial step toward reconciling PMU verification with its key

role for timing analysis. In particular, this Thesis covers the analysis of the correctness of hardware

event monitor (HEM) in embedded processors for CRTES domains. This Thesis illustrates that

some event monitors do not behave as expected in their specification, which can in turn invalidate

the software timing analysis process performed building on those HEMs. For three real processors

used in different CRTES domains, we report discrepancies on the values obtained from the PMU’s

HEMs and the number of events expected based on HEM description in the processor’s official

documentation. Discrepancies, which may be either due to actual errors or inaccurate specifications,

make PMU readings unreliable. This is particularly problematic in consideration of the critical role

played by event monitors for timing analysis in domains such as automotive and avionics.

This Thesis proposes a systematic procedure for event monitor validation. We apply this proce-

dure to validate event monitors in the NVIDIA AGX Xavier, NVIDIA TX2, and the Xilinx Zynq

UltraScale+ MPSoC. We show that while some event monitors count as expected, this is not the

case for others whose discrepancies with expected values we analyze.

GlossaryTerm Definition

HEM Hardware Event Monitor

SoC System on Chip

MPSoC MultiProcessor System on Chip

WCET Worst-Case Execution Time

RTES Real-Time Embedded Systems

CRTES Critical Real-Time Embedded Systems

ISO-26262 Safety standard for CRTES in the automotive domain

DO-178C Safety standard for CRTES in the avionics domain

CAST-32A Certification Guidance for the use of multicores in the avionics domain

EN 50128 Safety standard for CRTES in the railway domain

ASIL Automation Safety Integrity Level, a risk classification scheme

V&V Validation and Verification process

GPU Graphics Processing Unit

COTS Commercial Off-The-Shelf

PMU Performance Monitoring Unit

SDTA Static Deterministic Timing Analysis

MBDTA Measurement-Based Deterministic Timing Analysis

HDTA Hybrid Deterministic Timing Analysis

SPTA Static Probabilistic Timing Analysis

MBPTA Measurement-Based Probabilistic Timing Analysis

HYPTA Hybrid Probabilistic Timing Analysis

PMC Performance Monitoring Counter

CPU Central Processing Unit

pWCET probabilistic Worst-Case Execution Time

ILP Integer Linear Programming

EMVP Event Monitor Validation Process

PUD Platform Usage Domain

rbe representative benchmark

ISA Instruction Set Architecture

OEM Original Equipment Manufacturer

CUDA Compute Unified Device Architecture

SASS GPU Assembly Code for NVIDIA GPUs

Contents

1 Introduction 5

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background and Related Work 11

2.1 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 SDTA: Static Deterministic Timing Analysis . . . . . . . . . . . . . . . . . . 11

2.1.2 MBDTA: Measurement-Based Deterministic Timing Analysis . . . . . . . . . 12

2.1.3 HDTA: Hybrid Deterministic Timing Analysis . . . . . . . . . . . . . . . . . 12

2.1.4 SPTA: Static Probabilistic Timing Analysis . . . . . . . . . . . . . . . . . . . 13

2.1.5 MBPTA: Measurement-Based Probabilistic Timing Analysis . . . . . . . . . 13

2.1.6 HYPTA: Hybrid Probabilistic Timing Analysis . . . . . . . . . . . . . . . . . 14

2.2 Using PMCs in the Timing Analysis Process . . . . . . . . . . . . . . . . . . . . . . 14

3 Event Monitor Validation Process 17

3.1 Event Monitor Validation Process (EMVP) . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Systematic and Automated Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Automation Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1

Contents

4 Assessment of EMVP on the NVIDIA Jetson AGX Xavier 22

4.1 Experiment and representative benchmark design . . . . . . . . . . . . . . . . . . . . 23

4.2 First Validation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Second Validation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Third Validation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5 Assessment on Complex Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Assessment of EMVP on the NVIDIA Jetson TX2 32


5.2 First Validation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Assessment of EMVP on the Xilinx Zynq Ultrascale+ 37


6.2 Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Conclusions and Future Work 42

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2

List of Figures

2.1 Example of pWCET curve [44] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 EMVP Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 CUDA/SASS code of matrix copy benchmark for AGX Xavier. . . . . . . . . . . . . 24

4.2 SASS code of the combined example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 MISC inst. counted and expected (example in Fig. 4.2). . . . . . . . . . . . . . . . . . 27

4.4 SASS code of two examples with NOP instructions. . . . . . . . . . . . . . . . . . . . 29

4.5 SASS code for the vector addition in Global Mem. . . . . . . . . . . . . . . . . . . . 30

5.1 CUDA/SASS code of matrix copy benchmark at TX2. . . . . . . . . . . . . . . . . . 36

6.1 Diagram of the Cortex-A53 CPU cluster [1] . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 C/ARM Assembly code of matrix copy in the Zynq. . . . . . . . . . . . . . . . . . . 39

3

List of Tables

2.1 Taxonomy of WCET techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Instruction types used in this analysis for the NVIDIA Jetson AGX Xavier GPU. . . 23

4.2 Measured/Expected values for matrix copy benchmark for Nvidia Jetson AGX Xavier 25

4.3 Instruction types in Figure 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Event counts for the vector addition benchmarks. . . . . . . . . . . . . . . . . . . . . 30

5.1 Instruction types used in this analysis for the NVIDIA Jetson TX2 GPU. . . . . . . 32

5.2 Measured/Expected values for matrix copy benchmark at Nvidia Jetson TX2 . . . . 33

5.3 Measured/Expected values for matrix copy benchmark at Nvidia Jetson TX2 after

applying PMC correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1 Instruction types used in the analysis for the Xilinx Ultrascale+ ARM Cortex-A53

CPUs [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 Measured/Expected values for matrix copy . . . . . . . . . . . . . . . . . . . . . . . 40

4

Chapter 1

Introduction

The Real-Time Embedded Systems (RTES) industry represents a key part of the global chip market

and some predictions point out that it will drive the global chip demand in the following years [39].

RTES comprise a wide range of commercial products: from low-cost commodity appliances such as

microwave ovens to expensive and critical systems like cars or planes.

In RTES, the timely execution of software is as important as its correctness. In particular, evidence

must be provided that the software finishes its processing before a given time bound which is

called deadline. Depending on its criticality real-time systems can be classified broadly into several

main categories:

• Hard real-time systems: these systems control critical operations, usually where a systematic

failure of the system can result in a catastrophic event, e.g. Anti-lock Braking System (ABS)

of a car or flight control system in planes. This can be caused by frequent deadline misses

which can have severe consequences. These systems are also known as Critical Real-Time

Embedded Systems (CRTES).

• Soft real-time systems: the system can afford missing several deadlines since it will not result

in a critical outcome. As an example, in a video decoding processor, if the processing of a

video frame does not meet the deadline, this likely will not be noticeable for the end user.

Even in an event of missing the deadline for several consecutive frames, although undesirable,

the system will experience deteriorated quality of service, but still the user will be able to

continue watching the movie.

• Firm real-time systems: the system can miss an occasional deadline and it will not cause a

critical outcome, but the results will be discarded since they are of no use after the deadline.

An example of this is Software-defined radio (SDR), where a missed deadline will result in

some part of the audio stream not being heard by the user.

5

Chapter 1. Introduction

CRTES comprise safety-critical systems whose failure could cause fatalities, injuries or severe dam-

ages to objects (including the system itself); and mission-critical systems whose failure may typically

cause economical losses such as, for instance, systems controlling measurement instruments in a

satellite. Even if those systems do not compromise the integrity of the satellite itself, they may

lead to a failure of accomplishing the mission, which ultimately is a severe consequence.

The expected correct behavior of CRTES is defined in generic [30] or domain safety standards such

as ISO-26262 [31] for road vehicles, DO-178C [58]CAST-32A [22] for airborne systems, and EN

50128 [14] for railway. Those standards described the required functional and timing verification to

provide evidence – qualitatively and quantitatively – the absence failure or the risk of failure can

be regarded as residual. In other words, the validation and verification (V&V) process provides

evidence that all relevant scenarios have been considered and safety measures have been put in

place to mitigate risks.

The certification processes are required in order to guarantee that a certain system or software is

safe to be used in the target domain. An example of certification standard in automotive electrical

and electronics is ISO-26262 [31] safety standard which defines the Automotive Safety Integrity

Level (ASIL) which is a risk classification scheme. This is an adaptation of the Safety Integrity

Level used in IEC 61508 [30] for the automotive industry. This classification helps defining the

safety requirements necessary to be in line with the ISO-26262 standard. The ASIL is established

by performing a risk analysis of a potential hazard by looking at the Severity, Exposure and

Controllability of the vehicle operating scenario. The safety goal for that hazard in turn carries

the ASIL requirements. There are four ASILs identified by the standard: ASIL A, ASIL B, ASIL

C, ASIL D. ASIL D dictates the highest integrity requirements on the product and ASIL A the

lowest.

Until recently, CRTES built upon relatively-simple software running on relatively low-performance

(and low-complexity) hardware. For instance, many avionics systems still today are built upon

single-core processors with an in-order execution pipeline and without cache memories or many

other advanced microarchitecture techniques. The advantage of those systems is that timing verifi-

cation is relatively simple since execution time variability is low and the system’s behavior is quite

predictable. However, the increasing automation of systems first, and the trend towards fully au-

tonomous systems later, pushes CRTES industry for adopting hardware platforms delivering much

higher performance to respond to the performance demands of complex functionalities. Multicore

and manycore processors are one such type of hardware platform. They consist of a number of cores

capable of executing software simultaneously, as well as an interconnection network to communicate

cores among them and with neighbor devices (e.g. main memory) [26,33,53,59,60].

6


1.1 Motivation

Complex state-of-the-art microprocessors present performance-improving features that have been

traditionally used for the high-performance domain, however, they are being increasingly used in

processors in domains like automotive [13]. Those features include multicores, multi-level cache hi-

erarchies, complex on-chip networks and accelerators, among which GPUs have a dominant position

[5,40,62]. This transition from the usage of simple micro-controllers to complex microprocessors in

the CRTES is driven by the unprecedented performance requirements of complex critical software

in order to support advanced functionalities such as autonomous driving in automotive and more

autonomous missions in space [7, 67].

Timing Verification and Validation (V&V) provides evidence for the correct temporal schedulabil-

ity of the system. This builds on deriving tight and reliable Worst-Case Execution Time (WCET)

estimates (budgets) to software execution time. The quality of the WCET estimates often depends

on the engineer’s previous experience. For instance, common industrial practice for timing analysis

consists of running several tests measuring the highest execution or high watermark and adding

an experience-based safety margin to it to cover the impact on ‘unobserved’ effects [65]. How-

ever, multicore systems, although enabling higher performance, introduce timing variability due to

contention between the different cores when accessing shared resources. Therefore, timing is not

deterministic and time predictability is needed, however, COTS processors in critical domains have

limited hardware support for time predictability. This includes automotive processors and SoCs

such as the NVIDIA Drive SoCs (Parker [47] and Xavier [46] SoCs), RENESAS R-Car H3 [57],

QUALCOMM SnapDragon 820 [55], and Intel Go [29]. Similar concerns also arise on SoCs such as

the Xilinx Zynq UltraScale+, which is increasingly considered for avionics and railway applications

among others [69].

It has been shown to be insufficient to try to achieve full isolation by software, resorting, for

example, to page (memory) coloring techniques 1 since there is an existing interference at the

shared queues and buffers [63]. Software solutions for Quota Monitoring and Enforcement have

been proposed to handle contention in multicore generic processors with limited hardware support

for time predictability [18, 45, 54, 72]. Quota enforcement approaches build on limiting per task

(core) maximum shared resources utilization. The operating system monitors the task’s activities

via the hardware event monitors which are offered by processors’ PMUs and suspends or slows

down the task’s execution when their assigned budget is about to be exhausted.

Existing software approaches and solutions for quota event monitoring and enforcement, as well as

software debugging processes, build on the naive assumption that event monitors and their docu-

mentation are always correct. In fact, the trustworthiness of event monitors in COTS processors

1Coloring is a well-known technique to segregate accesses to the different blocks of memory-like resources [38],

like banks of the shared last-level on-chip cache, the banks and ranks in a DDR memory system [37, 52, 71], or even

combined cache-memory segregation [61].

7


has not been questioned yet in the real-time research community, despite their critical role as func-

tional and non-functional verification means. The validity of all quota-based software solutions

cannot be sustained without providing evidence of a correct functioning of the event monitors,

according to the specification available in the official documentation. The lack of such support-

ive evidence ultimately jeopardizes the timing arguments and potentially invalidates the evidence

gathered to successfully undergo the mandatory timing Verification and Validation (V&V) process,

in accordance with safety regulations.

While the PMUs in mainstream processors do offer a promising baseline for this low-level analysis,

the historical role and limited relevance PMUs have been given in mainstream systems – from where

PMU design in CRTES chips is inherited – is in strident contrast with the critical role they would

acquire for timing analysis. In fact, PMUs and Performance Monitoring Counters (PMCs 2) have

been traditionally intended to capture average behavior rather than the worst-case one and have

been used as cursory, low-level debugging support by the chip manufacturer (hence, with reduced

need for detailed documentation). Moreover, the fact that PMU and PMCs do not directly impact

the timing and functional behavior of applications running on top of the platform has a twofold

consequence:

1. PMU’s inclusion in the hardware design usually occurs in late design phases, with reduced

flexibility to incorporate new counters or to fix potential deviations

2. The PMU does not need to comply with high-integrity constraints and can be designed

according to low-integrity (e.g., ASIL-A) requirements.

This difference in integrity level exposes system designers to the evident paradox of using low

integrity, poorly-documented PMUs as the basis for timing analysis mechanisms that are expected

to guarantee that the system achieves enough freedom from interference for higher-integrity tasks

(e.g., ASIL-C/D). High-integrity (i.e. ASIL C/D), WCET-aware, well-documented PMUs will

become an instrumental tool to simplify and consolidate the arguments in support to timing V&V

in the presence of automotive multicore complex processors [42].

1.2 Contribution

In this Thesis, we take a step towards reconciling PMU verification, which is often disregarded,

with its critical role for timing analysis. Our contributions are as follows:

1. Analysis of Event Monitor Correctness. We perform an analysis for several event monitors

which are present i) in the GPU of the NVIDIA AGX Xavier and TX2 development boards,

2PMCs are the software visible and programmable registers to read HEMs. The latter store counts of events that

are made visible to the software via PMCs. For simplicity, we refer to both indistinctly.

8


and ii) in the CPU of the Xilinx UltraScale+ SoC, and we assess them against their techni-

cal specification provided by the manufacturer. Our goal is not to cover all event monitors

supported by those architectures, since they comprise several hundreds [32]. Our focus is,

instead, illustrating that some event monitors might not behave as one would expect, and, for

specific code snippets, we show that some discrepancies occur between observed event counts

and the expected values that a performance analyst would expect based on the event mon-

itors specification provided in the corresponding product manuals. Such evidence supports

our claim that OEMs/TIER/timing analysis companies cannot blindly trust event monitors

without a preliminary validation process.

2. Monitor Validation Process. We describe the steps to follow in a manual validation process

that helps in the validation of the event monitors of COTS SoCs. We also show a practical

application of this process to a small subset of monitors in i) the NVIDIA Jetson AGX

Xavier, ii) the TX2 and in iii) the Zynq UltraScale+ MPSoC. Those event monitors, for

which discrepancies are detected w.r.t. the expected values, are put under quarantine and

investigated. For some of them, and as a result of the application of the validation process,

we show that discrepancies can be explained, hence regaining trust on the correctness of the

hardware event monitor.

3. Assessment of an automatic validation process. We discuss the difficulties of developing a

systematic and automatic process for event monitor validation. In contrast with other verifi-

cation activities (e.g., unit testing), the PMU validation process cannot be easily automated

because event counters are extremely target-specific and their operation may differ depending

on the processor vendor and the specific hardware/software configuration. However, manual

procedures are frequent in verification and certification processes. This includes all safety-

related software in an automotive system that needs to undergo a manual inspection process

in order to be certified.

The contribution of this Thesis has been published in the following paper [8]:

• On the reliability of hardware event monitors in MPSoCs for critical domains

Barrera, J.; Kosmidis, L.; Tabani, H.; Mezzetti, E.; Abella, J.; Fernandez, M.; Bernat, G.;

Cazorla, F. J. . On the reliability of hardware event monitors in MPSoCs for critical domains.

A: ACM Symposium on Applied Computing. ”The 35th Annual ACM Symposium on Applied

Computing: Brno, Czech Republic, March 30-April 3, 2020”. New York: Association for

Computing Machinery (ACM), 2020, p. 580-589.

1.3 Thesis Organization

The rest of this Thesis is organized as follows:

9


• Chapter 2 provides the necessary information in order to understand the context of the field

as well as the purpose of this Thesis and its main related works.

• Chapter 3 presents a methodological approach to validate event monitors against their spec-

ification and discusses the difficulties of making this process fully automated.

• Chapter 4, Chapter 5, and Chapter 6 report on the application of the proposed validation

process to a selection of event monitors in the NVIDIA Jetson AGX Xavier, the NVIDIA

Jetson TX2, and the Xilinx Zynq Ultrascale+ respectively.

• Chapter 7 concludes the Thesis presenting the main take away messages.

10

Chapter 2

Background and Related Work

In this Chapter, we present a brief summary of timing analysis approaches for real-time systems,

as well as the role that performance counters play in the validity of these methods. Finally, we

present the most relevant works to ours.

2.1 Timing Analysis

In this Thesis, we focus on the timing verification of CRTES, which is equally important to its

functional verification. The purpose of timing verification is to ensure that the software complies

with its timing requirements, which are expressed in terms of tasks’ periods and deadlines. This

is achieved by performing a process known as timing analysis, which is used in order to compute

(or estimate) the Worst-Case Execution Time (WCET) of the task under analysis, which is the

maximum time a task can take under any circumstances. However, determining this time precisely

is very hard to achieve – if at all possible for large, complex programs – so frequently it is enough

to select an upper bound of this time, which is frequently used as WCET.

There are several methods in the literature which can be used for computing the WCET, which are

summarized in Table 2.1. It is worth to note that there is no perfect WCET computation technique,

since each one is based on a set of assumptions [4]. Whether these assumptions are satisfied has an

effect in the soundness and accuracy of each method. In the following sections we briefly examine

the characteristics of each one.

2.1.1 SDTA: Static Deterministic Timing Analysis

The SDTA techniques derive the WCET bounds for a given task without executing it in the target

platform, instead, it combines the results from two models, the abstract hardware model and

11

Chapter 2. Background and Related Work

Deterministic Probabilistic

Static SDTA SPTA

Measurement-based MBDTA MBPTA

Hybrid HYDTA HYPTA

Table 2.1: Taxonomy of WCET techniques.

the structural representation of the task under analysis. SDTA approaches consider all possible

inputs for a program and the search space is kept within a tractable dimension only by using

safe abstractions of the software and hardware. Despite the precision of STDA, the inputs and

assumptions in the analysis steps may result into inaccuracies if they are defective. Nevertheless,

STDA is an industrially-viable option for timing analysis if the hardware and software are simple

and well documented.

2.1.2 MBDTA: Measurement-Based Deterministic Timing Analysis

MBDTA derives the WCET estimates by collecting measurements from the execution on top of

the target platform. MBDTA trustworthiness is dependant on whether the target platform is the

same as the ones for deployment, if the input data include the scenario leading to the WCET and

finally, if the measurements are accurate and they are used for the WCET estimation. Since it is

unknown whether the input data includes the scenario leading to the WCET – an input known

as the Worst-Case Input, whose identification is an open problem and usually it depends on the

software developer’s knowledge of the software –, the gap between the highest observed execution

time (HOET) and the WCET is not known. Therefore, it is common to apply a certain engineering

factor to the HOET to estimate the WCET if the user has a non-negligible knowledge of both

hardware and software being analysed. For example, a common engineering margin used in the

avionics sector is 20%, for the particular software and hardware used in this sector [65].

2.1.3 HDTA: Hybrid Deterministic Timing Analysis

The HDTA approaches purpose is to bring the best from SDTA and MBDTA into a single technique,

increasing the confidence in the measurements with static information while keeping industrial

viability of the approach. The hybrid approaches achieve a better trustworthiness than MBDTA

as well as they help to aliviate the workload of the user when it comes to produce tests. Hybrid

analysis techniques use measurements to infer hardware properties required to build the static

model [45] which makes them susceptible to the accuracy of the performance counters used to

derive the model.

12


Figure 2.1: Example of pWCET curve [44]

2.1.4 SPTA: Static Probabilistic Timing Analysis

As well as SDTA, SPTA is limited to simple processor models, but with the addition of limitations

on the associativity of caches. The applicability of SPTA to more realistic processor designs has

not been proven yet, which makes it a non-viable alternative for industrial use. Nevertheless, the

trustworthiness of SPTA suffers from the same challenges as SDTA techniques.

2.1.5 MBPTA: Measurement-Based Probabilistic Timing Analysis

In contrast with STA, the Measurement-Based Probabilistic Timing Analysis (MBPTA) techniques

execute the program in the real system or in a simulator, measuring the time it takes to execute.

After taking several execution times, an upper-bounded WCET is derived from the execution

time distribution and the proper probabilistic and statistical methods in order to ensure that the

estimation of the WCET is representative. MBPTA on top of MBPTA-friendly hardware keeps

the industrial viability of MBDTA while gaining a higher level of trustworthiness, as it has been

shown in several industrial case studies [25, 64, 65]. Moreover, the same effect can be achieved

in using software only techniques such as software randomization, in order to enable MBPTA on

conventional architectures [17,35,36,64].

The idea behind the probabilistic approach is to produce more than one WCET with an assigned

probability. This enables the system designer to choose a pessimism level with a function that

MBPTA approach collects the execution time and, by applying extreme value theory, can predict

how WCET will behave in extreme cases. With a high enough number of runs, it is possible to see

the curve that, given a probability, provides a WCET value as Figure 2.1 shows. The WCET takes

into acount the deterministic base time and the additional delay (worst-impact) due to contention

13


that happens in the execution.

MBPTA is capable of deriving a tight and reliable WCET with less information than other timing

analysis techniques. In order to achieve this, some hardware components are randomized, since

true random behavior is independent and identically distributed.

2.1.6 HYPTA: Hybrid Probabilistic Timing Analysis

HYPTA is at an infant phase and thus, there is no viable alternative for industrial use. To the best

of our knowledge, PUB [34] and EPC [41, 74] are the main HYPTA techniques so far. Although

PUB increases the path coverage with respect to MBPTA, it relies on automatic code modifications.

Some work has been done in order to increase the path coverage of MBPTA by means of HYPTA

approaches without needing to modify the application under analysis [43]. However, how to link

these new methods to certification processes is still an issue to be tackled.

2.2 Using PMCs in the Timing Analysis Process

As the complexity of MPSoCs in CRTES continues to increase, we are witnessing an increasing

number of works that propose building on PMCs to get timing analysis. These works cover both

static and measurement-based timing analysis techniques.

The increase in complexity of multicore hardware and advanced software functionalities jeopar-

dizes the applicability and effectiveness of conventional timing analysis approaches [4, 56, 68]. It is

then becoming increasingly evident that novel forms of timing analysis are required that capture

the peculiarities of multicore execution [42]. Specific MPSoC execution aspects like utilization of

shared resources and contention delay are captured to meet emerging certification and qualification

requirements (e.g., interference channels characterization [22] and freedom from interference [31]).

Monitoring and profiling solutions are becoming fundamental aspects in the timing verification.

While several profiling and monitoring solutions exist, they have been designed and deployed for

software/hardware debugging and (average) performance optimization purposes, and are not par-

ticularly tailored to timing analysis. In the following, we cover some of the key trade-offs when

considering different tracing solutions, with particular focus on the specific end user requirements.

• Static timing analysis techniques are migrating towards hybrid approaches in which the mea-

surement of PMCs is used to validate the predictions made in metrics like execution time or

access counts [20,21].

• For measurement-based techniques, a full breath of approaches build on PMCs to derive

quotas to the maximum number of events of a given type tasks can generate [18,19,45,54,72].

14


This includes cache access counts and misses.

In this line, we can find a handful of prior works that build on performance counters for software

timing estimation both for deterministic and probabilistic timing analysis methods. Paulisch et

al. [45] build on performance counter events to create an analysis and runtime monitoring solution

for limiting task contention in multicore CPU architectures. Diaz et al. [?] inflate multicore pWCET

estimations derived by MBPTA based on number of cache events obtained with performance coun-

ters, to account for cache contention. In [19] an ILP-based contention model is proposed for the

AURIX automotive microcontroller building on the performance counters available on that plat-

form. In addition, the authors identified limitations in the counting of events of interest using the

available performance counters. Authors in [23, 24] show the importance of documentation since

the lack of it leads to uncontrolled and unknown activities that jeopardizes the WCET estimations

for MBTA.

In [28], authors use performance counters in the CPU of multicore systems for WCET estimation

using measurement-based probabilistic timing analysis. Authors of [27] also used performance

counters for WCET estimation of CPU tasks on multicore systems proposing a method to select the

performance counter with highest contribution and a forecast model to predict execution time under

unseen configurations. Authors in [66, 73] study the variability caused due to non-deterministic

performance counter implementations in CPUs, without analyzing whether values are as expected,

which is instead the target of our work. Nevertheless, in these works, authors observe non-null but

relatively low variability across measurements.

Several works [6, 51] have focused on automotive platforms featuring GPUs such as NVIDIA’s

TX1 and TX2. Some of them have discovered undocumented features of those hardware platforms

like the scheduling policy [6] or exposed mismatches in the software documentation regarding

blocking or asynchronous behavior of CUDA API calls [70]. However, none of these works studies

event monitors, whose behavior and documentation mismatches we expose in this work. To our

knowledge, there is no other work in the real-time literature which considers GPU performance

counters.

In addition, [12, 51] present benchmarking and platform characterization studies of automotive

platforms. Regarding timing modeling of GPUs, in the literature we can find the seminal works [9–

11]. The first two papers build on a simulated GPU for WCET estimation, while our work uses a

real GPU for event monitor validation. On the other hand, [10] relies on end-to-end measurements

on a real-GPU platform for WCET estimation and timing analysis, not validation of performance

counters.

All these works build on the counters from the PMU, trusting that this unit is a reliable source of

information, but no work has performed an assessment on whether the PMU can be trusted as it is.

This lack of verification leaves open the possibility of errors, which can affect the trustworthiness

of the timing analysis.

15


Conclusion. While PMUs may vary depending on the different architecture and family of proces-

sors and platforms, they generally offer the capability to track a large number of events, typically

in the extent of few hundreds or even thousands, related to multiple aspects of execution: from

cache-hierarchy statistics to accesses over the interconnects, as well as instruction counts for dif-

ferent instruction types. Instruction counts are fundamental to assess that the program has been

executed correctly. At type level, memory operations such as loads and stores are needed to derive

cache miss rates. Likewise, uncacheable loads and stores allow to assess the memory accesses of

the program.

The fine-grained information that can be obtained from hardware event monitors can be used

to improve the understanding of the timing behavior of an application [18, 19], to enforce usage

thresholds for shared components [45], and to define a more accurate timing model of contention-

prone hardware resources [19]. Ultimately, these aspects concur with the sought-after properties

of freedom from interference in ISO-26262 (and interference channels identification in CAST-32A)

to guarantee that timing faults cannot propagate across software elements with different criticality

levels. However, the following question arises: whether the information derived from event monitors

in PMUs can be trusted for supporting timing evidence for certification purposes [42].

The critical role of PMU information clashes with their intended purpose, as PMUs were originally

devised as means to support low-level performance tuning and to provide rough outlines about

the average behavior of the software running on top of it. In fact, PMUs have been traditionally

developed at the lowest-integrity levels (if any), under quite relaxed V&V criteria, and are, thus,

more error prone than components intended for higher integrity levels [42]. Moreover, PMUs are

generally accompanied with scarce and inaccurate documentation [50]. Therefore, PMU information

cannot be straightforwardly used as a cornerstone for the provision of solid certification arguments

on the timing behavior. Instead, PMU must undergo a rigorous validation process to guarantee

the information they provide can be trusted for timing V&V.

16

Chapter 3

Event Monitor Validation Process

In this Thesis we contend that it is required to use a methodological approach to validate event

monitors in order to use Performance Monitoring Counters (PMCs) with high confidence as part

of MPSoC timing verification [18, 45, 54, 72]. Defining a generic toolkit for validation is not usu-

ally practical due to the large number of available events with differences in terms of operation

and characteristics across processor vendors or even across models from the same vendor and they

depend on the hardware and system software configuration. Nevertheless, what can be done in-

stead, is defining a general methodological process that can be later fine-grained defined, based on

expertise knowledge, for a specific event monitor and platform configuration. We call this general

methodology Event Monitor Validation Process (EMVP).

3.1 Event Monitor Validation Process (EMVP)

The validation of event monitors is a test driven process in which each monitor is exercised while

running specifically designed programs. The value counted by the monitor is compared with an

expected value, estimated based on the target platform hardware and software and the test program

to assess whether it can be deemed as a trusted monitor, i.e. the counted value and the expected

value match or present a gap within acceptable threshold, or not. The proposed EMVP comprises

several steps as seen at Figure 3.1. From the steps on the process, some of them require technical

knowledge in order to do an informed tailoring, hence, an expert analyst will perform some of the

activities.

Event Selection. Following the trend of processors in the high-performance domain, the number of

event monitors in the latest processors in domains such as automotive is in the order of hundreds.

As an example, the Xavier SoC offers 273 event monitors accessible from the profiler and the

debugger for its Pascal GPU. Hence, an exhaustive validation of all event monitors can be too

costly in general. Instead, the analyst can discard those event monitors that do no affect the

17

Chapter 3. Event Monitor Validation Process

Figure 3.1: EMVP Diagram

timing/safety argumentation based on requirements coming from the upper timing V&V, and

hence, do not require any validation. Also, in some architectures, the hardware allows multiple

configurations (a.k.a. platform usage domain or Critical Configuration Setting [22], which impact

the event monitors to validate. For instance, if a given resource is partitioned (segregated) it might

not be needed to track per-core/task access counts to it. Note that, strictly speaking, this step,

represented as 0 in Figure 3.1, is not part of the monitor validation process, which only focuses on

the validation of the events provided as input. We have added this preliminary step to the diagram

for completeness.

Experiment and representative benchmark (rbe) design. From the description of the events

in the processor manuals or programmers’ guidelines and the understanding of the processor ar-

chitecture, the analyst designs one or several baseline representative benchmarks or rbe 1 . The

rbe must have two key characteristics. First, the rbe needs to exercise the event monitor. Second,

the analyst can derive the expected value of the event monitor for that rbe, which means that the

rbe must be simple enough to allow the analyst to place enough confidence on the expected values.

For a certification argument, the completeness of the used rbe to exercise the event monitor under

validation must be justified.

Validation campaign. Empirical evidence is collected on the target. The rbe is executed in

controlled scenarios 2 configured by the analysis on the target platform to reduce as much as

18


possible external sources of variability, e.g. operating system. In each run, the PMU is configured

to read the event monitor under validation.

Acceptance criteria. Next the analyst compares the expected results and those captured with

event monitors 3 . In case a discrepancy is detected, this can be due to either an imprecise technical

documentation of the event monitor in the users’ manual, or an actual misbehavior in the counter

logic. Either the case, the counter cannot be used as-is for timing V&V purposes and further

investigation is required to understand, and possibly resolve, the cause of the inconsistency. If no

discrepancy is detected in the tests carried out, the counter is deemed as trustable 4 based on the

tests performed.

Formulate hypotheses. For those counters whose measured values do not match expected ones,

the analyst formulates hypotheses 5 on the causes for the observed misbehaviour. This relates to

understand the experiment, the architecture and the expected results. For instance, by determining

the magnitude of the discrepancy and the expected values for other related events, the analyst can

formulate further hypotheses to be verified. The process continues, going back to step 1, in which

the same or new rbe are used to accept/reject the hypotheses. In case it is accepted, then the

discrepancy between the observed and the expected values is understood and can be corrected.

Instead, if it is rejected, time/effort allowing, new hypotheses are formulated and the whole process

starts over. If no further hypotheses can be formulated and/or tested, the event monitor is regarded

as untrusted 6 .

3.2 Systematic and Automated Validation

The apparently simple assessment process is inherently platform-specific and requires deep technical

knowledge on both the nominal behavior of the target hardware components and the manifold

platform and PMU configurations. Hardware and software development have benefited from some

form of automated functional verification based on relatively high-level models of both hardware

and software. However, no abstraction model is available for the verification of PMUs. PMUs touch

the lowest levels of hardware design and their black-box verification can only be performed building

on the understanding and expertise of a hardware expert. In particular, expertise is required in

order to select the subset of relevant event monitors to be empirically validated. Further it is not

possible to automatically generate the platform configuration and verification snippets necessary

to validate a given monitor because both vary across ISA, platforms, models, and versions.

Having an expert supervising a verification or certification process, however, is consolidated prac-

tice. Several aspects in testing are delegated to the expertise of testing engineers, especially for

the verification of system-wide properties. Several objectives in CAST-32A rely on the guidance

of an external assessment as, for example, the identification of interference channels, the verifica-

tion of inter-core data and control coupling, or the implementation (and coverage) of the safety

19


net [22]. In some of these cases, there is not even a metric or criteria (such as MC/DC - Modified

Condition/Decision Coverage or branch for structural coverage) to determine when testing can be

deemed sufficient.

3.3 Automation Opportunities

Following the discussion in Section 3.2, a question that arises is whether some of the steps in the

proposed methodological approach can benefit from some form of automation.

Regarding step 1 , on experiment and rbe design, while specific procedures can be set, we are not

aware of any technology that from the technical reference manual of a processor and the event

monitor to validate, can systematically and automatically define a (set of) rbe(s) to validate it.

Instead, this task is to be done manually by a performance analyst, i.e. following predefined

procedures, as for design inspection and walkthrough in functional safety verification processes.

Once rbe(s) are defined, tool support can be used to derive the expected value for some of the

event monitors for that rbe. The analyst could also exploit a database of rbe(s) with precomputed

event monitor values, which can be obtained through state-of-the-art simulators or assembly-code

analyzers. This of course implies that the used tools shall be qualified to the appropriate criticality

level according to the applicable safety standards.

Step 2 is mostly procedural and can be in large part automated building on an automated test

framework.

In terms of acceptance criteria 3 , similarly to step 1 , there is no systematic approach to determine

which acceptance criterion is correct to apply on each case. In fact, such criterion is to be assessed

by the expert analyst and properly described and sustained in front of the certification authorities,

building upon repeatable protocols.

Likewise, in step 5 and once a deviation is detected in the event counter, we are not aware of any

solution to automatically formulate hypotheses to explain the observed behavior and design new

experiments (and likely a rbe) to assess them. Hence, it also requires human intervention.

3.4 Conclusion

While full automation is not possible, it is important to establish well-defined procedures that

allow performing the verification processes exhaustively and reviewing them easily and avoiding

ambiguities and misunderstandings. In the particular case of the reliability of event monitors, the

focus of this work, we propose a specific procedure that we apply in specific events and platform

examples. This procedure and its results, in the form of evidence verifying what each event monitor

counts in practice is the basis upon which OEMs/TIER/tool vendors can build timing analysis

20


methods and tools for complex SoCs where timing guarantees build upon event quota budgeting,

monitoring, and enforcement [18,45,54,72].

In the following three chapters, we apply EMVP to some platforms from different vendors and

architectures. The assessments on the platforms will be performed as follows:

1. NVIDIA Jetson AGX Xavier: Assessment on its Nvidia Volta GPU

2. NVIDIA Jetson TX2: Assessment on Nvidia its Pascal GPU

3. Xilinx Zynq Ultrascale+: Assessment on its Cortex-A53 CPU

21

Chapter 4

Assessment of EMVP on the NVIDIA

Jetson AGX Xavier

We start assessing our validation approach on a selection of event monitors in the NVIDIA Jetson

AGX Xavier. In particular, we focus on type-based instruction counts, a basic information element

used for several aspects of timing analysis. This includes the following:

1. For quota monitoring, store counts is important when first level data caches are write-through

as each store causes a transfer to the inter-core shared interconnection or the next (second

level) shared cache level.

2. Instruction counts for uncacheable loads and stores determine how many times specific de-

vices, subject to contention, are used.

3. Instruction counts are also used for timing validation as they allow assessing whether programs

experience preemption by comparing instruction counts between runs on bare metal and on

top of the analysed RTOS.

Table 4.1 shows the instruction types used in this analysis for the NVIDIA Jetson AGX Xavier

GPU. The first column describes the particular event monitors to validate, while the second column

provides the description in the official GPU provider documentation. To obtain this information, we

used the NVPROF tool [48] from CUDA 10.0 version toolkit as follows: nvprof --query-events

--query-metrics. As it can be seen, each event monitor counts certain instruction types. The

particular operation codes under each instruction type are provided in a different document [2]. The

third column lists the subset of opcodes under each instruction type on which we focus (extending

this to other opcodes is an engineering work following the same EVMP approach). For instance,

inst integer captures the following opcodes: BMSK, BREV, FLO, IABS, IADD, IADD3, IADD32I, IDP,

IDP4A, IMAD, IMMA, IMNMX, IMUL, IMUL32I, ISCADD, ISCADD32I, ISETP, LEA, LOP, LOP3, LOP32I,

22

Chapter 4. Assessment of EMVP on the NVIDIA Jetson AGX Xavier

Table 4.1: Instruction types used in this analysis for the NVIDIA Jetson AGX Xavier GPU.

Event [2] Official Description [2] Opcodes

counted [2]

inst integer Number of integer instructions executed by non-

predicated threads

IMAD, IADD3, SHF,

LOP3, ISETP

inst fp 32 No. of single-precision fp instructions executed by non-

predicated threads (arithmetic, compare, etc.)

FSETP, FMUL,

FADD, FSEL

inst compute ld st Number of compute load/store instructions executed by

non-predicated threads

LDS, LDG, STS,

STG

inst control Number of control-flow instructions executed by non-

predicated threads (jump, branch, etc.)

BRA, EXIT

inst bit convert Number of bit-conversion instructions executed by non-

predicated threads

I2F

no event Instructions that move data across registers MOV, SHFL

inst misc Number of miscellaneous instructions executed by non-

predicated threads

NOP, S2R, BAR

not pred off thread inst exec Number of thread instructions executed that are not

predicated off

Total

POPC, SHF, SHL, SHR, VABSDIFF, VABSDIFF4. From those we focus on those boldfaced as they are

the only ones that appear in our tests. Interestingly, there is not event counter to track MOV and

SHFL instructions.

4.1 Experiment and representative benchmark design

We build on a matrix copy program on which we can derive the number of instructions expected of

each type. Figure 4.1 (top) shows the C code with CUDA calls of the program, and the correspond-

ing GPU assembly (SASS) code produced for this specific GPU, by using cuobjdump (bottom).

Instructions 1 and 2 in the SASS code comprise the kernel’s prologue, performing the kernel ini-

tialization. Instructions 3 to 6 load to registers the thread and block identifiers which are used in

the right hand side of the CUDA source code in lines 4 and 5. Instructions 7 to 9 in the SASS code

compute the thread access positions stored in the variables in the left hand side of source code lines

4 and 5. Instruction 10 calculates the index within the brackets of source code line 6. Instructions

11 and 13 calculate the memory address for arrays d x and d y respectively. Instruction 12 performs

the load access from d x while instruction 14 carries out the store access to d y. Finally, instruction

15 terminates the kernel.

As shown in the kernel invocation in line 23 of the source code, the kernel is launched with 1024x1024

threads. Each instruction is executed by all threads, which allows us to compute the number of

expected instructions for each type of instruction, in order to validate it with the measurements

of those instructions obtained with performance counters in the next step. Therefore, we expect

the SASS code on the right to be executed 1,048,576 times, thus leading to 16,777,216 (16 · 220)

23


1 #inc lude <s t d i o . h>

2

3 global void copy ( i n t N, f l o a t ∗d x , f l o a t ∗d y ) {4 i n t x = blockDim . x∗blockIdx . x + threadIdx . x ;

5 i n t y = blockDim . y∗blockIdx . y + threadIdx . y ;

6 d y [N∗y + x]=d x [N∗y + x ] ;

7 }8

9 i n t main ( void ) {10 i n t N = 1024 ;

11 f l o a t ∗x , ∗y , ∗d x , ∗d y ;

12 x = ( f l o a t ∗) mal loc (N∗N∗ s i z e o f ( f l o a t ) ) ;13 y = ( f l o a t ∗) mal loc (N∗N∗ s i z e o f ( f l o a t ) ) ;14 dim3 g r id (32 ,32) ;

15 dim3 block (N/32 ,N/32) ;

16 cudaMalloc(&d x , N∗N∗ s i z e o f ( f l o a t ) ) ;17 cudaMalloc(&d y , N∗N∗ s i z e o f ( f l o a t ) ) ;18 f o r ( i n t i =0; i<N∗N; i++){19 x [ i ]=42.0 f ;

20 }21 cudaMemcpy( d x , x ,N∗N∗ s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice ) ;

22 copy<<<gr id , block>>>(N, d x , d y ) ;

23 cudaMemcpy(y , d y ,N∗N∗ s i z e o f ( f l o a t ) , cudaMemcpyDeviceToHost ) ;

24 cudaFree ( d x ) ;

25 cudaFree ( d y ) ;

26 f r e e ( x ) ;

27 f r e e ( y ) ;

28 }29

1 /∗0000∗/ MOV R1 , c [ 0 x0 ] [ 0 x28 ] ;

2 /∗0010∗/ @!PT SHFL. IDX PT, RZ, RZ, RZ, RZ;

3 /∗0020∗/ S2R R0 , SR CTAID .X;

4 /∗0030∗/ S2R R2 , SR TID .X;

5 /∗0040∗/ S2R R3 , SR CTAID .Y;

6 /∗0050∗/ S2R R4 , SR TID .Y;

7 /∗0060∗/ MOV R5 , 0x4 ;

8 /∗0070∗/ IMAD R0 , R0 , c [ 0 x0 ] [ 0 x0 ] , R2 ;

9 /∗0080∗/ IMAD R2 , R3 , c [ 0 x0 ] [ 0 x4 ] , R4 ;

10 /∗0090∗/ IMAD R0 , R2 , c [ 0 x0 ] [ 0 x160 ] , R0 ;

11 /∗00a0∗/ IMAD.WIDE R2 , R0 ,R5 , c [ 0 x0 ] [ 0 x168 ] ;

12 /∗00b0∗/ LDG.E.SYS R2 , [R2 ] ;

13 /∗00 c0 ∗/ IMAD.WIDE R4 , R0 ,R5 , c [ 0 x0 ] [ 0 x170 ] ;

14 /∗00d0∗/ STG.E.SYS [R4 ] , R2 ;

15 /∗00 e0 ∗/ EXIT ;

16 /∗00 f0 ∗/ BRA 0xf0 ;

17

Figure 4.1: CUDA/SASS code of matrix copy benchmark for AGX Xavier.

instructions. Those instructions are broken down into 3 ·220 data movement (MOV and SHFL), 4 ·220

miscellaneous (S2R), 5 · 220 integer (IMAD), 2 · 220 load/store (LDG and STG), and 1 · 220 control

flow (EXIT and BRA). Note that EXIT acts as a safeguard following the kernel termination.

24


Table 4.2: Measured/Expected values for matrix copy benchmark for Nvidia Jetson AGX Xavier

Event Expected Measured Discrepancy

(1) ‘DMOV’ 3,145,728 0 -3,145,728

(2) inst misc 4,194,304 6,291,456 2,097,152

(3) inst integer 5,242,880 5,242,880 0

(4) inst compute ld st 2,097,152 2,097,152 0

(5) inst control 2,097,152 1,048,576 -1,048,576

(6) Total 16,777,216 14,680,064 -2,097,152

4.2 First Validation Step

From the collected values we have detected several discrepancies in comparison to the expected

values, as shown in Table 4.2. For each instruction type we report the number of instructions

expected based on our analysis of the SASS code, those counted with the event monitors, and

the discrepancies. Note that we exclude those types for which we both expect and count zero

instructions. We extract the following conclusions:

(1) Data movement instructions, as expected, are not counted at all since there is no specific event

to count them.

(2) Surprisingly, the number of miscellaneous instructions measured is higher than that in the SASS

code. In particular, there are 4 S2R in the SASS code executed ≈ 1 million times each (1,048,576

threads), so we would expect ≈ 4 million MISC instructions counted. However, inst misc reports

≈ 6 million MISC instructions, as if there were 2 additional MISC instructions per thread in the

SASS code.

(3), (4) Integer and loads/stores are counted properly.

(5) The total number of instructions measured matches the addition of the individual types counted.

However, this number is different from the total number of expected instructions. Hence, we

need to further analyse the event counters inst misc, ‘DMOV’, and Total. On the contrary, for

inst integer, inst control and inst compute ld st, since the counts we observe for both ex-

periments in Figure 4.1 and Figure 4.2 – explained later – are precise, we consider them reliable.

First set of Hypotheses. From these results, we formulate the following hypotheses. The

inst misc monitor counts two instructions beyond those appearing in the SASS code and regarded

as MISC according to NVIDIA’s documentation [2]. We hypothesize that other instructions are

counted as MISC:

• Hypothesis 1a. Either those other instructions correspond to a different category, but are

counted as MISC.

• Hypothesis 1b. Or they are instructions not shown in the SASS code. After reviewing the

25


Pre-loop

loop

prolog

Loop body

Post loop

Figure 4.2: SASS code of the combined example.

semantics of the program in the SASS code, we verify that addresses are properly computed,

data read from the source matrix and written in the destination matrix. Thus, we cannot

attribute any specific operation to the potentially hidden instructions (e.g. they could be NOP

instructions).

26


Figure 4.3: MISC inst. counted and expected (example in Fig. 4.2).

4.3 Second Validation Step

In order to test the hypotheses above, we have performed a number of individual experiments. Each

of them aims at varying the instruction counts for the different instruction types whose counters

report discrepancies w.r.t. the expected values. By doing so and comparing the expected number

of instructions for those instruction types against actual event counts, we expect to discern which

of the formulated hypotheses is the right one in each case and, if all of them are rejected, obtain

additional information to raise new informed hypotheses. For the sake of simplicity, we have merged

all experiments into a single one. The combined experiment contains a loop within which we can

vary the number of iterations and hence, the number of executed instructions of each type. The

SASS code of this example is shown in Figure 4.2. Hexadecimal numbers on the left show the

instruction address. Arrows indicate the direction of the conditional branches, which in fact are

predicated unconditional branches. Predicates are shown as @!PT, @!P0 and @!P1. The program

starts by executing instructions 10h-30h. When the loop is executed at least once, BRA at 40h is

not taken and the rest of the execution continues until the BRA in 120h. That branch is taken

for each additional iteration, thus looping in instructions 90h-120h. Whenever it is not taken,

instructions from 130h until the end of the program are executed. Therefore, instructions 10h-40h

and 130h-1F0h are executed exactly once. Instructions 50h-80h are executed exactly once as long

as the loop iterates at least once. Instructions 90h-120h are executed as many times as the loop is

intended to execute. Note that, in theory, instructions 200h-270h should not be executed since the

EXIT instruction at address 1F0h should terminate the kernel execution. Why those instructions

are part of the SASS code is not documented by NVIDIA and, in any case, they should not have

any functional effect.

For runs with 0, 1, 2 and 10 iterations, Table 4.3 shows that inst control, inst compute ld st,

inst fp 32, inst integer, and inst bit convert event counters match exactly the number of

instructions executed. For instance, in the case of 1 loop iteration, where all instructions in the SASS

code are executed exactly once, one would expect 9 INT, 7 FP32, 3 LDST and 1 CONV instructions (see

Table 4.1) for each of the 1,024 threads, which matches exactly the corresponding event counters.

Also for inst control, when the number of iterations is 0, the BRA at 40h is taken, and then only

27


Table 4.3: Instruction types in Figure 4.2.

Event Exp. Meas. Exp. Meas. Exp. Meas.

0 iter 0 iter 1 iter 1 iter 10 iter 10 iter

‘DMOV’ 4,096 0 6,144 0 15,360 0

inst misc 9,216 5,120 9,216 7,168 9,216 16,384

inst integer 4,096 4,096 9,216 9,216 36,864 36,864

inst fp 32 3,072 3,072 7,168 7,168 34,816 34,816

inst compute ld st 3,072 3,072 3,072 3,072 3,072 3,072

inst control 3,072 2,048 2,048 1,024 11,264 10,240

inst bit convert 0 0 1,024 1,024 10,240 10,240

Total 26,624 17,408 37,888 28,672 120,832 111,616

the EXIT instruction at the end is executed (the BRA at 200h is never executed). When the number

of iterations is N , N > 0, then the BRA at 40h is not taken, the BRA at 120h is taken N − 1 times

and not-taken once, the EXIT and the BRA at the end are executed also once. Overall, we expect

N + 1 (BRA+EXIT) instructions.

Assessing hypotheses 1a and 1b. In order to determine the source of the unexpected MISC

instructions, we build upon the example in Figure 4.2. In particular, consider the case with 1

loop iteration for simplicity, so that all instructions are executed exactly once. The event counter

indicates that there are 7 MISC instructions per thread. In the SASS code, we can identify 2 S2R and

7 NOP instructions. Thus, differently to the previous example, where we were expecting more events

than the ones provided by the event monitor, this time the monitor is undercounting. In order to

have additional information, we also include the result of executing the loop 10 times, where we

still would expect 9 MISC instructions per thread, since S2R and NOP instructions are outside the

loop, but the MISC counter then counts 16 instructions. Thus, by having 9 additional iterations,

the counter increases by 9. This behaviour also holds for other numbers of loop iterations. We

conclude that:

(1) Exactly one instruction in the loop (90h-120h) is counted as MISC. If we discard all INT, FP32

and CONV instructions in the loop, which we regarded as precisely counted by their corresponding

event counters, we get only a MOV instruction. Thus, we consider that MOV instructions are counted

as MISC.

(2) We revisit the example of the matrix copy in Figure 4.1, where we have exactly 4 S2R and

2 MOV instructions per thread. If we analyse the MISC counter in that case, which overcounted 2

instructions per thread, we realize it is fully precise if we include MOV instructions. Therefore, we

conclude that both, S2R and MOV instructions are counted as MISC.

(3) We compare MISC measured against theoretical MISC (S2R+NOP), MISC and MOV (S2R+NOP+MOV),

and only S2R+MOV, see Figure 4.3.

Overall, we conclude that although MOV instructions are classified as data movement instructions

28


1 /∗0000∗/ MOV R1 ,2 c [ 0 x0 ] [ 0 x28 ] ;3 /∗0010∗/ @!PT SHFL. IDX PT,4 RZ, RZ, RZ, RZ;5 /∗0020∗/ S2R R4 ,SR CTAID .X;6 /∗0030∗/ S2R R2 , SR TID .X;7 /∗0040∗/ MOV R5 , 0x4 ;8 /∗0050∗/ NOP;9 /∗0060∗/ IMAD R4 , R4 ,

10 c [ 0 x0 ] [ 0 x0 ] ,R2 ;11 /∗0070∗/ IMAD.WIDE R2 , R4 ,12 R5 , c [ 0 x0 ] [ 0 x168 ] ;13 /∗0080∗/ LDG.E.SYS R2 ,14 [R2 ] ;15 /∗0090∗/ IMAD.WIDE R4 , R4 ,16 R5 , c [ 0 x0 ] [ 0 x170 ] ;17 /∗00a0∗/ STG.E.SYS [R4 ] ,R2 ;18 /∗00b0∗/ EXIT ;19 /∗00 c0 ∗/ BRA 0xc0 ;20 /∗00d0∗/ NOP;21 /∗00 e0 ∗/ NOP;22 /∗00 f0 ∗/ NOP;

1 /∗0000∗/ MOV R1 ,2 c [ 0 x0 ] [ 0 x28 ] ;3 /∗0010∗/ @!PT SHFL. IDX PT,4 RZ, RZ, RZ,RZ;5 /∗0020∗/ S2R R4 ,SR CTAID .X;6 /∗0030∗/ S2R R2 , SR TID .X;7 /∗0040∗/ MOV R5 , 0x4 ;8 /∗0050∗/ NOP;9 /∗0060∗/ NOP;

10 /∗0070∗/ NOP;11 /∗0080∗/ IMAD R4 , R4 ,12 c [ 0 x0 ] [ 0 x0 ] ,R2 ;13 /∗0090∗/ IMAD.WIDE R2 , R4 ,14 R5 , c [ 0 x0 ] [ 0 x168 ] ;15 /∗00a0∗/ LDG.E.SYS R2 ,16 [R2 ] ;17 /∗00b0∗/ IMAD.WIDE R4 , R4 ,18 R5 , c [ 0 x0 ] [ 0 x170 ] ;19 /∗00 c0 ∗/ STG.E.SYS [R4 ] ,R2 ;20 /∗00d0∗/ EXIT ;21 /∗00 e0 ∗/ BRA 0xe0 ;22 /∗00 f0 ∗/ NOP;

Figure 4.4: SASS code of two examples with NOP instructions.

in [2], they are effectively counted as MISC instructions. Instead, NOP instructions, classified as MISC,

are not counted. However, all those NOP instructions are exactly after the last BRA instruction, thus

not executed in practice. Hence, it remains unknown whether MISC counts executed NOP instructions

or it never counts them.

Observation 1: inst misc counts S2R and MOV and it remains unknown whether it counts

executed NOP instructions.

Second set of Hypotheses. We formulate two hypotheses, as consequence of the investigation

of hypotheses 1a and 1b;

• Hypothesis 2a. MISC does not count NOPs, which matches the fact that, so far, those NOPs

found in the SASS code have not been counted in any experiment.

• Hypothesis 2b. MISC counts NOP instructions only if effectively executed, which would be in

line with NVIDIA documentation [2] for executed NOPs.

4.4 Third Validation Step

Assessing hypotheses 2a and 2b. To assess whether NOP instructions before the final BRA

are counted under inst misc, we have performed several experiments but, for the sake of space

limitations, we present the simplest ones solving the unknown. In particular, we manipulated the

source code of the example to enforce the use of NOP instructions, which do not have any functional

impact.

As shown in Figure 4.4, the SASS code of these programs includes NOP instructions before and after

the final BRA instruction. According to the observations before, MISC must be at least 4 per thread.

29


1 /∗0000∗/ MOV R1 , c [ 0 x0 ] [ 0 x28 ] ;2 /∗0010∗/ @!PT SHFL. IDX PT, RZ, RZ, RZ, RZ;3 /∗0020∗/ S2R R6 , SR CTAID .X;4 /∗0030∗/ S2R R0 , SR TID .X;5 /∗0040∗/ MOV R7 , 0x4 ;6 /∗0050∗/ IMAD R6 , R6 , c [ 0 x0 ] [ 0 x0 ] , R0 ;7 /∗0060∗/ IMAD.WIDE R2 ,R6 . reuse ,R7 . reuse , c [ 0 x0 ] [ 0 x168 ] ;8 /∗0070∗/ IMAD.WIDE R4 , R6 , R7 , c [ 0 x0 ] [ 0 x170 ] ;9 /∗0080∗/ LDG.E.SYS R2 , [R2 ] ;

10 /∗0090∗/ LDG.E.SYS R4 , [R4 ] ;11 /∗00a0∗/ IMAD.WIDE R6 , R6 , R7 , c [ 0 x0 ] [ 0 x178 ] ;12 /∗00b0∗/ FADD R0 , R2 , R4 ;13 /∗00 c0 ∗/ STG.E.SYS [R6 ] , R0 ;14 /∗00d0∗/ EXIT ;15 /∗00 e0 ∗/ BRA 0xe0 ;16 /∗00 f0 ∗/ NOP;

Figure 4.5: SASS code for the vector addition in Global Mem.

In particular, MISC must count the 2 S2R and the 2 MOV instructions, and exclude the NOP(s) after

the final BRA. If NOP instructions before the final BRA were not counted, we would obtain in both

cases that MISC is exactly 4. However, in the example in the left figure MISC is 5, whereas in the

right figure MISC is 7, thus including those 1 and 3 NOPs before the final BRA respectively.

Table 4.4: Event counts for the vector addition benchmarks.

Event Gmem Smem VSmem Smem Smem

0 sync 0 sync 1 sync 2 sync

inst misc 4,096 4,096 4,096 5,120 6,144

inst integer 4,096 5,120 5,120 5,120 5,120

inst fp 32 1,024 1,024 1,024 1,024 1,024

inst compute ld st 3,072 6,144 9,216 8,192 9,216

inst control 1,024 1,024 1,024 1,024 1,024

Total 13,312 17,408 20,480 20,480 22,528

Observation 2: inst misc also counts NOP instructions if executed (thus excluding those after

the final BRA).

4.5 Assessment on Complex Code

In order to further assess our findings, we have evaluated several benchmarks, as well as kernels

extracted from the Rodinia benchmark suite [15, 16], a widely used benchmark suite for GPUs.

In this Thesis, we report the results we obtained on benchmarks, which suffices for illustrative

purposes. In particular, we analyse a vector addition benchmark whose SASS code has no loops

and the only predicated instruction is a DMOV instruction (hence not counted). In Figure 4.5

we analyse the global memory (GMEM) incarnations of that same benchmark, the other variants

(shared memory (Smem) with variable synchronization (sync) points) are not listed due to space

constraints.

Event counts are shown in Table 4.4, with each benchmark executing 1,024 threads. Hence, in-

structions per thread can be matched by dividing by 1,024 the values in the table. For instance,

30


MISC instructions count 2 MOV and 2 S2R instructions in all cases, plus 1 and 2 BAR instructions in

the two last cases respectively.

In all experiments, the observations we made as part of the application our process hold, hence,

event monitor reads match in all cases the (new) expected values:

(i) inst misc includes MOV instructions as well as MISC instructions (excluding NOPs after the final

BRA);

(ii) inst integer, inst control, inst fp 32, inst bit convert, and inst compute ld st, count

their expected instruction types precisely;

(iii) And total instructions match the addition of the other counters.

Overall, the large set of tests conducted for the validation of the event monitors of the Xavier, the

most relevant subset of which is presented in this Thesis, reveals that a methodology like the one

we propose is a prerequisite for a reliable use of the even monitors of GPUs in the timing V&V

process.

31

Chapter 5

Assessment of EMVP on the NVIDIA

Jetson TX2

In order to verify that the validation process works for more than one GPU family, we have also

assessed the PMCs available at the NVIDIA Jetson TX2 Development Board, from the Nvidia

Pascal Family. As we did with the assessment on NVIDIA AGX Xavier, we use the instruction

count’s PMCs as the example PMC set.

Table 5.1: Instruction types used in this analysis for the NVIDIA Jetson TX2 GPU.

Event [2] Official Description [2] Opcodes

counted [2]

inst integer Number of integer instructions executed by non-

predicated threads

XMAD, IADD, SHL,

SHR

inst compute ld st Number of compute load/store instructions executed by

non-predicated threads

LDS, LDG, STS,

STG

inst control Number of control-flow instructions executed by non-

predicated threads (jump, branch, etc.)

BRA, EXIT

no event Instructions that move data across registers MOV

inst misc Number of miscellaneous instructions executed by non-

predicated threads

NOP, S2R, BAR

not pred off thread inst exec Number of thread instructions executed that are not

predicated off

Total

The first column in Table 5.1 describes the particular event monitors to validate, while the sec-

ond column provides the description in the official GPU provider documentation. To obtain this

information we used the NVPROF tool [49] from CUDA 9.2 version toolkit as follows: nvprof

--query-events --query-metrics. As it can be seen, each event monitor counts certain instruc-

tion types. The particular operation codes under each instruction type are provided in a different

document [2]. Column three lists the subset of opcodes under each instruction type on which

we focus (extending this to other opcodes is an engineering work following the same EVMP ap-

32

Chapter 5. Assessment of EMVP on the NVIDIA Jetson TX2

proach). For instance, inst integer captures the following opcodes: BFE, BFI, FLO, IADD, IADD3,

ICMP, IMAD, IMADSP, IMNMX, IMUL, ISCADD, ISET, ISETP, LEA, LOP, LOP3, POPC, SHF, SHL, SHR,

XMAD. From those we focus on those boldfaced as they are the only ones that appear in our tests.

Interestingly, there is not event counter to track MOV instructions.


Since we are building the analysis of Nvidia Jetson TX2 based on the previous assessment with

Nvidi Jetson AGX Xavier, the experimental source code is the same, although, when compiled, the

SASS code generate differs due to the change of GPU generation (from Volta to Pascal) and CUDA

version used (from 10.0 to 9.2). The experiment case shown in this section, is the same matrix

copy benchmark shown in section 4.1. Figure 5.1 (top) shows the C code with CUDA calls of the

program, and the corresponding GPU assembly (SASS) code produced for this specific GPU, by

using cuobjdump (bottom).

Instruction 1 in the SASS code comprise the kernel’s prologue, performing the kernel initialization.

Instructions 2 to 5 load to registers the thread and block identifiers which are used in the right

hand side of the CUDA source code in lines 4 and 5. Instructions 6 to 11 in the SASS code

compute the thread access positions stored in the variables in the left hand side of source code

lines 4 and 5. Instructions 12 to 16 calculates the index within the brackets of source code line

6. Instruction couples [17,18] and [20,21] calculate the memory address for arrays d x and d y

respectively. Instruction 19 performs the load access from d x while instruction 22 carries out the

store access to d y. Finally, instruction 23 terminates the kernel.

As shown in the kernel invocation in line 22 of the source code, the kernel is launched with 1024x1024

threads. Each instruction is executed by all threads, which allows us to compute the number of

expected instructions for each type of instruction, in order to validate it with the measurements

of those instructions obtained with performance counters in the next step. Therefore, we expect

the SASS code on the right to be executed 1,048,576 times, thus leading to 25,165,824 (25 · 220)

instructions. Those instructions are broken down into 1 · 220 data movement (MOV), 4 · 220 miscella-

neous (S2R), 15 · 220 integer (XMAD, IADD, SHL, SHR), 2 · 220 load/store (LDG and STG), and 1 · 220

control flow (EXIT and BRA). Note that EXIT acts as a safeguard following the kernel termination.

Table 5.2: Measured/Expected values for matrix copy benchmark at Nvidia Jetson TX2


(1) ‘DMOV’ 1,048,576 0 -1,048,576

(2) inst misc 4,194,304 5,242,880 1,048,576

(3) inst integer 15,728,640 15,728,640 0


(5) inst control 2,097,152 1,048,576 -1,048,576

(6) Total 25,165,824 24,117,248 -1,048,576

33


5.2 First Validation Step

From the collected values we have detected several discrepancies in comparison to the expected

values, as shown in Table 5.2. For each instruction type we report the number of instructions

expected based on our analysis of the SASS code, those counted with the event monitors, and

the discrepancies. Note that we exclude those types for which we both expect and count zero

instructions. We extract the following conclusions:

(1) Data movement instructions, as expected, are not counted at all since there is no specific event

to count them.

(2) The number of miscellaneous instructions measured is higher than that in the SASS code. In

particular, there are 4 S2R in the SASS code executed ≈ 1 million times each (1,048,576 threads),

so we would expect ≈ 4 million MISC instructions counted. However, inst misc reports ≈ 5 million

MISC instructions, as if there were 1 additional MISC instructions per thread in the SASS code.

(3), (4) Integer and loads/stores are counted properly.

(5) The total number of instructions measured matches the addition of the individual types counted.

However, this number is different from the total number of expected instructions.

Hence, we need to further analyse the event counters inst misc, ‘DMOV’, and Total. On the

contrary, for inst integer, inst control and inst compute ld st, since the counts we observe

for experiment in Figure 5.1 – explained later – are precise, we consider them reliable.

Set of Hypotheses. From these results, we formulate the following hypotheses. The inst misc

monitor counts one instruction beyond those appearing in the SASS code and regarded as MISC

according to NVIDIA’s documentation [2]. We hypothesize that other instructions are counted as

MISC:

• Hypothesis 1a. Jetson TX2 behaviour for MISC counter is the same as in Jetson AGX Xavier,

and therefore, MOV instructions are being counted as MISC.

• Hypothesis 1b. The behaviour of Jetson TX2 MISC counter differs from the observation and

validation perform in Jetson AGX Xavier.

Table 5.3: Measured/Expected values for matrix copy benchmark at Nvidia Jetson TX2 after

applying PMC correction


(1) ‘DMOV’ 0 0 0

(2) inst misc 5,242,880 5,242,880 0

(3) inst integer 15,728,640 15,728,640 0


(5) inst control 1,048,576 1,048,576 0

(6) Total 24,117,248 24,117,248 0

34


Table 5.3 shows the expected, measured and discrepancy values after applying the corrections.

Since the number of MISC instructions matches, we regard that the MISC counter at Jetson TX2

behaves in the same way as Jetson AGX Xavier, assessing the hypothesis 1a. What this shows, is

that some problems on the PMCs w.r.t. their documentation is replicated across different GPU

generations and that the validation process can be applied to a different processor family with ease.

35


1 #inc lude <s t d i o . h>

2

3 global void copy ( i n t N, f l o a t ∗d x , f l o a t ∗d y ) {4 i n t x = blockDim . x∗blockIdx . x + threadIdx . x ;

5 i n t y = blockDim . y∗blockIdx . y + threadIdx . y ;

6 d y [N∗y + x]=d x [N∗y + x ] ;

7 }8

9 i n t main ( void ) {10 i n t N = 1024 ;

11 f l o a t ∗x , ∗y , ∗d x , ∗d y ;

12 x = ( f l o a t ∗) mal loc (N∗N∗ s i z e o f ( f l o a t ) ) ;13 y = ( f l o a t ∗) mal loc (N∗N∗ s i z e o f ( f l o a t ) ) ;14 dim3 g r id (32 ,32) ;

15 dim3 block (N/32 ,N/32) ;

16 cudaMalloc(&d x , N∗N∗ s i z e o f ( f l o a t ) ) ;17 cudaMalloc(&d y , N∗N∗ s i z e o f ( f l o a t ) ) ;18 f o r ( i n t i =0; i<N∗N; i++){19 x [ i ]=42.0 f ;

20 }21 cudaMemcpy( d x , x ,N∗N∗ s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice ) ;

22 copy<<<gr id , block>>>(N, d x , d y ) ;

23 cudaMemcpy(y , d y ,N∗N∗ s i z e o f ( f l o a t ) , cudaMemcpyDeviceToHost ) ;

24 cudaFree ( d x ) ;

25 cudaFree ( d y ) ;

26 f r e e ( x ) ;

27 f r e e ( y ) ;

28 }29

1 /∗0008∗/ MOV R1 , c [ 0 x0 ] [ 0 x20 ] ;

2 /∗0010∗/ S2R R0 , SR CTAID .X;

3 /∗0018∗/ S2R R2 , SR TID .X;

4 /∗0028∗/ S2R R3 , SR CTAID .Y;

5 /∗0030∗/ S2R R4 , SR TID .Y;

6 /∗0038∗/ XMAD R2 , R0 . reuse , c [ 0 x0 ] [ 0 x8 ] , R2 ;

7 /∗0048∗/ XMAD.MRG R5 , R0 , c [ 0 x0 ] [ 0 x8 ] . H1 , RZ;

8 /∗0050∗/ XMAD R4 , R3 , c [ 0 x0 ] [ 0 xc ] , R4 ;

9 /∗0058∗/ XMAD.MRG R6 , R3 . reuse , c [ 0 x0 ] [ 0 xc ] . H1 , RZ;

10 /∗0068∗/ XMAD.PSL .CBCC R0 , R0 .H1 , R5 .H1 , R2 ;


12 /∗0078∗/ XMAD R0 , R2 . reuse , c [ 0 x0 ] [ 0 x140 ] , R0 ;

13 /∗0088∗/ XMAD.MRG R3 , R2 . reuse , c [ 0 x0 ] [ 0 x140 ] . H1 , RZ;


15 /∗0098∗/ SHL R4 , R2 . reuse , 0x2 ;

16 /∗00a8∗/ SHR R5 , R2 , 0x1e ;

17 /∗00b0∗/ IADD R2 .CC, R4 . reuse , c [ 0 x0 ] [ 0 x148 ] ;

18 /∗00b8∗/ IADD.X R3 , R5 , c [ 0 x0 ] [ 0 x14c ] ;

19 /∗00 c8 ∗/ LDG.E R2 , [R2 ] ;

20 /∗00d0∗/ IADD R4 .CC, R4 , c [ 0 x0 ] [ 0 x150 ] ;

21 /∗00d8∗/ IADD.X R5 , R5 , c [ 0 x0 ] [ 0 x154 ] ;

22 /∗00 e8 ∗/ STG.E[R4 ] , R2 ;

23 /∗00 f0 ∗/ EXIT ;

24 /∗00 f8 ∗/ BRA 0xf8 ;

25

Figure 5.1: CUDA/SASS code of matrix copy benchmark at TX2.

36

Chapter 6

Assessment of EMVP on the Xilinx

Zynq Ultrascale+

As a confirmation of the generality of our approach, we also applied our approach to another

architecture from a different vendor, the Xilinx Zynq UltraScale+, and in particular, to the CPUs

in the Application Processor Unit cluster (see Figure 6.1). The ARM Cortex-A53 CPUs [1] contains

more than 63 events from which we focus on a subset, selecting again the number of instructions

executed, as well as the number of memory related events. A notable difference between this

platform and the Xavier is the lack of event counters breaking down the arithmetic instruction

categories. Instead, there are only PMCs about memory operations, branches and total instructions.

On the other hand, more events are provided regarding the microarchitectural events taking place.

It is important to note that according to the ARM Cortex-A53 CPU technical reference manual [1]

and the ARMv8-A architecture reference manual [3], the event values are not expected to be

completely accurate, and that the microarchitectural implementation may introduce small absolute

variations in the actual number of the events reported due to pipeline effects. For this reason, we

perform our validation in rough numbers, reporting only big discrepancies whenever found.


Table 6.1 lists the selected PMCs for validation. For the validation experiment, we use a bare-metal

configuration in order to guarantee no interference from the operating system, something that was

not possible in the Xavier, since the use of the GPU can only be supported by a driver within the

operating system. As rbe we selected the same application presented in the previous section and

used in the validation of the NVIDIA platform, matrix copy, which we compile with the ARM gcc

compiler. For the PMC readings we directly read their values from their memory mapped locations.

We disable the hardware prefetcher in to force a more predictable behaviour.

37

Chapter 6. Assessment of EMVP on the Xilinx Zynq Ultrascale+

Zynq UltraScale+ Device TRM 54UG1085 (v1.9) January 17, 2019 www.xilinx.com

Chapter 3: Application Processing Unit

When a Cortex-A53 MPCore processor is brought up in 32-bit mode using the APU_CONFIG0 [VINITHI] parameter register, its exception table cannot be relocated at run time. The V[13] bit of the system control register defines the base address of the exception vector.

See the Zynq UltraScale+ MPSoC Software Developer�s Guide (UG1137) [Ref 3] for more information.

Figure 3-2 shows a top-level functional diagram of the Cortex-A53 MPCore processor.

X-Ref Target - Figure 3-2

Figure 3-2: APU Block Diagram

Cortex-A53 Processor

APB Decoder APB ROM APB Multiplexer CTM

Governor

CTI Retention Control

Debug Over Power Down

Arch Timer

Clock and Reset

GIC CPU interface

Core 0 Governor

Core 0

FPU and NEON Extension

Crypto Extension

L1 ICache

L1 DCache

Debug and Trace



Arch Timer

Clock and Reset

GIC CPU interface

Core 1 Governor

Core 1


Crypto Extension

L1 ICache

L1 DCache

Debug and Trace



Arch Timer

Clock and Reset

GIC CPU interface

Core 2 Governor

Core 2


Crypto Extension

L1 ICache

L1 DCache

Debug and Trace



Arch Timer

Clock and Reset

GIC CPU interface

Core 3 Governor

Core 3


Crypto Extension

L1 ICache

L1 DCache

Debug and Trace

Level 2 Memory System

L2 Cache SCU ACE Master Bus Interface ACP Slave

X15287-092916

��

Figure 6.1: Diagram of the Cortex-A53 CPU cluster [1]

Table 6.1: Instruction types used in the analysis for the Xilinx Ultrascale+ ARM Cortex-A53

CPUs [1].

Event Official Description Instruction Types counted [3]

L1D CACHE REFILL Level 1 data cache refill Loads and stores missing L1

L1D CACHE Level 1 data cache access Loads and stores

LD RETIRED Instruction architecturally executed,

Condition code check pass, load

Loads

ST RETIRED Instruction architecturally executed,

Condition code check pass, store

Stores

INST RETIRED Instruction architecturally executed All instructions

MEM ACCESSES Data memory access Loads and stores

L2D CACHE Level 2 data cache access Loads, stores and instructions missing L1

caches

L2D CACHE REFILL Level 2 data cache refill Loads, stores and instructions missing L2

BUS ACCESS Bus access Bus acccesses from loads and stores missing

the last level cache

In Figure 6.2 we show the C code implementing the matrix copy benchmark, followed by its assembly

form. The memory instructions which are of our interest are shown in bold, and load operations

are shown in italics to ease load and store identification. In the assembly code we notice again

different code sections. The main loop of the benchmark is between the lines 4-24.

Load instructions in lines 5, 11, 16 and 21, and the store instruction in line 19 serve the purpose of

loading or updating the loop index, which is located in a fixed memory location so the instructions

will cause 5 cache hits (4 load hits, 1 store hit) per iteration.

The load instruction in line 9 loads the data from the source array (from[]), which may cause cache

38


misses when crossing cache line boundaries, and the store instruction in line 14 stores the value

loaded from the source array into the destination array (to[]), which may also cause cache misses

when crossing cache line boundaries.

Lines 20-24 perform the out of boundaries check to determine if more iterations are needed or the

algorithm has ended. Knowing the number of total iterations and the assembly representation, we

are able to tightly estimate the number of expected instructions and events for each of the selected

PMCs.

1 #de f i n e SIZE 2∗1024∗1024/ s i z e o f ( i n t ) /∗512K∗/2 i n t from [ SIZE ] , to [ SIZE ] ;

3

4 f o r ( i n t i = 0 ; i < SIZE ; i++) {5 to [ i ] = from [ i ] ;

6 }7

1 /∗3358∗/ add x0 , x29 , #0x400 , l s l #12

2 /∗335 c∗/ str wzr , [ x0 ,#24]

3 /∗3360∗/ b 33a4 <main+0xd4>

4 /∗3364∗/ add x0 , x29 , #0x400 , l s l #12

5 /∗3368∗/ ldrsw x0 , [ x0 ,#24]

6 /∗336 c∗/ l s l x0 , x0 , #2

7 /∗3370∗/ add x1 , x29 , #0x200 , l s l #12

8 /∗3374∗/ add x1 , x1 , #0x18

9 /∗3378∗/ ldr w2 , [ x1 , x0 ]

10 /∗337 c∗/ add x0 , x29 , #0x400 , l s l #12

11 /∗3380∗/ ldrsw x0 , [ x0 ,#24]

12 /∗3384∗/ l s l x0 , x0 , #2

13 /∗3388∗/ add x1 , x29 , #0x18

14 /∗338 c∗/ str w2 , [ x1 , x0 ]

15 /∗3390∗/ add x0 , x29 , #0x400 , l s l #12

16 /∗3394∗/ ldr w0 , [ x0 ,#24]

17 /∗3398∗/ add w0 , w0 , #0x1

18 /∗339 c∗/ add x1 , x29 , #0x400 , l s l #12

19 /∗33a0∗/ str w0 , [ x1 ,#24]

20 /∗33a4∗/ add x0 , x29 , #0x400 , l s l #12

21 /∗33a8∗/ ldr w1 , [ x0 ,#24]

22 /∗33 ac ∗/ mov w0 , #0x 7 f f f f //#524287

23 /∗33b0∗/ cmp w1 , w0

24 /∗33b4∗/ b . l s 3364 <main+0x94>

25 /∗33b8∗/ adrp x0 , 1 e000 < ex idx end>

26 /∗33bc∗/ add x0 , x0 , #0x3c0

27 /∗33 c0 ∗/ ldr w2 , [ x0 ]

28 /∗33 c4 ∗/ adrp x0 , 14000 <z e r o e s .5791+0x1c0>

29 /∗33 c8 ∗/ add x0 , x0 , #0x450

30 /∗33 cc ∗/ ldr x0 , [ x0 ]

31 /∗33d0∗/ mov x1 , x0

32 /∗33d4∗/ mov w0 , w2

33

Figure 6.2: C/ARM Assembly code of matrix copy in the Zynq.

39


Table 6.2: Measured/Expected values for matrix copy


(A) L1D CACHE REFILL 64K 65566 0

(B) L1D CACHE 3.5M 3670319 0

(C) LD RETIRED 2.5M 2621612 0

(D) ST RETIRED 1M 1048626 0

(E) INST RETIRED 10.5M 11010313 0

(F) MEM ACCESSES 3.5M 3670057 0

(G) L2D CACHE 64K 130772 64K

(H) L2D CACHE REFILL 64K 65559 0

(I) BUS ACCESS 352K 360309 0

6.2 Assessment

Table 6.2 shows the obtained values from the PMCs, together with their expected values. As a

first validation step, we validate the accuracy of the instructions that we know. We perform 512K

iterations in which we access 5 loads and 2 store instructions. The number of load and stores are as

expected (rows C and D). Likewise the number of L1 cache accesses and total memory operations

match their expected value (rows B and F). The number of total instructions within the loop is 21,

resulting in a total of 10.5M instructions executed (row E).

The total memory footprint of the application is 4MB, as it copies a 2MB array into a new location.

Since no data is reused, every read and write accessing a new cache line of data and L2 caches is

expected to generate a cache miss, whereas the remaining accesses to those cache lines are expected

to hit due to spatial locality. Given that our application has a sequential access pattern, that the

cache line size is 64B in both the data and L2 caches, and that the data type used by the application

is 4B, we expect that both, source loads and destination store instructions produce 1 miss followed

by 15 hits for each cache line. Therefore, we expect 32K read misses and 32K write misses out of

a total of 512K accesses of each type, as a new line is accessed once every 16 memory accesses.

Note that only a load and a store instruction may miss the caches each iteration, as explained in

section 6.1. Note that the code is small enough to fit in the instruction cache after the first loop

iteration. Thus, instructions should cause few (below 10) L2 cache misses, so L2 misses roughly

correspond to data misses only. The measured misses in each cache are 64K as expected (rows A

and H).

The bus access counter (row I) counts the number of bus transactions issued, which are caused by

L2 load misses, L2 store misses, or L2 dirty evictions. A total of 32K load misses and 32K store

misses are expected, while only 24K dirty evictions are expected, as 512KB -a fourth- of the data

stored will still remain in the L2 when the execution finishes. The total amount of lines issued by

the L2 to the bus is 88K, however the bus width is 16B [1], so each line is split in 4 transactions,

causing a total of 352K bus accesses. As shown in row I, this counter is precise.

40


Finally, L2 accesses (row G) should be either counting the 32K load misses and 32K store misses

(so 64K accesses), or include also the almost 32K dirty evictions (so 96K accesses). However, it

counts 128K accesses. According to the ARM Cortex-A53 CPU technical reference manual [1],

L1 load misses sent to L2 are served through a 16B bus, whereas write operations use a 32B bus.

We have leveraged this information to guess whether they influenced the number of L2 accesses

counted, but no reasonable combination led to 128K accesses. In fact, we have conducted additional

experiments (e.g. only read misses) and in all cases the number of L2 accesses has doubled the

number of L1 data cache misses. However, we could not formulate any reasonable hypothesis to

justify this behavior. In fact, our past work on ARM-based platforms already revealed mismatches

between event counters obtained and values expected [?].

Observation 3: In the absence of any evidence on the existence of additional L2 cache access

activity, we regard L2D CACHE as unreliable for timing validation purposes.

41

Chapter 7

Conclusions and Future Work

7.1 Conclusions

The increasing need for the adoption of high-performance hardware to execute performance-demanding

critical real-time software poses stringent V&V constraints on those systems. More complex and

sophisticated safety-related software is used in CRTES which can benefit the advanced SoCs. Con-

sequently, software timing analysis, as a mandatory requirement for CRTES, becomes much more

complex.

PMUs are key to efficiently enable monitoring task activities such as quota monitoring and software

debugging. However, existing solutions and software approaches rely on the assumption that event

monitors and their documentations are fully correct and can be trusted. Therefore, trustworthiness

of event monitors is a precondition for the reliability of the critical V&V processes built atop.

In this Thesis, first, we show that even some of the most basic event counters may fail to match

their specifications. This can result in misunderstanding or misinterpretation of the outcome of

event counters and, therefore, HEMs cannot be trusted airily.

Next, we propose a methodology to validate the event monitors via the analysis of GPUs for the

automotive domain (NVIDIA Xavier and TX2) and multicores for the railway and avionics domains

(Xilinx Zynq UltraScale+). Our methodology allows, in many cases, discern what they count,

which is the basis to build reliable processes for critical real-time systems atop. In particular, by

performing specific empirical tests, we are able to accept or reject plausible hypotheses and collect

evidence supporting our conclusions. We show how some instructions are misclassified, some others

are counted in non-obvious ways, and some events may fully mismatch expectations. However, once

this information is obtained and verified empirically, validated event counters in complex hardware

can be used for the V&V of critical real-time systems.

Lastly, we have discussed the automated validation and tooling support and the existing challenges

42

Chapter 7. Conclusions and Future Work

on the assessment process. We have seen that by defining representative benchmarks, tool support

can be used to derive the expected value for some of the event monitor values. We discussed that

while a full automation procedure is not feasible, establishing well-defined procedures to perform

the verification process exhaustively and providing an easy review process without ambiguity and

misunderstanding is very important.

7.2 Future work

This Thesis has only scratched the surface of the validation of performance counters, therefore it

can be extended in several ways. For example, as we discussed in Section 3.3, there are several

opportunities for automating Event Monitor Validation Process we introduced, in order to facilitate

the inevitable manual work required by the expert analyst who has to supervise the process. In

particular, some of the directions to follow in that aspect are to develop tools to derive the expected

value of event monitors for a given rbe or the creation of a database of rbe(s) with precomputed

values from previous analyses. An interesting extension is to work on tranferring this technology

to industry by qualifying the new tools which will be developed and commercialise it.

As another future work direction, the methodology of this Thesis can be replicated to extend the

verification of more performance counters than the ones that were already analysed in this work in

the three evaluated platforms. In addition, it will be interesting to apply our methodology to more

architectures which are considered in critical real-time systems. Some candidate platforms for such

future analysis are the T2080 from NXP, the hard real-time R5 cores found in Xilinx FPGA SoCs,

as well as other embedded GPUs such the GPU found in the Jetson Nano SoC from NVIDIA which

was introduced after the TX2 and the Xavier used in this work and the latest generation of ARM

GPUs which can be found in newly released products such as the Mali G72.

43

Bibliography

[1] ARM Cortex-A53 MPCore Processor, 2014.

[2] CUDA Binary Utilities, 2018.

[3] ARMv8 Reference Manual v8.5 EAC, 2019.

[4] Jaume Abella, Carles Hernandez, Eduardo Quinones, Francisco J Cazorla, Philippa Ryan

Conmy, Mikel Azkarate-Askasua, Jon Perez, Enrico Mezzetti, and Tullio Vardanega. Wcet

analysis methods: Pitfalls and challenges on their trustworthiness. In 10th IEEE International

Symposium on Industrial Embedded Systems (SIES), pages 1–10. IEEE, 2015.

[5] Sergi Alcaide, Leonidas Kosmidis, Hamid Tabani, Carles Hernandez, Jaume Abella, and Fran-

cisco J Cazorla. Safety-Related Challenges and Opportunities for GPUs in the Automotive

Domain. IEEE Micro, 38(6):46–55, 2018.

[6] Tanya Amert, Nathan Otterness, Ming Yang, James H Anderson, and F Donelson Smith. GPU

scheduling on the NVIDIA TX2: Hidden details revealed. In 2017 IEEE Real-Time Systems

Symposium (RTSS), pages 104–115. IEEE, 2017.

[7] ARM. ARM Expects Vehicle Compute Performance to Increase 100x in Next Decade, 2015.

https://www.arm.com/about/newsroom/arm-expects-vehicle-compute-performance-t

o-increase-100x-in-next-decade.php.

[8] Javier Barrera, Leonidas Kosmidis, Hamid Tabani, Enrico Mezzetti, Jaume Abella, Mikel

Fernandez, Guillem Bernat, and Francisco J Cazorla. On the reliability of hardware event

monitors in mpsocs for critical domains. In Proceedings of the 35th Annual ACM Symposium

on Applied Computing, pages 580–589, 2020.

[9] Kostiantyn Berezovskyi, Konstantinos Bletsas, and Bjorn Andersson. Makespan computa-

tion for GPU threads running on a single streaming multiprocessor. In 2012 24th Euromicro

Conference on Real-Time Systems, pages 277–286. IEEE, 2012.

[10] Kostiantyn Berezovskyi, Fabrice Guet, Luca Santinelli, Konstantinos Bletsas, and Eduardo

Tovar. Measurement-based probabilistic timing analysis for graphics processor units. In In-

ternational Conference on Architecture of Computing Systems, pages 223–236. Springer, 2016.

44

https://www.arm.com/about/newsroom/arm-expects-vehicle-compute-performance-to-increase-100x-in-next-decade.php

https://www.arm.com/about/newsroom/arm-expects-vehicle-compute-performance-to-increase-100x-in-next-decade.php

Bibliography

[11] Adam Betts and Alastair Donaldson. Estimating the WCET of GPU-accelerated applications

using hybrid analysis. In 2013 25th Euromicro Conference on Real-Time Systems, pages 193–

202. IEEE, 2013.

[12] Nicola Capodieci, Roberto Cavicchioli, Marko Bertogna, and Aingara Paramakuru. Deadline-

based scheduling for gpu with preemption support. In 2018 IEEE Real-Time Systems Sympo-

sium (RTSS), pages 119–130. IEEE, 2018.

[13] Francisco J Cazorla, Jaume Abella, Enrico Mezzetti, Carles Hernandez, Tullio Vardanega, and

Guillem Bernat. Reconciling time predictability and performance in future computing systems.

IEEE Design & Test, 35(2):48–56, 2018.

[14] CENELEC. EN50128 Railway Applications: Software for Railway Control and Protection,

2001.

[15] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee,

and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE

international symposium on workload characterization (IISWC), pages 44–54. Ieee, 2009.

[16] Shuai Che, Jeremy W Sheaffer, Michael Boyer, Lukasz G Szafaryn, Liang Wang, and Kevin

Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary

cmp workloads. In IEEE International Symposium on Workload Characterization (IISWC’10),

pages 1–11. IEEE, 2010.

[17] Fabrice Cros, Leonidas Kosmidis, Franck Wartel, David Morales, Jaume Abella, Ian Broster,

and Francisco J Cazorla. Dynamic software randomisation: Lessons learnec from an aerospace

case study. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017,

pages 103–108. IEEE, 2017.

[18] Dakshina Dasari, Bjorn Andersson, Vincent Nelis, Stefan M Petters, Arvind Easwaran, and

Jinkyu Lee. Response time analysis of COTS-based multicores considering the contention on

the shared memory bus. In 2011IEEE 10th International Conference on Trust, Security and

Privacy in Computing and Communications, pages 1068–1075. IEEE, 2011.

[19] Enrique Dıaz, Enrico Mezzetti, Leonidas Kosmidis, Jaume Abella, and Francisco J Cazorla.

Modelling multicore contention on the aurix tm tc27x. In Proceedings of the 55th Annual

Design Automation Conference, pages 1–6, 2018.

[20] Boris Dreyer, Christian Hochberger, Alexander Lange, Simon Wegener, and Alexander Weiss.

Continuous non-intrusive hybrid WCET estimation using waypoint graphs. In 16th Interna-

tional Workshop on Worst-Case Execution Time Analysis (WCET 2016). Schloss Dagstuhl-

Leibniz-Zentrum fuer Informatik, 2016.

[21] Boris Dreyer, Christian Hochberger, Simon Wegener, and Alexander Weiss. Precise continuous

non-intrusive measurement-based execution time estimation. In 15th International Workshop

45

Bibliography

on Worst-Case Execution Time Analysis (WCET 2015). Schloss Dagstuhl-Leibniz-Zentrum

fuer Informatik, 2015.

[22] Federal Aviation Administration, Certification Authorities Software Team (CAST). CAST-

32A Multi-core Processors, 2016.

[23] Gabriel Fernandez, Francisco Cazorla, and Jaume Abella. Consumer Electronics Processors

for Critical Real-Time Systems: a (Failed) Practical Experience. 2018.

[24] Gabriel Fernandez, Francisco J Cazorla, Jaume Abella, and Sylvain Girbal. Assessing Time

Predictability Features of ARM Big. LITTLE Multicores. In 2018 30th International Sympo-

sium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 258–

261. IEEE, 2018.

[25] Mikel Fernandez, David Morales, Leonidas Kosmidis, Alen Bardizbanyan, Ian Broster, Car-

les Hernandez, Eduardo Quinones, Jaume Abella, Francisco Cazorla, Paulo Machado, et al.

Probabilistic timing analysis on time-randomized platforms for the space domain. In Design,

Automation & Test in Europe Conference & Exhibition (DATE), 2017, pages 738–739. IEEE,

2017.

[26] Kees Goossens, Arnaldo Azevedo, Karthik Chandrasekar, Manil Dev Gomony, Sven Goossens,

Martijn Koedam, Yonghui Li, Davit Mirzoyan, Anca Molnos, Ashkan Beyranvand Nejad, et al.

Virtual execution platforms for mixed-time-criticality systems: the compsoc architecture and

design flow. ACM SIGBED Review, 10(3):23–34, 2013.

[27] David Griffin, Benjamin Lesage, Iain Bate, Frank Soboczenski, and Robert I Davis. Forecast-

based interference: Modelling multicore interference from observable factors. In Proceedings of

the 25th International Conference on Real-Time Networks and Systems, pages 198–207, 2017.

[28] Fabrice Guet, Luca Santinelli, and Jerome Morio. Probabilistic analysis of cache memories

and cache memories impacts on multi-core embedded systems. In 2016 11th IEEE Symposium

on Industrial Embedded Systems (SIES), pages 1–10. IEEE, 2016.

[29] Intel. Intel GO, 2017. https://www.nxp.com/docs/en/errata/IMX6SLLCE.pdf.

[30] International Electrotechnical Commission. International Electrotechnical Commission,

IEC61508. Functional safety of electrical/electronic/programmable eletronic safety-related sys-

tems, 2010.

[31] International Organization for Standardization. ISO/DIS 26262. Road Vehicles – Functional

Safety, 2009.

[32] Javier Jalle, Mikel Fernandez, Jaume Abella, Jan Andersson, Mathieu Patte, Luca Fossati,

Marco Zulianello, and Francisco J Cazorla. Contention-aware performance monitoring counter

support for real-time MPSoCs. In 2016 11th IEEE Symposium on Industrial Embedded Systems

(SIES), pages 1–10. IEEE, 2016.

46

https://www.nxp.com/docs/en/errata/IMX6SLLCE.pdf

Bibliography

[33] Hermann Kopetz and Gunther Bauer. The time-triggered architecture. Proceedings of the

IEEE, 91(1):112–126, 2003.

[34] Leonidas Kosmidis, Jaume Abella, Franck Wartel, Eduardo Quinones, Antoine Colin, and

Francisco J Cazorla. PUB: Path upper-bounding for measurement-based probabilistic timing

analysis. In 2014 26th Euromicro Conference on Real-Time Systems, pages 276–287. IEEE,

2014.

[35] Leonidas Kosmidis, Davide Compagnin, David Morales, Enrico Mezzetti, Eduardo Quinones,

Jaume Abella Ferrer, Tullio Vardanega, and Francisco Javier Cazorla Almeida. Measurement-

based timing analysis of the aurix caches. In 16th International Workshop on Worst-Case

Execution Time Analysis (WCET 2016), pages 9–1. Schloss Dagstuhl-Leibniz-Zentrum fur

Informatik, 2016.

[36] Leonidas Kosmidis, Cristian Maxim, Victor Jegu, Francis Vatrinet, and Francisco J Cazorla.

Industrial experiences with resource management under software randomization in ARINC653

avionics environments. In 2018 IEEE/ACM International Conference on Computer-Aided

Design (ICCAD), pages 1–7. IEEE, 2018.

[37] Lei Liu, Zehan Cui, Mingjie Xing, Yungang Bao, Mingyu Chen, and Chengyong Wu. A soft-

ware memory partition approach for eliminating bank-level interference in multicore systems.

In 2012 21st International Conference on Parallel Architectures and Compilation Techniques

(PACT), pages 367–375. IEEE, 2012.

[38] Renato Mancuso, Roman Dudko, Emiliano Betti, Marco Cesati, Marco Caccamo, and Rodolfo

Pellizzoni. Real-time cache management framework for multi-core architectures. In 2013 IEEE

19th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 45–54.

IEEE, 2013.

[39] MarketsandMarkets. Embedded System Market Worth 116.2 Billion Dollars by 2025 - Exclu-

sive Report by MarketsandMarkets, 2020. https://www.bloomberg.com/press-releases/2

020-03-17/embedded-system-market-worth-116-2-billion-by-2025-exclusive-repor

t-by-marketsandmarkets.

[40] Fabio Mazzocchetti, Pedro Benedicte, Hamid Tabani, Leonidas Kosmidis, Jaume Abella, and

Francisco J Cazorla. Performance analysis and optimization of automotive gpus. In 2019

31st International Symposium on Computer Architecture and High Performance Computing

(SBAC-PAD), pages 96–103. IEEE, 2019.

[41] Enrico Mezzetti, Jaume Abella, Carles Hernandez, and Francisco J Cazorla. Work-in-Progress

paper: An Analysis of the Impact of Dependencies on Probabilistic Timing Analysis and Task

Scheduling. In 2017 IEEE Real-Time Systems Symposium (RTSS), pages 357–359. IEEE, 2017.

47

https://www.bloomberg.com/press-releases/2020-03-17/embedded-system-market-worth-116-2-billion-by-2025-exclusive-report-by-marketsandmarkets



Bibliography

[42] Enrico Mezzetti, Leonidas Kosmidis, Jaume Abella, and Francisco J Cazorla. High-integrity

performance monitoring units in automotive chips for reliable timing V&V. IEEE Micro,

38(1):56–65, 2018.

[43] Suzana Milutinovic, Jaume Abella, Enrico Mezzetti, and Francisco J Cazorla. Measurement-

based cache representativeness on multipath programs. In Proceedings of the 55th Annual

Design Automation Conference, pages 1–6, 2018.

[44] Suzana Milutinovic, Jaume Abella Ferrer, and Francisco Javier Cazorla Almeida. Validating

the Reliability of WCET Estimates with MBPTA. In Book of abstracts, pages 68–70. Barcelona

Supercomputing Center, 2015.

[45] Jan Nowotsch, Michael Paulitsch, Daniel Buhler, Henrik Theiling, Simon Wegener, and

Michael Schmidt. Multi-core interference-sensitive WCET analysis leveraging runtime resource

capacity enforcement. In 2014 26th Euromicro Conference on Real-Time Systems, pages 109–

118. IEEE, 2014.

[46] Nvidia. Nvidia Jetson AGX Xavier, 2017. https://www.nvidia.com/en-us/autonomous-ma

chines/embedded-systems/jetson-agx-xavier/.

[47] Nvidia. Nvidia Jetson TX2, 2017. https://www.nvidia.com/en-us/autonomous-machines

/embedded-systems/jetson-tx2/.

[48] NVIDIA. CUDA 10.0 toolkit documentation., 2018.

[49] NVIDIA. CUDA 9.2 toolkit documentation., 2018.

[50] NXP. Chip Errata for the i.MX 6SLL, 2017. https://www.nxp.com/docs/en/errata/IMX6S

LLCE.pdf.

[51] Nathan Otterness, Ming Yang, Sarah Rust, Eunbyung Park, James H Anderson, F Donelson

Smith, Alex Berg, and Shige Wang. An evaluation of the NVIDIA TX1 for supporting real-

time computer-vision workloads. In 2017 IEEE Real-Time and Embedded Technology and

Applications Symposium (RTAS), pages 353–364. IEEE, 2017.

[52] Xing Pan and Frank Mueller. Controller-aware memory coloring for multicore real-time sys-

tems. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pages

584–592, 2018.

[53] Milos Panic, Eduardo Quinones, Pavel G Zavkov, Carles Hernandez, Jaume Abella, and Fran-

cisco J Cazorla. Parallel many-core avionics systems. In 2014 International Conference on

Embedded Software (EMSOFT), pages 1–10. IEEE, 2014.

[54] Rodolfo Pellizzoni, Andreas Schranzhofer, Jian-Jia Chen, Marco Caccamo, and Lothar Thiele.

Worst case delay analysis for memory interference in multicore systems. In 2010 Design,

48

https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/

https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/

https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2/

https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2/



Bibliography

Automation & Test in Europe Conference & Exhibition (DATE 2010), pages 741–746. IEEE,

2010.

[55] Qualcomm. Qualcomm Snapdragon 820, 2017. https://www.qualcomm.com/products/snap

dragon-820-mobile-platform.

[56] Jan Reineke. Challenges for Timing Analysis of Multi-Core Architectures. Workshop on

Foundational and Practical Aspects of Resource Analysis, 2017. Invited Talk.

[57] Renesas. Renesas R-Car H3, 2017. https://www.renesas.com/us/en/solutions/automoti

ve/soc/r-car-h3.html.

[58] RTCA and EUROCAE. DO-178C / ED-12C, Software Considerations in Airborne Systems

and Equipment Certification, 2011.

[59] Martin Schoeberl, Florian Brandner, Jens Sparsø, and Evangelia Kasapaki. A statically sched-

uled time-division-multiplexed network-on-chip for real-time systems. In 2012 IEEE/ACM

Sixth International Symposium on Networks-on-Chip, pages 152–160. IEEE, 2012.

[60] Jens Sparsø. Design of networks-on-chip for real-time multi-processor systems-on-chip. In 2012

12th International Conference on Application of Concurrency to System Design, pages 1–5.

IEEE, 2012.

[61] Noriaki Suzuki, Hyoseung Kim, Dionisio De Niz, Bjorn Andersson, Lutz Wrage, Mark Klein,

and Ragunathan Rajkumar. Coordinated bank and cache coloring for temporal protection of

memory accesses. In 2013 IEEE 16th International Conference on Computational Science and

Engineering, pages 685–692. IEEE, 2013.

[62] Hamid Tabani, Leonidas Kosmidis, Jaume Abella, Francisco J Cazorla, and Guillem Bernat.

Assessing the adherence of an industrial autonomous driving framework to iso 26262 software

guidelines. In 2019 56th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE,

2019.

[63] Prathap Kumar Valsan, Heechul Yun, and Farzad Farshchi. Taming non-blocking caches to

improve isolation in multicore real-time systems. In 2016 IEEE Real-Time and Embedded

Technology and Applications Symposium (RTAS), pages 1–12. IEEE, 2016.

[64] Franck Wartel, Leonidas Kosmidis, Adriana Gogonel, Andrea Baldovino, Zoe Stephenson,

Benoit Triquet, Eduardo Quinones, Code Lo, Enrico Mezzetta, Ian Broster, et al. Timing

analysis of an avionics case study on complex hardware/software platforms. In 2015 Design,

Automation & Test in Europe Conference & Exhibition (DATE), pages 397–402. IEEE, 2015.

[65] Franck Wartel, Leonidas Kosmidis, Code Lo, Benoit Triquet, Eduardo Quinones, Jaume

Abella, Adriana Gogonel, Andrea Baldovin, Enrico Mezzetti, Liliana Cucu, et al.

49

https://www.qualcomm.com/products/snapdragon-820-mobile-platform

https://www.qualcomm.com/products/snapdragon-820-mobile-platform

https://www.renesas.com/us/en/solutions/automotive/soc/r-car-h3.html

https://www.renesas.com/us/en/solutions/automotive/soc/r-car-h3.html

Bibliography

Measurement-based probabilistic timing analysis: Lessons from an integrated-modular avion-

ics case study. In 2013 8th IEEE International Symposium on Industrial Embedded Systems

(SIES), pages 241–248. IEEE, 2013.

[66] Vincent M Weaver, Dan Terpstra, and Shirley Moore. Non-determinism and overcount on

modern hardware performance counter implementations. In 2013 IEEE International Sym-

posium on Performance Analysis of Systems and Software (ISPASS), pages 215–224. IEEE,

2013.

[67] Adam West. NASA Study on Flight Software Complexity. Final Report. Technical report,

NASA, 2009.

[68] Reinhard Wilhelm and Jan Reineke. Embedded systems: Many cores—Many problems. In 7th

IEEE International Symposium on Industrial Embedded Systems (SIES’12), pages 176–180.

IEEE, 2012.

[69] Xilinx. Rockwell Collins Uses Zynq UltraScale+ RFSoC Devices in Revolutionizing How Arrays

are Produced and Fielded: Powered by Xilinx, 2019.

[70] Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H Anderson, and F Donel-

son Smith. Avoiding pitfalls when using nvidia gpus for real-time tasks in autonomous sys-

tems. In 30th Euromicro Conference on Real-Time Systems (ECRTS 2018). Schloss Dagstuhl-

Leibniz-Zentrum fuer Informatik, 2018.

[71] Heechul Yun, Renato Mancuso, Zheng-Pei Wu, and Rodolfo Pellizzoni. PALLOC: DRAM

bank-aware memory allocator for performance isolation on multicore platforms. In 2014 IEEE

19th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 155–

166. IEEE, 2014.

[72] Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Marco Caccamo, and Lui Sha. Memguard: Mem-

ory bandwidth reservation system for efficient performance isolation in multi-core platforms. In

2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS),

pages 55–64. IEEE, 2013.

[73] Dmitrijs Zaparanuks, Milan Jovic, and Matthias Hauswirth. Accuracy of performance counter

measurements. In 2009 IEEE International Symposium on Performance Analysis of Systems

and Software, pages 23–32. IEEE, 2009.

[74] Marco Ziccardi, Enrico Mezzetti, Tullio Vardanega, Jaume Abella, and Francisco Javier Ca-

zorla. Epc: extended path coverage for measurement-based probabilistic timing analysis. In

2015 IEEE Real-Time Systems Symposium, pages 338–349. IEEE, 2015.

50

Documents

On The Analysis Of Hardware Event Monitors Accuracy In