67
Timing Analysis and Timing Predictability Reinhard Wilhelm Saarland University http://www.predator-project.eu/

Timing Analysis and Timing Predictability Reinhard Wilhelm Saarland University

Embed Size (px)

Citation preview

Timing Analysis and Timing PredictabilityReinhard Wilhelm

Saarland University

http://www.predator-project.eu/

- 2 -

Deriving Run-Time Guarantees for Hard Real-Time Systems

Given:1. a software to produce a reaction, 2. a hardware platform, on which to

execute the software,3. a required reaction time.Derive: a guarantee for timeliness.

- 3 -

Timing Analysis

• Sound methods that determine upper bounds for all execution times,

• can be seen as the search for a longest path, – through different types of graphs,– through a huge space of paths.

I will show 1. how this huge state space originates,2. how and how far we can cope with this huge

state space,3. what synchronous languages contribute.

- 4 -

Architecture

(constant execution

times)

Timing Analysis – the Search Space

• all control-flow paths (through the binary executable) – depending on the possible inputs.

• Feasible as search for a longest path if –Iteration and recursion are bounded,–Execution time of instructions are (positive) constants.

• Elegant method: Timing Schemata (Shaw 89) – inductive calculation of upper bounds.

Software

Input

ub (if b then S1 else S2) := ub (b) + max (ub (S1), ub (S2))

- 5 -

High-Performance Microprosessors

• increase (average-case) performance by using: Caches, Pipelines, Branch Prediction, Speculation

• These features make timing analysis difficult:Execution times of instructions vary widely– Best case - everything goes smoothly: no cache miss,

operands ready, resources free, branch correctly predicted

– Worst case - everything goes wrong: all loads miss the cache, resources are occupied, operands not ready

– Span may be several hundred cycles

- 6 -

Variability of Execution TimesLOAD r2, _a

LOAD r1, _b

ADD r3,r2,r1

0

50

100

150

200

250

300

350

Best Case Worst Case

Execution Time (Clock Cycles)

Clock Cycles

PPC 755

x = a + b;

In most cases, executionwill be fast.So, assuming the worst caseis safe, but very pessimistic!

- 7 -

AbsInt‘s WCET Analyzer aiT

IST Project DAEDALUS final review report:

"The AbsInt tool is probably thebest of its kind in the world and it is justified to consider this result as a breakthrough.”

Several time-critical subsystems of the Airbus A380 have been certified using aiT;aiT is the only validated tool for these applications.

- 8 -

Tremendous Progressduring the past 14 Years

1995 2002 2005

over

-est

imat

ion

20-30%15%

30-50%

4

25

60

200ca

che-

mis

s pe

nalt

y

Lim et al. Thesing et al. Souyris et al.

The explosion of penalties has been compensated by the improvement of the analyses!

10%

25%

- 9 -

State-dependent Execution Times

• Execution time depend on the execution state.

• Execution state results from the execution history.

semantics state:values of variables

execution state:occupancy of resources

state

- 10 -

Architecture

Timing Analysis – the Search Spacewith State-dependent Execution Times

• all control-flow paths – depending on the possible inputs

• all paths through the architecture for potential initial states

Software

Input

initialstate

mul rD, rA, rB execution states for paths reaching this program point

instructionin I-cache

instructionnot in I-cache

1

bus occupied

bus not occupied

small operands

large operands

1

4

≥ 40

- 11 -

Architecture

Timing Analysis – the Search Spacewith out-of-order execution

• all control-flow paths – depending on the possible inputs

• all paths through the architecture for potential initial states

• including different schedules for instruction sequences

Software

Input

initialstate

- 12 -

Architecture

Timing Analysis – the Search Spacewith multi-threading

• all control-flow paths – depending on the possible inputs

• all paths through the architecture for potential initial states

• including different schedules for instruction sequences

• including different interleavings of accesses to shared resources

Software

Input

initialstate

- 13 -

Why Exhaustive Exploration?

• Naive attempt: follow local worst-case transitions only

• Unsound in the presence of Timing Anomalies:A path starting with a local worst case may have a lower overall execution time,Ex.: a cache miss preventing a branch mis-prediction

• Caused by the interference between processor components: Ex.: cache hit/miss influences branch prediction; branch prediction causes prefetching; prefetching pollutes the I-cache.

- 14 -

State Space Explosion in Timing Analysis

constantexecutiontimes

constantexecutiontimes

state-dependentexecution timesstate-dependentexecution times

out-of-orderexecutionout-of-orderexecution

preemptiveschedulingpreemptivescheduling

concurrency +shared resourcesconcurrency +shared resources

years +methods

~1995 ~2000 2010

Timing schemataStatic analysis ???

Caches, pipelines,

speculation:

combined cache and

pipeline analysis

Superscalar processors:

interleavings

of all schedules

Multi-core with

shared resources:

interleavings

of several threads

- 15 -

Coping with the Huge Search Space

statespace

+ compact representation+ efficient update-loss in precision -over-estimation

Timing anomalies

abstraction no abstraction

-exhaustive exploration of full program - intolerable effort

-exhaustive exploration of basic blocks + tolerable effort

Splitting into•Local analysis on basic blocks•Global bounds analysis

For caches:O(2 cache capacity)

For caches:O(cache capacity)

- 16 -

Abstraction-Induced Imprecision

- 17 -

NotionsDeterminism: allows the prediction of the future behavior given the actual state

and knowing the future inputsState is split into • the semantics state, i.e. a point in the program + a mapping of variables to

values, and • the execution state, i.e. the allocation of variables to resources and the

occupancy of resourcesTiming Repeatability: • same execution time for all inputs to and initial execution states of a program• allows the prediction of the future timing behavior given the PC and the control-

flow context, without knowing the future inputsPredictability of a derived property of program behavior:• expresses how this property of all behaviors of a program independent of inputs

and initial execution state can be bounded, • bounds are typically derived from invariants over the set of all execution traces• P. expresses how this property of all future behaviors can be bounded given an

invariant about a set of potential states at a program point without knowing future inputs

- 18 -

Some Observations/Thoughts

Repeatability • doesn’t make sense for behaviors,• it may make sense for derived properties, e.g. time,

space and energy consumptionPredictability • concerns properties of program behaviors,• the set of all behaviors (collecting semantics) is

typically not computable• abstraction is applied

– to soundly approximate program behavior and – to make their determination (efficiently) computable

- 19 -

Approaches• Excluding arch.-state-dependent variability

by design -> • Excluding state- and input-dependent

variability by design -> repeatability• Bounding the variability, but ignoring

invariants about the set of possible states, i.e. assuming arch.-state-independent worst cases for each “transition”

• Using invariants about sets of possible arch. states to bound future behaviors

• Designing architecture as to support analyzability• independent of the applications or• for a given set of applications

PRET with PRET programming

MERASA

PRET architecturew/o PRET

programming

Predator

Predatorwith

PROMPT

- 20 -

Time Predictability

•There are analysis-independent notions, • predictability as an inherent system

property,e.g. Reineke et al. for caches, Grund et al. 2009i.e., predictable by an optimal analysis method

•There are analysis-method dependent notions, • predictable by a certain analysis method. • In case of static analysis by abstract

interpretation: designing a timing analysis means essentially designing abstract domains.

•To achieve predictability is not difficult, not to loose much performance at the same time is!

- 21 -

Tool Architecture

Abstract Interpretations

Abstract InterpretationInteger LinearProgramming

combined cache and pipeline

analysis

determines enclosing intervals

for the values in registers and local

variables

determines loop bounds determines

infeasible paths

derives invariants about architectural execution states,

computes bounds on execution times of basic blocks

determines a worst-case path and an

upper bound

- 22 -

Timing Accidents and Penalties

Timing Accident – cause for an increase of the execution time of an instruction

Timing Penalty – the associated increase

• Types of timing accidents– Cache misses– Pipeline stalls– Branch mispredictions– Bus collisions– Memory refresh of DRAM– TLB miss

- 23 -

Our Approach

• Static Analysis of Programs for their behavior on the execution platform

• computes invariants about the set of all potential execution states at all program points,

• the execution states result from the execution history,

• static analysis explores all execution histories

semantics state:values of variables

execution state:occupancy of resources

state

- 24 -

Deriving Run-Time Guarantees

• Our method and tool derives Safety Properties from these invariants : Certain timing accidents will never happen.Example: At program point p, instruction fetch will never cause a cache miss.

• The more accidents excluded, the lower the upper bound.

Murphy’sinvariant

Fastest Variance of execution times Slowest

- 25 -

Architectural Complexity impliesAnalysis Complexity

Every hardware component whose state has an influence on the timing behavior

• must be conservatively modeled, • may contribute a multiplicative factor to the

size of the search space• Exception: Caches

– some have good abstractions providing for highly precise analyses (LRU), cf. Diss. of J. Reineke

– some have abstractions with compact representations, but not so precise analyses

- 26 -

Abstraction and Decomposition

Components with domains of states C1, C2, … ,

Ck

Analysis has to track domain C1 C2 … Ck

Start with the powerset domain 2 C1

C2

… Ck

Find an abstract domain C1#

transform into C1# 2 C

2 … C

k

This has worked for caches and cache-like devices.

Find abstractions C11# and C12

# factor out C11

# and transform rest into 2 C

12# … C

kThis has worked for the arithmeticof the pipeline.

C11# program program with

annotations2 C

12# … C

k

value analysis

microarchitecturalanalysis

- 28 -

Complexity Issues for Predictability by Abstract Interpretation

Independent-attribute analysis

• Feasible for domains with no dependences or tolerable loss in precision

• Examples: value analysis, cache analysis

• Efficient!

Relational analysis

• Necessary for mutually dependent domains

• Examples: pipeline analysis

• Highly complex

Other parameters: Structure of the underlying domain, e.g. height of lattice;Determines speed of convergence of fixed-point iteration.

- 29 -

My Intuition

• Current high-performance processors have cyclic dependences between components

• Statically analyzing components in isolation (independent-attribute method) looses too much precision.

• Goals: – cutting the cycle without loosing too much performance,– designing architectural components with compact abstract

domains,– avoiding interference on shared resources in multi-core

architectures (as far as possible).

• R.Wilhelm et al.: Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-critical Embedded Systems, IEEE TCAD, July 2009

- 30 -

Caches: Small & Fast Memory on Chip

• Bridge speed gap between CPU and RAM• Caches work well in the average case:

– Programs access data locally (many hits)– Programs reuse items (instructions, data)– Access patterns are distributed evenly across the

cache

• Cache performance has a strong influence on system performance!

• The precision of cache analysis has a strong influence on the degree of over-estimation!

- 31 -

Caches: How they work

CPU: read/write at memory address a, – sends a request for a to bus

Cases:• Hit:

– Block m containing a in the cache: request served in the next cycle

• Miss:– Block m not in the cache:

m is transferred from main memory to the cache, m may replace some block in the cache,request for a is served asap while transfer still continues

• Replacement strategy: LRU, PLRU, FIFO,...determine which line to replace in a full cache (set)

a

m

- 32 -

Cache Analysis

How to statically precompute cache contents:

• Must Analysis:For each program point (and context), find out which blocks are in the cache prediction of cache hits

• May Analysis: For each program point (and context), find out which blocks may be in the cacheComplement says what is not in the cache prediction of cache misses

• In the following, we consider must analysis until otherwise stated.

- 33 -

(Must) Cache Analysis

• Consider one instruction in the program.

• There may be many paths leading to this instruction.

• How can we compute whether a will always be in cache independently of which path execution takes?

load a

Question: Is the access to a

always a cache hit?

- 34 -

Determine LRU-Cache-Information(abstract cache states) at each Program Point

{a, b}{x}

youngest age - 0

oldest age - 3

Interpretation of this cache information:describes the set of all concrete cache states in which x, a, and b occur • x with an age not older than 1• a and b with an age not older than 2,

Cache information contains 1. only memory blocks guaranteed to be in cache.2. they are associated with their maximal age.

- 35 -

Cache- Information

Cache analysis determines safe information about Cache Hits.Each predicted Cache Hit reduces the upper bound by the cache-miss penalty.

load a {a, b}

Computedcache information

Access to a is a cache hit;assume 1 cycle access time.

{x}

- 36 -

Cache Analysis – how does it work?

• How to compute for each program point an abstract cache state representing a set of memory blocks guaranteed to be in cache each time execution reaches this program point?

• Can we expect to compute the largest set?• Trade-off between precision and efficiency –

quite typical for abstract interpretation

- 37 -

(Must) Cache analysis of a memory access with LRU replacement

{a, b}{x}

access to a

{b, x}

{a}

After the access to a, a is the youngest memory block in cache, and we must assume that x has aged.

ba

access to a

b

a

x

y

y

x

concretetransferfunction(cache)

abstracttransferfunction(analysis)

- 38 -

Combining Cache Information• Consider two control-flow paths to a program point with sets S1

and S2 of memory blocks in cache,.– Cache analysis should not predict more than S1 S2 after the

merge of paths.

– elements in the intersection with their maximal age from S1 and S2.

• Suggests the following method: Compute cache information along all paths to a program point and calculate their intersection – but too many paths!

• More efficient method: – combine cache information on the fly,

– iterate until least fixpoint is reached.

• There is a risk of losing precision, but not in case of distributive transfer functions.

- 39 -

What happens when control-paths merge?

{ a }{ }

{ c, f }{ d }

{ c }{ e }{ a }{ d }

{ }{ }

{ a, c }{ d }

“intersection + maximal age”

We canguaranteethis content

on this path.

We canguarantee

this content

on this path.

Which contentcan we

guaranteeon this path?

combine cache information at each control-flow merge point

- 40 -

Predictability of Caches - Speed of Recovery from Uncertainty -

read y

mul x, y

read x

write z

1. Initial cache contents?2. Need to combine information3. Cannot resolve address of x...4. Imprecise analysis domain/ update functions

Need to recover information: Predictability = Speed of Recovery

J. Reineke et al.: Predictability of Cache Replacement Policies, Real-Time Systems, Springer, 2007

- 41 -

Metrics of Predictability:...

......

[f,e,d]

[f,e,c]

[f,d,c]

[h,g,f]

fillevict

Seq: a b c d e f g h

Two Variants:M = Misses OnlyHM

evict & fill

- 42 -

Results: tight bounds

Generic examples prove tightness.

- 43 -

The Influence of the Replacement Strategy

Information gainthrough access to mLRU:

m

m

m

m

+ aging ofprefix of unknown length of the cache contents

FIFO:

m

m

m

m

m cacheat least k-1 youngeststill in cache

- 44 -

- 45 -

Pipelines

Ideal Case: 1 Instruction per Cycle

Fetch

Decode

Execute

WB

Fetch

Decode

Execute

WB

Inst 1 Inst 2 Inst 3 Inst 4

Fetch

Decode

Execute

WB

Fetch

Decode

Execute

WB

Fetch

Decode

Execute

WB

- 46 -

CPU as a (Concrete) State Machine

• Processor (pipeline, cache, memory, inputs) viewed as a big state machine, performing transitions every clock cycle

• Starting in an initial state for an instruction, transitions are performed, until a final state is reached:– End state: instruction has left the pipeline– # transitions: execution time of instruction

- 47 -

Pipeline Analysis

• simulates the concrete pipeline on abstract states

• counts the number of steps until an instruction retires

• non-determinism resulting from abstraction and timing anomalies require exhaustive exploration of paths

- 48 -

Integrated Analysis: Overall Picture

Basic Block

s1

s10

s2s3

s11 s12

s1

s13

Fixed point iteration over Basic Blocks (in context) {s1, s2, s3} abstract state

move.1 (A0,D0),D1

Cyclewise evolution of processor modelfor instruction

s1 s2 s3

- 49 -

Implementation

• Abstract model is implemented as a DFA• Instructions are the nodes in the CFG• Domain is powerset of set of abstract states• Transfer functions at the edges in the CFG

iterate cycle-wise updating each state in the current abstract value

• max{# iterations for all states} gives bound • From this, we can obtain bounds for basic

blocks

- 50 -

Classification of Pipelined Architectures

• Fully timing compositional architectures:– no timing anomalies.– analysis can safely follow local worst-case paths

only, – example: ARM7.

• Compositional architectures with constant-bounded effects: – exhibit timing anomalies, but no domino effects,– example: Infineon TriCore

• Non-compositional architectures: – exhibit domino effects and timing anomalies.– timing analysis always has to follow all paths,– example: PowerPC 755

- 51 -

Extended the Predictability Notion

• The cache-predictability concept applies to all cache-like architecture components:

• TLBs, BTBs, other history mechanisms• It does not cover the whole

architectural domain.

- 52 -

The Predictability Notion

Unpredictability• is an inherent system property• limits the obtainable precision of static predictions

about dynamic system behaviorDigital hardware behaves deterministically (ignoring

defects, thermal effects etc.)• Transition is fully determined by current state and

input• We model hardware as a (hierarchically structured,

sequentially and concurrently composed) finite state machine

• Software and inputs induce possible (hardware) component inputs

- 53 -

Uncertainties About State and Input

• If initial system state and input were known only one execution (time) were possible.

• To be safe, static analysis must take into account all possible initial states and inputs.

• Uncertainty about state implies a set of starting states and different transition paths in the architecture.

• Uncertainty about program input implies possibly different program control flow.

• Overall result: possibly different execution times

- 54 -

Source and Manifestation of Unpredictability

• “Outer view” of the problem: Unpredictability manifests itself in the variance of execution time

• Shortest and longest paths through the automaton are the BCET and WCET

• “Inner view” of the problem: Where does the variance come from?

• For this, one has to look into the structure of the finite automata

- 55 -

Connection Between Automata and Uncertainty

• Uncertainty about state and input are qualitatively different:

• State uncertainty shows up at the “beginning” number of possible initial starting states the automaton may be in.

• States of automaton with high in-degree lose this initial uncertainty.

• Input uncertainty shows up while “running the automaton”.

• Nodes of automaton with high out-degree introduce uncertainty.

- 56 -

State Predictability – the Outer View

Let T(i;s) be the execution time with component input istarting in hardware component state s.

The range is in [0::1], 1 means perfectly timing-predictable

The smaller the set of states, the smaller the variance and the larger the predictability.

The smaller the set of component inputs to consider, the larger the predictability.

- 58 -

The Main Culprit: Interference on Shared Resources

• They come in many flavors:– instructions interfere on the caches,– bus masters interfere on the bus,– several threads interfere on shared caches, shared

memory.

• some directly cause variability of execution times, e.g. different bus access times in case of collision,

• some allow for different interleavings of control or architectural flow resulting in different execution states and subsequently different timings.

- 59 -

Dealing with Shared Resources

Alternatives:• Avoiding them,• Bounding their effects on timing

variability

- 60 -

Embedded Systems go Multicore

• Multicore systems are expected to solve all problems: performance, energy consumption, etc.

• Current designs share resources.• Recent experiment at Thales:

– running the same application on 2 cores, code and data duplicated 50% loss compared to execution on one core

– running two different applications on 2 cores 30% loss – Reason: interference on shared resources– Worst-case performance loss will be larger due to

uncertainty about the interferences!

- 61 -

Principles for the PROMPT Architecture and Design Process

• No shared resources where not needed for performance,

• Harmonious integration of applications: not introducing interferences on shared resources not existing in the applications.

- 62 -

Steps of the Design Process

1. Hierarchical privatization– decomposition of the set of applications according to the

sharing relation on the global state– allocation of private resources for non-shared code and

state– allocation of the shared global state to non-cached

memory, e.g. scratchpad,– sound (and precise) determination of delays for accesses

to the shared global state

2. Sharing of lonely resources – seldom accessed resources, e.g. I/O devices

3. Controlled socialization• introduction of sharing to reduce costs• controlling loss of predictability

- 63 -

Sharing of Lonely Resources

• Costly lonely resources will be shared.• Accesses rate is low compared to CPU

and memory bandwidth.• The access delay contributes little to the

overall execution time because accesses happen infrequently.

- 65 -

PROMPT Design Principles for Predictable Systems

• reduce interference on shared resources in architecture design

• avoid introduction of interferences in mapping application to target architecture

Applied to Predictable Multi-Core Systems• Private resources for non-shared components

of applications• Deterministic regime for the access to

shared resources

- 66 -

Comments on the Project ProposalPrecision-Timed Synchronous Reactive Processing

“Traditional” WCET analysis of MBD systems•analyzes binaries, which are generated from C code, which is generated from models•good precision due to disciplined code•precision is increased by synergetic integration of MDB code generator, compiler, and WCET toolSeparation of WCET analysis and WCRT analysis •multiplies the bounds on the number of synchronous cycles with the bound on the tick durations, •these two upper bounds do not necessarily happen together danger of loosing precision

- 67 -Comments on the Project ProposalPrecision-Timed Synchronous Reactive

Processing

• Taking the compiler into the boat is a good idea• increase of precision and efficiency of WCET

analysis by WCET-aware code generation• Operating modes should be supported to gain

precision

- 68 -

Conclusion

• Timing analysis for single tasks running on single-processor systems solved.

• Determination of context-switch costs for preemptive execution almost solved.

• TA for concurrent threads on multi-core platforms with shared resources not solved.

• PROMPT is a rather radical approach,• requires new design and fabrication process.• Reconciling Predictability with Performance still

an interesting research problem.

- 69 -Some Relevant Publications from my Group

• C. Ferdinand et al.: Cache Behavior Prediction by Abstract Interpretation. Science of Computer Programming 35(2): 163-189 (1999)

• C. Ferdinand et al.: Reliable and Precise WCET Determination of a Real-Life Processor, EMSOFT 2001

• R. Heckmann et al.: The Influence of Processor Architecture on the Design and the Results of WCET Tools, IEEE Proc. on Real-Time Systems, July 2003

• St. Thesing et al.: An Abstract Interpretation-based Timing Validation of Hard Real-Time Avionics Software, IPDS 2003

• L. Thiele, R. Wilhelm: Design for Timing Predictability, Real-Time Systems, Dec. 2004

• R. Wilhelm: Determination of Execution Time Bounds, Embedded Systems Handbook, CRC Press, 2005

• St. Thesing: Modeling a System Controller for Timing Analysis, EMSOFT 2006• J. Reineke et al.: Predictability of Cache Replacement Policies, Real-Time Systems,

Springer, 2007• R. Wilhelm et al.:The Determination of Worst-Case Execution Times - Overview of

the Methods and Survey of Tools. ACM Transactions on Embedded Computing Systems (TECS) 7(3), 2008.

• R.Wilhelm et al.: Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-critical Embedded Systems, IEEE TCAD, July 2009

• R. Wilhelm et al.: Designing Predictable Multicore Architectures for Avionics and Automotive Systems, RePP Workshop, Grenoble, Oct. 2009

• D. Grund, J. Reineke, and R. Wilhelm: A Template for Predictability Definitions with• Supporting Evidence, PPES Workshop, Grenoble 2011

- 70 -

Some other Publications dealing with Predictability

• R. Pellizzoni, M. Caccamo: Toward the Predictable Integration of Real-Time COTS Based Systems. RTSS 2007: 73-82

• J. Rosen, A. Andrei, P. Eles, and Z. Peng:Bus access optimization for predictable implementation of real-time applications on multiprocessor systems-on-chip, RTSS 2007

• B. Lickly, I. Liu, S. Kim, H. D. Patel, S. A. Edwards and E. A. Lee: Predictable Programming on a Precision Timed Architecture, CASES 2008

• M. Schoeberl, A Java processor architecture for embedded real-time systems, Journal of Systems Architecture, 54/1--2:265--286, 2008

• M. Paolieri et al.: Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems, ISCA 2009

• A. Hansson et al.: CompSoC: A Template for Composable and Predictable Multi-Processor System on Chips, ACM Trans. Des. Autom. Electr. Systems, 2009