unit 6 -ST

8/3/2019 unit 6 -ST

1/14

SOFWARE TESTING

DEPENDENCE AND DATA FLOW

MODELS

8/3/2019 unit 6 -ST

2/14

Objective

Understand basics of data-flow models and the related concepts (def-use pairs,

dominators)

Understand some analyses that can be performed with the data-flow model of a program

The data flow analyses to build models

Analyses that use the data flow models

Understand basic trade-offs in modeling data flow variations and limitations of data-flow

models and analyses, differing in precision and cost

Data Flow

An abstract representation of the sequence and possible changes of state of data objects, where

the state of an object is any of creation, usage, modification or destruction.

Data Flow graph

A picture of the system as a network of functional processes connected to each other by

pipelines of data (Data Flows) and holding tanks of data (Data Stores).

The processes transform the data.

The Data Flow graphs partitions the system into its component parts Data flow structure

follows the trail of a data item as it is accessed and modified by the code.

Why data flow models?

Models emphasized control Eg: Control flow graph, call graph, finite state machines

We also need to reason about dependence

Where does this value of x come from?

What would be affected by changing this?

Many program analyses and test design techniques use data flow information Often incombination with control flow Example: Taint analysis to prevent SQL injection attacks

Example: Dataflow test criteria

8/3/2019 unit 6 -ST

3/14

Definition- Use

A definition-use (of def-use) pair is the most basic data flow relationship: A value was defined

(produced or stored) at one point and used (referenced or retrieved from a variable) at another

point.

A def-use (du) pair associates a point in a program where a value is produced with a point

where it is used

Definition: where a variable gets a value

Variable declaration (often the special value uninitialized)

Variable initialization

Assignment

Values received by a parameter

Use: extraction of a value from a variable

Expressions

Conditional statements

Parameter passing

Returns

Eg:

8/3/2019 unit 6 -ST

4/14

Definition-Clear or Killing

When x gets a new value in line B, we say the earlier definition at line A is killed. This means

a def-use path with A as the definition cannot pass through B.

Dominator

Pre-dominators in a rooted, directed graph can be used to make this intuitive notion of

controlling decision precise.

Node M dominates node N if every path from the root to N passes through M.

A node will typically have many dominators, but except for the root, there is a unique

immediate dominator of node N which is closest to N on any path from the root, and which is

in turn dominated by all the other dominators of N.

Because each node (except the root) has a unique immediate dominator, the immediate

dominator relation forms a tree.

Post-dominators: Calculated in the reverse of the control flow graph, using a special exit node

as the root.

Data flow analysis

The algorithms used to compute data flow information. These are basic algorithms used widely

in compilers, test and analysis tools, and other software tools.

8/3/2019 unit 6 -ST

5/14

It is worthwhile to understand basic data flow analysis algorithms, even though we can usually

use tools without writing them ourselves.

The basic ideas are applicable to other problems for which tools are not readily available.

A basic understanding of how they work is key to understanding their limitations and the trade-offs we must make between efficiency and precision

DF Algorithm

An efficient algorithm for computing reaching definitions (and several other properties) is based

on the way reaching definitions at one node are related to the reaching definitions at an adjacent

node.

Suppose we are calculating the reaching definitions of node n, and there is an edge (p,n) from an

immediate predecessor node p.

If the predecessor node p can assign a value to variable v, then the definition vp reaches n. We

say the definition vp is generated at p.

If a definition vp of variable v reaches a predecessor node p, and if v is not redefined at that node

(in which case we say the vp is killed at that point), then the definition is propagated on from p to

n.

public class GCD {

public int gcd(int x, int y) {

int tmp; // A: def x, y, tmp

while (y != 0) { // B: use y

tmp = x % y; // C: def tmp; use x, y

x = y; // D: def x; use y

y = tmp; // E: def y; use tmp

}

return x; // F: use x

}

8/3/2019 unit 6 -ST

6/14

Reach(E) = ReachOut(D)

ReachOut(E) = (Reach(E) \ {yA}) {yE}

Reach(E) = ReachOut(D)

The first (simple) equation says that values can reach E by going through D.

ReachOut(E) = (Reach(E) \ {yA}) {yE}

This describes what happens at D. The previous definition of y at node A is overwritten. A new

value of y is defined here at line E.

We can associate two data flow equations of this form (what reached here, and what this node

changes) for every line in the program or node in the control flow graph.

General equation for reaching analysis

Each node of the control flow graph (or each line of the program) can be associated with set

equations of the same general form, with gen and kill

For reaching definitions, the gen set is just the variable that gets a new value there. It could be

the empty set if no variable is changed in that line.

The kill set is the set of values that will be replaced.

The subscripts here are the lines or nodes where the assignment takes place, so that we can keep

track of different assignments to the same variable.

For example, v99 would represent a value that v received at node (or line) 99.

Reach(n) = ReachOut(m)

mpred(n)

ReachOut(n) = (Reach(n) \ kill (n)) gen(n)

gen(n) = { vn | v is defined or modified at n }

kill(n) = { vx | v is defined or modified at x, xn }

Avail equations

Available expressions is another classic data flow analysis. An expression becomes available

where it is computed, and remains available until one of the variables in the expression is

8/3/2019 unit 6 -ST

7/14

changed. Unlike reaching definitions, though, it is not enough for an expression to be available

on ANY path to a node. Rather, it has to be available on ALL paths to the node.

Avail (n) = AvailOut(m)

mpred(n)

AvailOut(n) = (Avail (n) \ kill (n)) gen(n)

gen(n) = { exp | exp is computed at n }

kill(n) = { exp | exp has variables assigned at n }

The gen and kill sets are different, reflecting the definition

We use intersection instead of union when there is more than one path to a node, because

this is an all paths analysis rather than an any paths analysis

Live variable equations

Live variables is a third, classic data flow analysis. A variable is live at a node if the value that

is currently in the variable may be used later (before being replaced by another value). It is an

any-paths property like reaching definitions Note what is different here: We look at successors

to the current node, rather than predecessors. We say live is a backward analysis, while reach

and avail were forward analyses.

Live(n) =

LiveOut(m)

msucc(n)

LiveOut(n) = (Live(n) \ kill (n)) gen(n)

gen(n) = { v | v is used at n }

kill(n) = { v | v is modified at n }

classification analysis

Forward/backward: a nodes set depends on that of its predecessors/successors

Any-path/all-path: a nodes set contains a value iff it is coming from any/all of its inputs

8/3/2019 unit 6 -ST

8/14

Any-path () All-paths ()

Forward (pred) Reach Avail

Backward (succ) Live inevitable

Live, Avail, and Reach fill three possible combinations of { Forward, Backward } x { Any

path, All paths }., (All paths, backward), a natural term for it would be inevitability: It would

propagate values backward from the points where they appear in a gen set, but only so long as

all possible paths forward will definitely reach the point of gen.

From Execution to Conservative Flow Analysis

We can use the same data flow algorithms to approximate other dynamic properties

Gen set will be facts that become true here ,Kill set will be facts that are no longer true here

Flow equations will describe propagation

Example: Taintedness (in web form processing)

Taint: a user-supplied value (e.g., from web form) that has not been validated

Gen: we get this value from an untrusted source here

Kill: we validated to make sure the value is proper

Monotonic: y > x implies f(y) f(x) (where f is application of the flow equations on values from

successor or predecessor nodes, and > is movement up the lattice)

8/3/2019 unit 6 -ST

9/14

Flow equations must be monotonic, Initialize to the bottom element of a lattice of

approximations. Each new value that changes must move up the lattice

Typically: Powerset lattice: Bottom is empty set, top is universe Or empty at top for all-paths

analysis

Work list Algorithm for Data Flow

One way to iterate to a fixed point solution. General idea: Initially all nodes are on the work list,

and have default values -Default for any-path problem is the empty set, default for all-path

problem is the set of all possibilities (union of all gen sets) While the work list is not empty- Pick

any node n on work list; remove it from the list -Apply the data flow equations for that node to

get new values

-If the new value is changed (from the old value at that node), then

Add successors (for forward analysis) or predecessors (for backward analysis) on the

work list

Eventually the work list will be empty (because new computed values = old values for

each node) and the algorithm stops.

This is a classic iteration to a fixed point. Perhaps the only subtlety is how we initialize values.

Data flow analysis with arrays and pointers

Arrays and pointers introduce uncertainty: Do different expressions access the same storage?

a[i] same as a[k] when i = k

a[i] same as b[i] when a = b (aliasing)

The uncertainty is accomodated depending to the kind of analysisAny-path: gen sets should

include all potential aliases and kill set should include only what is definitely modified

All-path: vice versa

Scope of Data Flow Analysis

8/3/2019 unit 6 -ST

10/14

Intraprocedural :Within a single method or procedure as described so far,Interprocedural:Across

several methods (and classes) or procedures Cost/Precision trade-offs for interprocedural

analysis are critical, and difficult

context sensitivity

A context-sensitive (interprocedural) analysis distinguishes sub() called from foo() from sub()

called from bar();

A context-insensitive (interprocedural) analysis does not separate them, as if foo() could call

sub() and sub() could then return to bar()

The problem with context sensitivity is that we dont need to keep track of the immediate calling

context --- we may also need to consider all the methods that could call foo, and all the methods

that could call those methods, and so on.

A single procedure or method can have exponentially many calling contexts, relative to the

number of procedures or methods in the program. A context-insensitive analysis does not

distinguish each context, so it can be much cheaper --- but it can be very imprecise.

For example, we could get an analysis result that corresponds to foo calling sub (trace the top

half of the red line) and sub returning to bar (trace the bottom half of the blue line), which is

impossible.

flow-sensitivity

Within a single method or procedure, we typically use a flow-sensitive data flow analysis, like

those we have seen (avail, live, reach, etc.). The cost may be cubic in the size of a procedure or

method, if procedures and methods are short and, for well-structured code, the algorithms often

run much faster than their theoretical worst case time bound.

8/3/2019 unit 6 -ST

11/14

Interprocedural analysis, on the other hand, could need to handle many thousands or even

millions of lines of code.

For the properties that really demand interprocedural analysis, a flow-insensitive analysis is often

good enough. It may be imprecise, but it is much better than just making worst-case

assumptions.

Type checking is a familiar example of flow-insensitive analysis:

Data flow concept

Value of x at 6 could be computed at 1 or at 4

Bad computation at 1 or 4 could be revealed only if they are used at 6

(1,6) and (4,6) are def-use (DU) pairs

defs at 1,4

- use at 6

DU pair: a pair ofdefinition and use for some variable, such that at least one DU path exists

from the definition to the use

8/3/2019 unit 6 -ST

12/14

x = ... is a definition of x

= ... x ... is a use of x

DU path: a definition-clear path on the CFG starting from a definition to a use of a same

variable

Definition clear: Value is not replaced on path

Note loops could create infinite DU paths between a def and a use

Adequacy criteria

The All DU pairs adequacy criterion requires each DU pair is exercised by at least one test case

All DU paths: Each simple (non looping) DU path is exercised by at least one test case

CDUpath = # of exercised simple DU paths

# of simple DU paths

All definitions: For each definition, there is at least one test case which exercises a DU pair

containing it (Every computed value is used somewhere)Corresponding coverage fractions can also be defined

CDU pairs = # of exercised DU pairs

---------------------------

# of Du pairs

Difficult cases

x[i] = ... ; ... ; y = x[j]

DU pair (only) if i==j

p = &x ; ... ; *p = 99 ; ... ; q = x

*p is an alias of xm.putFoo(...); ... ; y=n.getFoo(...);

Are m and n the same object?

Do m and n share a foo field?

Problem ofaliases: Which references are (always or sometimes) the same?

Data flow coverage with complex structures

Arrays and pointers are critical for data flow analysis Under-estimation of aliases may fail to

include some DU pairs Over-estimation, on the other hand, may introduce unfeasible test

obligations For testing, it may be preferrable to accept under-estimation of alias set rather than

over-estimation or expensive analysis

Controversial: In other applications (e.g., compilers), a conservative over-estimation of

aliases is usually required

8/3/2019 unit 6 -ST

13/14

Alias analysis may rely on external guidance or other global analysis to calculate good

estimates

Undisciplined use of dynamic storage, pointer arithmetic, etc. may make the whole

analysis infeasible

Infeasibility

The path-oriented nature of data flow analysis makes the infeasibility problem especially

relevant Combinations of elements matter! Impossible to (infallibly) distinguish feasible from

infeasible paths. More paths = more work to check manually. In practice, reasonable coverage is

(often, not always) achievable Number of paths is exponential in worst case, but often linear

All DUpaths is more often impractical

Not all elements of a profram are executable

The path oriented nature of data flow aggravates the problem

Infeasibility creates test obligation not only for isolated unexecutable elements but also

for infeasible combinations of feasible elements

Complex data strucutre may amplify the infeasibility problem by adding infeasible paths

as a result of alias computation

Eg:x[i] alias of x[j] when i=j cant determine when i = j in prg exe.

But problem of infeasible path modest in DU pair adequacy criteria

8/3/2019 unit 6 -ST

14/14

Documents

unit 6 -ST