Upload
tushar-goel
View
216
Download
0
Embed Size (px)
Citation preview
8/3/2019 unit 6 -ST
1/14
SOFWARE TESTING
DEPENDENCE AND DATA FLOW
MODELS
8/3/2019 unit 6 -ST
2/14
Objective
Understand basics of data-flow models and the related concepts (def-use pairs,
dominators)
Understand some analyses that can be performed with the data-flow model of a program
The data flow analyses to build models
Analyses that use the data flow models
Understand basic trade-offs in modeling data flow variations and limitations of data-flow
models and analyses, differing in precision and cost
Data Flow
An abstract representation of the sequence and possible changes of state of data objects, where
the state of an object is any of creation, usage, modification or destruction.
Data Flow graph
A picture of the system as a network of functional processes connected to each other by
pipelines of data (Data Flows) and holding tanks of data (Data Stores).
The processes transform the data.
The Data Flow graphs partitions the system into its component parts Data flow structure
follows the trail of a data item as it is accessed and modified by the code.
Why data flow models?
Models emphasized control Eg: Control flow graph, call graph, finite state machines
We also need to reason about dependence
Where does this value of x come from?
What would be affected by changing this?
Many program analyses and test design techniques use data flow information Often incombination with control flow Example: Taint analysis to prevent SQL injection attacks
Example: Dataflow test criteria
8/3/2019 unit 6 -ST
3/14
Definition- Use
A definition-use (of def-use) pair is the most basic data flow relationship: A value was defined
(produced or stored) at one point and used (referenced or retrieved from a variable) at another
point.
A def-use (du) pair associates a point in a program where a value is produced with a point
where it is used
Definition: where a variable gets a value
Variable declaration (often the special value uninitialized)
Variable initialization
Assignment
Values received by a parameter
Use: extraction of a value from a variable
Expressions
Conditional statements
Parameter passing
Returns
Eg:
8/3/2019 unit 6 -ST
4/14
Definition-Clear or Killing
When x gets a new value in line B, we say the earlier definition at line A is killed. This means
a def-use path with A as the definition cannot pass through B.
Dominator
Pre-dominators in a rooted, directed graph can be used to make this intuitive notion of
controlling decision precise.
Node M dominates node N if every path from the root to N passes through M.
A node will typically have many dominators, but except for the root, there is a unique
immediate dominator of node N which is closest to N on any path from the root, and which is
in turn dominated by all the other dominators of N.
Because each node (except the root) has a unique immediate dominator, the immediate
dominator relation forms a tree.
Post-dominators: Calculated in the reverse of the control flow graph, using a special exit node
as the root.
Data flow analysis
The algorithms used to compute data flow information. These are basic algorithms used widely
in compilers, test and analysis tools, and other software tools.
8/3/2019 unit 6 -ST
5/14
It is worthwhile to understand basic data flow analysis algorithms, even though we can usually
use tools without writing them ourselves.
The basic ideas are applicable to other problems for which tools are not readily available.
A basic understanding of how they work is key to understanding their limitations and the trade-offs we must make between efficiency and precision
DF Algorithm
An efficient algorithm for computing reaching definitions (and several other properties) is based
on the way reaching definitions at one node are related to the reaching definitions at an adjacent
node.
Suppose we are calculating the reaching definitions of node n, and there is an edge (p,n) from an
immediate predecessor node p.
If the predecessor node p can assign a value to variable v, then the definition vp reaches n. We
say the definition vp is generated at p.
If a definition vp of variable v reaches a predecessor node p, and if v is not redefined at that node
(in which case we say the vp is killed at that point), then the definition is propagated on from p to
n.
public class GCD {
public int gcd(int x, int y) {
int tmp; // A: def x, y, tmp
while (y != 0) { // B: use y
tmp = x % y; // C: def tmp; use x, y
x = y; // D: def x; use y
y = tmp; // E: def y; use tmp
}
return x; // F: use x
}
8/3/2019 unit 6 -ST
6/14
Reach(E) = ReachOut(D)
ReachOut(E) = (Reach(E) \ {yA}) {yE}
Reach(E) = ReachOut(D)
The first (simple) equation says that values can reach E by going through D.
ReachOut(E) = (Reach(E) \ {yA}) {yE}
This describes what happens at D. The previous definition of y at node A is overwritten. A new
value of y is defined here at line E.
We can associate two data flow equations of this form (what reached here, and what this node
changes) for every line in the program or node in the control flow graph.
General equation for reaching analysis
Each node of the control flow graph (or each line of the program) can be associated with set
equations of the same general form, with gen and kill
For reaching definitions, the gen set is just the variable that gets a new value there. It could be
the empty set if no variable is changed in that line.
The kill set is the set of values that will be replaced.
The subscripts here are the lines or nodes where the assignment takes place, so that we can keep
track of different assignments to the same variable.
For example, v99 would represent a value that v received at node (or line) 99.
Reach(n) = ReachOut(m)
mpred(n)
ReachOut(n) = (Reach(n) \ kill (n)) gen(n)
gen(n) = { vn | v is defined or modified at n }
kill(n) = { vx | v is defined or modified at x, xn }
Avail equations
Available expressions is another classic data flow analysis. An expression becomes available
where it is computed, and remains available until one of the variables in the expression is
8/3/2019 unit 6 -ST
7/14
changed. Unlike reaching definitions, though, it is not enough for an expression to be available
on ANY path to a node. Rather, it has to be available on ALL paths to the node.
Avail (n) = AvailOut(m)
mpred(n)
AvailOut(n) = (Avail (n) \ kill (n)) gen(n)
gen(n) = { exp | exp is computed at n }
kill(n) = { exp | exp has variables assigned at n }
The gen and kill sets are different, reflecting the definition
We use intersection instead of union when there is more than one path to a node, because
this is an all paths analysis rather than an any paths analysis
Live variable equations
Live variables is a third, classic data flow analysis. A variable is live at a node if the value that
is currently in the variable may be used later (before being replaced by another value). It is an
any-paths property like reaching definitions Note what is different here: We look at successors
to the current node, rather than predecessors. We say live is a backward analysis, while reach
and avail were forward analyses.
Live(n) =
LiveOut(m)
msucc(n)
LiveOut(n) = (Live(n) \ kill (n)) gen(n)
gen(n) = { v | v is used at n }
kill(n) = { v | v is modified at n }
classification analysis
Forward/backward: a nodes set depends on that of its predecessors/successors
Any-path/all-path: a nodes set contains a value iff it is coming from any/all of its inputs
8/3/2019 unit 6 -ST
8/14
Any-path () All-paths ()
Forward (pred) Reach Avail
Backward (succ) Live inevitable
Live, Avail, and Reach fill three possible combinations of { Forward, Backward } x { Any
path, All paths }., (All paths, backward), a natural term for it would be inevitability: It would
propagate values backward from the points where they appear in a gen set, but only so long as
all possible paths forward will definitely reach the point of gen.
From Execution to Conservative Flow Analysis
We can use the same data flow algorithms to approximate other dynamic properties
Gen set will be facts that become true here ,Kill set will be facts that are no longer true here
Flow equations will describe propagation
Example: Taintedness (in web form processing)
Taint: a user-supplied value (e.g., from web form) that has not been validated
Gen: we get this value from an untrusted source here
Kill: we validated to make sure the value is proper
Monotonic: y > x implies f(y) f(x) (where f is application of the flow equations on values from
successor or predecessor nodes, and > is movement up the lattice)
8/3/2019 unit 6 -ST
9/14
Flow equations must be monotonic, Initialize to the bottom element of a lattice of
approximations. Each new value that changes must move up the lattice
Typically: Powerset lattice: Bottom is empty set, top is universe Or empty at top for all-paths
analysis
Work list Algorithm for Data Flow
One way to iterate to a fixed point solution. General idea: Initially all nodes are on the work list,
and have default values -Default for any-path problem is the empty set, default for all-path
problem is the set of all possibilities (union of all gen sets) While the work list is not empty- Pick
any node n on work list; remove it from the list -Apply the data flow equations for that node to
get new values
-If the new value is changed (from the old value at that node), then
Add successors (for forward analysis) or predecessors (for backward analysis) on the
work list
Eventually the work list will be empty (because new computed values = old values for
each node) and the algorithm stops.
This is a classic iteration to a fixed point. Perhaps the only subtlety is how we initialize values.
Data flow analysis with arrays and pointers
Arrays and pointers introduce uncertainty: Do different expressions access the same storage?
a[i] same as a[k] when i = k
a[i] same as b[i] when a = b (aliasing)
The uncertainty is accomodated depending to the kind of analysisAny-path: gen sets should
include all potential aliases and kill set should include only what is definitely modified
All-path: vice versa
Scope of Data Flow Analysis
8/3/2019 unit 6 -ST
10/14
Intraprocedural :Within a single method or procedure as described so far,Interprocedural:Across
several methods (and classes) or procedures Cost/Precision trade-offs for interprocedural
analysis are critical, and difficult
context sensitivity
A context-sensitive (interprocedural) analysis distinguishes sub() called from foo() from sub()
called from bar();
A context-insensitive (interprocedural) analysis does not separate them, as if foo() could call
sub() and sub() could then return to bar()
The problem with context sensitivity is that we dont need to keep track of the immediate calling
context --- we may also need to consider all the methods that could call foo, and all the methods
that could call those methods, and so on.
A single procedure or method can have exponentially many calling contexts, relative to the
number of procedures or methods in the program. A context-insensitive analysis does not
distinguish each context, so it can be much cheaper --- but it can be very imprecise.
For example, we could get an analysis result that corresponds to foo calling sub (trace the top
half of the red line) and sub returning to bar (trace the bottom half of the blue line), which is
impossible.
flow-sensitivity
Within a single method or procedure, we typically use a flow-sensitive data flow analysis, like
those we have seen (avail, live, reach, etc.). The cost may be cubic in the size of a procedure or
method, if procedures and methods are short and, for well-structured code, the algorithms often
run much faster than their theoretical worst case time bound.
8/3/2019 unit 6 -ST
11/14
Interprocedural analysis, on the other hand, could need to handle many thousands or even
millions of lines of code.
For the properties that really demand interprocedural analysis, a flow-insensitive analysis is often
good enough. It may be imprecise, but it is much better than just making worst-case
assumptions.
Type checking is a familiar example of flow-insensitive analysis:
Data flow concept
Value of x at 6 could be computed at 1 or at 4
Bad computation at 1 or 4 could be revealed only if they are used at 6
(1,6) and (4,6) are def-use (DU) pairs
defs at 1,4
- use at 6
DU pair: a pair ofdefinition and use for some variable, such that at least one DU path exists
from the definition to the use
8/3/2019 unit 6 -ST
12/14
x = ... is a definition of x
= ... x ... is a use of x
DU path: a definition-clear path on the CFG starting from a definition to a use of a same
variable
Definition clear: Value is not replaced on path
Note loops could create infinite DU paths between a def and a use
Adequacy criteria
The All DU pairs adequacy criterion requires each DU pair is exercised by at least one test case
All DU paths: Each simple (non looping) DU path is exercised by at least one test case
CDUpath = # of exercised simple DU paths
# of simple DU paths
All definitions: For each definition, there is at least one test case which exercises a DU pair
containing it (Every computed value is used somewhere)Corresponding coverage fractions can also be defined
CDU pairs = # of exercised DU pairs
---------------------------
# of Du pairs
Difficult cases
x[i] = ... ; ... ; y = x[j]
DU pair (only) if i==j
p = &x ; ... ; *p = 99 ; ... ; q = x
*p is an alias of xm.putFoo(...); ... ; y=n.getFoo(...);
Are m and n the same object?
Do m and n share a foo field?
Problem ofaliases: Which references are (always or sometimes) the same?
Data flow coverage with complex structures
Arrays and pointers are critical for data flow analysis Under-estimation of aliases may fail to
include some DU pairs Over-estimation, on the other hand, may introduce unfeasible test
obligations For testing, it may be preferrable to accept under-estimation of alias set rather than
over-estimation or expensive analysis
Controversial: In other applications (e.g., compilers), a conservative over-estimation of
aliases is usually required
8/3/2019 unit 6 -ST
13/14
Alias analysis may rely on external guidance or other global analysis to calculate good
estimates
Undisciplined use of dynamic storage, pointer arithmetic, etc. may make the whole
analysis infeasible
Infeasibility
The path-oriented nature of data flow analysis makes the infeasibility problem especially
relevant Combinations of elements matter! Impossible to (infallibly) distinguish feasible from
infeasible paths. More paths = more work to check manually. In practice, reasonable coverage is
(often, not always) achievable Number of paths is exponential in worst case, but often linear
All DUpaths is more often impractical
Not all elements of a profram are executable
The path oriented nature of data flow aggravates the problem
Infeasibility creates test obligation not only for isolated unexecutable elements but also
for infeasible combinations of feasible elements
Complex data strucutre may amplify the infeasibility problem by adding infeasible paths
as a result of alias computation
Eg:x[i] alias of x[j] when i=j cant determine when i = j in prg exe.
But problem of infeasible path modest in DU pair adequacy criteria
8/3/2019 unit 6 -ST
14/14