unit 6 -ST

Embed Size (px)

Citation preview

  • 8/3/2019 unit 6 -ST

    1/14

    SOFWARE TESTING

    DEPENDENCE AND DATA FLOW

    MODELS

  • 8/3/2019 unit 6 -ST

    2/14

    Objective

    Understand basics of data-flow models and the related concepts (def-use pairs,

    dominators)

    Understand some analyses that can be performed with the data-flow model of a program

    The data flow analyses to build models

    Analyses that use the data flow models

    Understand basic trade-offs in modeling data flow variations and limitations of data-flow

    models and analyses, differing in precision and cost

    Data Flow

    An abstract representation of the sequence and possible changes of state of data objects, where

    the state of an object is any of creation, usage, modification or destruction.

    Data Flow graph

    A picture of the system as a network of functional processes connected to each other by

    pipelines of data (Data Flows) and holding tanks of data (Data Stores).

    The processes transform the data.

    The Data Flow graphs partitions the system into its component parts Data flow structure

    follows the trail of a data item as it is accessed and modified by the code.

    Why data flow models?

    Models emphasized control Eg: Control flow graph, call graph, finite state machines

    We also need to reason about dependence

    Where does this value of x come from?

    What would be affected by changing this?

    Many program analyses and test design techniques use data flow information Often incombination with control flow Example: Taint analysis to prevent SQL injection attacks

    Example: Dataflow test criteria

  • 8/3/2019 unit 6 -ST

    3/14

    Definition- Use

    A definition-use (of def-use) pair is the most basic data flow relationship: A value was defined

    (produced or stored) at one point and used (referenced or retrieved from a variable) at another

    point.

    A def-use (du) pair associates a point in a program where a value is produced with a point

    where it is used

    Definition: where a variable gets a value

    Variable declaration (often the special value uninitialized)

    Variable initialization

    Assignment

    Values received by a parameter

    Use: extraction of a value from a variable

    Expressions

    Conditional statements

    Parameter passing

    Returns

    Eg:

  • 8/3/2019 unit 6 -ST

    4/14

    Definition-Clear or Killing

    When x gets a new value in line B, we say the earlier definition at line A is killed. This means

    a def-use path with A as the definition cannot pass through B.

    Dominator

    Pre-dominators in a rooted, directed graph can be used to make this intuitive notion of

    controlling decision precise.

    Node M dominates node N if every path from the root to N passes through M.

    A node will typically have many dominators, but except for the root, there is a unique

    immediate dominator of node N which is closest to N on any path from the root, and which is

    in turn dominated by all the other dominators of N.

    Because each node (except the root) has a unique immediate dominator, the immediate

    dominator relation forms a tree.

    Post-dominators: Calculated in the reverse of the control flow graph, using a special exit node

    as the root.

    Data flow analysis

    The algorithms used to compute data flow information. These are basic algorithms used widely

    in compilers, test and analysis tools, and other software tools.

  • 8/3/2019 unit 6 -ST

    5/14

    It is worthwhile to understand basic data flow analysis algorithms, even though we can usually

    use tools without writing them ourselves.

    The basic ideas are applicable to other problems for which tools are not readily available.

    A basic understanding of how they work is key to understanding their limitations and the trade-offs we must make between efficiency and precision

    DF Algorithm

    An efficient algorithm for computing reaching definitions (and several other properties) is based

    on the way reaching definitions at one node are related to the reaching definitions at an adjacent

    node.

    Suppose we are calculating the reaching definitions of node n, and there is an edge (p,n) from an

    immediate predecessor node p.

    If the predecessor node p can assign a value to variable v, then the definition vp reaches n. We

    say the definition vp is generated at p.

    If a definition vp of variable v reaches a predecessor node p, and if v is not redefined at that node

    (in which case we say the vp is killed at that point), then the definition is propagated on from p to

    n.

    public class GCD {

    public int gcd(int x, int y) {

    int tmp; // A: def x, y, tmp

    while (y != 0) { // B: use y

    tmp = x % y; // C: def tmp; use x, y

    x = y; // D: def x; use y

    y = tmp; // E: def y; use tmp

    }

    return x; // F: use x

    }

  • 8/3/2019 unit 6 -ST

    6/14

    Reach(E) = ReachOut(D)

    ReachOut(E) = (Reach(E) \ {yA}) {yE}

    Reach(E) = ReachOut(D)

    The first (simple) equation says that values can reach E by going through D.

    ReachOut(E) = (Reach(E) \ {yA}) {yE}

    This describes what happens at D. The previous definition of y at node A is overwritten. A new

    value of y is defined here at line E.

    We can associate two data flow equations of this form (what reached here, and what this node

    changes) for every line in the program or node in the control flow graph.

    General equation for reaching analysis

    Each node of the control flow graph (or each line of the program) can be associated with set

    equations of the same general form, with gen and kill

    For reaching definitions, the gen set is just the variable that gets a new value there. It could be

    the empty set if no variable is changed in that line.

    The kill set is the set of values that will be replaced.

    The subscripts here are the lines or nodes where the assignment takes place, so that we can keep

    track of different assignments to the same variable.

    For example, v99 would represent a value that v received at node (or line) 99.

    Reach(n) = ReachOut(m)

    mpred(n)

    ReachOut(n) = (Reach(n) \ kill (n)) gen(n)

    gen(n) = { vn | v is defined or modified at n }

    kill(n) = { vx | v is defined or modified at x, xn }

    Avail equations

    Available expressions is another classic data flow analysis. An expression becomes available

    where it is computed, and remains available until one of the variables in the expression is

  • 8/3/2019 unit 6 -ST

    7/14

    changed. Unlike reaching definitions, though, it is not enough for an expression to be available

    on ANY path to a node. Rather, it has to be available on ALL paths to the node.

    Avail (n) = AvailOut(m)

    mpred(n)

    AvailOut(n) = (Avail (n) \ kill (n)) gen(n)

    gen(n) = { exp | exp is computed at n }

    kill(n) = { exp | exp has variables assigned at n }

    The gen and kill sets are different, reflecting the definition

    We use intersection instead of union when there is more than one path to a node, because

    this is an all paths analysis rather than an any paths analysis

    Live variable equations

    Live variables is a third, classic data flow analysis. A variable is live at a node if the value that

    is currently in the variable may be used later (before being replaced by another value). It is an

    any-paths property like reaching definitions Note what is different here: We look at successors

    to the current node, rather than predecessors. We say live is a backward analysis, while reach

    and avail were forward analyses.

    Live(n) =

    LiveOut(m)

    msucc(n)

    LiveOut(n) = (Live(n) \ kill (n)) gen(n)

    gen(n) = { v | v is used at n }

    kill(n) = { v | v is modified at n }

    classification analysis

    Forward/backward: a nodes set depends on that of its predecessors/successors

    Any-path/all-path: a nodes set contains a value iff it is coming from any/all of its inputs

  • 8/3/2019 unit 6 -ST

    8/14

    Any-path () All-paths ()

    Forward (pred) Reach Avail

    Backward (succ) Live inevitable

    Live, Avail, and Reach fill three possible combinations of { Forward, Backward } x { Any

    path, All paths }., (All paths, backward), a natural term for it would be inevitability: It would

    propagate values backward from the points where they appear in a gen set, but only so long as

    all possible paths forward will definitely reach the point of gen.

    From Execution to Conservative Flow Analysis

    We can use the same data flow algorithms to approximate other dynamic properties

    Gen set will be facts that become true here ,Kill set will be facts that are no longer true here

    Flow equations will describe propagation

    Example: Taintedness (in web form processing)

    Taint: a user-supplied value (e.g., from web form) that has not been validated

    Gen: we get this value from an untrusted source here

    Kill: we validated to make sure the value is proper

    Monotonic: y > x implies f(y) f(x) (where f is application of the flow equations on values from

    successor or predecessor nodes, and > is movement up the lattice)

  • 8/3/2019 unit 6 -ST

    9/14

    Flow equations must be monotonic, Initialize to the bottom element of a lattice of

    approximations. Each new value that changes must move up the lattice

    Typically: Powerset lattice: Bottom is empty set, top is universe Or empty at top for all-paths

    analysis

    Work list Algorithm for Data Flow

    One way to iterate to a fixed point solution. General idea: Initially all nodes are on the work list,

    and have default values -Default for any-path problem is the empty set, default for all-path

    problem is the set of all possibilities (union of all gen sets) While the work list is not empty- Pick

    any node n on work list; remove it from the list -Apply the data flow equations for that node to

    get new values

    -If the new value is changed (from the old value at that node), then

    Add successors (for forward analysis) or predecessors (for backward analysis) on the

    work list

    Eventually the work list will be empty (because new computed values = old values for

    each node) and the algorithm stops.

    This is a classic iteration to a fixed point. Perhaps the only subtlety is how we initialize values.

    Data flow analysis with arrays and pointers

    Arrays and pointers introduce uncertainty: Do different expressions access the same storage?

    a[i] same as a[k] when i = k

    a[i] same as b[i] when a = b (aliasing)

    The uncertainty is accomodated depending to the kind of analysisAny-path: gen sets should

    include all potential aliases and kill set should include only what is definitely modified

    All-path: vice versa

    Scope of Data Flow Analysis

  • 8/3/2019 unit 6 -ST

    10/14

    Intraprocedural :Within a single method or procedure as described so far,Interprocedural:Across

    several methods (and classes) or procedures Cost/Precision trade-offs for interprocedural

    analysis are critical, and difficult

    context sensitivity

    A context-sensitive (interprocedural) analysis distinguishes sub() called from foo() from sub()

    called from bar();

    A context-insensitive (interprocedural) analysis does not separate them, as if foo() could call

    sub() and sub() could then return to bar()

    The problem with context sensitivity is that we dont need to keep track of the immediate calling

    context --- we may also need to consider all the methods that could call foo, and all the methods

    that could call those methods, and so on.

    A single procedure or method can have exponentially many calling contexts, relative to the

    number of procedures or methods in the program. A context-insensitive analysis does not

    distinguish each context, so it can be much cheaper --- but it can be very imprecise.

    For example, we could get an analysis result that corresponds to foo calling sub (trace the top

    half of the red line) and sub returning to bar (trace the bottom half of the blue line), which is

    impossible.

    flow-sensitivity

    Within a single method or procedure, we typically use a flow-sensitive data flow analysis, like

    those we have seen (avail, live, reach, etc.). The cost may be cubic in the size of a procedure or

    method, if procedures and methods are short and, for well-structured code, the algorithms often

    run much faster than their theoretical worst case time bound.

  • 8/3/2019 unit 6 -ST

    11/14

    Interprocedural analysis, on the other hand, could need to handle many thousands or even

    millions of lines of code.

    For the properties that really demand interprocedural analysis, a flow-insensitive analysis is often

    good enough. It may be imprecise, but it is much better than just making worst-case

    assumptions.

    Type checking is a familiar example of flow-insensitive analysis:

    Data flow concept

    Value of x at 6 could be computed at 1 or at 4

    Bad computation at 1 or 4 could be revealed only if they are used at 6

    (1,6) and (4,6) are def-use (DU) pairs

    defs at 1,4

    - use at 6

    DU pair: a pair ofdefinition and use for some variable, such that at least one DU path exists

    from the definition to the use

  • 8/3/2019 unit 6 -ST

    12/14

    x = ... is a definition of x

    = ... x ... is a use of x

    DU path: a definition-clear path on the CFG starting from a definition to a use of a same

    variable

    Definition clear: Value is not replaced on path

    Note loops could create infinite DU paths between a def and a use

    Adequacy criteria

    The All DU pairs adequacy criterion requires each DU pair is exercised by at least one test case

    All DU paths: Each simple (non looping) DU path is exercised by at least one test case

    CDUpath = # of exercised simple DU paths

    # of simple DU paths

    All definitions: For each definition, there is at least one test case which exercises a DU pair

    containing it (Every computed value is used somewhere)Corresponding coverage fractions can also be defined

    CDU pairs = # of exercised DU pairs

    ---------------------------

    # of Du pairs

    Difficult cases

    x[i] = ... ; ... ; y = x[j]

    DU pair (only) if i==j

    p = &x ; ... ; *p = 99 ; ... ; q = x

    *p is an alias of xm.putFoo(...); ... ; y=n.getFoo(...);

    Are m and n the same object?

    Do m and n share a foo field?

    Problem ofaliases: Which references are (always or sometimes) the same?

    Data flow coverage with complex structures

    Arrays and pointers are critical for data flow analysis Under-estimation of aliases may fail to

    include some DU pairs Over-estimation, on the other hand, may introduce unfeasible test

    obligations For testing, it may be preferrable to accept under-estimation of alias set rather than

    over-estimation or expensive analysis

    Controversial: In other applications (e.g., compilers), a conservative over-estimation of

    aliases is usually required

  • 8/3/2019 unit 6 -ST

    13/14

    Alias analysis may rely on external guidance or other global analysis to calculate good

    estimates

    Undisciplined use of dynamic storage, pointer arithmetic, etc. may make the whole

    analysis infeasible

    Infeasibility

    The path-oriented nature of data flow analysis makes the infeasibility problem especially

    relevant Combinations of elements matter! Impossible to (infallibly) distinguish feasible from

    infeasible paths. More paths = more work to check manually. In practice, reasonable coverage is

    (often, not always) achievable Number of paths is exponential in worst case, but often linear

    All DUpaths is more often impractical

    Not all elements of a profram are executable

    The path oriented nature of data flow aggravates the problem

    Infeasibility creates test obligation not only for isolated unexecutable elements but also

    for infeasible combinations of feasible elements

    Complex data strucutre may amplify the infeasibility problem by adding infeasible paths

    as a result of alias computation

    Eg:x[i] alias of x[j] when i=j cant determine when i = j in prg exe.

    But problem of infeasible path modest in DU pair adequacy criteria

  • 8/3/2019 unit 6 -ST

    14/14