Context-Sensitive, Interprocedural Dataflow Analysis as CFL Reachability

Context-Sensitive, Interprocedural Dataflow

Analysis as CFL ReachabilitySeth Hallem and Eric Watkins

Exhaustive Analysis Papers

• “Precise Interprocedural Dataflow Analysis via Graph Reachability”– Reps, Horowitz, Sagiv -- POPL 1995

– applies CFL reachability to context-sensitive, interprocedural dataflow analysis

• “Program Analysis via Graph Reachability”– Reps -- ILP 1997

– describes two additional applications: interprocedural program slicing and shape analysis

The Reduction to CFL Reachability

• Question 1: What problems can we solve?

• Question 2: How do we set up the problem?

• Question 3: How do we solve the problem?

• Question 4: What is the complexity of this approach?

• Running example: possibly uninitialized variables

What problems can we solve?

• IFDS problems– Finite set of dataflow facts (D)

– Mapping from functions ƒ:2D2D to edges in the CFG

– Each ƒ is distributive wrt the meet operator:• ƒ(a b) = ƒ(a) ƒ(b)

• Possibly uninitialized vars:– Each program variable corresponds to a dataflow fact.

When that fact holds, the variable may be uninitialized.

– Transfer functions: a variable is uninitialized if it was just declared or if it is assigned an expression containing uninitialized variables.

Simple Exampleint z;

int main (void) {

int x ,y = 0; /* {x, z} */

y = y + x; /* {x, y, z} */

z = 0; /* {x, y} */

}

• D = {x, y, z}, domain/range of transfer functions is the power set of D (2D)

How do we setup and solve IFDS problems?

• Inputs to the algorithm:– Exploded supergraph (next couple of slides)

• Outputs from the algorithm:– meet-over-all-realizable-paths solution:

• MRPn = pfq( )qRpaths (startmain, n)

The Supergraph

Representation Relations

• Each dataflow function, ƒ, is converted to a representation relation, which is represented as a graph consisting of 2D + 2 nodes– D input nodes, one for each dataflow fact, plus the node

(or 0), which corresponds to the empty set.

– D output nodes plus the node – There is an edge from input node d1 to output node d2 if

d2 ƒ(S) if d1S and d2 ƒ()

More Representation Relations

• (a) and (b) show representation relations for two functions (nodes smain and n1)

• (c) and (d) show two ways to compose these relations– (d) illustrates the need for the in each relation

Exploding the Supergraph

CFL Reachability

• Want to solve the dataflow problem with a reachability query on the exploded supergraph.

• Not all paths in G# are valid, though. Must match calls w/returns.

• Insight: context-sensitivity = matching parens; language of matching parens is a CFL

Context-Sensitivity = CFL

• Assign a unique index to each callsite, define a CFL of matching calls and returns.

• Suppose we have two call-sites to function P(), which we label i and k– (i (k )k )i is a valid path

– (i (k )k is a valid path

– (i (k )i is not

Reachability Algorithm

• Dynamic programming is the key– Start at the entry point to the program. Follow the

edges in G#, recording what dataflow facts we can reach.

– At a procedure call, follow the call. To avoid re-doing any work, though, maintain a cache of edges of that summarize pieces of the computation.

– Summary edges record the results of an entire procedure, start at a callsite, end at the corresponding return-site.

– Path edges record the suffix of a valid path.

Dynamic Programming Details

Complexity

• Worst case for general CFL reachability is cubic in the number of nodes in the graph

• Can do better for dataflow analysis: O(ED3) for any distributive problem, O(Call D3 + hED2) for h-sparse problems– possibly uninitialized variables is 2-sparse when

aliasing is ignored: a variable’s status as initialized or uninitialized can only affect itself and one other variable (if it is assigned to that variable)

Other Applications• Interprocedural slicing

– identify all pieces of a program relevant to a particular statement

• Shape Analysis

– For any DAG data structure, determines a superset of the possible shapes for that data structure.

– Each dataflow fact corresponds to a single possible shape.

– Problem: infinite number of shapes. Solution is to define shape at program point q in terms of shape at previous program points.

– ILP paper has an example of shape analysis of a linked list.

The other papers

• “Demand Interprocedural Dataflow Analysis”– Horowitz, Reps, Sagiv -- FSE 1995

• “Demand-driven Computation of Interprocedural Data Flow”– Duesterwald, Gupta, Soffa -- POPL 1995

• Provide two possible frameworks for transforming any IFDS analysis into a demand-driven analysis

Steps to Demand-driven analysis

• Define problem in the IFDS framework

• Reverse the flow functions, or reverse the flow edges

• Start with initial query < d, n >

• Propagate the query backwards until solved

Reversing dataflow

• In Duesterwald et al., the dataflow problem is specified with flow functions– Reverse the functions

• For CFL problems, the problem is represented as a set of edges– Just reverse the edges

Example: CCPNotation

• x – set of dataflow facts

• xw – dataflow fact for variable w

• fn(x)w – transfer fn for variable w at node n

• [w = c] – set of dataflow facts, where the fact for variable w equals c

Query Algorithm

• Worklist holds the set of outstanding queries

• While not empty, remove a query

• Propagate backwards one node in the flowgraph

• For a function call, create a backwards summary for that function and apply that

Query Propagation

More notation• rp – entry node for

procedure p• m, n – normal nodes• fm – reverse dataflow fn

for node m• Ncall – all nodes that are

callsites• call(m) – the procedure

called at node m• (rp, ep) – summary fn

for procedure p

Backwards edge propagation

Query Algorithm Efficiency

• Optimizations: function summaries, early termination, query result cache

• In the worst case, it’s the same as exhaustive analysis

• Some problems work better than others for demand-driven analysis.– Depends how much information you need to answer

queries, or how many queries need to be made.

Conclusions

• Demand-driven analysis is a powerful idea

• Saves time and space, but in the worst case it’s no better than exhaustive analysis

• Only works for distributive problems

• Two approaches for demand-driven analysis are equivalent

Discussion

• Are these algorithms generally applicable?• Are they fast?

– No evidence the papers, but the answer is yes (see ESP in a couple of weeks)

• Why are they efficient (beyond the complexity guarantee)?

• Is it always cheap to compute the exploded supergraph?– How can an imprecise alias analysis influence this step

and the overall performance of the algorithm?

Documents

Context-Sensitive, Interprocedural Dataflow Analysis as CFL Reachability