Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers

Context-Sensitive Interprocedural Points-to Analysis in the

Presence of Function Pointers

Presentation byPatrickKaleemJustin

Abstract

• Emami, et al introduced a new method for dealing with the alias problem in C-like languages. The method provides context-sensitive interprocedural information based on “invocation graphs” that even supports recursive functions.

Introduction/Motivation

Similar analyses had already been applied to block-structured languages like Fortran.

C proved to be a greater challenge because of the many pointer-related options available to the programmer.

Simply applying Fortran’s optimizations was practically useless due to the many pointer-related operations used in C.

Difficulties With C

• The address-of operator (&) can create new points-to relationships at any point in a program

• Handling both stack-allocated and dynamically-allocated (heap) variables

• Proper analysis of recursion and function pointers

Data Structures

Two main data structures are used during Emami’s analysis:

• An invocation graph to deal with function call paths.

• A set of abstract stack locations to deal with Steensgaard/Anderson-like points-to relationships

Points-to Abstraction

• Based on relationships between stack locations instead of simple alias pairs

• Stack location x points-to stack location y at program point p if x == &y

• Computes both possible and definite points-to relationships

Points-to Abstraction

• Definite relationships provide valuable “killing” information as well as providing pointer replacement

• If we know q definitely points to y, we can replace the statement x=*q with x=y

• Later in compilation, these replacements can reduce the number of loads and stores needed

Context-sensitive/Interprocedural

• Emami’s approach is to use invocation graphs, which generate accurate results and correctly handle recursion in the presence of function pointers

• This is important because function pointers previously proved difficult for interprocedural analysis

• Paths along the graph represent possible paths of execution from one function call to another

Stack-based Analysis

• Aliasing problems come in one of three forms: aliases between references to the stack, between references to dynamically allocated heap space, and between two references to the same array

• Emami holds that stack-based analysis can be safely decoupled from heap-based analysis. The given algorithm deals only with the stack analysis.

Stack vs. Heap

• C’s pointer-related complexity involves aliases in the stack as well as the heap. Dynamically allocated memory is said to be on the heap. Emami’s algorithm deals with resolving stack-based aliases

• For example, the statements

int a; int* x=&a;

create a pointer to a member of the stack

Setting – McCAT Compiler

• SIMPLE is a compact representation for program statements that is easier to deal with, while retaining ALL the functionality required by real C programs. In effect, it is a strict subset of C.

• Using SIMPLE, points-to analysis rules need only be implemented for 15 basic statements and for the simplified conditions/control statements (if, while, etc).

• The 15 statements conveniently require only one-level pointer indirection per reference.

Previous MethodsPast analysis methods, like Steensgaard and Andersen,

approximated aliases via alias pairs. Two variable references were said to be aliased if they referred to the same location.

These algorithms ignored the context and flow for statements. Although they could be used to optimize real programs, there was still room for improvement because they would likely include many false alias pairs.

Context/Flow sensitive algorithms can eliminate even more false information, allowing more optimization possibilities when implemented.

Stack-based Method

• Emami’s method abstracts the set of all accessible stack locations with a finite set of named “abstract stack locations”.

• The approximation consists of a set of points-to relationships between the abstract stack locations.

• After the statement p=&y, we say abstract stack location p points-to abstract stack location y.

Properties of the Abstraction

• Every real stack location involved in a pointer reference is represented by exactly one named abstract stack location.

• Each of these named abstract location represents one or more real stack locations.

Abstract Stack Locations

Each abstract location corresponds to one of three things:

• The name of a local/global variable or parameter

• A symbolic name that corresponds to locations indirectly accessible through pointer variables

• The symbolic name heap

Definitely Points-to

Abstract stack location x definitely points to abstract stack location y, if x and y each represent exactly one real stack location, and the real stack location corresponding to x contains the address of the real stack location corresponding to y. This is denoted by the triple (x, y, D)

Possibly Points-to

Abstract stack location x possibly points-to abstract stack location y if it is possible that one of the real stack locations corresponding to x contains the address of one of the real stack locations corresponding to y. This is denoted by

(x, y, P)

L- vs R-locations

• L-locations and R-locations are abstract locations referred to by a variable reference on the left or right side of an assignment statement, respectively.

• Both are represented as the pair (x, D), (x, P), where X is an abstract location name, and D/P indicate definite and possible locations. These are described in Table 1.

L- and R-locations

• L-locations refer to the stack location of the variable itself

• An R-location is

{(x, d) | a points to x with the relationship d}

• Table 1 shows these sets for all references available in SIMPLE

SIMPLE References

Ref L-location set R-location set&a N/A {(a, D)}

&a[0] N/A {(ahead, D)}

a {(a, D)} {(x,d) | (a,x,d) in S}

*a {(x,d) |

(a,x,d) in S}

{(y, d1 /\ d2) |

(a,x,d1) in S /\ (x,y,d1) in S}

malloc() N/A {(heap, P)}

Basic Analysis Rules

If the statement is a sequence, process both statements separately.If the statement is a basic assignment, process it with process_basic_stmt().If the statement is a control statement, call the corresponding function to process it.


If we are assigning to a pointer variable:

kill_set = those relationships of definite L-locations of lhs(S)change_set = those relationships from possible L-locations of lhs(S)gen_set = generate all relationships between L-locations of

lhs(S) and R-locations of rhs(S)


Interprocedural Analysis

• When measuring the effect of a procedure call, we estimate it within the context in which it was called

• A calling context depends on the chain of procedure calls starting at main() and ending with the procedure being processed

Invocation Graph

• All invocation paths, beginning with main(), are represented in an invocation graph.

• If we could disregard recursion, the graph could be built using a depth-first traversal of the program’s calling structure

• With recursion, we have to approximate all possible “unrollings” of the recursion.

• Nodes for recursive calls are marked as an approximate node, and a special approximating edge is added to the graph

Mapping Points-to Information

• When a procedure is called, the input points-to set for the called procedure must be mapped from the points-to information at the call site.

• The mapping must include information about any multi-level pointer relationships


• Globals may refer to locations outside the scope of the function. These pointers are named “invisible”.

• Invisible variables are represented by a symbolic name in the abstraction.

• A symbolic name can represent more than one invisible variable.


• A good mapping scheme must minimize the number of invisible variables mapped to a symbolic name to improve accuracy

• The output points-to set is mapped back to obtain the output points-to information at the call site.

Recursion

• Figure 4 outlines the recursion algorithm• All possible unrollings of recursive calls are

approximated by introducing matched pairs of recursive and approximate nodes in the invocation graph.

• Each approximate node marks a place where the current stored approximation for the function should be used. Instead of evaluating the call again (and again…), the stored output set is used as an approximation.

Recursion – Figure 4

• For approximate nodes, the current input is compared to the stored input of the matching recursive node. If the current input is contained in the stored input, then we safely use the stored output as the result.

• An approximate node never evaluates the body of a function. It either uses the stored result or returns BOTTOM.

Function Pointers

With function pointers, the invocation graph cannot be constructed correctly with just a single pass through the source code.

When a function is called through a function pointer, program execution could potentially jump to a number of functions, which we cannot know for sure ahead of time.

Function Pointers

• The safest approximation would be to assume that these calls can point to any function in the program. But this would make our results too conservative.

• Emami improves this approximation by building the invocation graph while points-to analysis is performed.

Function Pointers

• When the analysis reaches a function pointer call, the set of currently-known possible points-to destinations for the pointer is used to limit the number of false assumptions.

• This is not perfect, because the graph is most likely incomplete at the point where the function is called, but it greatly improves upon assuming that all functions are reachable.

Function Pointer Algorithm

1. Begin building invocation graph

2. Perform points-to analysis using the incomplete graph

3. Each pointed-to function is analyzed in the context of the call

Function Pointer Example

Experimental Results• 17 C programs analyzed• All programs were converted to SIMPLE, and then

processed.

Results

• The average number of stack locations pointed to by the dereferenced pointer is nearly 1, which would be the ideal case

• This means that the analysis is very precise. Very few false assumptions make it past the analysis.

Results

• About 30% of indirect references refer to a pointer that definitely points to a single stack location. About 20% of indirect references can then be replaced by direct references in the final program. For this replacement, about two-thirds of the replacements are useful.

• The missing 10% deals with the scoping problems of invisible variables.

Heap Pointers

• Emami’s algorithm deals only with stack-based pointers. Heap-based pointers are ignored. In the 17 test programs, about 30% of points-to relationships used have heap locations as their target.

• Stack analysis alone is insufficient. For a more complete analysis, a companion heap analysis is needed.

Heap vs Stack

Surprisingly, all test programs had zero pointer relationships from the heap to the stack.

This strongly supports the author’s claim that stack and heap analysis should be performed separately.

Results

• Each call site was found to have an average of only two locations in the invocation graphs.

• This implies that for real programs, explicitly following call chains in an invocation graph is not too costly.

• In other words, there are not too many edges coming from each node.

Problems

• Even though there are only an average of two locations per call site in the invocation graph, it is theoretically possible that the algorithm can have exponential time.

• Possible explanations for this include recursion, and function pointers.

Applications

• Once the analysis has been completed, other analyses can use the existing invocation graph and abstract stack information to provide further optimizations.

• These other analyses won’t have to take invisible variables or function pointers into account, as those problems will have already been resolved.

Related Work

• Alias Analysis– Alias analysis was the precursor to points-to

analysis.

Related Work

• Landi and Ryder– The points-to abstraction provides more alias

information in a more compact method .

– The author’s method can give more accurate results for multi-level pointers.

– The points-to method also provides a safe approximation even in the presence of pointers from the heap to the stack.

– Function pointers cannot be handled by this method.

Conclusion

• The method presented provides context-sensitive interprocedural information, as well as handling general function pointers.

• This information can be used for optimizations, and transformations including pointer replacement and array dependence testing.

The End

Patrick

Kaleem

Justin

Documents

Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers