19
Mining Function Usage Patterns to Find Bugs Chadd Williams

Mining Function Usage Patterns to Find Bugs Chadd Williams

Embed Size (px)

Citation preview

Page 1: Mining Function Usage Patterns to Find Bugs Chadd Williams

Mining Function Usage Patterns to Find Bugs

Chadd Williams

Page 2: Mining Function Usage Patterns to Find Bugs Chadd Williams

2/19 University of Maryland

open(f)tmp = cnt = 0while(cnt < sz & tmp != -1) tmp = read(f,sz) if(tmp != -1) cnt += tmpclose(f)

open(f)tmp = cnt = 0while(cnt < sz & tmp != -1) tmp = read(f,sz) if(tmp != -1) cnt += tmpclose(f)

open(f)tmp = cnt = 0while(cnt < sz & tmp != -1) tmp = read(f,sz) if(tmp != -1) cnt += tmpclose(f)

Thesis

Source code is full of interesting properties– describes how the source code is written– rule that one must adhere to for code to work

correctly– what to do with values from a function– how to use an API

Can we find the properties?– every change is committed– changes highlight misunderstood code

We can discover important properties

by looking at source code changes

Can we use these rules to help the developer to find bugs?

Page 3: Mining Function Usage Patterns to Find Bugs Chadd Williams

3/19 University of Maryland

Why?

We wrote the code, we know the rules!

Implicit rules build up over time– little or no documentation– failure to understand implicit rules causes

bugs• 32% of bugs detected during maintenance1

How much do you know about your 10 year old code base?– Didn’t someone rewrite the matrix objects?– What about that third party library?

[1] Matsumura, T., Monden, A., Matsumoto, K., The Detection of Faulty Code Violating Implicit Coding Rules, IWPSE ’02

Page 4: Mining Function Usage Patterns to Find Bugs Chadd Williams

4/19 University of Maryland

Static Analysis

Analysis of code without execution– examine the source code only

Many successful static analysis tools check for violations of system specific rules– how to use an internal API– specialized lock/unlock functionality– data validation requirements

Often produces many false warnings– can historical information improve this?

Page 5: Mining Function Usage Patterns to Find Bugs Chadd Williams

5/19 University of Maryland

General Technique

Inspect each commit to each file Identify properties in each version Compare sets of properties to

determine new instances of properties

Identify commonly added properties

…value = foo();newPosition + = value; …

…value = foo();if( value != error_code) { newPosition + = value;}…

Commit

Page 6: Mining Function Usage Patterns to Find Bugs Chadd Williams

6/19 University of Maryland

Evaluation

Does historical information help?– can we get the same value by only looking

at the latest version of the source code?

Metric– are the likely bugs near the top?– cumulative precision

• Precision: number of likely bugs vs. number of warnings inspected

Page 7: Mining Function Usage Patterns to Find Bugs Chadd Williams

7/19 University of Maryland

Return Value Check Bug

Identify functions whose return value induces a code change

…value = foo();newPosition + = value; // ??? …

…value = foo();if( value != error_code) { // Check newPosition + = value;}…

Tool InferredBug Fix

Apache Results Provide developers a list of sorted warnings– use historical

information for sorting 0

0.2

0.4

0.6

0.8

1

1 6 11 16 21

Warnings Inspected

Cu

mu

lati

ve P

reci

sio

n

Naive Ranking

HistoryAw are Ranking

Chi-square = 6.15p is less than or equal to 0.025

Page 8: Mining Function Usage Patterns to Find Bugs Chadd Williams

8/19 University of Maryland

Discovering Function Usage Patterns

Function Usage Pattern– describe function invocations with respect

to each other• static analysis • intraprocedural

– describe relationships between functions• implicit rules

mdi = HeapAlloc(GetProcessHeap());if (!mdi) HeapFree(GetProcessHeap(), 0, cs);

HDC hdc = BeginPaint( hwnd, &ps );if( hdc ) DrawIcon( hdc, x, y, hIcon );EndPaint( hwnd, &ps );

Page 9: Mining Function Usage Patterns to Find Bugs Chadd Williams

9/19 University of Maryland

Goals

Discover valid patterns– use data mining techniques to identify

patterns

Identify buggy patterns– which patterns commonly cause a code

change

Find violations of these patterns– static analysis– use history to rank violations

Page 10: Mining Function Usage Patterns to Find Bugs Chadd Williams

10/19 University of Maryland

Mining Changes in Function Usage Patterns

Find new instances of patterns– where that instance was not found in the

revision immediately prior

This finds a large number of patterns– need context to strengthen the ties between

the pair of functions– Data Flow

new instance of the pattern open() -> read()

int foo(){ open(); }

int foo(){ open(); read();}

Commit

Page 11: Mining Function Usage Patterns to Find Bugs Chadd Williams

11/19 University of Maryland

Data Flow

Identify data flow relationships between function pairs– produced/consume– use same data– update same data

HDC hdc = BeginPaint( hwnd, &ps );if( hdc ) DrawIcon( hdc, x, y, hIcon );EndPaint( hwnd, &ps );

HDC hdc = BeginPaint( hwnd, &ps );if( hdc ) DrawIcon( hdc, x, y, hIcon );EndPaint( hwnd, &ps );

HDC hdc = BeginPaint( hwnd, &ps );if( hdc ) hdc.x = genX();

Data flow confidence– what percent of new

instances of foo() -> bar() have a data flow relationship?

Page 12: Mining Function Usage Patterns to Find Bugs Chadd Williams

12/19 University of Maryland

Bug-Prone Patterns How does a new instance enter the

source code– both of the function calls were added– one function call was added

• the added function completed the pairing• bug fix? refactoring?

Bug confidence– what percent of new instances of foo()-

>bar() are created by adding one function call?

int foo(){ }

int foo(){ open(); read();}

Commit

int foo(){ open(); read(); close();}

Commit

And which function call is most likely to be added?

Page 13: Mining Function Usage Patterns to Find Bugs Chadd Williams

13/19 University of Maryland

Valid, Bug Prone Patterns

Patterns added completely could indicate valid patterns

Patterns added by adding one function call indicate:– refactoring/very misunderstood pattern– random noise

Which are likely to be buggy?

Two Function Calls Added One Function Call Added

Two Function Calls Added One Function Call Added

Page 14: Mining Function Usage Patterns to Find Bugs Chadd Williams

14/19 University of Maryland

Ranking of Violations

Number of violations for each pattern– experience from the current code base

Data Flow Confidence– which are valid patterns

Bug Confidence– which have caused code changes in the

past

Confidence– how often, when foo() is added, is foo()-

>bar() created

Page 15: Mining Function Usage Patterns to Find Bugs Chadd Williams

15/19 University of Maryland

Preliminary Results

Student Projects – CS 3– Introduction to C– CVS history for each student for each

project• CVS commit to see automated test

results– 50% precision on final submission

Apache web server– 50% precision rate top 10 warnings– identified a refactoring

WineTREEVIEW_ValidItem(tree,item);

TREEVIEW_SendTreeviewNotify(tree,command,item);

Page 16: Mining Function Usage Patterns to Find Bugs Chadd Williams

16/19 University of Maryland

Apache Case Study

1,129 C source files– includes modules– Apache Portable Runtime

41,000 CVS commits– 6,000 compilable CVS transactions that

change source files for the Linux version

Studied httpd-2.0 branch– July 1996 through Oct 2003– some files have history back through 1.0

branch

Page 17: Mining Function Usage Patterns to Find Bugs Chadd Williams

17/19 University of Maryland

Apache Refactoring

Found many patterns of this form:

Thu Nov 18 23:07:53 1999 UTC (6 years, 3 months ago)

… I then changed all the fprintf(stderr calls to ap_log_error …

Function 1 Function 2 Bug Confidence Add Second Function

shmcb_get_safe_uint ap_log_error 1.0 1.0

ssl_util_vhostid ap_log_error 0.8 1.0

Change debug logging– previously printf– now ap_log_error or ap_log_rerror

Change debug logging– previously printf– now ap_log_error or ap_log_rerror

How often is this pattern created

by adding exactly one function call

How often, when one function call is

added to create this pattern, is it

the second function call

Page 18: Mining Function Usage Patterns to Find Bugs Chadd Williams

18/19 University of Maryland

Can we find bugs?

Static analysis to identify violations of ap_log_error patterns– 16 of first 20 warnings are likely bugs

• first 20 warnings involving ap_log_error

– ranking based on• violations per pattern• bug confidence• data flow confidence

Why do these bugs exist?– missed refactorings – bugs caused by not knowing implicit rules

This refactoring started in 1999

Page 19: Mining Function Usage Patterns to Find Bugs Chadd Williams

19/19 University of Maryland

Conclusions

Interesting properties can be mined from change history– function usage patterns

Using historical information has improved static analysis tools– provide a list of ranked warnings to user– reduced false positive rate