47
Introducing Natural Language Program Analysis Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, Kishen Maloor

Introducing Natural Language Program Analysis Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, Kishen Maloor

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

Introducing Natural Language Program Analysis

Lori Pollock, K. Vijay-Shanker, David Shepherd,

Emily Hill, Zachary P. Fry, Kishen Maloor

NLPA Research Team Leaders

Lori Pollock“Team Captain”

K. Vijay-Shanker“The Umpire”

ProblemModern software is large and complex

object oriented class hierarchy

Software development tools are needed

Successes in Software Development Tools

object oriented class hierarchy

Good with local tasks

Good with traditional structure

object oriented class hierarchy

Scattered tasks are difficult

Programmers use more than traditional program structure

Issues in Software Development Tools

public interface Storable{...

activate tool

save drawing

update drawing

undo action

public void Circle.save()

//Store the fields in a file....

object oriented system

Key Insight: Programmers leave natural language clues that

can benefit software development tools

Observations in Software Development Tools

Studies on choosing identifiers

Impact of human cognition on names [Liblit et al. PPIG 06] Metaphors, morphology, scope, part of speech hints Hints for understanding code

Analysis of Function identifiers [Caprile and Tonella WCRE 99] Lexical, syntactic, semantic Use for software tools: metrics, traceability, program understanding

Carla, the compiler writer Pete, the programmer

I don’t care about names.

So, I could use x, y, z. But, no one

will understandmy code.

Our Research Path

[MACS 05, LATE 05]

[AOSD 06]

[ASE 05, AOSD 07, PASTE 07]

Motivated usefulness of exploiting natural language (NL) clues in toolsDeveloped extraction process and an NL-

based program representationCreated and evaluated a concern

location tool and an aspect miner with NL-based analysis

pic

Name: David C ShepherdNickname: Leadoff HitterCurrent Position: PhD May 30, 2007Future Position: Postdoc, Gail Murphy

StatsYear coffees/day redmarks/paper draft2002 0.1 5002007 2.2 100

Aspect Mining

Aspect-Oriented Programming

Aspect Mining TaskLocate refactoring

candidates

Applying NL Clues for

Molly, the Maintainer

How can I fix Paul’s

atrocious code?

Timna: An Aspect Mining Framework [ASE 05]

Uses program analysis clues for mining Combines clues using machine learning Evaluated vs. Fan-in Precision (quality) and Recall (completeness)

P R 37 2 62 60

Fan-InTimna

iTimna (Timna with NL) Integrates natural language cluesExample: Opposite verbs (open and close)

P R 37 2 62 60 81 73

Fan-InTimna iTimna

Integrating NL Clues into Timna

Natural language information increases the effectiveness of Timna[Come back Thurs 10:05am]

Concern Location

60-90% software costs spent on reading and navigating code for maintenance*

(fixing bugs, adding features, etc.)

*[Erlikh] Leveraging Legacy System Dollars for E-Business

Applying NL Clues for

Motivation

Key Challenge: Concern Location

Find, collect, and understand all source code related to a particular concept

Concerns are often crosscutting

State of the Art for Concern Location

Mining Dynamic Information [Wilde ICSM 00]

Program Structure Navigation [Robillard FSE 05, FEAT, Schaefer ICSM 05]

Search-Based Approaches RegEx [grep, Aspect Mining Tool 00]

LSA-Based [Marcus 04]

Word-Frequency Based [GES 06]

Reduced to similar problem

Slow

Fast

Fragile

Sensitive

No Semantics

Limitations of Search Techniques

1. Return large result sets

2. Return irrelevant results

3. Return hard-to-interpret result sets

The Find-Concept Approach

concept

Find-ConceptConcrete query

Recommendations

Source Code

Method a

Method bMethod c

Method d Method e

NL-basedCode Rep

Result GraphNatural

Language Information

1. More effective search

2. Improved search terms

3. Understandable results

Underlying Program Analysis

Action-Oriented Identifier Graph (AOIG) [AOSD 06] Provides access to NL information Provides interface between NL and traditional

Word Recommendation Algorithm NL-based

Stemmed/Rooted: complete, completing Synonym: finish, complete

Combining NL and Traditional Co-location: completeWord()

Experimental Evaluation

Research Questions Which search tool is most effective at forming and

executing a query for concern location? Which search tool requires the least human effort to form

an effective query?

Methodology: 18 developers complete nine concern location tasks on medium-sized (>20KLOC) programs

Measures:Precision (quality), Recall (completeness), F-Measure (combination of both P & R)

Find Concept, GES, ELex

Overall Results

Effectiveness FC > Elex with statistical

significance FC >= GES on 7/9 tasks FC is more consistent than GES

Effort FC = Elex = GES

FC is more consistent and more effective in experimental study without requiring more effort

Across all tasks

Natural Language Extraction from Source Code

Key Challenges:Decode name usageDevelop automatic extraction

processCreate NL-based program

representation

Molly, the Maintainer

What was Pete thinking

when he wrote this code?

Natural Language: Which Clues to Use?

Software MaintenanceTypically focused on actionsObjects are well-modularized

Maintenance Requests

Natural Language: Which Clues to Use?

Software MaintenanceTypically focused on actionsObjects are well-modularized

Focus on actions Correspond to verbsVerbs need Direct Object

(DO)

Extract verb-DO pairs

Extracting Verb-DO Pairs

Two types of extractionclass Player{ /** * Play a specified file with specified time interval */ public static boolean play(final File file,final float fPosition,final long length) { fCurrent = file; try { playerImpl = null; //make sure to stop non-fading players stop(false); //Choose the player Class cPlayer = file.getTrack().getType().getPlayerImpl(); …}

Extraction from comments

Extraction from method signatures

public UserList getUserListFromFile( String path ) throws IOException {

try {

File tmpFile = new File( path );

return parseFile(tmpFile);

} catch( java.io.IOException e ) {

throw new IOrException( ”UserList format issue" + path + " file " + e );

}

}

Extracting Clues from Signatures

1. POS Tag Method Name

2. Chunk Method Name

3. Identify Verb and Direct-Object (DO)

get<verb> User<adj> List<noun> From <prep> File <noun>

get<verb phrase> User List<noun phrase> From File <prep phrase>

POS Tag

Chunk

pic

Name: Zak FryNickname: The RookieCurrent Position: Upcoming seniorFuture Position: Graduate School

StatsYear diet cokes/day lab days/week2006 1 22007 6 8

Developing rules for extraction

For many methods: Identify relevant verb (V)

and direct object (DO) in method signature

Classify pattern of V and DO locations

If new pattern, create new extraction rule

verbDO

verb DO

verbDO

Our Current Extraction Rules

4 general rules with subcategories:

URL parseURL()

void mouseDragged()

void Host.onSaved()

Left Verb

Right Verb

Generic Verb

Unidentified Verb

void message() message-

hostsaved

mousedragged

URLparse

DOVerb

Example: Sub-Categories for Left-Verb General Rule

Look beyond the method name:

Parameters, Return type, Declaring class name, Type hierarchy

Subcategory1) Standard left verb 2) No DO in method name; has parameters; non object return type3) No DO in method name; no parameters; no return type4) Creational left verb; has return type5) No DO in method name; has parameters; return type is more specific than parameters in type hierarchy6) No DO in method name; parameters are more specific than parameters in type hierarchy

2) No DO in method name; has parameters; non object return type

Verb-DO pair:

<remove, UserID>Left

Verb

Representing Verb-DO Pairs

Action-Oriented Identifier Graph (AOIG)

verb1 verb2 verb3 DO1 DO2 DO3

verb1, DO1 verb1, DO2 verb3, DO2 verb2, DO3

source code files

use

use

use

use

use

use

useuse

Action-Oriented Identifier Graph (AOIG)

play add remove file playlist listener

play, file play, playlist remove, playlist add, listener

source code files

use

use

use

use

use

use

useuse

Representing Verb-DO Pairs

Evaluation of Extraction Process

Compare automatic vs ideal (human) extraction 300 methods from 6 medium open source programs Annotated by 3 Java developers

Promising Results Precision: 57% Recall: 64%

Context of Results Did not analyze trivial methods On average, at least verb OR direct object obtained

pic

Name: Emily Gibson HillNickname: Batter on DeckCurrent Position: 2nd year PhD StudentFuture Position: PhD Candidate

StatsYear cokes/day meetings/week2003 0.2 12007 2 5

Program Exploration

Purpose: Expedite software maintenance and program comprehension

Key Insight: Automated tools can use program structure and identifier names to save the developer time and effort

Ongoing work:

Dora the Program Explorer*

* Dora comes from exploradora, the Spanish word for a female explorer.

DoraDora

Natural Language Query• Maintenance request• Expert knowledge• Query expansion

Natural Language Query• Maintenance request• Expert knowledge• Query expansion

Relevant Neighborhood

Program Structure• Representation

• Current: call graph• Seed starting point

Relevant Neighborhood• Subgraph relevant to query

Query

State of the Art in Exploration

Structural (dependence, inheritance) Slicing Suade [Robillard 2005]

Lexical (identifier names, comments) Regular expressions: grep, Eclipse search Information Retrieval: FindConcept,

Google Eclipse Search [Poshyvanyk 2006]

Motivating need for structural and lexical information

Program: JBidWatcher, an eBay auction sniping program

Bug: User-triggered add auction event has no effect

Task: Locate code related to ‘add auction’ trigger

Seed: DoAction() method, from prior knowledge

ExampleScenario

DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()

DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

Using only structural information

DoAction() has 38 callees, only 2/38 are relevant Relevant

Methods

Irrelevant Methods

Looking for: ‘add auction’ trigger

DoAction()

DoAdd()

DoPasteFromClipboard()

And what if you wanted to explore more than one edge away?

Locates locally relevant items, but many irrelevant

Using only lexical information

50/1812 methods contain matches to ‘add*auction’ regular expression query

Only 2/50 are relevant

Locates globally relevant items, but many irrelevant

Looking for: ‘add auction’ trigger

DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()

DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

Combining Structural & Lexical Information Structural: guides exploration

from seed

Looking for: ‘add auction’ trigger

RelevantNeighborhood

DoAction()

DoPasteFromClipboard()

DoAdd()

Lexical: prunes irrelevant edges

The Dora Approach

Determine method relevance to queryCalculate lexical-based relevance score

Low-scored methods pruned from neighborhood

Recursively explore

Prune irrelevant structural edges from seed

Calculating Relevance Score:Term Frequency Score based on query term frequency of the method

6 query term 6 query term occurrencesoccurrences6 query term 6 query term occurrencesoccurrences

Only 2 Only 2 occurrencesoccurrences

Only 2 Only 2 occurrencesoccurrences

Query: ‘add auction’

Weigh term frequency based on location: Method name more important than body Method body statements normalized by length

Calculating Relevance Score:Location Weights Query: ‘add auction’

?

Dora explores ‘add auction’ trigger

From DoAction() seed:Correctly identified at 0.5 threshold

DoAdd() (0.93)DoPasteFromClipboard() (0.60)

With only one false positiveDoSave() (0.52)

Summary

NL technology usedSynonyms, collocations, morphology, word frequencies, part-of-speech tagging, AOIG

Evaluation indicatesNatural language information shows promise for improving software development tools

Key to successAccurate extraction of NL clues

Our Current and Future Work

Basic NL-based tools for softwareAbbreviation expanderProgram synonymsDetermining relative importance of words

Integrating information retrieval techniques

Posed Questions for Discussion

What open problems faced by software tool developers can be mitigated by NLPA?

Under what circumstances is NLPA not useful?