47
Introducing Natural Language Program Analysis Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, Kishen Maloor

Introducing Natural Language Program Analysis

  • Upload
    ilyssa

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Introducing Natural Language Program Analysis. Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, Kishen Maloor. NLPA Research Team Leaders. K. Vijay-Shanker “The Umpire”. Lori Pollock “Team Captain”. University of Delaware. Problem. - PowerPoint PPT Presentation

Citation preview

Page 1: Introducing Natural Language Program Analysis

Introducing Natural Language Program Analysis

Lori Pollock, K. Vijay-Shanker, David Shepherd,

Emily Hill, Zachary P. Fry, Kishen Maloor

Page 2: Introducing Natural Language Program Analysis

NLPA Research Team Leaders

Lori Pollock“Team Captain”

K. Vijay-Shanker“The Umpire”

Page 3: Introducing Natural Language Program Analysis

ProblemModern software is large and complex

object oriented class hierarchy

Software development tools are needed

Page 4: Introducing Natural Language Program Analysis

Successes in Software Development Tools

object oriented class hierarchy

Good with local tasks

Good with traditional structure

Page 5: Introducing Natural Language Program Analysis

object oriented class hierarchy

Scattered tasks are difficult

Programmers use more than traditional program structure

Issues in Software Development Tools

Page 6: Introducing Natural Language Program Analysis

public interface Storable{...

activate tool

save drawing

update drawing

undo action

public void Circle.save()

//Store the fields in a file....

object oriented system

Key Insight: Programmers leave natural language clues that

can benefit software development tools

Observations in Software Development Tools

Page 7: Introducing Natural Language Program Analysis

Studies on choosing identifiers

Impact of human cognition on names [Liblit et al. PPIG 06] Metaphors, morphology, scope, part of speech hints Hints for understanding code

Analysis of Function identifiers [Caprile and Tonella WCRE 99] Lexical, syntactic, semantic Use for software tools: metrics, traceability, program understanding

Carla, the compiler writer Pete, the programmer

I don’t care about names.

So, I could use x, y, z. But, no one

will understandmy code.

Page 8: Introducing Natural Language Program Analysis

Our Research Path

[MACS 05, LATE 05]

[AOSD 06]

[ASE 05, AOSD 07, PASTE 07]

Motivated usefulness of exploiting natural language (NL) clues in toolsDeveloped extraction process and an NL-

based program representationCreated and evaluated a concern

location tool and an aspect miner with NL-based analysis

Page 9: Introducing Natural Language Program Analysis

pic

Name: David C ShepherdNickname: Leadoff HitterCurrent Position: PhD May 30, 2007Future Position: Postdoc, Gail Murphy

StatsYear coffees/day redmarks/paper draft2002 0.1 5002007 2.2 100

Page 10: Introducing Natural Language Program Analysis

Aspect Mining

Aspect-Oriented Programming

Aspect Mining TaskLocate refactoring

candidates

Applying NL Clues for

Molly, the Maintainer

How can I fix Paul’s

atrocious code?

Page 11: Introducing Natural Language Program Analysis

Timna: An Aspect Mining Framework [ASE 05]

Uses program analysis clues for mining Combines clues using machine learning Evaluated vs. Fan-in Precision (quality) and Recall (completeness)

P R 37 2 62 60

Fan-InTimna

Page 12: Introducing Natural Language Program Analysis

iTimna (Timna with NL) Integrates natural language cluesExample: Opposite verbs (open and close)

P R 37 2 62 60 81 73

Fan-InTimna iTimna

Integrating NL Clues into Timna

Natural language information increases the effectiveness of Timna[Come back Thurs 10:05am]

Page 13: Introducing Natural Language Program Analysis

Concern Location

60-90% software costs spent on reading and navigating code for maintenance*

(fixing bugs, adding features, etc.)

*[Erlikh] Leveraging Legacy System Dollars for E-Business

Applying NL Clues for

Motivation

Page 14: Introducing Natural Language Program Analysis

Key Challenge: Concern Location

Find, collect, and understand all source code related to a particular concept

Concerns are often crosscutting

Page 15: Introducing Natural Language Program Analysis

State of the Art for Concern Location

Mining Dynamic Information [Wilde ICSM 00]

Program Structure Navigation [Robillard FSE 05, FEAT, Schaefer ICSM 05]

Search-Based Approaches RegEx [grep, Aspect Mining Tool 00]

LSA-Based [Marcus 04]

Word-Frequency Based [GES 06]

Reduced to similar problem

Slow

Fast

Fragile

Sensitive

No Semantics

Page 16: Introducing Natural Language Program Analysis

Limitations of Search Techniques

1. Return large result sets

2. Return irrelevant results

3. Return hard-to-interpret result sets

Page 17: Introducing Natural Language Program Analysis

The Find-Concept Approach

concept

Find-ConceptConcrete query

Recommendations

Source Code

Method a

Method bMethod c

Method d Method e

NL-basedCode Rep

Result GraphNatural

Language Information

1. More effective search

2. Improved search terms

3. Understandable results

Page 18: Introducing Natural Language Program Analysis

Underlying Program Analysis

Action-Oriented Identifier Graph (AOIG) [AOSD 06] Provides access to NL information Provides interface between NL and traditional

Word Recommendation Algorithm NL-based

Stemmed/Rooted: complete, completing Synonym: finish, complete

Combining NL and Traditional Co-location: completeWord()

Page 19: Introducing Natural Language Program Analysis

Experimental Evaluation

Research Questions Which search tool is most effective at forming and

executing a query for concern location? Which search tool requires the least human effort to form

an effective query?

Methodology: 18 developers complete nine concern location tasks on medium-sized (>20KLOC) programs

Measures:Precision (quality), Recall (completeness), F-Measure (combination of both P & R)

Find Concept, GES, ELex

Page 20: Introducing Natural Language Program Analysis

Overall Results

Effectiveness FC > Elex with statistical

significance FC >= GES on 7/9 tasks FC is more consistent than GES

Effort FC = Elex = GES

FC is more consistent and more effective in experimental study without requiring more effort

Across all tasks

Page 21: Introducing Natural Language Program Analysis

Natural Language Extraction from Source Code

Key Challenges:Decode name usageDevelop automatic extraction

processCreate NL-based program

representation

Molly, the Maintainer

What was Pete thinking

when he wrote this code?

Page 22: Introducing Natural Language Program Analysis

Natural Language: Which Clues to Use?

Software MaintenanceTypically focused on actionsObjects are well-modularized

Maintenance Requests

Page 23: Introducing Natural Language Program Analysis

Natural Language: Which Clues to Use?

Software MaintenanceTypically focused on actionsObjects are well-modularized

Focus on actions Correspond to verbsVerbs need Direct Object

(DO)

Extract verb-DO pairs

Page 24: Introducing Natural Language Program Analysis

Extracting Verb-DO Pairs

Two types of extractionclass Player{ /** * Play a specified file with specified time interval */ public static boolean play(final File file,final float fPosition,final long length) { fCurrent = file; try { playerImpl = null; //make sure to stop non-fading players stop(false); //Choose the player Class cPlayer = file.getTrack().getType().getPlayerImpl(); …}

Extraction from comments

Extraction from method signatures

Page 25: Introducing Natural Language Program Analysis

public UserList getUserListFromFile( String path ) throws IOException {

try {

File tmpFile = new File( path );

return parseFile(tmpFile);

} catch( java.io.IOException e ) {

throw new IOrException( ”UserList format issue" + path + " file " + e );

}

}

Extracting Clues from Signatures

1. POS Tag Method Name

2. Chunk Method Name

3. Identify Verb and Direct-Object (DO)

get<verb> User<adj> List<noun> From <prep> File <noun>

get<verb phrase> User List<noun phrase> From File <prep phrase>

POS Tag

Chunk

Page 26: Introducing Natural Language Program Analysis

pic

Name: Zak FryNickname: The RookieCurrent Position: Upcoming seniorFuture Position: Graduate School

StatsYear diet cokes/day lab days/week2006 1 22007 6 8

Page 27: Introducing Natural Language Program Analysis

Developing rules for extraction

For many methods: Identify relevant verb (V)

and direct object (DO) in method signature

Classify pattern of V and DO locations

If new pattern, create new extraction rule

verbDO

verb DO

verbDO

Page 28: Introducing Natural Language Program Analysis

Our Current Extraction Rules

4 general rules with subcategories:

URL parseURL()

void mouseDragged()

void Host.onSaved()

Left Verb

Right Verb

Generic Verb

Unidentified Verb

void message() message-

hostsaved

mousedragged

URLparse

DOVerb

Page 29: Introducing Natural Language Program Analysis

Example: Sub-Categories for Left-Verb General Rule

Look beyond the method name:

Parameters, Return type, Declaring class name, Type hierarchy

Subcategory1) Standard left verb 2) No DO in method name; has parameters; non object return type3) No DO in method name; no parameters; no return type4) Creational left verb; has return type5) No DO in method name; has parameters; return type is more specific than parameters in type hierarchy6) No DO in method name; parameters are more specific than parameters in type hierarchy

2) No DO in method name; has parameters; non object return type

Verb-DO pair:

<remove, UserID>Left

Verb

Page 30: Introducing Natural Language Program Analysis

Representing Verb-DO Pairs

Action-Oriented Identifier Graph (AOIG)

verb1 verb2 verb3 DO1 DO2 DO3

verb1, DO1 verb1, DO2 verb3, DO2 verb2, DO3

source code files

use

use

use

use

use

use

useuse

Page 31: Introducing Natural Language Program Analysis

Action-Oriented Identifier Graph (AOIG)

play add remove file playlist listener

play, file play, playlist remove, playlist add, listener

source code files

use

use

use

use

use

use

useuse

Representing Verb-DO Pairs

Page 32: Introducing Natural Language Program Analysis

Evaluation of Extraction Process

Compare automatic vs ideal (human) extraction 300 methods from 6 medium open source programs Annotated by 3 Java developers

Promising Results Precision: 57% Recall: 64%

Context of Results Did not analyze trivial methods On average, at least verb OR direct object obtained

Page 33: Introducing Natural Language Program Analysis

pic

Name: Emily Gibson HillNickname: Batter on DeckCurrent Position: 2nd year PhD StudentFuture Position: PhD Candidate

StatsYear cokes/day meetings/week2003 0.2 12007 2 5

Page 34: Introducing Natural Language Program Analysis

Program Exploration

Purpose: Expedite software maintenance and program comprehension

Key Insight: Automated tools can use program structure and identifier names to save the developer time and effort

Ongoing work:

Page 35: Introducing Natural Language Program Analysis

Dora the Program Explorer*

* Dora comes from exploradora, the Spanish word for a female explorer.

DoraDora

Natural Language Query• Maintenance request• Expert knowledge• Query expansion

Natural Language Query• Maintenance request• Expert knowledge• Query expansion

Relevant Neighborhood

Program Structure• Representation

• Current: call graph• Seed starting point

Relevant Neighborhood• Subgraph relevant to query

Query

Page 36: Introducing Natural Language Program Analysis

State of the Art in Exploration

Structural (dependence, inheritance) Slicing Suade [Robillard 2005]

Lexical (identifier names, comments) Regular expressions: grep, Eclipse search Information Retrieval: FindConcept,

Google Eclipse Search [Poshyvanyk 2006]

Page 37: Introducing Natural Language Program Analysis

Motivating need for structural and lexical information

Program: JBidWatcher, an eBay auction sniping program

Bug: User-triggered add auction event has no effect

Task: Locate code related to ‘add auction’ trigger

Seed: DoAction() method, from prior knowledge

ExampleScenario

Page 38: Introducing Natural Language Program Analysis

DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()

DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

Using only structural information

DoAction() has 38 callees, only 2/38 are relevant Relevant

Methods

Irrelevant Methods

Looking for: ‘add auction’ trigger

DoAction()

DoAdd()

DoPasteFromClipboard()

And what if you wanted to explore more than one edge away?

Locates locally relevant items, but many irrelevant

Page 39: Introducing Natural Language Program Analysis

Using only lexical information

50/1812 methods contain matches to ‘add*auction’ regular expression query

Only 2/50 are relevant

Locates globally relevant items, but many irrelevant

Looking for: ‘add auction’ trigger

Page 40: Introducing Natural Language Program Analysis

DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()

DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

Combining Structural & Lexical Information Structural: guides exploration

from seed

Looking for: ‘add auction’ trigger

RelevantNeighborhood

DoAction()

DoPasteFromClipboard()

DoAdd()

Lexical: prunes irrelevant edges

Page 41: Introducing Natural Language Program Analysis

The Dora Approach

Determine method relevance to queryCalculate lexical-based relevance score

Low-scored methods pruned from neighborhood

Recursively explore

Prune irrelevant structural edges from seed

Page 42: Introducing Natural Language Program Analysis

Calculating Relevance Score:Term Frequency Score based on query term frequency of the method

6 query term 6 query term occurrencesoccurrences6 query term 6 query term occurrencesoccurrences

Only 2 Only 2 occurrencesoccurrences

Only 2 Only 2 occurrencesoccurrences

Query: ‘add auction’

Page 43: Introducing Natural Language Program Analysis

Weigh term frequency based on location: Method name more important than body Method body statements normalized by length

Calculating Relevance Score:Location Weights Query: ‘add auction’

?

Page 44: Introducing Natural Language Program Analysis

Dora explores ‘add auction’ trigger

From DoAction() seed:Correctly identified at 0.5 threshold

DoAdd() (0.93)DoPasteFromClipboard() (0.60)

With only one false positiveDoSave() (0.52)

Page 45: Introducing Natural Language Program Analysis

Summary

NL technology usedSynonyms, collocations, morphology, word frequencies, part-of-speech tagging, AOIG

Evaluation indicatesNatural language information shows promise for improving software development tools

Key to successAccurate extraction of NL clues

Page 46: Introducing Natural Language Program Analysis

Our Current and Future Work

Basic NL-based tools for softwareAbbreviation expanderProgram synonymsDetermining relative importance of words

Integrating information retrieval techniques

Page 47: Introducing Natural Language Program Analysis

Posed Questions for Discussion

What open problems faced by software tool developers can be mitigated by NLPA?

Under what circumstances is NLPA not useful?