Wiki dev nlp

Analyzing Natural-Language Artifacts Of The Software Process M. Hasan, E. Stroulia, D. Barbosa, M. Alalfi

University of Alberta http://ssrg.cs.ualberta.ca Oct-23-10

ICSM 2010 - ERA

WikiDev2.0 101

  Objective: to make explicit the elements around software collaboration: artifacts, people and communication   Integrate artifacts, people and communications on the web  URLs and (hyper)links are the mechanism

  How it works:   Information from SVN, tickets, emails, IRC chats, source code, code-

analysis tools (JDEvAn) is incrementally imported   Everything gets a URL  Analyses generate hyperlinks between these URLs   Interactive views enable the exploration of the original information

and analyses results

10/23/10

2

ICSM2010 - ERA

The WikiDev2.0 Architecture 3

LaTeX Convertor

Template Editor Calendars

Visualizations

Text Replace Tagging

Annoki Control and common functions

Annoki

Stock MediaWiki core

WikiDev 2.0 System

Stock MediaWiki database

Custom Annoki database

Custom WikiDev database

Access Controls Content Analysis

Page Feedback Structural Differencing

3D OWL Visualization

UML model integration

Text analysis

Bug-tracker integration

Project lifecycle graphs

Mailing list integration

IRC integration Artifact clustering

Common WikiDev functions

WikiDev 2.0

Communication graphs

SVN integration

10/23/10 ICSM2010 - ERA

WikiDEv2.0 is a test-bed for experiments The Question here: What information can we find in text and how?

! !

!"#"$%&"'

()*+

&'%,')--./,0

$)/,1),"

(%%$

)'(.2)3(

(.3+"(

'"#.*.%/

!"#"$%&4+/%54666000

1*"45%'+4666

3'")("4)!!4666

-%!.27438)/,"4666

8)/!$"4-%!.2746662.94'"*%$#"4666

38"3+43%--.(4666

5%'+05.(843%%&"')("666

Text analysis in WikiDev

  Two text-analysis methods  Lexical analysis with TAPoR  Syntactic/semantic analysis

  The underlying model

  Five sources of textual data  Wiki pages, ticket text, email

conversations, IRC chats, SVN comments

Oct-23-10 ICSM 2010 - ERA

Some interesting sentences


  User9 tried to run the Php code; User5 has started learning Php.

  User7 neaten up the UI along the side of UML display; User8 take care of documentation.

  User5 handled associations in XMIParser; User8 changed wikiroamer and wikiviewfactor.

  User6 modify User5’s Php file to allow …; User6 and User9 should focus on preparing parser

Syntactic Parsing

Sem

Wikidev 2.0Textual data-

sources

Sentence Parse Trees

SemAnno

TAPoR Analysis Categories &

W dli ty

Wordlist Dictionary

manticmantic otation

Semantically & Syntactically

annotated XML’sPattern

extraction

XQuery Patterns

RDF Triples

Sentence:

I d J d E li b fI used Java and Eclipse before.

Syntactic tags:

I/PRP used/VBD java/NN and/CC EclipsI/PRP used/VBD java/NN and/CC Eclips

Dependency Relations:

nsubj (used-2 I-1)nsubj (used 2, I 1)dobj (used-2, java-3)conj_and (java-3, Eclipse-5)advmod (used-2, before-6)

se/NN before/RB /se/NN before/RB ./.

Step 1


Syntactic Parsing

Sem


sources


SemAnno


W dli ty

Wordlist Dictionary




extraction

XQuery Patterns

RDF Triples

Users = { User1, User2, ... , User9 }. Programming Languages = { Java, PHP, XML, ... }. Tools = { Eclipse, Bugzilla, IBMJazz, ... }. Tickets = { ticket1, ticket2 , ... }. Revisions = { revision1, revision2, ... }. Action Verbs = { create, debug, implement, fix, make, ... }. Project Tasks = { visualization, documentation, user interface, testing, ... }. Project Artifacts = { class, method, database, script, ... }.


Step 2

0

200

400

600

800

1000

1200

1400

Planning Communication development Testing Deployment

activ

ities

fre

quen

cy

Month1

Month2

Month3

Month4

0

100

200

300

400

500

600

700

800

Month1 Month2 Month3 Month4

user9

user8

user7

user6

user5

user4

user3

user2

user1


Syntactic Parsing

Sem


sources


SemAnno


W dli ty

Wordlist Dictionary




extraction

XQuery Patterns

RDF Triples

Annotated XML for the sentence “I

<S Type="ticket-description" ticketIdI used Java and Eclipse before.p<Verb stem="use" ID="1" POS="VB

<PRP ID="1" POS="PRP" RelatiosemanticTag="Developer" Nam

<Noun ID="2" POS="NNP" RelatsemanticTag="language"> JavsemanticTag= language > Jav<Noun ID="4" POS="NNP

semanticTag="tool"> Eclipse /N</Noun>

<Adverb ID="5" POS="RB" Re</V b></Verb>

</S>

used Java and Eclipse before.”

d="1" sentId="127 Author="User1">

BN" Relation="root"> Usedon="nsubj"me="User1"> I </PRP>tion="dobj" vava " Relation="conj_and" </Noun>

elation="advmod"> before </Adverb>


Step 3

Syntactic Parsing

Sem


sources


SemAnno


W dli ty

Wordlist Dictionary




extraction

XQuery Patterns

RDF Triples! !

!"#$%!&'$#()#*+

,-,!&.+ /012!&3$04$(5+

/012!&6(7(+

"#$%!&8(7#+

291%:*0%:

(1;

2012<50*=>=#$

! !

!"#$%!&'(#)*

+,+!&-* ./'0!&1232*

0('%4 )/%4

"#$%!&523#*

2'6

Ei

Ej

object Rule1 relation

Ei

Ej

Ek

object Rule2 relation

noun-modifier relation


Step 4

! !

!!!!"#$!%&''()*! +,%-(*

.'/,0!'(11/2(

345!%6/*1 .7*8/%*(9!"/':0(1

$;'<(8!&=!"()*()%(1 "#" $%& '(' "$")

$;'<(8!&=!/))&*/*(9!>?@1 "'% $%$ '%$ ")$(

$;'<(8!&=!+8,:0(*1 $#' ** $%# (")

A9(B(0&:(8C!;1(DEEEC!0/)2;/2(F $) % $# (# +,-./%!012,-!13!45657

A9(B(0&:(8C!G&8-DEEEC!*&&0F $% ' 8) *' +,-./#!,-.9!:2;<=-.7

A9(B(0&:(8C!%8(/*(DEEEC!/8*,=/%*F (8 "' ## 8$) +,-./#!>539;.!?@AB5/-./7

A9(B(0&:(8C!6/)90(DEEEC!*/1-F $8 # $) %( +,-./(!C1/D!13!!EA7

A9(B(0&:(8C!=,7DEEEC!*,%-(*F ' # ' $% +,-./&!0<F.9!G,H#7

A9(B(0&:(8C!%6(%-DEEEC!8(B,1,&)F # ( ) " +,-./%!913.!/.6<-<13#"7

A/8*,=/%*C!%6/)2(DEEEC!*/1-F 8# $8 * %' +E@IJ539;./!!K19<0L!!?K;7

A9(B(0&:(8C!%&&:(8/*(DEEEC!9(B(0&:(8F

) " #% "$) +,-./%!M!,-./$*!!012,-!13!=5/-./7


Empirical Evaluation


Triples found Developer (D) 54 The tool 39

Triples missed 28 52% (out of the 54 identified by D)

Correct triples 19 49% (out of the 39 found) Incorrect triples 20 51% (out of the 39 found) Missed triples 3

Conclusions


  There is substantial information in the text associated with the software process   Developer experience   Decision rationale   Problems and solutions considered

  We developed a method for lexical, syntactic, and semantic analysis of textual data produced during the life-cycle of a software project.

  The empirical analysis shows that   Interesting data can be extracted   More robust parsing and better entity-phrase recognition is necessary

  We are currently working towards (a) extending the suite of RDF-triple patterns, and (b) developing a domain-specific query language for flexible question answering

on the project lifecycle.

And the Poster!


Motivation

Analyzing Natural-Language Artifacts of the

Software Process Maryam Hasan, Eleni Stroulia, Denilson Barbosa, and

Manar Alalfi Department of Computing Science, University of Alberta

http://ssrg.cs.ualberta.ca/index.php/WikiDev_2.0

Conclusion & Future Work !  The conversation among the team developers in

email messages, IRC chats, SVN commit messages, ticket descriptions and wiki pages contain valuable information about their activities and artifacts, issues the team members faced during their work and the decisions they made.

!  We are currently working towards: !  Extending the suite of RDF-triple patterns. !  Developing a domain-specific query language

(based on our underlying conceptual model of Figure4) for flexible question answering on the project lifecycle.

!  Running the experiment on a bigger dataset and evaluating the results.

TAPoR Analysis TAPoR (Text Analysis Portal for Research), is a web-based application to support a suite of text lexical-analysis tools, including word counts, word co-occurrence, word-clouds visualizations, words’ collocations, and pattern extraction. In our work, we used TAPoR for two purposes:

!  We used the “most-frequent-words” functionality to identify important keywords, which then will be used for the syntactic/semantic analysis method. !  We applied the word-count and keyword-in-context services to gain insights about interesting trends in the information contained in the different data sources as provided by the team members over the various stages of the project.

From triples extracted by syntactic/semantic analysis we can find useful pieces of information, as the following:

!  Expertise of developers: <User5 started learning PHP> !  Responsibility of developers: <User8 do documentation> !  Developers contribution: <User5 handle associations in XMIParser> !  Developers’ relationships: <User6 modify User5’s PHP file>

Methodology

!  Software developers generate a substantial stream of textual data as they communicate during the life-cycle of their projects.

!  Through emails and chats, developers discuss the requirements of their software system, they negotiate the distribution of tasks among them, and they make decisions about the system design, and the internal structure and functionalities of its code modules.

!  A careful analysis of such text reveals valuable information about various aspects of the software life-cycle

!  We applied two complimentary text-analysis methods to examine five different sources of textual data of a team project and gain valuable information about various aspects of the software life-cycle. The five textual-data sources are (a) wiki pages, (b) SVN comments, (c) tickets, (d) email messages, and (e) IRC chats.

!  The first method is based on approximate and efficient analysis, at the lexical level, using the off-the-shelf lexical-analysis toolkit TAPoR.

!  The second is much more accurate, albeit more expensive from a computational point of view, and focuses on the text at the syntactic and semantic level.

WikiDev 2.0 Textual

Data

Syntactic Parsing

Semantic Annotation

Annotated

XMLs

Pattern Extracti

on

TAPoR Analysi

s

Categories & Dictionary

RDF Triples

XQuery Patterns

Parse Trees

Figure1: Tool Architecture

Syntactic/Semantic Analysis Syntactic/semantic analysis, integrates computational-linguistic techniques with domain-specific knowledge, in order to extract useful pieces of information as RDF triples. This method consists of the following stages:

Figure4: Semantic relations of the RDF triples created by pattern extraction

Developer

Programming Language

Tool

Artifact

Task Ticket

Revision

Cooperate /Work with /..

Commit / Check /… Resolve/

Fix /.. Handle/ Work/..

Change/ Modify/..

Know / Use/..

Create / Add /..

Develop/ Write/..

SVN commen

t

Ticket

Email message

IRC chat

Sample Triple

Number of Sentences 353 169 484 3130

Number of annotated XMLs 346 161 461 3018

Number of Triplets 154 77 165 830

<developer, use/..., language> 10 6 15 85 <user6 focus on

Java>

<developer, work/..., tool> 16 4 20 74 <user5 used Eclipse>

<developer, create/..., artifact> 82 34 55 210 <user5 handle

XMIParser>

<developer, handle/..., task> 12 5 10 68 <user8 work on UI>

<developer, fix/..., ticket> 4 5 4 16 <user9 fixed bug5>

<developer, check/..., revision> 5 8 0 3 <user6 done

revision53>

<artifact, change/..., task> 25 12 7 64

<UMLHandler modify Xml>

<developer, cooperate/...,

developer> 0 3 56 310 <user6 & user17

focus on parser>

TAPoR Analysis Results

Figure 5: Trend of team activities throughout the project life-cycle

Figure 6: Trend of team members communication throughout the project life-cycle

Syntactic/Semantic Analysis Results

Acknowledgments

•  Syntactic parsing: assigns a syntactic tag for

each word and the grammatical relationships between word pairs.

•  Semantic Annotation: Assigns terms with semantic tags from a domain-specific vocabulary. The vocabulary created for this domain includes these categories: Users, Programming Languages, Tools, Project Tasks, Project Artifacts, Tickets, Revisions.

•  Pattern Extraction: extracts subject-predicate-

object patterns from annotated XMLs using XQuery to retrieve interesting RDF triples. The extracted triples constitute an instance of a rich conceptual model of the domain that captures interesting relations between developers and software products.

Use (Verb)

I (PRP) Java

(Noun)

Have (Verb)

nsubj dobj aux

Use (Verb)

I (PRP) name = User1 STag = developer

Java (Noun)

STag = Lang

Have (Verb)

nsubj dobj

aux

Figure3: Semantically annotated tree

Figure2: Syntax parse tree

The authors wish to thank Marios Fokaefs and Ken Bauer for their help with the experiments.

Table1: Summarily reports on the results of syntactic/semantic analysis.

Technology

Wiki dev nlp