Upload
icsm-2010
View
995
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Analyzing Natural-Language Artifacts Of The Software Process M. Hasan, E. Stroulia, D. Barbosa, M. Alalfi
University of Alberta http://ssrg.cs.ualberta.ca Oct-23-10
ICSM 2010 - ERA
WikiDev2.0 101
Objective: to make explicit the elements around software collaboration: artifacts, people and communication Integrate artifacts, people and communications on the web URLs and (hyper)links are the mechanism
How it works: Information from SVN, tickets, emails, IRC chats, source code, code-
analysis tools (JDEvAn) is incrementally imported Everything gets a URL Analyses generate hyperlinks between these URLs Interactive views enable the exploration of the original information
and analyses results
10/23/10
2
ICSM2010 - ERA
The WikiDev2.0 Architecture 3
LaTeX Convertor
Template Editor Calendars
Visualizations
Text Replace Tagging
Annoki Control and common functions
Annoki
Stock MediaWiki core
WikiDev 2.0 System
Stock MediaWiki database
Custom Annoki database
Custom WikiDev database
Access Controls Content Analysis
Page Feedback Structural Differencing
3D OWL Visualization
UML model integration
Text analysis
Bug-tracker integration
Project lifecycle graphs
Mailing list integration
IRC integration Artifact clustering
Common WikiDev functions
WikiDev 2.0
Communication graphs
SVN integration
10/23/10 ICSM2010 - ERA
WikiDEv2.0 is a test-bed for experiments The Question here: What information can we find in text and how?
! !
!"#"$%&"'
()*+
&'%,')--./,0
$)/,1),"
(%%$
)'(.2)3(
(.3+"(
'"#.*.%/
!"#"$%&4+/%54666000
1*"45%'+4666
3'")("4)!!4666
-%!.27438)/,"4666
8)/!$"4-%!.2746662.94'"*%$#"4666
38"3+43%--.(4666
5%'+05.(843%%&"')("666
Text analysis in WikiDev
Two text-analysis methods Lexical analysis with TAPoR Syntactic/semantic analysis
The underlying model
Five sources of textual data Wiki pages, ticket text, email
conversations, IRC chats, SVN comments
Oct-23-10 ICSM 2010 - ERA
Some interesting sentences
Oct-23-10 ICSM 2010 - ERA
User9 tried to run the Php code; User5 has started learning Php.
User7 neaten up the UI along the side of UML display; User8 take care of documentation.
User5 handled associations in XMIParser; User8 changed wikiroamer and wikiviewfactor.
User6 modify User5’s Php file to allow …; User6 and User9 should focus on preparing parser
Syntactic Parsing
Sem
Wikidev 2.0Textual data-
sources
Sentence Parse Trees
SemAnno
TAPoR Analysis Categories &
W dli ty
Wordlist Dictionary
manticmantic otation
Semantically & Syntactically
annotated XML’sPattern
extraction
XQuery Patterns
RDF Triples
Sentence:
I d J d E li b fI used Java and Eclipse before.
Syntactic tags:
I/PRP used/VBD java/NN and/CC EclipsI/PRP used/VBD java/NN and/CC Eclips
Dependency Relations:
nsubj (used-2 I-1)nsubj (used 2, I 1)dobj (used-2, java-3)conj_and (java-3, Eclipse-5)advmod (used-2, before-6)
se/NN before/RB /se/NN before/RB ./.
Step 1
Oct-23-10 ICSM 2010 - ERA
Syntactic Parsing
Sem
Wikidev 2.0Textual data-
sources
Sentence Parse Trees
SemAnno
TAPoR Analysis Categories &
W dli ty
Wordlist Dictionary
manticmantic otation
Semantically & Syntactically
annotated XML’sPattern
extraction
XQuery Patterns
RDF Triples
Users = { User1, User2, ... , User9 }. Programming Languages = { Java, PHP, XML, ... }. Tools = { Eclipse, Bugzilla, IBMJazz, ... }. Tickets = { ticket1, ticket2 , ... }. Revisions = { revision1, revision2, ... }. Action Verbs = { create, debug, implement, fix, make, ... }. Project Tasks = { visualization, documentation, user interface, testing, ... }. Project Artifacts = { class, method, database, script, ... }.
Oct-23-10 ICSM 2010 - ERA
Step 2
0
200
400
600
800
1000
1200
1400
Planning Communication development Testing Deployment
activ
ities
fre
quen
cy
Month1
Month2
Month3
Month4
0
100
200
300
400
500
600
700
800
Month1 Month2 Month3 Month4
user9
user8
user7
user6
user5
user4
user3
user2
user1
Oct-23-10 ICSM 2010 - ERA
Syntactic Parsing
Sem
Wikidev 2.0Textual data-
sources
Sentence Parse Trees
SemAnno
TAPoR Analysis Categories &
W dli ty
Wordlist Dictionary
manticmantic otation
Semantically & Syntactically
annotated XML’sPattern
extraction
XQuery Patterns
RDF Triples
Annotated XML for the sentence “I
<S Type="ticket-description" ticketIdI used Java and Eclipse before.p<Verb stem="use" ID="1" POS="VB
<PRP ID="1" POS="PRP" RelatiosemanticTag="Developer" Nam
<Noun ID="2" POS="NNP" RelatsemanticTag="language"> JavsemanticTag= language > Jav<Noun ID="4" POS="NNP
semanticTag="tool"> Eclipse /N</Noun>
<Adverb ID="5" POS="RB" Re</V b></Verb>
</S>
used Java and Eclipse before.”
d="1" sentId="127 Author="User1">
BN" Relation="root"> Usedon="nsubj"me="User1"> I </PRP>tion="dobj" vava " Relation="conj_and" </Noun>
elation="advmod"> before </Adverb>
Oct-23-10 ICSM 2010 - ERA
Step 3
Syntactic Parsing
Sem
Wikidev 2.0Textual data-
sources
Sentence Parse Trees
SemAnno
TAPoR Analysis Categories &
W dli ty
Wordlist Dictionary
manticmantic otation
Semantically & Syntactically
annotated XML’sPattern
extraction
XQuery Patterns
RDF Triples! !
!"#$%!&'$#()#*+
,-,!&.+ /012!&3$04$(5+
/012!&6(7(+
"#$%!&8(7#+
291%:*0%:
(1;
2012<50*=>=#$
! !
!"#$%!&'(#)*
+,+!&-* ./'0!&1232*
0('%4 )/%4
"#$%!&523#*
2'6
Ei
Ej
object Rule1 relation
Ei
Ej
Ek
object Rule2 relation
noun-modifier relation
Oct-23-10 ICSM 2010 - ERA
Step 4
! !
!!!!"#$!%&''()*! +,%-(*
.'/,0!'(11/2(
345!%6/*1 .7*8/%*(9!"/':0(1
$;'<(8!&=!"()*()%(1 "#" $%& '(' "$")
$;'<(8!&=!/))&*/*(9!>?@1 "'% $%$ '%$ ")$(
$;'<(8!&=!+8,:0(*1 $#' ** $%# (")
A9(B(0&:(8C!;1(DEEEC!0/)2;/2(F $) % $# (# +,-./%!012,-!13!45657
A9(B(0&:(8C!G&8-DEEEC!*&&0F $% ' 8) *' +,-./#!,-.9!:2;<=-.7
A9(B(0&:(8C!%8(/*(DEEEC!/8*,=/%*F (8 "' ## 8$) +,-./#!>539;.!?@AB5/-./7
A9(B(0&:(8C!6/)90(DEEEC!*/1-F $8 # $) %( +,-./(!C1/D!13!!EA7
A9(B(0&:(8C!=,7DEEEC!*,%-(*F ' # ' $% +,-./&!0<F.9!G,H#7
A9(B(0&:(8C!%6(%-DEEEC!8(B,1,&)F # ( ) " +,-./%!913.!/.6<-<13#"7
A/8*,=/%*C!%6/)2(DEEEC!*/1-F 8# $8 * %' +E@IJ539;./!!K19<0L!!?K;7
A9(B(0&:(8C!%&&:(8/*(DEEEC!9(B(0&:(8F
) " #% "$) +,-./%!M!,-./$*!!012,-!13!=5/-./7
Oct-23-10 ICSM 2010 - ERA
Empirical Evaluation
Oct-23-10 ICSM 2010 - ERA
Triples found Developer (D) 54 The tool 39
Triples missed 28 52% (out of the 54 identified by D)
Correct triples 19 49% (out of the 39 found) Incorrect triples 20 51% (out of the 39 found) Missed triples 3
Conclusions
Oct-23-10 ICSM 2010 - ERA
There is substantial information in the text associated with the software process Developer experience Decision rationale Problems and solutions considered
We developed a method for lexical, syntactic, and semantic analysis of textual data produced during the life-cycle of a software project.
The empirical analysis shows that Interesting data can be extracted More robust parsing and better entity-phrase recognition is necessary
We are currently working towards (a) extending the suite of RDF-triple patterns, and (b) developing a domain-specific query language for flexible question answering
on the project lifecycle.
And the Poster!
Oct-23-10 ICSM 2010 - ERA
Motivation
Analyzing Natural-Language Artifacts of the
Software Process Maryam Hasan, Eleni Stroulia, Denilson Barbosa, and
Manar Alalfi Department of Computing Science, University of Alberta
http://ssrg.cs.ualberta.ca/index.php/WikiDev_2.0
Conclusion & Future Work ! The conversation among the team developers in
email messages, IRC chats, SVN commit messages, ticket descriptions and wiki pages contain valuable information about their activities and artifacts, issues the team members faced during their work and the decisions they made.
! We are currently working towards: ! Extending the suite of RDF-triple patterns. ! Developing a domain-specific query language
(based on our underlying conceptual model of Figure4) for flexible question answering on the project lifecycle.
! Running the experiment on a bigger dataset and evaluating the results.
TAPoR Analysis TAPoR (Text Analysis Portal for Research), is a web-based application to support a suite of text lexical-analysis tools, including word counts, word co-occurrence, word-clouds visualizations, words’ collocations, and pattern extraction. In our work, we used TAPoR for two purposes:
! We used the “most-frequent-words” functionality to identify important keywords, which then will be used for the syntactic/semantic analysis method. ! We applied the word-count and keyword-in-context services to gain insights about interesting trends in the information contained in the different data sources as provided by the team members over the various stages of the project.
From triples extracted by syntactic/semantic analysis we can find useful pieces of information, as the following:
! Expertise of developers: <User5 started learning PHP> ! Responsibility of developers: <User8 do documentation> ! Developers contribution: <User5 handle associations in XMIParser> ! Developers’ relationships: <User6 modify User5’s PHP file>
Methodology
! Software developers generate a substantial stream of textual data as they communicate during the life-cycle of their projects.
! Through emails and chats, developers discuss the requirements of their software system, they negotiate the distribution of tasks among them, and they make decisions about the system design, and the internal structure and functionalities of its code modules.
! A careful analysis of such text reveals valuable information about various aspects of the software life-cycle
! We applied two complimentary text-analysis methods to examine five different sources of textual data of a team project and gain valuable information about various aspects of the software life-cycle. The five textual-data sources are (a) wiki pages, (b) SVN comments, (c) tickets, (d) email messages, and (e) IRC chats.
! The first method is based on approximate and efficient analysis, at the lexical level, using the off-the-shelf lexical-analysis toolkit TAPoR.
! The second is much more accurate, albeit more expensive from a computational point of view, and focuses on the text at the syntactic and semantic level.
WikiDev 2.0 Textual
Data
Syntactic Parsing
Semantic Annotation
Annotated
XMLs
Pattern Extracti
on
TAPoR Analysi
s
Categories & Dictionary
RDF Triples
XQuery Patterns
Parse Trees
Figure1: Tool Architecture
Syntactic/Semantic Analysis Syntactic/semantic analysis, integrates computational-linguistic techniques with domain-specific knowledge, in order to extract useful pieces of information as RDF triples. This method consists of the following stages:
Figure4: Semantic relations of the RDF triples created by pattern extraction
Developer
Programming Language
Tool
Artifact
Task Ticket
Revision
Cooperate /Work with /..
Commit / Check /… Resolve/
Fix /.. Handle/ Work/..
Change/ Modify/..
Know / Use/..
Create / Add /..
Develop/ Write/..
SVN commen
t
Ticket
Email message
IRC chat
Sample Triple
Number of Sentences 353 169 484 3130
Number of annotated XMLs 346 161 461 3018
Number of Triplets 154 77 165 830
<developer, use/..., language> 10 6 15 85 <user6 focus on
Java>
<developer, work/..., tool> 16 4 20 74 <user5 used Eclipse>
<developer, create/..., artifact> 82 34 55 210 <user5 handle
XMIParser>
<developer, handle/..., task> 12 5 10 68 <user8 work on UI>
<developer, fix/..., ticket> 4 5 4 16 <user9 fixed bug5>
<developer, check/..., revision> 5 8 0 3 <user6 done
revision53>
<artifact, change/..., task> 25 12 7 64
<UMLHandler modify Xml>
<developer, cooperate/...,
developer> 0 3 56 310 <user6 & user17
focus on parser>
TAPoR Analysis Results
Figure 5: Trend of team activities throughout the project life-cycle
Figure 6: Trend of team members communication throughout the project life-cycle
Syntactic/Semantic Analysis Results
Acknowledgments
• Syntactic parsing: assigns a syntactic tag for
each word and the grammatical relationships between word pairs.
• Semantic Annotation: Assigns terms with semantic tags from a domain-specific vocabulary. The vocabulary created for this domain includes these categories: Users, Programming Languages, Tools, Project Tasks, Project Artifacts, Tickets, Revisions.
• Pattern Extraction: extracts subject-predicate-
object patterns from annotated XMLs using XQuery to retrieve interesting RDF triples. The extracted triples constitute an instance of a rich conceptual model of the domain that captures interesting relations between developers and software products.
Use (Verb)
I (PRP) Java
(Noun)
Have (Verb)
nsubj dobj aux
Use (Verb)
I (PRP) name = User1 STag = developer
Java (Noun)
STag = Lang
Have (Verb)
nsubj dobj
aux
Figure3: Semantically annotated tree
Figure2: Syntax parse tree
The authors wish to thank Marios Fokaefs and Ken Bauer for their help with the experiments.
Table1: Summarily reports on the results of syntactic/semantic analysis.