Upload
baldwin-montgomery
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
CMCD: Count Matrix based Code Clone Detection
Yang Yuan and Yao GuoKey Laboratory of High-Confidence Software
Technologies (Ministry of Education)Peking University
Code Clones
• In software development, it is common to reuse some code fragments by copying with or without minor modifications.
• This kind of code fragments are called code clones. [Jurgens et al., ICSE 2009]
Scenario-based Evaluation
Original Copy Example of Scenario #1
Scenario-based Evaluation
Original Copy Example of Scenario #2
Scenario-based Evaluation
Original Copy Example of Scenario #3
Scenario-based Evaluation
Original Copy Example of Scenario #4
Importance of Code Clones
• Code clone brings troubles:– Increase the complexity of source code– Increase the maintenance cost of software system– Increase the possibility of getting bugs
• 7%-23% of the code in large software system is cloned. [Roy et al., SCP 2009]
• Detecting code clones may help:– Analyze the programming habits of the programmers– Find the design patterns of the source code
Previous Work in Clone Detection
• lower level:– Textual approach• SDD [Lee and Jeong, OOPSLA 2005]• NICAD [Roy and Cordy, ICPC 2008]• ...
– Lexical approach• DUP [Baker, WCRE 1995]• CCFinder [Kamiya et al., TSE 2002]• CP-Miner [Li et al., OSDI 2004, TSE 2006]• ….
Previous Work in Clone Detection
• Higher level:– Syntactic approach• CloneDr [Baxter et al., ICSM 1998]• Deckard [Jiang et al., ICSE 2007]• CloneDigger [Bulychev, SyRCoSE 2008]• …
– Semantic approach• Duplix [Krinke, WCRE 2001]• GPLAG [Liu et al., KDD 06]• …
Challenges
Low level approaches• Faster
• Usually focusing on local characters
• No Idea about global meanings
High level approaches• Slower
• Better understanding of the programs
• Difficult to scale
GAP
Our idea
• A novel count matrix based clone detection approach.
• Benefits of counting– By ignoring the order of variables, it can identify
clones with statement swapping cases, which is difficult for both lexical and syntactic approaches.
– Easy to calculate and implement• Reduces space and time complexity
Count Matrix Construction
Token Sequence Count Vector Count Matrix
tot,=,n,+,Find,(,n,),for,i,=,1,to,n,-,1, if,a,[,i,],>,a,[,j,],,k,=,a,[,i,]….
tot 1 0 0 … 0
i 3 0 0 … 2
j 1 0 0 … 1
a 3 0 0 … 3
n 2 1 0 … 0
tot 1 0 0 … 0
i 3 0 0 … 2
j 1 0 0 … 1
A 3 0 0 … 3
n 2 1 0 … 0
Comparison Algorithms
• Goal:– Find more scenario #4 clones with more
transformations such as sentence swapping – Run fast
• General principles:– Compare individual variables, instead of variable
sequences– Ignore variable orders in the count matrix
bipartite graph matching
• Use bipartite graph matching to find code clone in different granularity:– Bottom-up approach• Can be used for compute the similarity between two
projects, two classes, or two methods
– Use two kinds of bipartite graph• KM algorithm (low-level, slow, accurate)• Hungarian algorithm (high-level, fast, inaccurate)
Optimization
• Use Euclidean metrics to compute the similarity of CVs
• Use quick rejection algorithm to improve speed
• Eliminate false positives:– Cut and check– Slice and match
Implementation
• Use Soot to convert Java->Jimple • [Vallee-Rai et al., CASCON 1999]
– 3-address intermediate representation– Smaller language set– Break complex statements into basic ones– Does not change the meaning of the program
• A new version of CMCD without using Soot
Overview
Performance Comparison to Deckard
1.0(1.0) 0.95(0.9999) 0.9(0.999) 0.85(0.99) 0.8(0.95)
0.1
1
10
100
1000
10000
833565 571 636
2274
Stage1Stage2Stage3Stage2+keyStage3+keyDeckard
Similarity
Com
pare
Tim
e(se
c)
Scenario-based Evaluation
Based on scenario classification from Roy et al., paper “Comparison and Evaluation of Code Clone Detection Techniques ”
Detecting Plagiarisms
• Student-submitted compiler lab projects– 29 submissions– 106 - 251 Java classes – 7,825 – 38,086 Lines of code
• Experimental Results– Running time: 123 minutes– 2 clusters of code clones, each has 3 copies– Confirmed– Now used by two courses in Peking University for
detecting students’ homework
Analyzing JDK 1.6 Source Code
• JDK 1.6.0_18– 7,197 files– 2,079,166 LoC
• Experimental Results– Running time: 163 minutes– Found: 786 methods in 174 clusters (Small
methods are omitted)
Code Comparison: Two ClonesMethod 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory)public static SyncFactory getSyncFactory(){
if(syncFactory == null){synchronized(SyncFactory.class) {
if(syncFactory == null){syncFactory = new SyncFactory();
} //end if} //end synchronized block
} //end ifreturn syncFactory;
}
Method 2: (in javax.swing.JComponent)static Set<KeyStroke> getManagingFocusBackwardTraversalKeys() {
synchronized(JComponent.class) {if (managingFocusBackwardTraversalKeys == null) {
managingFocusBackwardTraversalKeys = new HashSet<KeyStroke>(1);managingFocusBackwardTraversalKeys.add(KeyStroke.getKeyStroke(KeyEvent.VK_TAB,InputEvent.SHIFT_MASK|InputEvent.CTRL_MASK));
}}return managingFocusBackwardTraversalKeys;
}
Detected a bugMethod 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory)public static SyncFactory getSyncFactory(){
if(syncFactory == null){synchronized(SyncFactory.class) {
if(syncFactory == null){syncFactory = new SyncFactory();} //end if
} //end synchronized block} //end ifreturn syncFactory;
}
Method 3: (in com.sun.corba.se.impl.ior.iiop.JavaSerializationComponent)public static JavaSerializationComponent singleton() {
if (singleton == null) {synchronized (JavaSerializationComponent.class) {
singleton =new JavaSerializationComponent(Message.JAVA_ENC_VERSION);}
}return singleton;
} http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6999537
Conclusion
• We propose a code clone detection approach CMCD:– Extracting count-based information– Language independent– Scales to large programs (> 1M LoC)
• Capabilities– Performs well in scenario-based evaluation– Detects code plagiarism in students’ homework– Identifies a potential bug in JDK source code