CodeSimian

CodeSimianCS491B – Andrew Weng

Motivation

• Academic integrity is a universal issue

• Plagiarism is still common today• Kaavya Viswanathan (Harvard Student)

• Book contains many plagiarized passages

• Yoshihiko Wada (Painter, Japan)• Artwork plagiarized from Alberto Sughi

• Scott D. Miller (Wesley College President)• Plagiarized material found on his website

Is Plagiarism Harmful?

• Who does plagiarism really hurt?• The student• The class• The University

• Plagiarism is not only concerned with the protection of intellectual property rights

Plagiarism Detection

Benefits of Utilizing Plagiarism Detection

• Prevention

• Enforcement

• Objective standpoint

Platform Overview

• Developed on Visual Studio .NET 2005• Coded in Microsoft Visual C# .NET• Windows Forms application• Simple and familiar GUI (Windows)

• Intended focus is ease of use

Theoretical Overview

CodeSimian is based on two primary principles

• Kolmogorov Complexity

• Information Distance

Kolmogorov Complexity

• Simple definition: The shortest length program that can be written on a universal Turing machine to produce a specified output

• Purely theoretical

• Impossible to calculate exactly


Define x to be a desired output string

K(x) = The length of the program that produces x

K(x|y) = The length of the program that produces x given y as an input

K(xy) = The length of the program that produces x concatenated with y


Compare two infinitely long numbers π and a randomly generated number between 0 and 1:

π =3.1415926535897932384626433832795…

n = 0.5234958723957329875320935293853…

K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite


π =3.1415926535897932384626433832795…

K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite

Perhaps something as simple as the implementation of Leibniz’s formula:

...11

1

9

1

7

1

5

1

3

1

1

14

12

14

0n

n

n


n = 0.5234958723957329875320935293853…

In order to generate the full output of a truly random number n, the length of the program would be infinitely long.

The code would essentially be System.out.println(“0.52349587…”);


So how does this apply to plagiarism detection?

Define x = π and y = π/4

K(x|y) would be a very small value. Given y, one can calculate the result of π with a simple multiplier.

Information Distance

The distance (or difference) between two objects

Formula used:

)(

)|()(1),(

xyK

yxKxKyxd

Information Distance

• Similarity Factor

If we remove the amount of information contained in x by y, and we normalize the number by the amount of information in both x and y, we can obtain a percentage of similarity

)(

)|()(),(

xyK

yxKxKyxs

Implementation

What does CodeSimian do to obtain the similarity factors?

1. Parse and Tokenize the code

2. Compress the tokenized strings

3. Compare the compressed strings

Parsing the Code

• Utilized ANTLR to parse and tokenize the code

• ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. (www.antlr.org)

Tokenizing the Code

• The tokenized output is a string of characters, each of which represents a token within the code

• For Example:

{ int c = 0; } contains 7 “letters”

Open Bracket

Integer type declaration

Variable name

Assignment operator

Integer Value

Statement end

Close Bracket

Compressing the String

This string is then compressed using a Lempel-Ziv compression algorithm with unbounded buffers

• As the string is being read, a library is generated as it progresses.

• When repeats are detected, it utilizes pointers to the library to recreate the required section

Compressing the String

• Normally limitations exist on library size and the “word” length stored

• Memory utilization and efficiency is not important

• Lempel-Ziv is suitable for this application

Comparing the Compressed String

• K(x) is the size of the compressed and tokenized code x.

• K(x|y) is the size of the compressed and tokenized code x, given y as a “free” library

• K(xy) is the size of the compressed and tokenized code x+y.

Results

Using the test on trivial examples:• LinkedList.java• LinkedList2.java• LinkedList3.java

• Changes included only variable names, reformatting, removing comments, rearranging variable declaration, adding “junk” code, such as random debugging text output.

• All files came out as >85% similar

Results

Using the test on a small real-world sample

Professor Kang’s CS201 HW1

• Relatively simple homework assignment

• 30-50% similarity average

• 95% similarity detected on one pair of submissions

• Confirmed by Professor Kang as correct

Results

Using the test on another small real-world sample

Professor Kang’s CS201 HW4• More complex homework assignment involving 2-3

files; break down of java files according to function• Problem being that specialized function files may

possible present false positives?• 30-70% similarity average• 95+% similarity detected on pairs of submissions• Confirmed by Professor Kang as correct

Results

• Things to note…

• The results showed a similarity of 80% on one pair of results, which is deemed significant by the application but necessarily conclusive

• Careful inspection by hand of the suspected files revealed one block of code that was apparently copied with variable name changes

Conclusions

• Successful test cases

• Simple and straightforward to use

• Based on an objective principle which works!

Future Work

• Enhancing the application to be able to compare internal “blocks” of code

• Improving the compression algorithm to better handle and adapt to “approximate matches”

• Improving the functionality with the GUI

• Providing a report printing capability of directories

Documents

CodeSimian