26
CodeSimian CS491B – Andrew Weng

CodeSimian

  • Upload
    ting

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

CodeSimian. CS491B – Andrew Weng. Motivation. Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student) Book contains many plagiarized passages Yoshihiko Wada (Painter, Japan) Artwork plagiarized from Alberto Sughi - PowerPoint PPT Presentation

Citation preview

Page 1: CodeSimian

CodeSimianCS491B – Andrew Weng

Page 2: CodeSimian

Motivation

• Academic integrity is a universal issue

• Plagiarism is still common today• Kaavya Viswanathan (Harvard Student)

• Book contains many plagiarized passages

• Yoshihiko Wada (Painter, Japan)• Artwork plagiarized from Alberto Sughi

• Scott D. Miller (Wesley College President)• Plagiarized material found on his website

Page 3: CodeSimian

Is Plagiarism Harmful?

• Who does plagiarism really hurt?• The student• The class• The University

• Plagiarism is not only concerned with the protection of intellectual property rights

Page 4: CodeSimian

Plagiarism Detection

Benefits of Utilizing Plagiarism Detection

• Prevention

• Enforcement

• Objective standpoint

Page 5: CodeSimian

Platform Overview

• Developed on Visual Studio .NET 2005• Coded in Microsoft Visual C# .NET• Windows Forms application• Simple and familiar GUI (Windows)

• Intended focus is ease of use

Page 6: CodeSimian

Theoretical Overview

CodeSimian is based on two primary principles

• Kolmogorov Complexity

• Information Distance

Page 7: CodeSimian

Kolmogorov Complexity

• Simple definition: The shortest length program that can be written on a universal Turing machine to produce a specified output

• Purely theoretical

• Impossible to calculate exactly

Page 8: CodeSimian

Kolmogorov Complexity

Define x to be a desired output string

K(x) = The length of the program that produces x

K(x|y) = The length of the program that produces x given y as an input

K(xy) = The length of the program that produces x concatenated with y

Page 9: CodeSimian

Kolmogorov Complexity

Compare two infinitely long numbers π and a randomly generated number between 0 and 1:

π =3.1415926535897932384626433832795…

n = 0.5234958723957329875320935293853…

K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite

Page 10: CodeSimian

Kolmogorov Complexity

π =3.1415926535897932384626433832795…

K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite

Perhaps something as simple as the implementation of Leibniz’s formula:

...11

1

9

1

7

1

5

1

3

1

1

14

12

14

0n

n

n

Page 11: CodeSimian

Kolmogorov Complexity

n = 0.5234958723957329875320935293853…

In order to generate the full output of a truly random number n, the length of the program would be infinitely long.

The code would essentially be System.out.println(“0.52349587…”);

Page 12: CodeSimian

Kolmogorov Complexity

So how does this apply to plagiarism detection?

Define x = π and y = π/4

K(x|y) would be a very small value. Given y, one can calculate the result of π with a simple multiplier.

Page 13: CodeSimian

Information Distance

The distance (or difference) between two objects

Formula used:

)(

)|()(1),(

xyK

yxKxKyxd

Page 14: CodeSimian

Information Distance

• Similarity Factor

If we remove the amount of information contained in x by y, and we normalize the number by the amount of information in both x and y, we can obtain a percentage of similarity

)(

)|()(),(

xyK

yxKxKyxs

Page 15: CodeSimian

Implementation

What does CodeSimian do to obtain the similarity factors?

1. Parse and Tokenize the code

2. Compress the tokenized strings

3. Compare the compressed strings

Page 16: CodeSimian

Parsing the Code

• Utilized ANTLR to parse and tokenize the code

• ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. (www.antlr.org)

Page 17: CodeSimian

Tokenizing the Code

• The tokenized output is a string of characters, each of which represents a token within the code

• For Example:

{ int c = 0; } contains 7 “letters”

Open Bracket

Integer type declaration

Variable name

Assignment operator

Integer Value

Statement end

Close Bracket

Page 18: CodeSimian

Compressing the String

This string is then compressed using a Lempel-Ziv compression algorithm with unbounded buffers

• As the string is being read, a library is generated as it progresses.

• When repeats are detected, it utilizes pointers to the library to recreate the required section

Page 19: CodeSimian

Compressing the String

• Normally limitations exist on library size and the “word” length stored

• Memory utilization and efficiency is not important

• Lempel-Ziv is suitable for this application

Page 20: CodeSimian

Comparing the Compressed String

• K(x) is the size of the compressed and tokenized code x.

• K(x|y) is the size of the compressed and tokenized code x, given y as a “free” library

• K(xy) is the size of the compressed and tokenized code x+y.

Page 21: CodeSimian

Results

Using the test on trivial examples:• LinkedList.java• LinkedList2.java• LinkedList3.java

• Changes included only variable names, reformatting, removing comments, rearranging variable declaration, adding “junk” code, such as random debugging text output.

• All files came out as >85% similar

Page 22: CodeSimian

Results

Using the test on a small real-world sample

Professor Kang’s CS201 HW1

• Relatively simple homework assignment

• 30-50% similarity average

• 95% similarity detected on one pair of submissions

• Confirmed by Professor Kang as correct

Page 23: CodeSimian

Results

Using the test on another small real-world sample

Professor Kang’s CS201 HW4• More complex homework assignment involving 2-3

files; break down of java files according to function• Problem being that specialized function files may

possible present false positives?• 30-70% similarity average• 95+% similarity detected on pairs of submissions• Confirmed by Professor Kang as correct

Page 24: CodeSimian

Results

• Things to note…

• The results showed a similarity of 80% on one pair of results, which is deemed significant by the application but necessarily conclusive

• Careful inspection by hand of the suspected files revealed one block of code that was apparently copied with variable name changes

Page 25: CodeSimian

Conclusions

• Successful test cases

• Simple and straightforward to use

• Based on an objective principle which works!

Page 26: CodeSimian

Future Work

• Enhancing the application to be able to compare internal “blocks” of code

• Improving the compression algorithm to better handle and adapt to “approximate matches”

• Improving the functionality with the GUI

• Providing a report printing capability of directories