View
523
Download
3
Category
Tags:
Preview:
Citation preview
flickr, cc by-nc jobadge, 2011
Technical possibilities of detecting plagiarism -
Comparative analysis of detection tools
Katrin Köhler (B.SC.)
Plagiarism - legal, moral and educational aspects, Amsterdam, 2011-12-14
Slides based on Debora Weber-Wulff, edited by Katrin Köhler
2 / 52 2
About me
• Research assistant of Prof. Dr. Weber-Wulff
since 2007
• Sofware Test in 2008 and 2010
• Masterthesis about “Cryptographic Watermarking
for Texts”
3
Contents
• Plagiarism Detection Test 2010
• Doctor Thesis of Karl-Theodor zu Guttenberg
• Discovering plagiarism
3 / 52
4
Teachers and administrations
want an simple solution
Photo: Flickr cc-by-nc-sa: xtrarant, 2008
Art Installation: Jamie Pawlus, Indianapolis, Indiana, 2003
4 / 52
6 / 150
Plagiarism detection software
• Can be extremely expensive!
• Teachers want to have all papers
marked original or plagiarism before
they start reading them.
• Students are afraid of wrongly being
labeled plagiarists.
• Only a teacher can decide if it is indeed
plagiarism! Software cannot be used to solve
social problems.
• Prof. Dr. Weber-Wulff has tested plagiarism
detection software 4.5 times: 2004, 2007, 2008,
2010 and zu Guttenberg’s thesis
6
7 / 52
Test process 2010
• 9 months of work with 2 persons
• 42 test cases in English, German
and Japanese
• Different types of plagiarism,
a few originals
• Market survey
• Access to the systems
• 48 systems found, 26 could be
completely evaluated
7
8 / 52
Evaluation metric: Effectivness
• Plagiarism or not:
What was found?
• Total
• Without the first 10 tests
(Google accident)
• English cases
• Japanese cases as additional
challenge
➡No winner,
continuous between 55% and 64 %
Flickr, cc-by, arthit, 2005
8
9 / 52
Evaluation metric: Usability
• Design, language consistency, navigation,
labelling, print quality of the reports, fits in
university processes
• Support by email:
Speed, good answers
• Top: PlagScan, followed by
PlagiarismFinder, Ephorus,
PlagAware and TurnItIn
Flickr, cc-by, Quapan, 2008
9
10 / 52
Evaluation metric : Professionalism
• Street address with town, telephone
number, name of a person
• Domain registration in own name
• No parallel offers of term papers or
pornography or advertising for such services
• German-speaking availability by telephone
during German working hours
• No installation of viruses
➡PlagiarismFinder, followed by PlagAware,
Strike Plagiarism, TurnItIn, Docoloc,
PlagScan, Blackboard
10
Flickr, cc-by-sa,
sludgegulper , 2008
11 / 52
Problems: Effectiveness
• Nothing found from books - not
even if they are in Google
Books!
• We had one 100% plagiarism
from Google books register at
less than 25%
• Translations are not found
11
12 / 52
Problems: Effectiveness
• Umlauts cause problems, although less so than
in earlier tests
• Redacted texts are found less often
• Many systems very
difficult to use
• Not all companies
trustworthy
• Some keep copies - and
award themselves
rights to use the text!
12
13 / 52
Problems: Usability
• Language mix
• Workflow problems
• The reports are generally not useful
13
14 / 52
Problems: Professionalism
• No info, no names
• The address listed is a parking lot
• Support questions not answered, telephone does
not pick up
• Offer term papers or
pornography in parallel,
all rights given
to the company
14
15 / 52
How to rank?
• No system was best in all of the metrics
• We set up a ranking for each of the five criteria
(three effectiveness, one usability, one
professionalism)
• Calculated the average ranking
15
16 / 52
Results: Useful
• There were no systems in
this category - only human
are able to reach this level of
effectiveness.
Flickr, cc-by-nc, dianejp, 2009
16
18 / 52
Partially useful: PlagAware
• German System
• Good documentation
• Average effectiveness: 61%
• But: each file must be submitted by itself (5
clicks!), this does not fit with the workflow
• Looks for plagiarism in online texts
18
20 / 52
Partially useful : turnitin
• Best results for material that is stored in their
database
• Translation problems
• Umlaut problems
• Return Wikipedia copies with ads for porn
• The source URLs reported are often no longer
valid
• Just adds up the percent values for the
“originality” report
• Only system to deal with Japanese properly
20
27 / 52
Partially useful: Ephorus
• Dutch system
• Direct mail-in using Hand-In-Code
• Reports by E-Mail
• Stores texts aggressively
• Problems with umlauts
27
30 / 52
Partially useful: PlagScan
• Newcomer from Germany
• One purchases “PlagPoints”
• Useful: Subaccounts for teachers
• First place in usability
• Three kinds of report, none of which are a
side-by-side report
• Only 60% in effectiveness
30
33 / 52
Partially useful: Urkund
• Swedish system
• Second in overall effectiveness
• 13th in usability and professionalism
• Language problems
• Complex navigation
• Catastrophic layout
• Unusable reports
• Cryptic error messages
• Test cases from 2008 were still stored
33
36 / 52
Barely useful Systems
• They find something, but miss a lot
• They are not really easy to use
• They have professionalism problems
• Docoloc, Copyscape, Blackboard/Safe Assign, Plagiarism Finder, Plagiarisma, Compilatio, StrikePlagiarism, The Plagiarism Checker
36
38 / 52
checkforplagiarism.net
• In 2007 it was called
iPlagiarismcheck.com
• Was a plagiarism of
turnitin, but they said:
These are the sources!
• Charge 15 €
for 5 tests, students
are the target group
• turnitin set up a
Honeypot
38
40 / 52
Viper
• Is installed on a PC
• In the terms of use: You give us
irrevocable rights to use your text
as we see fit
• Also runs a paper mill
• Complicated reports
• Only 24% effectiveness -
better to throw a coin!
• Advertise in the UK by power
cleaning the sidewalks
40
44 / 150
Test Results
• 38 of the (at the time of the test) 131 known
sources were found by at least one of the
systems
• Many of these sources (no longer) online
• Over all of the possible sources were found:
44
iThenticate 30 23 %
PlagScan 19 15 %
Urkund 16 12 %
PlagAware 7 5 %
Ephorus 6 5 %
45 / 52
We tested these systems on
zu Guttenbergs thesis
• The usability for such large
works was extremely poor
• The numbers appear to be
random
• Many sources throw a 404
“file not found” error with
iThenticate
• Nothing from books (or the
Bundestag) was found
45
46 / 52
The major problem is:
• They don’t find plagiarism! Just (marginally
changed)
copies of text - even properly referenced!
Flickr, cc-by-nc, Leeks, 2006
46
47 / 52
So let’s have a look ourselves....
• But doesn’t the thesis have to be available
digitally?
• And the thesis is so long?
• And the Internet
is extremely
large?
Flickr, cc-by-nc-nd, t_buchtele, 2009
47
48 / 52
Suspicion
• Upon careful reading you find it nicely written,
but .....
• The style is too polished, the vocabulary not that
of your students.
• There is some
strange formatting
• Interesting spelling
errors
• Lurching breaks in style
48
Flickr, cc-by, redcctshirt, 2009
49 / 52
Searching with Google & Co
• Phrase in "..."
• 3-5 nouns
• The typo
• Check the second page
of hits
• Set a time limit
Flickr, cc-by-nc-nd, Athena1970, 2008
49
52 / 52
Thank you!
• Portal Plagiarism
http://plagiat.htw-berlin.de
• Plagiarism-Blog:
http://copy-shake-paste.blogspot.com/
• Homepage:
http://www.f4.htw-berlin.de/~weberwu/
• Kontakt: katrin.koehler@student.htw-berlin.de
c. 2011: Axel Völcker,
DerWedding.de
52
Recommended