Re: Brainstorming session: detecting plagiarism

Professors use essay-comparing software already. It's not new, but it does a pretty effective job. It's pretty straightforward to look for matching fragments, even tiny fragments like Markov chains or statistically improbable word pairs. However, if you compare some feeds, say, Reuters to Associated Press on the same day, even those criteria would fall apart.

Separately, you might need to refine your personal definition of plagiarism. I'd say that the two examples below show that a fairly mechanical paraphrasing is going on, but it's not clear that it would rise to the definition of plagiarism. Also, how do you allow for proper quotations?

Macbeth is presented as a  mature    man of definitely established cha
+racter.
Macbeth is shown     as an empowered man of well-      established cha
+racter.
[download]

--
[ e d @ h a l l e y . c c ]

Comment on Re: Brainstorming session: detecting plagiarism Download Code

Replies are listed 'Best First'.
Re^2: Brainstorming session: detecting plagiarism by Ovid (Cardinal) on Jun 08, 2005 at 19:57 UTC
Eventually, this code will be released with full documentation. One of the first things the documentation would make clear is that matching text does not necessarily mean plagiarism. Instead, the person looking at the text would have to compare the two documents (with the HTML linking I hope to provide) and determine for themselves if plagiarism took place. My software will not be able to tell whether or not someone gave proper credit for a particular passage. If the above was the only sentence in a 10,000 word document, I wouldn't say it's plagiarism. If that and several other sentences grouped together in one paragraph have a decent match and there's no attribution, then that's something which merits further study. Deciding whether or not plagiarism has occurred is not something software can do. It can merely flag likely candidates and will always have false positives and negatives. And I'm aware that professors already have software to do this. The free software I've seen is very limited. (One merely does a "longest substring" match.) I'd like to provide free tools for them. Cheers, Ovid New address of my CGI Course.	[reply]
Re^2: Brainstorming session: detecting plagiarism by Ovid (Cardinal) on Jun 08, 2005 at 22:01 UTC
By the way, do you have any information about calculating statistically improbable word pairs? I would be most fascinated with that. I'd like to create an architechture whereby people could, at the potential cost of performance, pick and choose which features they would like to use when comparing. This sounds like a great choice. Cheers, Ovid New address of my CGI Course.	[reply]
Re^3: Brainstorming session: detecting plagiarism by halley (Prior) on Jun 09, 2005 at 00:57 UTC
There are many lexicons out there, and they often include a ranking by frequency found in a large source such as the Bible or the New York Times. One such popular lexicon for English is the Moby Project, and it includes two such rankings. Google will give you hints there. To find statistically improbable word pairs, one method is trivial: you take the product of word frequencies for each consecutive pair of words, and search for the smallest results. For example, "statistically=0.0004" and "improbable=0.0003" would give a very statistically improbable 0.00000012, and yet, this posting uses that phrase more than once. It's a pretty good indicator of a work's overall topics and themes. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re^3: Brainstorming session: detecting plagiarism by planetscape (Chancellor) on Jun 09, 2005 at 05:48 UTC
You might also want to check out Ted Pedersen's Ngram Statistics Package, with regard to the problem of improbable word pairs. The output can be easily sorted to highlight least likely occurrences. Of course you would want to compare to a corpus (of written English, say), to get a fairly good idea of "normal" parameters. Good luck, and keep us posted, please! planetscape	[reply]


The stupid question is the question not asked
	PerlMonks