Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Brainstorming session: detecting plagiarism

by halley (Prior)
on Jun 08, 2005 at 19:35 UTC ( [id://464809]=note: print w/replies, xml ) Need Help??


in reply to Brainstorming session: detecting plagiarism

Professors use essay-comparing software already. It's not new, but it does a pretty effective job. It's pretty straightforward to look for matching fragments, even tiny fragments like Markov chains or statistically improbable word pairs. However, if you compare some feeds, say, Reuters to Associated Press on the same day, even those criteria would fall apart.

Separately, you might need to refine your personal definition of plagiarism. I'd say that the two examples below show that a fairly mechanical paraphrasing is going on, but it's not clear that it would rise to the definition of plagiarism. Also, how do you allow for proper quotations?

Macbeth is presented as a mature man of definitely established cha +racter. Macbeth is shown as an empowered man of well- established cha +racter.

--
[ e d @ h a l l e y . c c ]

Replies are listed 'Best First'.
Re^2: Brainstorming session: detecting plagiarism
by Ovid (Cardinal) on Jun 08, 2005 at 19:57 UTC

    Eventually, this code will be released with full documentation. One of the first things the documentation would make clear is that matching text does not necessarily mean plagiarism. Instead, the person looking at the text would have to compare the two documents (with the HTML linking I hope to provide) and determine for themselves if plagiarism took place. My software will not be able to tell whether or not someone gave proper credit for a particular passage.

    If the above was the only sentence in a 10,000 word document, I wouldn't say it's plagiarism. If that and several other sentences grouped together in one paragraph have a decent match and there's no attribution, then that's something which merits further study. Deciding whether or not plagiarism has occurred is not something software can do. It can merely flag likely candidates and will always have false positives and negatives.

    And I'm aware that professors already have software to do this. The free software I've seen is very limited. (One merely does a "longest substring" match.) I'd like to provide free tools for them.

    Cheers,
    Ovid

    New address of my CGI Course.

Re^2: Brainstorming session: detecting plagiarism
by Ovid (Cardinal) on Jun 08, 2005 at 22:01 UTC

    By the way, do you have any information about calculating statistically improbable word pairs? I would be most fascinated with that. I'd like to create an architechture whereby people could, at the potential cost of performance, pick and choose which features they would like to use when comparing. This sounds like a great choice.

    Cheers,
    Ovid

    New address of my CGI Course.

      There are many lexicons out there, and they often include a ranking by frequency found in a large source such as the Bible or the New York Times. One such popular lexicon for English is the Moby Project, and it includes two such rankings. Google will give you hints there.

      To find statistically improbable word pairs, one method is trivial: you take the product of word frequencies for each consecutive pair of words, and search for the smallest results. For example, "statistically=0.0004" and "improbable=0.0003" would give a very statistically improbable 0.00000012, and yet, this posting uses that phrase more than once. It's a pretty good indicator of a work's overall topics and themes.

      --
      [ e d @ h a l l e y . c c ]

      You might also want to check out Ted Pedersen's Ngram Statistics Package, with regard to the problem of improbable word pairs. The output can be easily sorted to highlight least likely occurrences. Of course you would want to compare to a corpus (of written English, say), to get a fairly good idea of "normal" parameters.

      Good luck, and keep us posted, please!

      planetscape

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://464809]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-19 06:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found