Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Brainstorming session: detecting plagiarism

by CountZero (Bishop)
on Jun 08, 2005 at 21:13 UTC ( [id://464844]=note: print w/replies, xml ) Need Help??


in reply to Brainstorming session: detecting plagiarism

An interesting approach! I particularly liked the simple and elegant way of "hashing" the sentences and calculating the "distance" between them.

Another approach I was thinking of is moving a "sliding window" with a length of 3 or 4 words over the texts to be compared. Every list of 3 or 4 words is stored in a hash (with these words concatenated into one as the key and the number of occurences as the value) and thereafter the hashes of both texts are compared to each other; every "match" would give plagiarism-points and every non-match will give originality points: comparing the ratio of the plagiarism-score to the originality-score one can perhaps use this as another metric.

It would be a more abstract metric than your comparison, but it would perhaps be less prone to small deliberate changes or use of synonyms. In that respect one could vary the "sliding window" to look at e.g. first, second and fourth words (skipping the third word) so as to account for substitutions.

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

  • Comment on Re: Brainstorming session: detecting plagiarism

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://464844]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (1)
As of 2024-04-25 03:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found