Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Re: Re: Re: calculate matching words/sentence

by tachyon (Chancellor)
on Sep 05, 2003 at 11:07 UTC ( [id://289168]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Re: calculate matching words/sentence
in thread calculate matching words/sentence

We aren't told what this is for but my wild guess was plagurism detection. We wrote one for a client a while back. It depended to a degree on the temptation to cut and paste chunks of text. The approach was simple enough. Tokenize by spliting on \n\n . and , into roughly sentence size fragments, then strip case and all punctuation. MD5 hash the fragments and index in a DB to the original docs (hash as primary key, link list to docs). To test for a ripped off doc all you need to do is tokenize it, look in the DB for matching tokens and accululate a count for each doc the token exists in. The higher the count for a given doc the more common fragments the test doc has with the known doc. Linear regression shows obvious threshold values.

This is of course a somewhat naive approach but works suprisingly well in practice and is certainly a reasonable way to screen thousands of docs very rapidly. Each doc only gets tokenized once and accumulates in the DB. Stripping case and punctation mean's a cut'n'paster has to work that much harder to avoid detection. Rearranging word order (but not changes in case or most puntuation) will of course defeat this. However if you have to rearrange every sentence to avoid detection it starts to become a hell of a lot more effort to R&D someone elses work. The reality is that you only have to catch a % of would be cheats to discourage the practice. In WWI the French Officers in the trenches would shoot the first man to refuse to go over the top Pour encourager les autres

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

  • Comment on Re: Re: Re: Re: calculate matching words/sentence

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://289168]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2024-04-16 16:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found