Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Fingerprinting text documents for approximate comparison

by johndageek (Hermit)
on Mar 24, 2005 at 19:06 UTC ( [id://442166]=note: print w/replies, xml ) Need Help??


in reply to Fingerprinting text documents for approximate comparison

I would look at creating a fingerprint file for each document (you will need to refine the parameters you use).

In this file I would put perhaps:
number of significant words
average number of letters of the top 5 most common words
The three least common significant words (alphabetized)
The three most common significant words (alphabetic)

You can either use your current checksum, or create a checksum on the fingerprint files.

use similar checksums to select fingerprint files to compare, those fingerprints that are within a tolerance you set would be deemed matches.

Jsut my 2 cents worth, good luck! <!--

Enjoy!
Dageek

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://442166]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-03-29 06:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found