Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Module for Approximate File Comparison

by blokhead (Monsignor)
on Sep 24, 2007 at 13:19 UTC ( [id://640714]=note: print w/replies, xml ) Need Help??


in reply to Module for Approximate File Comparison

A pretty neat and simple method is one outlined by Zaxo in Re: similar texts !?. Basically, you measure how much it helps a compression algorithm to concatenate the two files together, compared to when you compress them independently.

If you think of a compression algorithm as a mild approximation to Shannon entropy, then this approach is essentially computing the corresponding approximation of mutual information (normalized to the range 0.5 - 1), which is the intuitive concept you seem to be looking for.

blokhead

  • Comment on Re: Module for Approximate File Comparison

Replies are listed 'Best First'.
Re^2: Module for Approximate File Comparison
by atcroft (Abbot) on Sep 24, 2007 at 18:44 UTC

    You mean something like:

    The code above uses IO::Compress::Gzip, and takes the two files as arguments on the command line. The ratio in this case is computed by saying that 100% would be if the content of the two files compressed to be the same size as one of them, 0% as compressing to the sum of the sizes of the two files.

    (By the way, for the example files from my earlier post, this gave a result of 90.61%, which is a fair approximation.)

    Update: 2007-09-24
    Added comments to code; had reversed description of ratio.

      If I recall correctly, zlib (and so, gzip) operate over a data window of 64Kb or less, so your method would not work if the files are bigger than that.

        True-any compression method that uses a data window in a similar manner would experience the same issue, albeit the size at which it would occur would be dependent upon the size of said window. The code above was merely an example.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://640714]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2024-04-18 18:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found