Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Is there a Perl version of UNIX "cmp" ?

by Amphiaraus (Beadle)
on Feb 06, 2009 at 19:03 UTC ( [id://741983]=perlquestion: print w/replies, xml ) Need Help??

Amphiaraus has asked for the wisdom of the Perl Monks concerning the following question:

Is there a Perl version of UNIX "cmp"?
The UNIX "cmp" command is used to compare 2 very large, non-human-readable files to confirm if they are the same, or different. Unlike UNIX "diff", UNIX "cmp" is "large-file-aware" and does not bungle the job when comparing 2 large files as UNIX "diff" is prone to do.

Sample "cmp" operations:

TWO FILES ARE SAME:
> cmp engine_security.a@@/main/par_x_rush_r55.1/2 engine_security.a@@/main/par_x_rush_r55.1/2

> echo $?
0

TWO FILES ARE DIFFERENT:
> cmp engine_security.a@@/main/par_x_rush_r55.1/1 engine_security.a@@/main/par_x_rush_r55.1/2

engine_security.a@@/main/par_x_rush_r55.1/1 engine_security.a@@/main/par_x_rush_r55.1/2 differ: char 29, line 2

> echo $?
1


I work for Motorola, and our Perl coding practices say to avoid qx and system calls to non-Perl functions, in all cases in which a Perl equivalent to a UNIX etc. function can be found.
  • Comment on Is there a Perl version of UNIX "cmp" ?

Replies are listed 'Best First'.
Re: Is there a Perl version of UNIX "cmp" ?
by zentara (Archbishop) on Feb 06, 2009 at 19:29 UTC
    File::Compare
    #!/usr/bin/perl #generally use File::Compare available in Perl5.8 print cmp_file(@ARGV) ? "equal\n" : "not equal\n"; ################# use constant BUF_SIZE => 4096; sub cmp_file { my ( $f1, $f2 ) = @_; open my $h1, $f1 or die "gah $f1: $!"; open my $h2, $f2 or die "gah $f2: $!"; binmode $h1; binmode $h2; my ( $buf1, $buf2 ); my $equal = 0; while ( read $h1, $buf1, BUF_SIZE ) { read $h2, $buf2, BUF_SIZE; last unless $equal = ( $buf1 eq $buf2 ); } return $equal and eof $h2; }

    I'm not really a human, but I play one on earth Remember How Lucky You Are
Re: Is there a Perl version of UNIX "cmp" ?
by zentara (Archbishop) on Feb 06, 2009 at 19:10 UTC
    The unix cmp utility will probably be faster on large files, so why not run it thru system or backticks? It's commonly done with sort on large files.

    I'm not really a human, but I play one on earth Remember How Lucky You Are
Re: Is there a Perl version of UNIX "cmp" ?
by Anonymous Monk on Feb 06, 2009 at 19:36 UTC
      Does anyone know if File::Compare's compare() function is "large-file-aware"? i.e. does it reliably return a correct boolean value when comparing large non-human-readable files?
        If perl was compiled with the USE_LARGE_FILES flag (which it likely was if your OS handles large files), it will handle large files (see "perl -V" for that info). Still, "cmp" is likely going to be much faster than File::Compare. The only way to tell is to try both on your large files. Coding practices are fine, but they should be guidelines, not absolutes.

        Update: quick benchmark on two identical 1GB files on HP-UX - 13.5 secs (cmp) vs. 17.5 seconds (File::Compare). "much faster" is relative it seems :-)

        Looking at the source for File::Compare, there is an undocumented third input that is used as the read buffer size. A default is used if that third argument is not provided, which is the size of the first file (-s FROM). If that file - or the third argument - is larger than 1024 * 1024 * 2, that number (2mb) is used as the buffer size.

        Basically, it reads the file in chunks up to 2mb, so it should be able to handle files of virtually any size given enough time.

        ---
        It's all fine and dandy until someone has to look at the code.
Re: Is there a Perl version of UNIX "cmp" ?
by samtregar (Abbot) on Feb 06, 2009 at 19:22 UTC
    I agree that there's no compelling reason not to just call it from system() but if you did need it in Perl code it'd be a fun challenge. I think I'd do it by reading in chunks from each file and doing an MD5 on each chunk with Digest::MD5. Compare the MD5s and if they're different then you've got a difference. Oh, and start by comparing file sizes!

    -sam

      Uh, you start with block A and block B in memory. Not sure how you think computing the MD5 on each block (hitting every byte, doing math) is going to be any faster than just comparing the blocks themselves (comparing byte by byte, but stopping on first difference). Bizarre.
        MD5 is magic! Ok, good point.

        -sam

Re: Is there a Perl version of UNIX "cmp" ?
by jdporter (Paladin) on Feb 07, 2009 at 12:37 UTC

    The tag line of tye's Algorithm::Diff modules says that it computes "'intelligent' differences between two files / lists"... but the doc doesn't explain how to use it on files, only on arrays. I suppose you could tie the two files to arrays using Tie::File, but I'm not sure how efficient that would be...

    Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://741983]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2024-04-19 10:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found