Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

compressing VERY LARGE files

by gnu@perl (Pilgrim)
on Sep 26, 2002 at 19:35 UTC ( [id://201013]=perlquestion: print w/replies, xml ) Need Help??

gnu@perl has asked for the wisdom of the Perl Monks concerning the following question:

I have the need to compress some very large files >600MB. This need to be some standard compression such as gzip or bzip. I have looked at Compress::Bzip2 as well as some of the others on CPAN. The problem is that Bzip2 (as well as others) compresses a string, that means that I would have to read in the whole 600MB file, compress it then write it out. This just can't be done. It would use far too much memory on the box.

Anyone have another way I can do this without systeming out to bzip or gzip. This program will be placed on different boxes and the location of the system compression binaries will not always be in the same place or even in the path of the user. :(

Replies are listed 'Best First'.
Re: compressing VERY LARGE files
by RMGir (Prior) on Sep 26, 2002 at 20:10 UTC

    Don't give up on bzip2 or gzip just because they may not always be in the same place.

    It may be simpler to write a routine to find the executable and run it than to get a module installed across all of those machines... If you can guarantee that at least gzip is on all the boxes, it shouldn't be too hard, in fact...

    sub findCompressor { my @paths=( "/usr/bin". "/bin", "/usr/local/bin", "/usr/share/bin", "/opt/bin", "$ENV{HOME}/bin", "$ENV{HOME}/local/bin", ); foreach my $prog('bzip2','gzip','compress') { foreach my $dir(@paths) { my $path="$dir/$prog"; return $path if -x $path; } } # return some default non-compressing compressor?? } # run once my $compressor=findCompress();
    Not sure if that makes sense in your case, but I think it would work...
    --
    Mike

      Thanks, I really had not thought of that. What I might do is write the location of the compression binary to a file once I find it, then check that file first each time to be sure it is still there, if not, just look for it again.

        Glad you like it :)

        But I wouldn't save the location to a file.

        Imagine a very bare bones server, where only compress is installed. Your script runs once, saves /bin/compress as the preferred compressor, and never notices the next week when bzip is installed. (It could also have trouble if the executable was moved or deleted...)

        Compared with compressing a 600M file, I wouldn't worry about the overhead of doing (at most!) 21 -x's once each time your script starts up...
        --
        Mike

      Many linux boxes also have the which command, so this could be simplified to,
      my $prog; for (qw/bzip2 gzip compress/) { chomp($prog = qx|which $_|); last if $prog; } print "Found: $prog\n";
        As gnu@perl said in his original post, the compression binaries may not be in the user's path.

        which tells you where a program lies on your path, "which" may not help in this case... It could be added as an additional test, though.
        --
        Mike

Re: compressing VERY LARGE files
by perrin (Chancellor) on Sep 26, 2002 at 19:46 UTC
      You'll probably find most things in Meta as it seems to be the oddest module on CPAN :)

      There is over 300 modules in the distribution and most look like wrappers for other modules or programs.

      gav^

Re: compressing VERY LARGE files
by Anonymous Monk on Sep 27, 2002 at 11:23 UTC
    Why didn't Compress::Gzip didn't work for you? Study this example from the documentation:
    use strict ; use warnings ; use Compress::Zlib ; binmode STDOUT; # gzopen only sets it on the fd my $gz = gzopen(\*STDOUT, "wb") or die "Cannot open stdout: $gzerrno\n" ; while (<>) { $gz->gzwrite($_) or die "error writing: $gzerrno\n" ; } $gz->gzclose ;
    though if it is binary data being input you probably want to read your input instead.
      Thanks. It would seem that I didn't think past the obvious when reading the doc on Compress::Gzip. This would work fine, but I think that I am going to do what was suggested and just look for whatever compress program has been loaded onto the box. This way I can check for bzip2 then gzip and if all else fails just use compress.

      This should allow me to use whatever the better compression binary is no matter what system I am on. I won't need to worry if the so-and-so libraries are on the destination machine or not.

      Thanks for your solution. I am positive I will be using it in another project in the near future here.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://201013]
Approved by fglock
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-03-29 01:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found