Different compression behaviours with Compress::Zlib

grinder has asked for the wisdom of the Perl Monks concerning the following question:

I thought I had the solution to my Compress::Zlib problem when I did a Super Search and came across this node. Unfortunately, I want to build gzipped files playing around with Z_BEST_SPEED and Z_BEST_COMPRESSION, to see the differences. I can do that, but unfortunately I can't read the data back in... I just get garbage.

Consider the two following ways to write a gzipped data file:

#! /usr/bin/perl -w

use strict;
use Compress::Zlib;

my $file  = shift || 'testdata.gz';
my $file2 = shift || 'testdata2.gz';
my @data = <DATA>;

my $gz = gzopen( $file, 'wb' ) or die "Cannot open $file for gzwrite: 
+$gzerrno\n";
my $line;

foreach $line( @data ) {
    $gz->gzwrite( $line )
        or die "Could not write gzipped data to $file: $gzerrno\n";
}
$gz->gzclose();

my( $gzstat, $gzout );
($gz, $gzstat) = deflateInit( { -Level => Z_BEST_COMPRESSION } )
    or die "Could not construct gz writer: $gzstat\n";
open OUT, ">$file2" or die "Could not open $file2 for output: $!\n";
binmode OUT;

foreach $line( @data ) {
    ($gzout, $gzstat) = $gz->deflate($line);
    die "Could not deflate data: $gzstat\n$line\n" unless $gzstat == Z
+_OK;
    print OUT $gzout;
}
($gzout, $gzstat) = $gz->flush();
die "Could not flush gz writer: $gzstat\n" unless $gzstat == Z_OK;
print OUT $gzout;
close OUT;

__DATA__
foo
bar
Judge my vow, sphinx of black quartz
__END__
[download]

If I use the funky hex viewer by OeufMayo, I see that these two files are not the same:

+--------------------------------------------------+------------------
++
| 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F  | 0123456789ABCDEF 
+|
+--------------------------------------------------+------------------
++
| 1F 8B 08 00 00 00 00 00 00 03 4B CB CF E7 4A 4A  |  ď........K-¤ţJJ 
+|
| 2C E2 F2 2A 4D 49 4F 55 C8 AD 54 28 CB 2F D7 51  | ,Ô_*MIOU+ĄT(-/ÎQ 
+|
| 28 2E C8 C8 CC AB 50 C8 4F 53 48 CA 49 4C CE 56  | (.++Ś˝P+OSH-IL+V 
+|
| 28 2C 4D 2C 2A A9 E2 02 00 7B AE 4A 0D 2D 00 00  | (,M,*ŽÔ..{ŤJ.-.. 
+|
| 00                                               | .                
+|
+--------------------------------------------------+------------------
++
[download]

compared to

+--------------------------------------------------+------------------
++
| 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F  | 0123456789ABCDEF 
+|
+--------------------------------------------------+------------------
++
| 78 DA 4B CB CF E7 4A 4A 2C E2 F2 2A 4D 49 4F 55  | x+K-¤ţJJ,Ô_*MIOU 
+|
| C8 AD 54 28 CB 2F D7 51 28 2E C8 C8 CC AB 50 C8  | +ĄT(-/ÎQ(.++Ś˝P+ 
+|
| 4F 53 48 CA 49 4C CE 56 28 2C 4D 2C 2A A9 E2 02  | OSH-IL+V(,M,*ŽÔ. 
+|
| 00 66 87 0F C8                                   | .fç +            
+|
+--------------------------------------------------+------------------
++
[download]

They are nearly the same, but they have different headers and footers, which explains why the latter cannot be decoded via the following snippet:

$gz = gzopen( $file, 'rb' ) or die "Cannot open $file for gzread: $gze
+rrno\n";
while( $gz->gzreadline($line) > 0 ) {
        print $line;
}
die "Error reading from $file: [$gzerrno]\n" unless Z_STREAM_END == $g
+zerrno;
$gz->gzclose();
[download]

Can someone enlighten me as to what is going on?

`g r i n d e r`

Comment on Different compression behaviours with Compress::Zlib Select or Download Code

Replies are listed 'Best First'.
Re: Different compression behaviours with Compress::Zlib by lemming (Priest) on Sep 11, 2001 at 00:23 UTC
Try using the inflate method as taken from the perldoc for Compress::Zlib. There is a difference in how the headers are written which explains why the the second method is incompatible with gzip. Note that for zip file manipulation, the docs suggest using Archive::Zip which I would guess is a reason for the default behaviour. I'm still looking into this, but maybe this will help? `#!/usr/bin/perl use strict; use warnings; use Compress::Zlib; my $x = inflateInit() or die "Cannot create inflation stream"; my $input = ""; binmode STDIN; binmode STDOUT; my ($output, $status); while (read(STDIN, $input, 4096)) { ($output, $status) = $x->inflate(\$input); print $output if $status == Z_OK or $status == Z_STREAM_END; last if $status != Z_OK; }` [download]	[reply] [d/l]
Re:x2 Different compression behaviours with Compress::Zlib by grinder (Bishop) on Sep 11, 2001 at 00:51 UTC
A little more background. I want to rewrite a backup script that is currently performed in shell. The first step is to write all the files to be backed up to a catalog file, and then write that file to tape. That step alone takes two to three hours to perform. The fact that several processes are spawned per file probably explains the cause and effect. So I want to use File::Find to get the file names, and then write them, line by line, to a gzipped file directly on tape. When it comes to restoring, I'd like to be able to read the catalog back off tape, line by line, in order to be able to apply a regex, to see whether the file in question is to be restored. Using `gzopen` will do the trick, but I'd like to benchmark whether spending more time to compress the file harder to write less bytes onto a glacially slow medium is faster than using less time to compress the file lightly, and spend more time writing to the device. Given that OS buffering may make the issue moot. But I'd like some hard numbers. Currently I have a perl script that takes 17 minutes instead of 150-180 minutes, to generate a 2Mb catalog, instead of 40-45Mb uncompressed catalog. I could read the file back in block by block, but to simplify the alogorithm (to avoid having to worry about a record that is split across two chunks) I'd have to write the whole thing back onto disk first, and then reopen, or seek to zero, and then loop through applying the regex. I was hoping that I could read the compressed catalog off the tape, inflate it record by record, apply the regex and push any hits to a `@todo` list in one pass. -- `g r i n d e r`	[reply]
Re: Different compression behaviours with Compress::Zlib by bikeNomad (Priest) on Sep 12, 2001 at 02:02 UTC
Yes. There is more to a gzipped file than a gzipped data stream. There's also a file header and footer, as you've found out. If you do the in-memory compression without using gzopen() etc. you will need to do in-memory decompression to read it. Or you could just write the header and footer yourself. You might consider trying afio, which is like cpio but can compress individual files. It also knows about tapes. When combined with reasonable buffering (like Kbackup's Multibuf), you won't see any slowdown from its spawning gzip to compress streams. In general, the answer to streaming tape drives is good buffering; this may be difficult to do in a single Perl process. You may want to put a dual-buffer or buffer-pool program in between your program and the tape. If you use Multibuf, it will also detect end-of-medium conditions and allow you to change tapes so you can have multiple volumes.	[reply]


XP is just a number
	PerlMonks