http://qs321.pair.com?node_id=520199

steves has asked for the wisdom of the Perl Monks concerning the following question:

I'm wondering if anyone has used Archive::Zip or a similar module for reading partial/corrupt zip files. These would normally be large files that didn't make it all the way before FTP or whatever dropped the connection. I've found that some Windows based zip programs read what they can and tell me the rest is corrupt. So I'm guessing there are standard approaches to this. All my attempts with existing Archive::Zip methods result in errors and no data. If someone has done this sort of thing I'd gladly start my new year off on a lazy foot and take advantage of what you've done.

Replies are listed 'Best First'.
Re: Reading partial/corrupt zip files
by abcde (Scribe) on Jan 01, 2006 at 12:11 UTC
    I am not sure of any library that can read corrupted files, but it might be possible to turn an incomplete file into a complete one, adding a phony footer and removing the last file.

    Look at the ZIP file format:
    http://www.pkware.com/business_and_developers/developer/appnote/
    The files in the zip are not connected to each other, so it is possible to read through the file, parsing each file as it comes and bailing out if the file ends unexpectedly:
    #!/usr/bin/perl open( Z, "test.zip" ); sub error { print "The file is corrupt.\n"; exit; } sub readstr { my $received; error() if eof Z; $received .= getc(Z) . " " for ( 1 .. $_[0] ); return $received; } sub readint { my $received = 0; error() if eof Z; $received = $received * 255 + ord( getc(Z) ) for ( 1 .. $_[0] ); return $received; } while ( !eof Z ) { my $head = readstr(4); # PK^C^D my $versions = readstr(4); # ... my $filenamelength = readint(2); # ... # Parse the rest of this file # Until we get an error or go on to the next file header } close(Z);

    My code doesn't actually produce a correct footer for the file, but it should start you off.

      You've hit on the key -- that the files are not connected. Looking at the code, Archive::Zip appears to always first access the central directory information, which is at the end of the file. For files that are not fully sent, that never works since it's the last part of the file that's missing. It makes sense to build the code around the central directory -- it's surely faster than parsing the entire zip file to get the pieces that are available. So I think a recovery method would have to try and piece things together the slow way as you state.

Re: Reading partial/corrupt zip files
by Anonymous Monk on Jan 01, 2006 at 08:46 UTC
    All my attempts with existing Archive::Zip methods result in errors and no data.
    Code, errors? Did you try Low-level member data reading?

      Code, errors? Did you try Low-level member data reading?

      I figured the key is probably in the low level routines. I didn't get too far with those (yet anyway). The issue is that the low level routines are member oriented. So I first have to know what members are in the zip file. In order to get members, you first have to read an existing zip file. The read method fails. After that failure none of the member methods return anything.

      Next I was going to take a corrupt zip I was able to read using one of the many Windows zip readers that handles corrupt files and try pulling members it tells me exist using the low level methods -- bypass the necessary step of having to first identify the members for now.