Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Correct Perl settings for sending zipfile to browser

by Anonymous Monk
on Nov 14, 2019 at 02:18 UTC ( [id://11108658]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

First the Facts:

  1. A perfect zipfile is being generated, which unzips perfectly on the server (linux).
  2. The zipfile contains a tab-delimited text file of UTF-8 script (as utf8mb4 in MariaDB).
  3. Following transfer to the client, via browser download, the zip cannot be opened.
  4. The zipped file is about 2MB in size.
  5. Several settings/variations have been tried, to no avail (see code and comments below).

Because some clients are on Windows and some on MacOS, the file needs to be exported in the line-ending format required by the client. MacOS (darwin), Linux, and UNIX all have the same "\n" endings, whereas Windows/DOS uses "\r\n". Some code has been added to facilitate this variation.

However, once the file is downloaded (via the browser's "Save as..." popup window), it cannot be opened with the usual zip tools like UnArchiver, nor can it be opened from the terminal with the "unzip" command. The latter gives the following error message:

$ unzip DB_ExportFile_2019-11-12.txt.zip 
Archive:  DB_ExportFile_2019-11-12.txt.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of DB_ExportFile_2019-11-12.txt.zip or
        DB_ExportFile_2019-11-12.txt.zip.zip, and cannot find DB_ExportFile_2019-11-12.txt.zip.ZIP, period.

The following code shows the subroutine that provides this download.

sub exportdatabase { fork: { my ($recnum,$revnum,$book,$chap,$verse,$text) = ''; my @resp = (); my $timestamp = "$curdate_$curtime"; $timestamp =~ s/[\/:.]/-/g; my $to_windows = ''; my $CRLF = "\n"; if ($OS eq "Windows") { $to_windows = '--to-crlf'; # SAME AS -l $CRLF = "\r\n"; } $statement = qq| SELECT a.RecordNum, a.RevisionNum, a.Book, a.Chapter, + a.Verse, a.Text from $table a INNER JOIN (SELECT RecordNum, max(Revi +sionNum) RevisionNum FROM $table GROUP BY RecordNum) b USING (RecordN +um,RevisionNum); |; &connectdb('exportdatabase'); push @resp, "RECORD#\tREVISION#\tBOOK#\tCHAP#\tVERSE#\tTEXT, AS EDITED + BY: $curdate $curtime (Pacific Time)$CRLF"; while (($recnum,$revnum,$book,$chap,$verse,$text) = $quest->fetchrow_a +rray()) { push @resp, "$recnum\t$revnum\t$book\t$chap\t$verse\t$text +$CRLF"; } chdir $exportdir or print "Cannot change directory to $exportdir\n"; { open TARGET, ">$db_export_file" or die "Cannot export the database to +a file. Please contact the system administrator.\n"; print TARGET @resp; close TARGET; } my $zipfile = "$db_export_file.zip"; #TO escape the 'taint' function on the $ENV{PATH} $ENV{'PATH'} =~ /(.*)/; $ENV{'PATH'} = $1; # -T test the file + # -l flag will convert to CRLF line endings (Win +dows) # -to_crlf convert to CRLF line endings (not for binary files) my $command = `/usr/bin/zip -T $to_windows $zipfile $db_export_file`; if ($zipfile eq '') { print "Content-type: text/html\n\n"; print "You must specify a file to download.\n<p>"; } else { open FILE, "<$zipfile" or die "Can't open $zipfile ... $!"; # open(FILE, "<:raw:perlio", $zipfile); binmode FILE; # binmode FILE, ":encoding(UTF-8)"; print qq|Content-Type: application/x-download\n|; #print "Content-Type: application/zip\n\n", print qq|Content-Length: | . (stat $zipfile)[7] . "\n"; print qq|Content-Disposition: attachment; filename="$zipfile";\n\n +|; while (<FILE>) { print }; close FILE; #print chr(0); unlink("$zipfile") || print $!; unlink("$db_export_file") || print $!; exit(0); } } #END fork } # END SUB exportdatabase

Note that I have attempted several variations on the file's "binmode" options, as well as specifying the Content-Type two different ways. Guessing and checking, trying all the various possible combinations, is not working out. I'm obviously missing something, and it's probably something very simple.

Thanks in advance to eagle-eyed coder who spots the flaw.

Replies are listed 'Best First'.
Re: Correct Perl settings for sending zipfile to browser
by tobyink (Canon) on Nov 14, 2019 at 10:11 UTC

    MacOS (darwin), Linux, and UNIX all have the same "\n" endings, whereas Windows/DOS uses "\r\n". Some code has been added to facilitate this variation.

    Firstly, I'd get rid of that. Most Windows software will cope fine with "\n" line endings, including pretty much any spreadsheet or database software you're planning to import the tab-delimited file into. It's only a few really basic tools like Notepad that might not. Once you get your code working, if you decide you really need "\r\n" on Windows, then you can add that code back in, but for now, I'd suggest removing it. Simplifying your code will help you find where your bug is.

    For now, I'd also suggest not generating a Content-Length HTTP header; perhaps it is somehow wrong and this is resulting in the browser truncating the file.

    Once you've done that, compare the size of the file and the MD5 sum of the file at both the client and server end, and check they match. (Don't unlink the ZIP file at the end, so you can compare them.)

    My gut feeling is that you're somehow appending some extra data at the end of the file when you send it. ZIP files are unusual in that they include header-like information at the end of the file instead of the start. This is a throwback to the days when people would write ZIP files that spanned multiple floppy disks, and the header information couldn't be written until the process was finished, so got written onto the last floppy disk. (When you unzipped the file, you needed to insert the last disk first so the header could be read, then start at the beginning and insert them all in order until you got to the last one again which would need to be read again for its non-header data.) So yeah, you've probably got some extra data like an error message or warning or even a few line break characters, at the end of your ZIP.

      Following your advice, I removed the Content-Length header. The file size of the downloaded zip increased. An error message indicated that I should check the binmode of the transferred file, so I changed that to the UTF-8 version. My file size increased again.

      Clearly, there is still some issue with the proper transfer from server to client.

      After learning this, I decided to retry the download a few times to see if the file sizes would be the same. I found the following issues, and made notes of them:

      #File size, (download attempt), unzip's reported missing bytes
      #4,569,935 (1) missing 942842225
      #4,569,765 (2) missing 942842396
      #4,571,674 (3) missing 942840486
      #4,569,765 (4) missing 942842396
      #4,571,656 (5) missing 942840504
      

      As the data indicates, only the second and fourth attempts resulted in identical numbers. Note that no change to the code was made between any of the attempts.

      With errata like this, I'm not sure where to look next.

      Here's a current error message, for comparison.

      $ unzip -v DB_ExportFile_2019-11-14.txt\(4\).zip 
      Archive:  DB_ExportFile_2019-11-14.txt(4).zip
      
      caution:  zipfile comment truncated
      error DB_ExportFile_2019-11-14.txt(4).zip:  missing 942842396 bytes in zipfile
        (attempting to process anyway)
      error DB_ExportFile_2019-11-14.txt(4).zip:  start of central directory not found;
        zipfile corrupt.
        (please check that you have transferred or created the zipfile in the
        appropriate BINARY mode and that you have compiled UnZip properly)
      

      It should also be noted that the original text file, before being zipped, weighs in at about 22 MB -- far short of the number of supposed missing bytes indicated in the error message.

        I wouldn't expect the file sizes to always be perfectly identical anyway. Your SELECT statement doesn't include an ORDER BY, so won't always return rows in the same order, and depending on what order they're returned, this will make subtle differences to how compressible the file is.

        Pretty sure you don't want to be reading the file as UTF-8. It should be raw. Might want to binmode STDOUT to raw too.

Re: Correct Perl settings for sending zipfile to browser
by cavac (Parson) on Nov 14, 2019 at 14:42 UTC

    You seem to be mixing up some stuff in my opinion.

    First of all, you don't seem to check the HTTP method, but unlinking (deleting) the file no matter what. It's important to realise that, depending on the browser/client/useragent, it might do multiple requests, for example first a HEAD lookup, then do partial downloads with GET (via range requests). HEAD and GET are idempotent, meaning that multiple requests to the same resource will yield the same result, unless you explicitly give an expiry header that says the browser can't rely on that.

    To quote the Wikipedia article on idempotence: Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application.. There is also a nice explanation on stackoverflow, which you should read.

    If you re-create the ZIP file for each request, it might change - not only depending on the data from the database, but also on the exact way ZIP is implemented (i'm not sure, but there could be a random element in how your implementation calculates the internal lookup tables). In this way, the partial downloaded parts might or might not match up to make a valid file.

    Here is an example header from my own webserver software for a zip file download (which works):

    200 OK Cache-Control: no-cache, no-store, must-revalidate Date: Thu, 14 Nov 2019 14:09:02 GMT Accept-Ranges: bytes ETag: b5e50bf7ab874841149553d71570c0ba4b54e940 Server: PAGECAMEL/2.4 Allow: GET, HEAD Content-Language: en Content-Length: 5199736 Content-Type: application/octet-stream Expires: Thu, 14 Nov 2019 14:09:02 GMT Last-Modified: Sat, 03 Aug 2019 22:38:27 GMT

    As you can see, i disable caching, have a stable ETag header (basically the checksum of the file) and an Expires header that says "it expires now". The ETag is for when a browsers requests a file again, it can first check with HEAD if the ETag has changed - if it didn't, it doesn't have to download the file again. ALso, on partial downloads, it can check the ETag, and if it has changed can show a proper error message along the lines of "can't continue download because the requested resource has changed".

    If you want to implement one-time downloads, you should make sure you implement it in a POST method (and check that the browser has used the correct one), as POST is not idempotent and allows resources on the server to change in response to client action.

    You are also using an experimental content type application/x-download which may or may not be supported by the browsers. Try application/octet-stream instead, this will tell the browser "I'm sending you some binary junk, do whatever you want with it".

    The Accept-Ranges header is there because i allow range requests. This helps a lot with the download manager integrated into modern browsers, especially on large files.

    The Content-Length header is also very important, because it allows Session reuse and allows the browser to verify it got ALL of the data. This must be byte-exact, any error will mess up the download. This includes any stray lineendings at the end of the content.

    There are a few other things that you should fix in your code, like using a proper three-argument open.

    If you have LWP installed, you could check your headers with GET https://myurl on the command line.

    perl -e 'use MIME::Base64; print decode_base64("4pmsIE5ldmVyIGdvbm5hIGdpdmUgeW91IHVwCiAgTmV2ZXIgZ29ubmEgbGV0IHlvdSBkb3duLi4uIOKZqwo=");'
Re: Correct Perl settings for sending zipfile to browser
by haukex (Archbishop) on Nov 14, 2019 at 19:00 UTC

    Personally, I would go about this differently: Instead of writing a file to disk, zipping it, and then re-reading it for download, you can do it all on the fly*. (If you did want to do it via temporary files like in your current script, please read my nodes on File::Temp examples and running external programs, as there are potential security and concurrency issues with your current script.)

    IO::Compress::Zip is a core module, and you can use it to generate and output a ZIP file on the fly (see its docs for details):

    use warnings; use strict; use IO::Compress::Zip qw/$ZipError/; my @lines = (qw/ Hello World Foo Bar /); my $eol = "\r\n"; binmode STDOUT; # just to play it safe my $z = IO::Compress::Zip->new('-', # STDOUT Name => "Filename.txt" ) or die "zip failed: $ZipError\n"; for my $line (@lines) { $z->print($line, $eol); } $z->close();

    It's also possible to write the ZIP file to a scalar, e.g. if you need to know its length before writing it out, although that of course increases the memory usage. At the very least, you don't need to buffer the output lines like you're doing in your current script with @resp.

    * OTOH, I agree with cavac that if these files are going to be unchanged across multiple downloads, it'd certainly be more efficient to not re-generate them on every request and use appropriate HTTP caching methods instead.

    Update: Minor edits.

      Implementing your example with my code, where @lines is changed to @resp, and where I remove the $CRLF from the lines going into the array, results in the following two errors in the log file, and the browser gives an Internal Server Error.

      ...AH01215: Wide character in IO::Compress::Zip::write...

      ...malformed header from script '___.pl': Bad header: PK\x03\x04\x14,...

      In this case, the file will very likely change every time it is downloaded, as the database is regularly updated. I experimented earlier on multiple downloads during a time when I knew that no one was logged in to the database to make changes, but ordinarily change may be expected. That means, if I could get a direct download like you suggest to work, it would be a perfect solution in this case.

        the browser gives an Internal Server Error

        The code I showed isn't a complete CGI example, since it doesn't output the headers, so those would need to be added back in. Since in the original code those are being written by hand, I'd suggest at least upgrading to one of the CGI modules, such as e.g. CGI::Simple, to generate those for you.

        Wide character in IO::Compress::Zip::write

        That would mean that there's Unicode in your @lines. (Although I don't see an encoding being set on TARGET in the original code, so I think it would have the same issue?) Anyway, although IO::Compress::Zip provides a filehandle-like interface, it looks like it doesn't (yet?) support encoding layers. A manual encoding with Encode does work though:

        use warnings; use strict; use IO::Compress::Zip qw/$ZipError/; use Encode qw/encode/; my @lines = (qw/ Hello World Foo Bar /, "\N{U+1F42A}"); my $eol = "\r\n"; my $encoding = "UTF-8"; # or maybe "CP1252" for Windows binmode STDOUT; # just to play it safe my $z = IO::Compress::Zip->new('-', # STDOUT Name => "Filename.txt" ) or die "zip failed: $ZipError\n"; for my $line (@lines) { $z->print( encode($encoding, $line.$eol, Encode::FB_CROAK|Encode::LEAVE_SRC) ); } $z->close();

        Note: For encodings such as UTF-16, it seems encode adds a Byte Order Mark for every string it encodes, and I don't see an option in the module to disable that. One way to get rid of them is to remove them manually, but an alternative might be to replace the for loop with this, at the expense of higher memory usage: $z->print( encode($encoding, join('', map {$_.$eol} @lines), Encode::FB_CROAK|Encode::LEAVE_SRC) ); - or just stick to UTF-8, as that's pretty ubiquitous.

        Update:

        I remove the $CRLF from the lines

        You can leave that in and remove my $eol, as they're the same thing (I missed that on my first read of the original source, sorry).

Re: Correct Perl settings for sending zipfile to browser
by bliako (Monsignor) on Nov 14, 2019 at 09:13 UTC

    use \r\n to separate headers.

    Is the size of the downloaded zip file the same as the one on the server? does it start with PK?

      The downloaded zip is NOT the same size as the one on the server. The one on the server is SMALLER.

      For example, commenting out the "unlink" lines of the code to leave the files as-is on the server, then comparing them, finds that whereas on the server it shows 2,002,621 bytes in the zipped file, the downloaded file shows 4,568,860 bytes, and unzip claims an additional 942,843,300 bytes are missing, despite the fact that the unzipped .txt file on the server had only 12,145,337 bytes. Irritatingly unexplainable, so far.

      Yes, the file starts with "PK".

        \r\n aka CRLF should be used to separate the headers. Use a module like CGI to print the headers so you don't have to worry about that.

        After commenting out the unlink, could you unzip the file successfully on the server?
Re: Correct Perl settings for sending zipfile to browser
by Anonymous Monk on Nov 14, 2019 at 06:40 UTC

    The zip file printing code looks ok to me. You could do two 2 tests related to download if you haven't already ...

    1. make a zipfile available to be downloaded straight via a link (put under password if you want);
    2. extract the code that prints HTTP headers & the zipfile content in its own program; then just run that to download the file via browser prompt.

    ... after download, test if you can successfully extract contents.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11108658]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-26 02:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found