Fetch URL Contents to File Handle

grahjenk has asked for the wisdom of the Perl Monks concerning the following question:

I need to download a very large zipfile containing thousands of records, and print the first ten records as shown here:

use File::Fetch;
use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError
+);

# Download a very large zipfile
my $ff=File::Fetch->new(uri => "https://tranco-list.eu/top-1m.csv.zip"
+);
my $scalar;
my $where=$ff->fetch( to => \$scalar ) or die $ff->error;

# Print the first 10 lines
my $z=new IO::Uncompress::AnyUncompress(\$scalar);
for (my $i=0;$i<10;$i++) {
  my $line=$z->getline();
  $line=~s/\r|\n//g;
  print $line,"\n"
}
[download]

Is there a way of downloading into a filehandle or pipe so that I don't have to download the entire large file?

Comment on Fetch URL Contents to File Handle Download Code

Replies are listed 'Best First'.
Re: Fetch URL Contents to File Handle by haukex (Archbishop) on Jun 09, 2020 at 08:19 UTC
A little bit of research on that site shows that getting the "top 10" is as easy as using the URL https://tranco-list.eu/download/K3VW/10. You can also register an account to customize the download even further. use warnings; use strict; use HTTP::Tiny; use Text::CSV qw/csv/; # also install Text::CSV_XS for speed my $resp = HTTP::Tiny->new->get( 'https://tranco-list.eu/download/K3VW/10'); $resp->{success} or die "$resp->{status} $resp->{reason}\n"; my $topten = csv(in=>\$resp->{content}); use Data::Dump; dd $topten; __END__ [ [1, "google.com"], [2, "facebook.com"], [3, "youtube.com"], [4, "netflix.com"], [5, "microsoft.com"], [6, "twitter.com"], [7, "tmall.com"], [8, "instagram.com"], [9, "qq.com"], [10, "linkedin.com"], ] [download]	[reply] [d/l]
Re^2: Fetch URL Contents to File Handle by marto (Cardinal) on Jun 09, 2020 at 08:32 UTC
Research ++. Answers like this always cheers me up. Answers that take a step back and achieve the goal in a fast and efficient way without assuming the question provides all relevant information.	[reply]
Re: Fetch URL Contents to File Handle by haukex (Archbishop) on Jun 09, 2020 at 07:28 UTC
I need to download a very large zipfile containing thousands of records, and print the first ten records ... Is there a way of downloading into a filehandle or pipe so that I don't have to download the entire large file? A ZIP file's central directory is at the end of the file. Although you could get fancy with range requests, it might be easier to actually download the whole file. How big is "very large"? Update: Even though 10MB isn't that big for a daily download, it turns out to be much easier to use the site's API.	[reply]
Re^2: Fetch URL Contents to File Handle by pmqs (Friar) on Jun 10, 2020 at 12:31 UTC
A ZIP file's central directory is at the end of the file. Although you could get fancy with range requests, ... This is true, but it is also possible to read a zip file in streaming mode without using the central directory at the end of the file. That's what IO::Uncompress::AnyUncompress does (via IO::Uncompress::Unzip). If there is a HTTP module that exposes a filehandle interface, then IO::Uncompress::AnyUncompress can read it.	[reply]
Re^3: Fetch URL Contents to File Handle by haukex (Archbishop) on Jun 10, 2020 at 20:12 UTC
This is true, but it is also possible to read a zip file in streaming mode without using the central directory at the end of the file. Yes, that's a good point, thanks! My understanding is that it's possible for files to have been deleted or replaced in the central directory but still be present in the ZIP file, but I haven't encountered such a ZIP file in the wild myself. I did write the parent node before I had looked into the ZIP file in question to discover that it only contains a single file.	[reply]
Re: Fetch URL Contents to File Handle by perlfan (Vicar) on Jun 09, 2020 at 03:35 UTC
You might be able to overload HTTP::Tiny's `mirror` method, since it surely uses a file handle. It might provide you some inspiration.	[reply] [d/l]
Re^2: Fetch URL Contents to File Handle by soonix (Canon) on Jun 09, 2020 at 08:00 UTC
No cigar. HTTP::Tiny's `mirror` internally uses a file handle, but the corresponding parameter is a file name.	[reply] [d/l]
Re^2: Fetch URL Contents to File Handle by Anonymous Monk on Jun 09, 2020 at 08:30 UTC
Please stop guessing, these posts are grasping at straws.	[reply]


XP is just a number
	PerlMonks