Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

using LWP::Simple to fetch binary file (gnu zip)

by Largins (Acolyte)
on Jan 24, 2012 at 01:47 UTC ( [id://949563]=perlquestion: print w/replies, xml ) Need Help??

Largins has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I am attempting to fetch some .gz files in a small web spider.
First I get the robots.txt file and look for a sitemap entry. --- This works fine
Next, I'll download sitemap.xml, and parsing that fetch the files that are in the links --- Works great for text based files, and although it appears as though it works (using getstore), for binary files, after it finishes, the files aren't there
I,m not sure what's wrong, or even if getstore can handle binary files (seems to me I have used it for .gifs, etc. with success)
The getstore is in the text subroutine
Here's the code

#!/usr/bin/env perl # # Name: TestFetch.pl # # Requires Internet access # use strict; use warnings; use LWP::Simple; use HTML::Parser; # Global Variables my $debug = 1; package MyParser; my $sitetrigger = 0; my $lastmodtrigger = 0; my $tofile = ""; my $pos = -1; use base qw(HTML::Parser); sub start { my ($self, $tagname, $attr, $attrseq, $origtext) = @_; $sitetrigger = 0; $lastmodtrigger = 0; if((index $tagname, "loc") ne -1) { $sitetrigger = 1; } if((index $tagname, "lastmod") ne -1) { $lastmodtrigger = 1; } if($debug == 1) { print "------------START-----------\n"; print "tagname: $tagname\n"; } } sub text { my ($self, $text) = @_; my $filename = ""; if($sitetrigger == 1) { $filename = ""; $pos = rindex($text, '/', ); if($pos ne -1) { $filename = substr($text, ($pos + 1)); } print "fetching: $text into $filename\n"; LWP::Simple->getstore ($text, $filename); sleep(6); } if($debug == 1) { print "------------TEXT-----------\n"; print "sitetrigger: $sitetrigger\n"; print "lastmodtrigger: $lastmodtrigger\n"; print "filename: $filename\n"; print "text: $text\n"; } } sub end { my($self, $end, $origtext) = @_; if($debug == 1) { print "------------END-----------\n"; print "end: $end\n"; } } package main; my $htmlparse = new MyParser; my $loc = ""; my $siteurl; my $filefound = 0; my $pos1 = -1; my $content = ""; my $url = $ARGV[0]; $loc = $url . '/robots.txt'; if($loc ne "") { if ($debug == 1) { print "loc: $loc\n"; } getstore($loc, 'robots.txt') or die "Couldn't get robots.txt"; open IN, 'robots.txt' or die $!; while (<IN>) { $pos1 = index (uc $_, 'SITEMAP'); if($pos1 ne -1) { $siteurl = substr($_, ($pos1 + 8)); if ($debug == 1) { print "siteurl: $siteurl\n"; } $content = get($siteurl); $filefound = 1; last; } } close IN or die "IN: $!"; if($filefound == 1) { $htmlparse->parse($content); } }

Any assistance would be appreciated.

Thanks
Largins

#!/usr/bin/env perl # # Name: TestFetch.pl # # Requires Internet access # use strict; use warnings; use LWP::Simple; use HTML::Parser; # Global Variables my $debug = 1; package MyParser; my $sitetrigger = 0; my $lastmodtrigger = 0; my $tofile = lastmodtrigger: $lastmodtrigger\n

Replies are listed 'Best First'.
Re: using LWP::Simple to fetch binary file (gnu zip)
by Anonymous Monk on Jan 24, 2012 at 08:41 UTC

    after it finishes, the files aren't there

    The files aren't where?

    You're looking in the wrong place, you should look in the cwd / pwd in effect at the time you launched this program

    or, you should specify full paths, or chdir to the directory you're interested in

      I am indeed looking in the cwd / pwd from where launched. Since I am on a Microsoft system, I also searched the entire disk drive, (thinking that Microsoft might think they know better than me as to where the files should be stored), and couldn't locate the files
      I'll try full path

Re: using LWP::Simple to fetch binary file (gnu zip)
by lune (Pilgrim) on Jan 24, 2012 at 15:39 UTC
    I suppose your print statement, appearing before the getstore call gives you the expected result.

    So why don't you just check for errors returned by getstore first?

    My guess is: 401 (RC_UNAUTHORIZED)

      The actual error message is #501 not implemented
      Here is a simplified bit of code that returns the same error
      First, a copy of the robots.txt file from the site used (www.archive.org):

      ############################################## # # Welcome to the Archive! # ############################################## # Please crawl our files. # We appreciate if you can crawl responsibly. # Stay open! ############################################## # slow down the ask jeeves crawler which was hitting our SE a little t +oo fast # via collection pages. --Feb2008 tracey-- User-agent: Teoma Disallow: /control/ Disallow: /report/ Sitemap: http://www.archive.org/sitemap/sitemap.xml Crawl-delay: 10 User-agent: * Disallow: /control/ Disallow: /report/ Disallow: /details/goldenbull2007john/ Disallow: /stream/goldenbull2007john/ Disallow: /download/goldenbull2007john/ Disallow: /14/items/goldenbull2007john/goldenbull2007john_djvu.txt Sitemap: http://www.archive.org/sitemap/sitemap.xml Crawl-delay: 10

      Next a small portion of the sitemap.xml
      <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc> http://www.archive.org/sitemap/sitemap_00000.xml.gz </loc> <lastmod> 2012-01-24T11:32:13Z </lastmod> </sitemap> <sitemap> <loc> http://www.archive.org/sitemap/sitemap_00001.xml.gz </loc> <lastmod> 2012-01-24T11:32:18Z </lastmod> </sitemap>

      And a simple perl script with error checking
      #!/usr/bin/env perl # # Name: TestFetch.pl # # Requires Internet access # use strict; use warnings; use LWP::Simple; use HTML::Parser; use HTTP::Status qw(:constants :is status_message); package main; my $text = 'http://www.archive.org/sitemap/sitemap_00000.xml.gz'; my $filename = 'sitemap_00000.xml.gz'; my $hstatus = 0; $hstatus = LWP::Simple->getstore ($text, $filename); if($hstatus != HTTP_OK) { print "$hstatus: ", status_message($hstatus), "\n"; }

      I am able to fetch the file manually
      Largins

        The problem is you are calling a function as a method:

        $hstatus = LWP::Simple->getstore ($text, $filename);

        Change that to:

        $hstatus = getstore ($text, $filename);

        And it will work.

        Effectively you are calling the function with the string 'LWP::Simple' as the first argument where it is expecting a URL. It tries to parse that to discover the protocol (http://, https://, ftp:// etc.) that it should use and doesn't find anything it recognises, so it return 501-Not Implemented.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://949563]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-25 17:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found