The actual error message is #501 not implemented
Here is a simplified bit of code that returns the same error
First, a copy of the robots.txt file from the site used (www.archive.org):
##############################################
#
# Welcome to the Archive!
#
##############################################
# Please crawl our files.
# We appreciate if you can crawl responsibly.
# Stay open!
##############################################
# slow down the ask jeeves crawler which was hitting our SE a little t
+oo fast
# via collection pages. --Feb2008 tracey--
User-agent: Teoma
Disallow: /control/
Disallow: /report/
Sitemap: http://www.archive.org/sitemap/sitemap.xml
Crawl-delay: 10
User-agent: *
Disallow: /control/
Disallow: /report/
Disallow: /details/goldenbull2007john/
Disallow: /stream/goldenbull2007john/
Disallow: /download/goldenbull2007john/
Disallow: /14/items/goldenbull2007john/goldenbull2007john_djvu.txt
Sitemap: http://www.archive.org/sitemap/sitemap.xml
Crawl-delay: 10
Next a small portion of the sitemap.xml
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>
http://www.archive.org/sitemap/sitemap_00000.xml.gz
</loc>
<lastmod>
2012-01-24T11:32:13Z
</lastmod>
</sitemap>
<sitemap>
<loc>
http://www.archive.org/sitemap/sitemap_00001.xml.gz
</loc>
<lastmod>
2012-01-24T11:32:18Z
</lastmod>
</sitemap>
And a simple perl script with error checking
#!/usr/bin/env perl
#
# Name: TestFetch.pl
#
# Requires Internet access
#
use strict;
use warnings;
use LWP::Simple;
use HTML::Parser;
use HTTP::Status qw(:constants :is status_message);
package main;
my $text = 'http://www.archive.org/sitemap/sitemap_00000.xml.gz';
my $filename = 'sitemap_00000.xml.gz';
my $hstatus = 0;
$hstatus = LWP::Simple->getstore ($text, $filename);
if($hstatus != HTTP_OK) {
print "$hstatus: ", status_message($hstatus), "\n";
}
I am able to fetch the file manually
Largins
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.