Having a little bit of time to myself, I decided I'd finally put together a script for checking the links on one of my web sites. As a well-trained monk, I found a modified version of Recipe 20.7 from the Perl Cookbook.
All fine and dandy, especially after I spiffed things up a bit to use -w, strict, and add some basic error-checking. (Note: The version in the Cookbook is actually slightly [c]leaner than the one I linked to. Start with that one first.)
As I'm sure most of you already know, this Recipe (as published) relies on the head() routine exported by LWP::Simple (alt). In practice, though, it appears that some sites (pages?) don't respond to HEAD requests. The workaround, as mentioned in lwpcook, is to use get() instead.
Since this can take some time, especially when bandwidth is a little lean, it occurred to me that it might be wiser to perform a simple optimization: try head() and then try get() if the former fails. (The idea being, of course, to not invoke the longer download if the smaller one suceeds.)
This works, but in putting it together, I got to wondering if:
There's might be a better, more efficient (and possibly even faster) way to do this.
I'm fully understanding the problems with HEAD requests. (Can anyone point to a link that discusses any problems in more detail?)
There are any known subtleties or considerations regarding Perl's short-circuit evaluation and/or this sort of optimization. (I see nothing mentioned in perlop, but I thought I'd check. I've run into various subtleties using other languages in the past.)
I've fallen into any of the [non?]standard traps that I tend to obsess about, e.g. CCP, memes, security risks, etc. (I plan to put a cleaned up version of the code on my site as a CGI script, so I worry.)
Here's the code as it currently stands. (It's still a little crude):
#!/usr/bin/perl -w
# Adapted, in part, from Recipe 20.7 of The Perl Cookbook.
use strict;
use HTML::LinkExtor;
use LWP::Simple;
my @links; # holds all links found on target page
my %dead; # list of unresolveable (dead) links.
my $base_url = shift
or die "Usage: $0 <url_to_parse>\n\n",
"(including scheme prefix, e.g. http://)\n";
unless ( url_exists( $base_url ) )
{
die "Unable to test links;\n",
"$base_url could not be reached\n\n";
}
print "Parsing $base_url...\n";
my $parser = HTML::LinkExtor->new( undef, $base_url );
$parser->parse( get( $base_url ) );
@links = $parser->links;
print "Checking links";
foreach my $linkarray ( @links )
{
print '.'; # show user that something is happening.
my @element = @$linkarray;
my $elt_type = shift @element;
while ( @element )
{
( my $attr_name, my $attr_value) = splice( @element, 0, 2 );
# skip MAILTO's, which you shouldn't have anyway.
next if ( $attr_value =~ /\b(mailto)\b/ );
unless ( url_exists( $attr_value ) )
{ $dead{ $attr_value }++; }
}
}
print "\nAll links checked.\n\n",
"$base_url contains ", scalar @links, " link(s)";
unless ( scalar keys %dead )
{ print "; all live.\n"; }
else
{
print "; sadly, the following links aren't working:\n\n";
for (sort keys %dead) { print $_, "\n"}
print "\n";
}
sub url_exists
{
# Two tests are done because some sites do not appear to
# properly support HEAD requests. Ex: www.borland.com
# As a result, we were getting some false positives; the
# extra test prevents those.
my $url = shift;
return head( $url ) || get( $url );
}
Thanks in advance for any assistance/feedback...
--f