On Detecting Broken Links

footpad has asked for the wisdom of the Perl Monks concerning the following question:

Having a little bit of time to myself, I decided I'd finally put together a script for checking the links on one of my web sites. As a well-trained monk, I found a modified version of Recipe 20.7 from the Perl Cookbook.

All fine and dandy, especially after I spiffed things up a bit to use -w, strict, and add some basic error-checking. (Note: The version in the Cookbook is actually slightly [c]leaner than the one I linked to. Start with that one first.)

As I'm sure most of you already know, this Recipe (as published) relies on the head() routine exported by LWP::Simple (alt). In practice, though, it appears that some sites (pages?) don't respond to HEAD requests. The workaround, as mentioned in lwpcook, is to use get() instead.

Since this can take some time, especially when bandwidth is a little lean, it occurred to me that it might be wiser to perform a simple optimization: try head() and then try get() if the former fails. (The idea being, of course, to not invoke the longer download if the smaller one suceeds.)

This works, but in putting it together, I got to wondering if:

There's might be a better, more efficient (and possibly even faster) way to do this.
I'm fully understanding the problems with HEAD requests. (Can anyone point to a link that discusses any problems in more detail?)
There are any known subtleties or considerations regarding Perl's short-circuit evaluation and/or this sort of optimization. (I see nothing mentioned in perlop, but I thought I'd check. I've run into various subtleties using other languages in the past.)
I've fallen into any of the [non?]standard traps that I tend to obsess about, e.g. CCP, memes, security risks, etc. (I plan to put a cleaned up version of the code on my site as a CGI script, so I worry.)

Here's the code as it currently stands. (It's still a little crude):

#!/usr/bin/perl -w
# Adapted, in part, from Recipe 20.7 of The Perl Cookbook.
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my @links;  # holds all links found on target page
my %dead;   # list of unresolveable (dead) links.
my $base_url = shift
   or die "Usage: $0 <url_to_parse>\n\n",
          "(including scheme prefix, e.g. http://)\n";
          
unless ( url_exists( $base_url ) )
{ 
  die "Unable to test links;\n",
      "$base_url could not be reached\n\n"; 
}

print "Parsing $base_url...\n";
my $parser = HTML::LinkExtor->new( undef, $base_url );
$parser->parse( get( $base_url ) );
@links = $parser->links;

print "Checking links";
foreach my $linkarray ( @links ) 
{
   
   print '.';  # show user that something is happening.

   my @element = @$linkarray;
   my $elt_type = shift @element;

   while ( @element ) 
   {

     ( my $attr_name, my $attr_value) = splice( @element, 0, 2 );

     # skip MAILTO's, which you shouldn't have anyway.
     next if ( $attr_value =~ /\b(mailto)\b/ ); 
     
     unless ( url_exists( $attr_value ) ) 
       { $dead{ $attr_value }++; }
   }
}

print "\nAll links checked.\n\n",
      "$base_url contains ", scalar @links, " link(s)";

unless ( scalar keys %dead )
  { print "; all live.\n"; }
else
  { 
    print "; sadly, the following links aren't working:\n\n";
    for (sort keys %dead) { print $_, "\n"} 
    print "\n";
}

sub url_exists
{
   # Two tests are done because some sites do not appear to 
   # properly support HEAD requests.  Ex: www.borland.com 
   # As a result, we were getting some false positives; the
   # extra test prevents those.

   my $url = shift;
   return head( $url ) || get( $url );  
}
[download]

Thanks in advance for any assistance/feedback...

--f

Comment on On Detecting Broken Links Download Code

Replies are listed 'Best First'.

Re: On Detecting Broken Links
by dws (Chancellor) on Jan 24, 2002 at 07:28 UTC

LWP::Simple

return head( $url ) || get( $url );

why

www.example.com

And I could swear that merlyn has written a half-dozen columns or so on link checking. Might be worth checking his site.

[reply]
[d/l]
[select]

(crazyinsomniac) Re: On Detecting Broken Links
by crazyinsomniac (Prior) on Jan 24, 2002 at 07:24 UTC

LWP head replacement

Inspired by LWP head mystery, where a server replies incorrectly to a HEAD request, which is apparently a common bug, this snippet makes a GET request, but closes the socket after reading only 1 character of the content, thus doing basically the same thing HEAD does, while not having to deal with the common problem of a broken HEAD reply.

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]

(sacked) Re: On Detecting Broken Links
by sacked (Hermit) on Jan 24, 2002 at 09:04 UTC

foreach my $linkarray ( @links ) 
{
   
   print '.';  # show user that something is happening.
[download]

$|

sacked

[reply]
[d/l]
[select]


We don't bite newbies here... much
	PerlMonks