Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

On Detecting Broken Links

by footpad (Abbot)
on Jan 24, 2002 at 06:35 UTC ( [id://141115]=perlquestion: print w/replies, xml ) Need Help??

footpad has asked for the wisdom of the Perl Monks concerning the following question:

Having a little bit of time to myself, I decided I'd finally put together a script for checking the links on one of my web sites. As a well-trained monk, I found a modified version of Recipe 20.7 from the Perl Cookbook.

All fine and dandy, especially after I spiffed things up a bit to use -w, strict, and add some basic error-checking. (Note: The version in the Cookbook is actually slightly [c]leaner than the one I linked to. Start with that one first.)

As I'm sure most of you already know, this Recipe (as published) relies on the head() routine exported by LWP::Simple (alt). In practice, though, it appears that some sites (pages?) don't respond to HEAD requests. The workaround, as mentioned in lwpcook, is to use get() instead.

Since this can take some time, especially when bandwidth is a little lean, it occurred to me that it might be wiser to perform a simple optimization: try head() and then try get() if the former fails. (The idea being, of course, to not invoke the longer download if the smaller one suceeds.)

This works, but in putting it together, I got to wondering if:

  1. There's might be a better, more efficient (and possibly even faster) way to do this.

  2. I'm fully understanding the problems with HEAD requests. (Can anyone point to a link that discusses any problems in more detail?)

  3. There are any known subtleties or considerations regarding Perl's short-circuit evaluation and/or this sort of optimization. (I see nothing mentioned in perlop, but I thought I'd check. I've run into various subtleties using other languages in the past.)

  4. I've fallen into any of the [non?]standard traps that I tend to obsess about, e.g. CCP, memes, security risks, etc. (I plan to put a cleaned up version of the code on my site as a CGI script, so I worry.)

Here's the code as it currently stands. (It's still a little crude):

#!/usr/bin/perl -w # Adapted, in part, from Recipe 20.7 of The Perl Cookbook. use strict; use HTML::LinkExtor; use LWP::Simple; my @links; # holds all links found on target page my %dead; # list of unresolveable (dead) links. my $base_url = shift or die "Usage: $0 <url_to_parse>\n\n", "(including scheme prefix, e.g. http://)\n"; unless ( url_exists( $base_url ) ) { die "Unable to test links;\n", "$base_url could not be reached\n\n"; } print "Parsing $base_url...\n"; my $parser = HTML::LinkExtor->new( undef, $base_url ); $parser->parse( get( $base_url ) ); @links = $parser->links; print "Checking links"; foreach my $linkarray ( @links ) { print '.'; # show user that something is happening. my @element = @$linkarray; my $elt_type = shift @element; while ( @element ) { ( my $attr_name, my $attr_value) = splice( @element, 0, 2 ); # skip MAILTO's, which you shouldn't have anyway. next if ( $attr_value =~ /\b(mailto)\b/ ); unless ( url_exists( $attr_value ) ) { $dead{ $attr_value }++; } } } print "\nAll links checked.\n\n", "$base_url contains ", scalar @links, " link(s)"; unless ( scalar keys %dead ) { print "; all live.\n"; } else { print "; sadly, the following links aren't working:\n\n"; for (sort keys %dead) { print $_, "\n"} print "\n"; } sub url_exists { # Two tests are done because some sites do not appear to # properly support HEAD requests. Ex: www.borland.com # As a result, we were getting some false positives; the # extra test prevents those. my $url = shift; return head( $url ) || get( $url ); }

Thanks in advance for any assistance/feedback...

--f

Replies are listed 'Best First'.
Re: On Detecting Broken Links
by dws (Chancellor) on Jan 24, 2002 at 07:28 UTC
    One problem you may discover with relying on LWP::Simple is that      return head( $url ) || get( $url ); doesn't provide enough information. You need to know why you weren't able to get the URL. In particular, you might want to detect things like loss of connectivity to a site, and not report the link as broken until you've been unable to contact that site for some interval. (You can also use that as an optimization when checking. For example, if you have a lot of links to www.example.com, and you're unable to connect to the site the first time, why bother paying for n-1 additional timeouts?)

    And I could swear that merlyn has written a half-dozen columns or so on link checking. Might be worth checking his site.

(crazyinsomniac) Re: On Detecting Broken Links
by crazyinsomniac (Prior) on Jan 24, 2002 at 07:24 UTC
(sacked) Re: On Detecting Broken Links
by sacked (Hermit) on Jan 24, 2002 at 09:04 UTC
    foreach my $linkarray ( @links ) { print '.'; # show user that something is happening.
    I like the status dots, but you need to disable buffering on STDOUT so the user will see each one as it's printed.

    Set the $| var to a nonzero value before the loop to fix this.

    --sacked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://141115]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-25 09:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found