Here's my pretty simple try (without using files and just for a single page). Adjust to taste:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::SimpleLinkExtor;
sub grab_links {
my ( $ua, $url ) = @_;
my @links;
my $response = $ua->get($url);
if ($response->is_success) {
my $extor = HTML::SimpleLinkExtor->new();
my $content = $response->content;
$extor->parse($content);
@links = $extor->a; # get a ref links. Check docs - these are re
+lative paths.
} else {
die $response->status_line;
}
return @links;
}
my $visit = $ARGV[0];
my $ua = LWP::RobotUA->new('my-robot/0.1', 'me@foo.com'); # Change th
+is to suit.
$ua->delay( 0.1 ); # hit every 1/10 second
my @links = grab_links($ua, $visit);
my %uniq;
foreach ( @links ) {
$uniq{$_}++;
}
print "Visited: ", $visit, " found these links:\n", join( "\n", keys %
+uniq), "\n";
Update: this code was put here after talking to mkurtis in the CB. It appears to do most of the things mkurtis is after, so I posted it for future reference.
Most of the code was taken straight from the docs for HTML::SimpleLinkExtor, LWP::RobotUA and LWP::UserAgent.
This is the first time I've used any of those modules and it was quite cool :-)
If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
That way everyone learns.