Here's my pretty simple try (without using files and just for a single page). Adjust to taste:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::SimpleLinkExtor;
sub grab_links {
my ( $ua, $url ) = @_;
my @links;
my $response = $ua->get($url);
if ($response->is_success) {
my $extor = HTML::SimpleLinkExtor->new();
my $content = $response->content;
$extor->parse($content);
@links = $extor->a; # get a ref links. Check docs - these are re
+lative paths.
} else {
die $response->status_line;
}
return @links;
}
my $visit = $ARGV[0];
my $ua = LWP::RobotUA->new('my-robot/0.1', 'me@foo.com'); # Change th
+is to suit.
$ua->delay( 0.1 ); # hit every 1/10 second
my @links = grab_links($ua, $visit);
my %uniq;
foreach ( @links ) {
$uniq{$_}++;
}
print "Visited: ", $visit, " found these links:\n", join( "\n", keys %
+uniq), "\n";
Update: this code was put here after talking to mkurtis in the CB. It appears to do most of the things mkurtis is after, so I posted it for future reference.
Most of the code was taken straight from the docs for HTML::SimpleLinkExtor, LWP::RobotUA and LWP::UserAgent.
This is the first time I've used any of those modules and it was quite cool :-)
If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
That way everyone learns.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|