As you see, originally I posted a piece of pseudo-code. It described logic. The algorithm. I used perl syntax and made the block into a sub to help you with the actual implementation.
Below is a piece of REAL executable code in realworld perl. It actually crawls the web (provide it with urls in command-line). And I --you, sorry.
Just run it as a separate script, no need to "put it into" your code.
#/usr/bin/perl -w
use strict;
use LWP::RobotUA;
use HTML::SimpleLinkExtor;
use vars qw/$http_ua $link_extractor/;
sub crawl {
my @queue = @_;
my %visited;
while(my $url = shift @queue) {
next if $visited{$url};
my $content = $http_ua->get($url)->content;
# do useful things with $content
# for example, save it into a file or index or whatever
# i just print the url
print qq{Downloaded: "$url"\n};
push @queue,
do { $link_extractor->parse($content); $link_extractor->a };
$visited{$url} = 1;
}
}
$http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com';
$link_extractor = new HTML::SimpleLinkExtor;
crawl(@ARGV);
| [reply] [Watch: Dir/Any] [d/l] |
Thanks so much kappa, i sure wish i could vote more than once for your post. But I have some problems still, how do I make it follow the links it extracts, it just stops. For example, when I start it on wired.com, it creates 77 files and then brings up the command prompt.
I have modified your code into this:
#!/usr/bin/perl -w
use strict;
use LWP::RobotUA;
use HTML::SimpleLinkExtor;
use vars qw/$http_ua $link_extractor/;
my @queue;
@queue = qw ("http://www.wired.com");
sub crawl {
my $a = 0;
my %visited;
my $links;
my @links;
while(my $url = shift @queue) {
next if $visited{$url};
my $content = $http_ua->get($url)->content;
open(FILE,">/var/www/data/$a.txt");
print FILE "$url\n";
print FILE "$content";
close(FILE);
print qq{Downloaded: "$url"\n};
push @queue,
do { $link_extractor->parse($content);
@links = $link_extractor->a };
foreach $links(@links) {
unshift @queue, $links;
}
$visited{$url} = 1;
$a++;
}
}
$http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com';
$http_ua->delay(10/6000);
$link_extractor = new HTML::SimpleLinkExtor;
crawl(@ARGV);
Also, what do I do when the array gets too large?
Thanks again | [reply] [Watch: Dir/Any] [d/l] |
You unshift links into the queue after pushing them there several lines above. That's weird, but does not matter as it never crawls to the same url for two times. My original code did everything you need about links and queueing, btw.
Next, I can't debug mirroring wired.com, sorry :) I pay for traffic. Try to watch the growing queue of pending visits and catch the moment your script finishes.
And last. Your arrays won't get too large anytime soon. Really. Your computer will be able to handle an array of million of links, I suppose, without much problems. I'd suggest filtering visited links before adding new ones to the queue and not before crawling as the first possible optimization.
| [reply] [Watch: Dir/Any] [d/l] |