Re: Cutting Out Previously Visited Web Pages in A Web Spider

Uh. You wanna keep two lists: one full of URLs queued for crawling and the other with those you successfully visited (this one will be searched on each iteration so let it be hash). So the logic is:

sub crawl {
    my @queue = @_;
    my %visited;

    while(my $url = shift @queue) {
        next if $visited{$url};

        my $content = $http_ua->get($url);
        # do useful things with $content

        push @queue, $link_extractor->links($content);
        $visited{$url} = 1;
    }
}
[download]

That's all. When size and efficiency start to really matter you will evaluate migrating data to something like Cache::Cache or Berkeley DB.

Comment on Re: Cutting Out Previously Visited Web Pages in A Web Spider Download Code

Replies are listed 'Best First'.
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 12, 2004 at 16:59 UTC
Define `$http_ua` and `$link_extractor` and above code will work.	[reply] [d/l] [select]
Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 13, 2004 at 01:32 UTC
But where exactly do I put that though, what portions of the code do I replace exactly? Thanks kappa	[reply]
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 13, 2004 at 11:02 UTC
As you see, originally I posted a piece of pseudo-code. It described logic. The algorithm. I used perl syntax and made the block into a sub to help you with the actual implementation. Below is a piece of REAL executable code in realworld perl. It actually crawls the web (provide it with urls in command-line). And I --you, sorry. Just run it as a separate script, no need to "put it into" your code. #/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; # do useful things with $content # for example, save it into a file or index or whatever # i just print the url print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); $link_extractor->a }; $visited{$url} = 1; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV); [download]	[reply] [d/l]
Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 13, 2004 at 18:05 UTC
Re: Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 14, 2004 at 20:51 UTC
Some notes below your chosen depth have not been shown here
Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 12, 2004 at 03:01 UTC
I'm sorry, I don't understand, where exactly do I place your code? I am not sure how to implement your code into the crawler. Thanks for your post	[reply]


more useful options
	PerlMonks