Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Cutting Out Previously Visited Web Pages in A Web Spider

by kappa (Chaplain)
on Mar 11, 2004 at 13:09 UTC ( [id://335794]=note: print w/replies, xml ) Need Help??


in reply to Cutting Out Previously Visited Web Pages in A Web Spider

Uh. You wanna keep two lists: one full of URLs queued for crawling and the other with those you successfully visited (this one will be searched on each iteration so let it be hash). So the logic is:
sub crawl { my @queue = @_; my %visited; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url); # do useful things with $content push @queue, $link_extractor->links($content); $visited{$url} = 1; } }

That's all. When size and efficiency start to really matter you will evaluate migrating data to something like Cache::Cache or Berkeley DB.

Replies are listed 'Best First'.
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider
by kappa (Chaplain) on Mar 12, 2004 at 16:59 UTC
    Define $http_ua and $link_extractor and above code will work.
      But where exactly do I put that though, what portions of the code do I replace exactly?

      Thanks kappa

        As you see, originally I posted a piece of pseudo-code. It described logic. The algorithm. I used perl syntax and made the block into a sub to help you with the actual implementation.

        Below is a piece of REAL executable code in realworld perl. It actually crawls the web (provide it with urls in command-line). And I --you, sorry.

        Just run it as a separate script, no need to "put it into" your code.

        #/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; # do useful things with $content # for example, save it into a file or index or whatever # i just print the url print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); $link_extractor->a }; $visited{$url} = 1; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV);
Re: Cutting Out Previously Visited Web Pages in A Web Spider
by mkurtis (Scribe) on Mar 12, 2004 at 03:01 UTC
    I'm sorry, I don't understand, where exactly do I place your code? I am not sure how to implement your code into the crawler.

    Thanks for your post

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://335794]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-04-18 19:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found