Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Cutting Out Previously Visited Web Pages in A Web Spider

by mkurtis (Scribe)
on Mar 13, 2004 at 01:32 UTC ( [id://336323]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider
in thread Cutting Out Previously Visited Web Pages in A Web Spider

But where exactly do I put that though, what portions of the code do I replace exactly?

Thanks kappa

Replies are listed 'Best First'.
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider
by kappa (Chaplain) on Mar 13, 2004 at 11:02 UTC
    As you see, originally I posted a piece of pseudo-code. It described logic. The algorithm. I used perl syntax and made the block into a sub to help you with the actual implementation.

    Below is a piece of REAL executable code in realworld perl. It actually crawls the web (provide it with urls in command-line). And I --you, sorry.

    Just run it as a separate script, no need to "put it into" your code.

    #/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; # do useful things with $content # for example, save it into a file or index or whatever # i just print the url print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); $link_extractor->a }; $visited{$url} = 1; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV);
      Thanks so much kappa, i sure wish i could vote more than once for your post. But I have some problems still, how do I make it follow the links it extracts, it just stops. For example, when I start it on wired.com, it creates 77 files and then brings up the command prompt. I have modified your code into this:
      #!/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; my @queue; @queue = qw ("http://www.wired.com"); sub crawl { my $a = 0; my %visited; my $links; my @links; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; open(FILE,">/var/www/data/$a.txt"); print FILE "$url\n"; print FILE "$content"; close(FILE); print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); @links = $link_extractor->a }; foreach $links(@links) { unshift @queue, $links; } $visited{$url} = 1; $a++; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $http_ua->delay(10/6000); $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV);
      Also, what do I do when the array gets too large?

      Thanks again

        You unshift links into the queue after pushing them there several lines above. That's weird, but does not matter as it never crawls to the same url for two times. My original code did everything you need about links and queueing, btw. Next, I can't debug mirroring wired.com, sorry :) I pay for traffic. Try to watch the growing queue of pending visits and catch the moment your script finishes. And last. Your arrays won't get too large anytime soon. Really. Your computer will be able to handle an array of million of links, I suppose, without much problems. I'd suggest filtering visited links before adding new ones to the queue and not before crawling as the first possible optimization.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://336323]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-03-29 14:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found