Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider

by kappa (Chaplain)
on Mar 13, 2004 at 11:02 UTC ( [id://336363]=note: print w/replies, xml ) Need Help??


in reply to Re: Cutting Out Previously Visited Web Pages in A Web Spider
in thread Cutting Out Previously Visited Web Pages in A Web Spider

As you see, originally I posted a piece of pseudo-code. It described logic. The algorithm. I used perl syntax and made the block into a sub to help you with the actual implementation.

Below is a piece of REAL executable code in realworld perl. It actually crawls the web (provide it with urls in command-line). And I --you, sorry.

Just run it as a separate script, no need to "put it into" your code.

#/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; # do useful things with $content # for example, save it into a file or index or whatever # i just print the url print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); $link_extractor->a }; $visited{$url} = 1; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV);

Replies are listed 'Best First'.
Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider
by mkurtis (Scribe) on Mar 13, 2004 at 18:05 UTC
    Thanks so much kappa, i sure wish i could vote more than once for your post. But I have some problems still, how do I make it follow the links it extracts, it just stops. For example, when I start it on wired.com, it creates 77 files and then brings up the command prompt. I have modified your code into this:
    #!/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; my @queue; @queue = qw ("http://www.wired.com"); sub crawl { my $a = 0; my %visited; my $links; my @links; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; open(FILE,">/var/www/data/$a.txt"); print FILE "$url\n"; print FILE "$content"; close(FILE); print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); @links = $link_extractor->a }; foreach $links(@links) { unshift @queue, $links; } $visited{$url} = 1; $a++; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $http_ua->delay(10/6000); $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV);
    Also, what do I do when the array gets too large?

    Thanks again

      You unshift links into the queue after pushing them there several lines above. That's weird, but does not matter as it never crawls to the same url for two times. My original code did everything you need about links and queueing, btw. Next, I can't debug mirroring wired.com, sorry :) I pay for traffic. Try to watch the growing queue of pending visits and catch the moment your script finishes. And last. Your arrays won't get too large anytime soon. Really. Your computer will be able to handle an array of million of links, I suppose, without much problems. I'd suggest filtering visited links before adding new ones to the queue and not before crawling as the first possible optimization.
        How do I get it to go to other pages though. When I visit wired.com for example, I want it to take all the links off of it and visit them. And for each page that it visits off of wired, take the links off those pages and visit them and so on. This one only takes the links off of wired.com, not any of the pages that are linked to wired.

        Thank you

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://336363]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-25 11:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found