Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider

by kappa (Chaplain)
on Mar 12, 2004 at 16:59 UTC ( [id://336202]=note: print w/replies, xml ) Need Help??


in reply to Re: Cutting Out Previously Visited Web Pages in A Web Spider
in thread Cutting Out Previously Visited Web Pages in A Web Spider

Define $http_ua and $link_extractor and above code will work.

Replies are listed 'Best First'.
Re: Cutting Out Previously Visited Web Pages in A Web Spider
by mkurtis (Scribe) on Mar 13, 2004 at 01:32 UTC
    But where exactly do I put that though, what portions of the code do I replace exactly?

    Thanks kappa

      As you see, originally I posted a piece of pseudo-code. It described logic. The algorithm. I used perl syntax and made the block into a sub to help you with the actual implementation.

      Below is a piece of REAL executable code in realworld perl. It actually crawls the web (provide it with urls in command-line). And I --you, sorry.

      Just run it as a separate script, no need to "put it into" your code.

      #/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; # do useful things with $content # for example, save it into a file or index or whatever # i just print the url print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); $link_extractor->a }; $visited{$url} = 1; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV);
        Thanks so much kappa, i sure wish i could vote more than once for your post. But I have some problems still, how do I make it follow the links it extracts, it just stops. For example, when I start it on wired.com, it creates 77 files and then brings up the command prompt. I have modified your code into this:
        #!/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; my @queue; @queue = qw ("http://www.wired.com"); sub crawl { my $a = 0; my %visited; my $links; my @links; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; open(FILE,">/var/www/data/$a.txt"); print FILE "$url\n"; print FILE "$content"; close(FILE); print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); @links = $link_extractor->a }; foreach $links(@links) { unshift @queue, $links; } $visited{$url} = 1; $a++; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $http_ua->delay(10/6000); $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV);
        Also, what do I do when the array gets too large?

        Thanks again

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://336202]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-04-18 04:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found