Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Cutting Out Previously Visited Web Pages in A Web Spider

by kappa (Chaplain)
on Mar 14, 2004 at 22:35 UTC ( [id://336552]=note: print w/replies, xml ) Need Help??


in reply to Cutting Out Previously Visited Web Pages in A Web Spider

mkurtis, I tried to mimic the behaviour you seem to expect. Try this.

Updated, for HTML::SimpleLinkExtor returns links only from first parse.

#/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; my $a = 0; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; open FILE, '>' . ++$a . '.txt'; print FILE $content; close FILE; print qq{Downloaded: "$url"\n}; push @queue, do { my $link_extractor = new HTML::SimpleLinkExtor; $link_extractor->parse($content); $link_extractor->a }; $visited{$url} = 1; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; crawl(@ARGV);

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://336552]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-03-29 14:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found