Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider

Replies are listed 'Best First'.
Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 13, 2004 at 01:32 UTC
But where exactly do I put that though, what portions of the code do I replace exactly? Thanks kappa	[reply]
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 13, 2004 at 11:02 UTC
As you see, originally I posted a piece of pseudo-code. It described logic. The algorithm. I used perl syntax and made the block into a sub to help you with the actual implementation. Below is a piece of REAL executable code in realworld perl. It actually crawls the web (provide it with urls in command-line). And I --you, sorry. Just run it as a separate script, no need to "put it into" your code. #/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; # do useful things with $content # for example, save it into a file or index or whatever # i just print the url print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); $link_extractor->a }; $visited{$url} = 1; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV); [download]	[reply] [d/l]
Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 13, 2004 at 18:05 UTC
Thanks so much kappa, i sure wish i could vote more than once for your post. But I have some problems still, how do I make it follow the links it extracts, it just stops. For example, when I start it on wired.com, it creates 77 files and then brings up the command prompt. I have modified your code into this: #!/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; my @queue; @queue = qw ("http://www.wired.com"); sub crawl { my $a = 0; my %visited; my $links; my @links; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; open(FILE,">/var/www/data/$a.txt"); print FILE "$url\n"; print FILE "$content"; close(FILE); print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); @links = $link_extractor->a }; foreach $links(@links) { unshift @queue, $links; } $visited{$url} = 1; $a++; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $http_ua->delay(10/6000); $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV); [download] Also, what do I do when the array gets too large? Thanks again	[reply] [d/l]
Re: Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 14, 2004 at 20:51 UTC
Re: Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 14, 2004 at 21:57 UTC


Welcome to the Monastery
	PerlMonks