Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Multithread Web Crawler

by xuqy (Initiate)
on Sep 22, 2005 at 00:16 UTC ( [id://494003]=perlquestion: print w/replies, xml ) Need Help??

xuqy has asked for the wisdom of the Perl Monks concerning the following question:

I am interested in "focused crawling" (crawling web pages of some specific topic and ignoring all the others) and have written a "focused crawler" recently. Perl is a reasonable alternative to writing web crawler for its LWP module and CPAN. However, when I planed to implement multithread strategy in crawling, I was confused by perl 5.8.4's "threads" module, especially threads::shared. How is a object reference shared by multiple threads? I want to utilize "Cache::File::Heap" module to sort the urls in "crawling frontier" by heuristic prediction of its "harvest outcome". Below is the relevant code part:
#!/usr/bin/perl -w use strict; use threads; use threads::shared; use Cache::File::Heap; my $heap = Cache::File::Heap->new('frontier'); my $heap_lock : shared = 0; ... sub go {#crawling thread's control flow .... #xtracted best promising url { lock $heap_lock; my($value, $url) = $heap->extract_minimum; } ... #after downloading and extract hyperlinks { lock $heap_lock; $heap->add($value, $url) for } ... } my @threads; for(1..10) { push @threads, threads->new(\&go); } for(@threads) { $_->join; }
All is fine, just untill all the threads joined by main thread and main thread exists. Following error message appears: Scalar leaks : -1 Segmentation fault. My question is : How to share object reference (such as Cache::File::Heap) ?? Cache::File::Heap is the wrapper of BerkeleyDB's BTREE, is BerkeleyDB thread-safe?

Replies are listed 'Best First'.
Re: Multithread Web Crawler
by merlyn (Sage) on Sep 22, 2005 at 00:23 UTC
      Agreed here.

      A web crawling application is not going to see benefit from the light-weightedness of multiple threads since it is by it's nature fairly heavy.

      If you decide that threads don't really hold an advantage for your application you can save yourself a whole load of work by forking off processes.

      As pointed to in a recent node, Parallel::ForkManager might be of use to you. The module description includes:

      This module is intended for use in operations that can be done in parallel where the number of processes to be forked off should be limited. Typical use is a downloader which will be retrieving hundreds/thousands of files.
      Sounds right up your tree? Or is that down your tree? (I never did work out where the roots for a red-black tree would go).
        Thank you so much. I do tried Parallel::ForkManager but come up with a puzzle: How to share data between processes? To avoid crawling the same page repeatly, a global tied hash has to be shared by all the crawling processes. I experimented and found that all the forked processes just ended up with the same crawling history. Can you do me a favor to suggest a patch to it?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://494003]
Approved by DigitalKitty
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (2)
As of 2024-04-25 06:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found