Re: Advice on Efficient Large-scale Web Crawling

(OP here)

salva, yes that was my thinking, too. With the advice here and quite a few optimisations it looks as if I can push this up. More tweaking, I think. That exit advice is new to me. How would I use that in the context of Parallel::ForkManager?

matija, good point. I'll eventually use ReiserFS which has superb support for large numbers of files, but I should probably use your approach now. I agree that it would probably give better performance.

Regarding HTTP::GHTTP and HTTP::MHTTP, MHTTP doesn't respect the Host header thus can't handle virtual hosts. GHTTP is indeed nice, but it neither supports HTTPS nor the features of LWP. (My main attraction to HTTP::Lite was that it was pure Perl and easy enough to hack to get the remote IP address -- now I can get that from LWP, Lite is less useful). It looks as if LWP supports using GHTTP internally, though, which sounds like a win-win. :-) I'll have to run some benchmarks on this...

merlyn, I'm afraid I do have to hit this number of external URLs. :-) It's for a research project that does have many merits. (I don't agree that we don't need a better search engine, but I guess that's academic). I'm going some way to support the Robots Exclusion Protocol. I do pre-processing on the list of URLs to identify the few hosts which will be hit more than a couple of times, then fetch their robots.txt. If they forbid crawling, I nix them from the input. By working with batches indexed by the hash of the URL I severely reduce the risk of hitting any server too hard (the host would have to have more than a trivial number of URLs in the index whose hashes start with at least the same two characters (even more when I implement matija's suggestion)). I've just wrote a script to double check this and only two hosts have multiple URLs in the same job bin: one has 2, the other 3. I appreciate your concern -- I run large sites myself, and am perfectly aware of the damage a runaway spider can cause. ;-)

Comment on Re: Advice on Efficient Large-scale Web Crawling

Replies are listed 'Best First'.
Re^2: Advice on Efficient Large-scale Web Crawling by merlyn (Sage) on Dec 19, 2005 at 16:07 UTC
merlyn, I'm afraid I do have to hit this number of external URLs. :-) It's for a research project that does have many merits. Then use the Google API and their database. Or, you can also use the newly announced Alexa API from Amazon. There's no justified reason to re-crawl the world, unless you're also providing benefit to the world, and you haven't yet convinced me of your benefit ("research project" could mean anything). -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re^3: Advice on Efficient Large-scale Web Crawling by Scott7477 (Chaplain) on May 07, 2006 at 06:54 UTC
According to Google API a license key only allows for 1,000 automated queries per day. This page while somewhat dated, provides some data relevant to this discussion. A couple of key points from that data include: -Netcraft estimated that 42.8 million web servers existed. Assuming 50 URLs per web server gives over 2.1 billion URLs. If the OP is randomly selecting URLs the chances of any particular server being significantly inconvenienced are small, in my estimation.	[reply]
Re^2: Advice on Efficient Large-scale Web Crawling by salva (Canon) on Dec 19, 2005 at 16:16 UTC
How would I use that in the context of Parallel::ForkManager? Not tested, but just calling `POSIX::_exit` (as suggested by Celada) instead of `$fm->finnish` should do the trick. Or you can switch to my own Proc::Queue module. I am sure it supports exiting via `POSIX::_exit`.	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks