Re^2: Advice on Efficient Large-scale Web Crawling

Yeah, I'm leaning towards a local DNS cache as well. Thanks.

Currently the pool is a hierarchy of directories like this:

pool/
pool/todo
pool/doing
pool/done

A sample file path is

pool/todo/a/6/a6869c08bcaa2bb6f878de99491efec4f16d0d69

This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors.

I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue?

I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things.

Does this sound somewhat sensible? :-)

Comment on Re^2: Advice on Efficient Large-scale Web Crawling

Replies are listed 'Best First'.
Re^3: Advice on Efficient Large-scale Web Crawling by matija (Priest) on Dec 19, 2005 at 14:57 UTC
With a single hex digit in the directory you get an average of 15625 files per directory, which is still too many (IMHO). It might work if the filesystem has hashed directory lookups, but I can't remember offhand which file systems do and which don't have that. I suggest you simply change that to two hex digits per directory name, e.g. pool/todo/a6/86/a6869c08bcaa2bb6f878de99491efec4f16d0d69 That should reduce the average number of files per directory to a much more reasonable 60 and change. And yes, benchmarking (lots and lots of benchmarking) and tweaking seem to be the best way to tackle this kind of problems.	[reply]
Re^3: Advice on Efficient Large-scale Web Crawling by salva (Canon) on Dec 19, 2005 at 14:22 UTC
To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes that limit seems too low for the task you want to accomplish, specially if you have a good internet connection. Have you actually tried incrementing it to 30 or even 50. Forking is not so expensive in moderm Unix/Linux systems with support for COW. update: actually, much of the overhead generated by the forked processes can be caused by perl cleaning up everything. On Unix, this cleanup is mostly useless, and you can get rid of it calling `exec $ok ? '/bin/true' : '/bin/false';` [download] instead of `exit($ok)` to finalize child processes. Just remember to close first any file you had written to.	[reply] [d/l] [select]
Re^4: Advice on Efficient Large-scale Web Crawling by Celada (Monk) on Dec 19, 2005 at 15:28 UTC
That's what `POSIX::_exit` is for. Exit the process without giving Perl (or anything else) a chance to clean up.	[reply] [d/l]


Problems? Is your data what you think it is?
	PerlMonks