|Keep It Simple, Stupid|
Re^2: Advice on Efficient Large-scale Web Crawlingby Anonymous Monk
|on Dec 19, 2005 at 14:11 UTC||Need Help??|
Yeah, I'm leaning towards a local DNS cache as well. Thanks.
Currently the pool is a hierarchy of directories like this:
pool/ pool/todo pool/doing pool/done
A sample file path is
This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors.
I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue?
I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things.
Does this sound somewhat sensible? :-)