in reply to Re: Advice on Efficient Large-scale Web Crawling
in thread Advice on Efficient Large-scale Web Crawling
Yeah, I'm leaning towards a local DNS cache as well. Thanks.
Currently the pool is a hierarchy of directories like this:
pool/ pool/todo pool/doing pool/done
A sample file path is
pool/todo/a/6/a6869c08bcaa2bb6f878de99491efec4f16d0d69
This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors.
I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue?
I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things.
Does this sound somewhat sensible? :-)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: Advice on Efficient Large-scale Web Crawling
by matija (Priest) on Dec 19, 2005 at 14:57 UTC | |
Re^3: Advice on Efficient Large-scale Web Crawling
by salva (Canon) on Dec 19, 2005 at 14:22 UTC | |
by Celada (Monk) on Dec 19, 2005 at 15:28 UTC |