http://qs321.pair.com?node_id=517727


in reply to Re: Advice on Efficient Large-scale Web Crawling
in thread Advice on Efficient Large-scale Web Crawling

Yeah, I'm leaning towards a local DNS cache as well. Thanks.

Currently the pool is a hierarchy of directories like this:

pool/
pool/todo
pool/doing
pool/done

A sample file path is

pool/todo/a/6/a6869c08bcaa2bb6f878de99491efec4f16d0d69

This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors.

I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue?

I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things.

Does this sound somewhat sensible? :-)