go ahead... be a heretic | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Yeah, I'm leaning towards a local DNS cache as well. Thanks. Currently the pool is a hierarchy of directories like this: pool/ pool/todo pool/doing pool/done A sample file path is pool/todo/a/6/a6869c08bcaa2bb6f878de99491efec4f16d0d69 This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors. I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue? I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things. Does this sound somewhat sensible? :-) In reply to Re^2: Advice on Efficient Large-scale Web Crawling
by Anonymous Monk
|
|