comment on

Yeah, I'm leaning towards a local DNS cache as well. Thanks.

Currently the pool is a hierarchy of directories like this:

pool/
pool/todo
pool/doing
pool/done

A sample file path is

pool/todo/a/6/a6869c08bcaa2bb6f878de99491efec4f16d0d69

This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors.

I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue?

I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things.

Does this sound somewhat sensible? :-)

In reply to Re^2: Advice on Efficient Large-scale Web Crawling by Anonymous Monk
in thread Advice on Efficient Large-scale Web Crawling by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl Monk, Perl Meditation
	PerlMonks