Re: Advice on Efficient Large-scale Web Crawling

in reply to Advice on Efficient Large-scale Web Crawling

Personaly, I think you're engaging in premature optimization here: when fetching 4M urls, the DNS traffic is unlikely to be your biggest concern.

Having said that, the cheapest/cleanest method would be to install a caching-only DNS server on your localhost, and let it handle the DNS caching.

Some reasons why your current solution might be slow:

are all those 4 pages each in a flat file, and all the flat files in one directory? You'd be better off distributing them over a tree of directories.
Do you have enough bandwidth to download all those pages? The line might be saturated with that much data. If you are connected through some asymetric line (like ADSL), your downloads could be chocked by the lack of bandwidth for the ACK traffic.
Do you have enough memory for all the processes you've started? If your processes are being swapped out, they will not only be running more slowly as different processes are getting swapped in and out, but they'll probably compete for disk bandwidth with the files you're writing out.

Comment on Re: Advice on Efficient Large-scale Web Crawling

Replies are listed 'Best First'.
Re^2: Advice on Efficient Large-scale Web Crawling by Anonymous Monk on Dec 19, 2005 at 14:11 UTC
Yeah, I'm leaning towards a local DNS cache as well. Thanks. Currently the pool is a hierarchy of directories like this: pool/ pool/todo pool/doing pool/done A sample file path is pool/todo/a/6/a6869c08bcaa2bb6f878de99491efec4f16d0d69 This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors. I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue? I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things. Does this sound somewhat sensible? :-)	[reply]
Re^3: Advice on Efficient Large-scale Web Crawling by matija (Priest) on Dec 19, 2005 at 14:57 UTC
With a single hex digit in the directory you get an average of 15625 files per directory, which is still too many (IMHO). It might work if the filesystem has hashed directory lookups, but I can't remember offhand which file systems do and which don't have that. I suggest you simply change that to two hex digits per directory name, e.g. pool/todo/a6/86/a6869c08bcaa2bb6f878de99491efec4f16d0d69 That should reduce the average number of files per directory to a much more reasonable 60 and change. And yes, benchmarking (lots and lots of benchmarking) and tweaking seem to be the best way to tackle this kind of problems.	[reply]
Re^3: Advice on Efficient Large-scale Web Crawling by salva (Canon) on Dec 19, 2005 at 14:22 UTC
To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes that limit seems too low for the task you want to accomplish, specially if you have a good internet connection. Have you actually tried incrementing it to 30 or even 50. Forking is not so expensive in moderm Unix/Linux systems with support for COW. update: actually, much of the overhead generated by the forked processes can be caused by perl cleaning up everything. On Unix, this cleanup is mostly useless, and you can get rid of it calling `exec $ok ? '/bin/true' : '/bin/false';` [download] instead of `exit($ok)` to finalize child processes. Just remember to close first any file you had written to.	[reply] [d/l] [select]
Re^4: Advice on Efficient Large-scale Web Crawling by Celada (Monk) on Dec 19, 2005 at 15:28 UTC
That's what `POSIX::_exit` is for. Exit the process without giving Perl (or anything else) a chance to clean up.	[reply] [d/l]

In Section Seekers of Perl Wisdom