Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Advice on Efficient Large-scale Web Crawling

by Anonymous Monk
on Dec 19, 2005 at 12:45 UTC ( #517702=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a system which needs to fetch well over four million URLs (and this number will grow), save their contents, and record their host's IP address(es). (Note: I'm only doing GET $url -- I'm not spidering the sites).

I've written a Pool:: and Pool::Job module which let me do something like this:

my $pool = Pool->new; while (my $job = $pool->get_next_job) { #do something with $job->url if ($error) { $job->skip; } else { $job->done; } }

(Yeah, I'm sure there's a better way to do this just waiting on CPAN, but... ;-))

The reason for this complexity is that I need to process these jobs in parallel, will need to retry some jobs, and need the crawler to start working as soon as possible after it starts up if it crashes. This architecture allows that.

My current crawler uses LWP::UserAgent and Parallel::ForkManager. It writes the page content to individual flat files. This works, but is slow, appears to just die occasionally, and doesn't allow me to get the IP address(es) efficiently.

Regarding the IPs, I could simply use URI:: to extract the hostname and then use inet_ntoa() and/or gethostbyname(), but this would require performing a DNS lookup twice per URL: once for LWP and once for inet_ntoa/gethostbyname. Ideally, I'd have LWP return the IP it resolved, but I can't see an easy way to do this. Of course, even then for hosts which have multiple IP addresses, I'm resolving them multiple times. I imagine that the answer is: use a cache, but where? Use a Memoize type wrapper for inet_ntoa/gethostbyname? Use a DNS cache at the OS level (something like djbdns?)? Alternatively, I can just accept that this inefficiency is OK, and bulk resolve the hostnames independently (which will make removing duplicates trivial) using something like http://he.fi/slookup/ . I would like to do as much in Perl as possible, however, so POE::Component::Client::DNS looks useful if I do look them up in bulk...

If I do fetch the URLs and resolve the IPs in the same process, there appear to be two main options: use a hacked version of HTTP::Lite which returns the IP address (a trivial change) along with Parallel::ForkManager, or use POE::Component::Client::HTTP and figure out a way to have it return the IP.

Regarding POE::Component::Client::HTTP, I'm a bit confused as to how it works under the hood... What limits the amount of concurrent processes does it run, or is this a stupid question? With most parallel modules that I've used you can configure a maximum level of processes to run concurrently.

Anyway, that's a brief overview of the problem. Any thoughts on the most efficient way to approach this? Better modules to use? :-)

Replies are listed 'Best First'.
Re: Advice on Efficient Large-scale Web Crawling
by matija (Priest) on Dec 19, 2005 at 13:45 UTC
    Personaly, I think you're engaging in premature optimization here: when fetching 4M urls, the DNS traffic is unlikely to be your biggest concern.

    Having said that, the cheapest/cleanest method would be to install a caching-only DNS server on your localhost, and let it handle the DNS caching.

    Some reasons why your current solution might be slow:

    • are all those 4 pages each in a flat file, and all the flat files in one directory? You'd be better off distributing them over a tree of directories.
    • Do you have enough bandwidth to download all those pages? The line might be saturated with that much data. If you are connected through some asymetric line (like ADSL), your downloads could be chocked by the lack of bandwidth for the ACK traffic.
    • Do you have enough memory for all the processes you've started? If your processes are being swapped out, they will not only be running more slowly as different processes are getting swapped in and out, but they'll probably compete for disk bandwidth with the files you're writing out.

      Yeah, I'm leaning towards a local DNS cache as well. Thanks.

      Currently the pool is a hierarchy of directories like this:

      pool/
      pool/todo
      pool/doing
      pool/done

      A sample file path is

      pool/todo/a/6/a6869c08bcaa2bb6f878de99491efec4f16d0d69

      This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors.

      I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue?

      I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things.

      Does this sound somewhat sensible? :-)

        With a single hex digit in the directory you get an average of 15625 files per directory, which is still too many (IMHO). It might work if the filesystem has hashed directory lookups, but I can't remember offhand which file systems do and which don't have that.

        I suggest you simply change that to two hex digits per directory name, e.g.

        pool/todo/a6/86/a6869c08bcaa2bb6f878de99491efec4f16d0d69
        
        
        That should reduce the average number of files per directory to a much more reasonable 60 and change.

        And yes, benchmarking (lots and lots of benchmarking) and tweaking seem to be the best way to tackle this kind of problems.

        To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes

        that limit seems too low for the task you want to accomplish, specially if you have a good internet connection. Have you actually tried incrementing it to 30 or even 50. Forking is not so expensive in moderm Unix/Linux systems with support for COW.

        update: actually, much of the overhead generated by the forked processes can be caused by perl cleaning up everything. On Unix, this cleanup is mostly useless, and you can get rid of it calling

        exec $ok ? '/bin/true' : '/bin/false';
        instead of exit($ok) to finalize child processes. Just remember to close first any file you had written to.
Re: Advice on Efficient Large-scale Web Crawling
by salva (Abbot) on Dec 19, 2005 at 13:30 UTC
    LWP adds the IP address of the remote host to the response object as a header.
      LWP adds the IP address of the remote host to the response object as a header.
      That is so helpful. Thank you. :-)
Re: Advice on Efficient Large-scale Web Crawling
by merlyn (Sage) on Dec 19, 2005 at 15:30 UTC
    I would have contacted you privately on this, but that's hard to do since you aren't logged in.

    However, I question the meta-question here. Do you really have 4 million (and growing) internal URLs to hit? And if so, why don't you just hit the database behind the URLs instead... no need to do an HTTP query when a DBI query will give you the same information.

    But if not... if these are external URLs, do you have permission to fetch them for your purposes? I tolerate having Google and other search engines hit my site because they give value to me and my customers. For this, I put up with reduced CPU performance and increased bandwidth usage while these public search engines hit my 44000 photograph pages and 250 magazine articles. If you're setting up another search engine, we really don't need one. If you're using this information for your own gain, I really don't want you hitting my site. Also, if you're hitting public URLs, you should use a distinct robot agent and follow the robot rules.

    So, it'll help me to help you in the future to establish some sort of context for this. Why the heck do you want to hit 4 million (plus) URLs "efficiently"?

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: Advice on Efficient Large-scale Web Crawling
by perrin (Chancellor) on Dec 19, 2005 at 15:24 UTC
    FWIW, HTTP::GHTTP and HTTP::MHTTP are both a lot faster than HTTP::Lite.
Re: Advice on Efficient Large-scale Web Crawling
by Anonymous Monk on Dec 19, 2005 at 15:57 UTC

    (OP here)

    salva, yes that was my thinking, too. With the advice here and quite a few optimisations it looks as if I can push this up. More tweaking, I think. That exit advice is new to me. How would I use that in the context of Parallel::ForkManager?

    matija, good point. I'll eventually use ReiserFS which has superb support for large numbers of files, but I should probably use your approach now. I agree that it would probably give better performance.

    Regarding HTTP::GHTTP and HTTP::MHTTP, MHTTP doesn't respect the Host header thus can't handle virtual hosts. GHTTP is indeed nice, but it neither supports HTTPS nor the features of LWP. (My main attraction to HTTP::Lite was that it was pure Perl and easy enough to hack to get the remote IP address -- now I can get that from LWP, Lite is less useful). It looks as if LWP supports using GHTTP internally, though, which sounds like a win-win. :-) I'll have to run some benchmarks on this...

    merlyn, I'm afraid I do have to hit this number of external URLs. :-) It's for a research project that does have many merits. (I don't agree that we don't need a better search engine, but I guess that's academic). I'm going some way to support the Robots Exclusion Protocol. I do pre-processing on the list of URLs to identify the few hosts which will be hit more than a couple of times, then fetch their robots.txt. If they forbid crawling, I nix them from the input. By working with batches indexed by the hash of the URL I severely reduce the risk of hitting any server too hard (the host would have to have more than a trivial number of URLs in the index whose hashes start with at least the same two characters (even more when I implement matija's suggestion)). I've just wrote a script to double check this and only two hosts have multiple URLs in the same job bin: one has 2, the other 3. I appreciate your concern -- I run large sites myself, and am perfectly aware of the damage a runaway spider can cause. ;-)

      merlyn, I'm afraid I do have to hit this number of external URLs. :-) It's for a research project that does have many merits.
      Then use the Google API and their database. Or, you can also use the newly announced Alexa API from Amazon.

      There's no justified reason to re-crawl the world, unless you're also providing benefit to the world, and you haven't yet convinced me of your benefit ("research project" could mean anything).

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        According to Google API a license key only allows for 1,000 automated queries per day. This page while somewhat dated, provides some data relevant to this discussion. A couple of key points from that data include:

        -Netcraft estimated that 42.8 million web servers existed. Assuming 50 URLs per web server gives over 2.1 billion URLs. If the OP is randomly selecting URLs the chances of any particular server being significantly inconvenienced are small, in my estimation.
      How would I use that in the context of Parallel::ForkManager?

      Not tested, but just calling POSIX::_exit (as suggested by Celada) instead of $fm->finnish should do the trick.

      Or you can switch to my own Proc::Queue module. I am sure it supports exiting via POSIX::_exit.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://517702]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2020-10-28 11:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (260 votes). Check out past polls.

    Notices?