Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket

by ajt (Prior)
on May 16, 2003 at 07:55 UTC ( [id://258593]=note: print w/replies, xml ) Need Help??


in reply to LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket

hacker

I can't comment on LWP::Parallel directly, but I did some crude benchmarks on LWP, HTTP::GHTTP, HTTP::Lite and HTTP::MHTTP. What I found was hardly surprising:

  • LWP is a big slow to load module, and once it's loaded, it's still pretty slow. It can do just about anything, but it's not a speed demon.
  • Lite is quicker than LWP to load, and quicker in use, but it's still not what you would call fast.
  • GHTTP as expected was fast to load, and fast in use. Much faster than either of the pure Perl modules. I can't get it to work under mod_Perl on Windows, but that's my only complaint.
  • MHTTP was the only surprise. It has the most basic API, it's not object orientated like the others, but it's even faster than GHTTP - in both module load time, and in actual use..

UPDATE: It should be possible to compile the two c based modules on Windows, GHTTP and MHTTP. I believe that currently only GHTTP has a precompiled PPM available. Building the module on Windows, is just a case of asking a nice person with a compiler to do the work for you - CrazyPPM repository, interested?. I've recently spoken with Piers, and if you have any bugs to submit for MHTTP, let him know and he'll have a look at the for you.


--
"It's not magic, it's work..."
ajt
  • Comment on Re: LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket

Replies are listed 'Best First'.
Re: LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket
by hacker (Priest) on May 16, 2003 at 15:16 UTC
    Along these lines, would it be faster to use LWP::Parallel, even though it is a bit heavier and slower, to fetch requests in parallel, or to use something like HTTP::MHTTP and fork() or Thread, and grab the requests from @urls one-at-a-time?

    My concern here is that I'll have an array and some hashes that have urls that are seen, unseen, down, bad, and so on.. and I need to make sure that the process putting urls into the hashes and arrays (as links are yanked from the pages in %seen) can be fetched by processes already in fork() or registered in parallel. Would this require some sort of shared memory to get working properly? Can a forked process read and write to an array or hash created by the parent of the fork?

    I've got a lot of this code "functioning", but now is the time to refactor and get the performance up to speed (pun intended) for a production distribution of the tool.

      You can use Parallel::ForkManager to parallelize HTTP::MHTTP or HTTP::GHTTP calls easily and apply a limit to the maximum number of child processes.

      There are a number of ways to handle getting the retrieved data back to the parent or other process that don't require use of shared memory:

        As it turns out, HTTP::MHTTP seems to have an 'issue' with name-based virtual hosts, exhibited by the code below, so I can't use that, and it doesn't appear to work on Windows machines either, which puts it in the non-portable category for me:
        use strict; use HTTP::MHTTP; # This url REALLY exists, but is a virtual host # on a domain shared by multiple hosts. my $url = 'http://advogato.org/recentlog.html; http_init(); switch_debug(1); http_call("GET", $url); print http_response();

        Thanks to bart and ChemBoy for the enlightening discussion that exposed this issue.

        Based on my loose testing (excluding HTTP::MHTTP), it looks like HTTP::GHTTP is the fastest, followed closely by HTTP::Lite and LWP::Simple behind that. I haven't done benching against Parallel::ForkManager yet with these, so that waits to be seen.

        The other issue also, is the speed at which DNS queries are resolved. I think I can speed that up with a local database of resolved sites, but on the first run, that'll take a hit.

        Thanks for the tips and hints though, I'm closer to a functional solution, but it seems the more I test, the farther down the stack I get, closer to writing my own code around IO::Socket. I'd like to avoid that if I can.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://258593]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (None)
    As of 2024-04-25 00:23 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found