Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket

by hacker (Priest)
on May 16, 2003 at 15:16 UTC ( [id://258679]=note: print w/replies, xml ) Need Help??


in reply to Re: LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket
in thread LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket

Along these lines, would it be faster to use LWP::Parallel, even though it is a bit heavier and slower, to fetch requests in parallel, or to use something like HTTP::MHTTP and fork() or Thread, and grab the requests from @urls one-at-a-time?

My concern here is that I'll have an array and some hashes that have urls that are seen, unseen, down, bad, and so on.. and I need to make sure that the process putting urls into the hashes and arrays (as links are yanked from the pages in %seen) can be fetched by processes already in fork() or registered in parallel. Would this require some sort of shared memory to get working properly? Can a forked process read and write to an array or hash created by the parent of the fork?

I've got a lot of this code "functioning", but now is the time to refactor and get the performance up to speed (pun intended) for a production distribution of the tool.

Replies are listed 'Best First'.
Re: Re: LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket
by mp (Deacon) on May 16, 2003 at 16:42 UTC
    You can use Parallel::ForkManager to parallelize HTTP::MHTTP or HTTP::GHTTP calls easily and apply a limit to the maximum number of child processes.

    There are a number of ways to handle getting the retrieved data back to the parent or other process that don't require use of shared memory:

      As it turns out, HTTP::MHTTP seems to have an 'issue' with name-based virtual hosts, exhibited by the code below, so I can't use that, and it doesn't appear to work on Windows machines either, which puts it in the non-portable category for me:
      use strict; use HTTP::MHTTP; # This url REALLY exists, but is a virtual host # on a domain shared by multiple hosts. my $url = 'http://advogato.org/recentlog.html; http_init(); switch_debug(1); http_call("GET", $url); print http_response();

      Thanks to bart and ChemBoy for the enlightening discussion that exposed this issue.

      Based on my loose testing (excluding HTTP::MHTTP), it looks like HTTP::GHTTP is the fastest, followed closely by HTTP::Lite and LWP::Simple behind that. I haven't done benching against Parallel::ForkManager yet with these, so that waits to be seen.

      The other issue also, is the speed at which DNS queries are resolved. I think I can speed that up with a local database of resolved sites, but on the first run, that'll take a hit.

      Thanks for the tips and hints though, I'm closer to a functional solution, but it seems the more I test, the farther down the stack I get, closer to writing my own code around IO::Socket. I'd like to avoid that if I can.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://258679]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-04-24 00:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found