Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: IPC::Open, Parallel::ForkManager, or Sockets::IO for parallelizing?

by mldvx4 (Friar)
on Sep 05, 2023 at 15:19 UTC ( [id://11154262] : note . print w/replies, xml ) Need Help??


in reply to Re: IPC::Open, Parallel::ForkManager, or Sockets::IO for parallelizing?
in thread IPC::Open, Parallel::ForkManager, or Sockets::IO for parallelizing?

Thanks. I took a closer look at LWP::Parallel and it may come in handy later¹. From the documentation, I see how to fetch batches of links and wonder if there is any way to parallelize the processing of the results in the same move. That way fetch+process runs continuously, rather than fetch in parallel wait and then process in parallel.

I'll be digging into the options mentioned in the other threads, too.

#!/usr/bin/perl use LWP::Parallel::UserAgent; use strict; use warnings; my @feeds = ( 'http://localhost/feed1.xml', # rss 'http://localhost/feed2.xml', # atom 'http://localhost/foo', # 404 ); my $requests = &prepare_requests(@feeds); my $entries = &fetch_feeds($requests); foreach my $k (keys %$entries) { my $res = $entries->{$k}->response; print "Answer for '",$res->request->url,"' was \t", $res->code,": ", $res->message,"\n"; # $res->content,"\n"; } exit(0); sub prepare_requests { my (@feeds) = (@_); my $requests; foreach my $url (@feeds) { push(@$requests, HTTP::Request->new('GET', $url)); } return($requests); } sub fetch_feeds { my ($requests) = (@_); my $pua = LWP::Parallel::UserAgent->new(); $pua->in_order (0); # handle requests in order of registration $pua->duplicates(1); # ignore duplicates $pua->timeout (9); # in seconds $pua->redirect (1); # follow redirects $pua->max_hosts (3); # max locations accessed in parallel foreach my $req (@$requests) { print "Registering '".$req->url."'\n"; if (my $res=$pua->register($req, \&handle_answer, 8192, 1)) { # print STDERR $res->error_as_HTML; print $res->error_as_HTML; } else { print qq(ok\n); } } my $entries = $pua->wait(); return($entries); } sub handle_answer { my($content, $response, $protocol, $entry) = @_; if (length($content)) { $response->add_content($content); } else { 1; } return(undef); }

¹ That's the thing about CPAN, there are so many useful modules with great accompanying documentation that discovery can be a challenge. So I am very appreciative of everyone's input here.

Replies are listed 'Best First'.
Re^3: IPC::Open, Parallel::ForkManager, or Sockets::IO for parallelizing?
by hippo (Bishop) on Sep 05, 2023 at 15:55 UTC
    wonder if there is any way to parallelize the processing of the results in the same move

    Have you tried just putting your processing in the handle_answer callback? That may be all you need. But I would still profile it first because it would be a big surprise if the feed processing step weren't dwarfed by the fetch times.


    🦛

      Have you tried just putting your processing in the handle_answer callback?

      Yes, I had started to look at that. As far as I can tell handle_answer keeps getting called only until the HTTP response is complete. I suppose there is a way to identify when the response is finally complete? For now, I will try building out the script as-is, with LWP in parallel and the rest serial. Some of the preparatory parts seem much faster than expected, based on trials in Perl compared to another scripting language. So it may very well not be an issue. Although I have lots of RAM on this computer, I'd like to find a way to process the responses as they come in so that all 700 to 800 of them don't sit around whole at the same time.

      Resuming more reading...