Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^2: Parallel::ForkManager and multiple datasets

by Speed_Freak (Sexton)
on Jul 06, 2018 at 12:57 UTC ( [id://1218048]=note: print w/replies, xml ) Need Help??


in reply to Re: Parallel::ForkManager and multiple datasets
in thread Parallel::ForkManager and multiple datasets

Thanks! Working on trying this out now.

In reading through this, I do see a problem that I'm not entirely sure how to handle. I have a multidimensional hash that is created in the loop. And values are pulled from it later in the script. Will I need to rewrite all the follow on code to accommodate the extra layer of data? ($pid) Or is there a way to "push" each de-serialized chunk into the parent structure without changing the child structure?

won't this line: $results{$pid} = $data; turn this: $VAR1 = { ‘id_1’ => { 'thing_1' => { 'a' => 1, 'b' => 4.5, 'c' => 1200 } 'thing_2' => { 'a' => 0, 'b' => 3.2, 'c' => 100 } } ‘id_2’ => { 'thing_1' => { 'a' => 1, 'b' => 4.5, 'c' => 1200 } 'thing_2' => { 'a' => 0, 'b' => 3.2, 'c' => 100 } } } Into something much more complex since each child is forked on the lis +t of things, and then loops through a list of 1 million id's.

The code has a for loop inside a for loop. I am trying to fork it at the main loop. This will generate around 200 child processes. The internal loop then repeats one million times. The data structure is based on the inner loop first, then the outer loop. So there are a million id's, and around 200 things per id, and 6 or so place holders per thing. I'm worried that adding the $pid into the mix in front of the data structure, but for each child process will add a ton of data t the hash?

Replies are listed 'Best First'.
Re^3: Parallel::ForkManager and multiple datasets
by bliako (Monsignor) on Jul 07, 2018 at 08:46 UTC

    I am not sure if I understand correctly what the challenge is. Each child must return back results as a data chunk independent of any other child's. The run_on_finish() sub receives each child's data and, in my example, puts all children's data together in a hash keyed on child pid (see note below about that). Why? because I had assumed that you want to keep separate each child's results, that it is possible that child1 returns data with id=12 and so can child2. If that is not necessary, e.g. if each child returns results which can be added to a larger hash without any of the keys being the same over some children, then fine, it is not set on stone, just merge the children's returned hashes into a larger hash like so:

    my %results = (); $pfm->run_on_finish( sub { my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_stru +cture_reference) = @_; my $data = Sereal::Decoder::decode_sereal($$data_structure_referen +ce); # surely this is sequential code here so no need to lock %results, + right? @results{keys %$data} = values %$data; });

    This will create a "flatter" hash without pid information but there is the risk of key clashes: if %child1 contains key id=12 and %child2 contains key id=12 (at the top level of their hashes), the new hash %results can contain, of course, only 1 value for it and that will be what is in the last child.

    A nested hash is probably more efficient than a flat-out hash of 1 million items. At least as far as possible key collisions are concerned. Other Monks can correct me on that. In general, I would assume that hash with 1 million items is child's play for Perl.

    Note on using PIDs as hash keys: Using children's pid as a hash key to collect each child's results is not a good idea because pid numbers can be re-cycled by the OS and two children at different times may get the same pid. A better idea is to assign each child its own unique id drawn from a pool of unique ids and given to the child at fork just like its input data.

    Let me know if I got something wrong or you have more questions

    bw, bliako

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1218048]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-19 12:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found