in reply to Parallel::ForkManager and multiple datasets

In each child process create as complex a data structure as you wish totally contained within the child block. Then, when you are done processing, serialise the data structure using one of many existing serialisers, e.g. Sereal, to serialise the data.

my $complex_data_structure = {'a'=>[1,2,3], 'b'=>{'c'=>[4,5,6],'d'=>LW +P::UserAgent->new()}}; my $serialised_data = Sereal::Encoder::encode_sereal($complex_data_str +ucture); $pfm->finish(0, \$serialised_data); # <<< note that we pass a referenc +e to our serialised-data.

The callback run_on_finish() is called every time a child is done processing. There we will de-serialise our data via the $data_structure_reference, as thus:

my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structur +e_reference) = @_; my $data = Sereal::Decoder::decode_sereal($$data_structure_reference); + ## de-referencing the ref to the serialised data and then de-seriali +sing.

Below is something to get you started. Note a few points: 1) how to get the pid of the child, 2) pass the data back via its reference. But the main point is that you serialise your complex data from child as a, let's say huge zipped string and that is passed on the parent process. I am not sure how well Sereal can handle references to objects created within the child and how well can re-constitute them back in parent.

#!/usr/bin/env perl use strict; use warnings; use Parallel::ForkManager; use Data::Dump qw/dump/; # bliako use Sereal::Encoder qw(encode_sereal sereal_encode_with_object); use Sereal::Decoder qw(decode_sereal sereal_decode_with_object); my @names = (); my %list = (); my %thing = (); my $threads = 20; my $pfm = new Parallel::ForkManager( $threads ); my %results = (); $pfm->run_on_finish( sub { my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_ +structure_reference) = @_; my $data = Sereal::Decoder::decode_sereal($$data_structure_ref +erence); # surely this is sequential code here so no need to lock %resu +lts, right? $results{$pid} = $data; # using pid as key is not a good idea because maybe a pid numb +er will be eventually recycled. }); my $things_hr = { 'job1' => 'this is job 1 data', 'job2' => 'this is job 2 data', 'job3' => 'this is job 3 data', 'job4' => 'this is job 4 data', 'job5' => 'this is job 5 data', }; THELOOP: foreach my $thing(keys %{$things_hr}) { print "THING = $thing\n"; $pfm->start and next THELOOP; my $pid = $$; my $returned_data = { 'item1' => "item1 from pid $pid, for item $thing and v +alue ".$things_hr->{$thing}, 'item2' => "item2 from pid $pid, for item $thing and v +alue ".$things_hr->{$thing}, "item3 are some array refs for pid: $pid", => [1,2,3,4 +], }; my $serialised_data = Sereal::Encoder::encode_sereal($returned +_data); print "pid=$pid, this is what I am sending:\n".dump($returned_ +data)."\n"; $pfm->finish(0, \$serialised_data); } $pfm->wait_all_children; print "Here are the results:\n".dump(%results)."\n";

bw, bliako

Replies are listed 'Best First'.
Re^2: Parallel::ForkManager and multiple datasets
by Speed_Freak (Sexton) on Jul 06, 2018 at 12:57 UTC

    Thanks! Working on trying this out now.

    In reading through this, I do see a problem that I'm not entirely sure how to handle. I have a multidimensional hash that is created in the loop. And values are pulled from it later in the script. Will I need to rewrite all the follow on code to accommodate the extra layer of data? ($pid) Or is there a way to "push" each de-serialized chunk into the parent structure without changing the child structure?

    won't this line: $results{$pid} = $data; turn this: $VAR1 = { ‘id_1’ => { 'thing_1' => { 'a' => 1, 'b' => 4.5, 'c' => 1200 } 'thing_2' => { 'a' => 0, 'b' => 3.2, 'c' => 100 } } ‘id_2’ => { 'thing_1' => { 'a' => 1, 'b' => 4.5, 'c' => 1200 } 'thing_2' => { 'a' => 0, 'b' => 3.2, 'c' => 100 } } } Into something much more complex since each child is forked on the lis +t of things, and then loops through a list of 1 million id's.

    The code has a for loop inside a for loop. I am trying to fork it at the main loop. This will generate around 200 child processes. The internal loop then repeats one million times. The data structure is based on the inner loop first, then the outer loop. So there are a million id's, and around 200 things per id, and 6 or so place holders per thing. I'm worried that adding the $pid into the mix in front of the data structure, but for each child process will add a ton of data t the hash?

      I am not sure if I understand correctly what the challenge is. Each child must return back results as a data chunk independent of any other child's. The run_on_finish() sub receives each child's data and, in my example, puts all children's data together in a hash keyed on child pid (see note below about that). Why? because I had assumed that you want to keep separate each child's results, that it is possible that child1 returns data with id=12 and so can child2. If that is not necessary, e.g. if each child returns results which can be added to a larger hash without any of the keys being the same over some children, then fine, it is not set on stone, just merge the children's returned hashes into a larger hash like so:

      my %results = (); $pfm->run_on_finish( sub { my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_stru +cture_reference) = @_; my $data = Sereal::Decoder::decode_sereal($$data_structure_referen +ce); # surely this is sequential code here so no need to lock %results, + right? @results{keys %$data} = values %$data; });

      This will create a "flatter" hash without pid information but there is the risk of key clashes: if %child1 contains key id=12 and %child2 contains key id=12 (at the top level of their hashes), the new hash %results can contain, of course, only 1 value for it and that will be what is in the last child.

      A nested hash is probably more efficient than a flat-out hash of 1 million items. At least as far as possible key collisions are concerned. Other Monks can correct me on that. In general, I would assume that hash with 1 million items is child's play for Perl.

      Note on using PIDs as hash keys: Using children's pid as a hash key to collect each child's results is not a good idea because pid numbers can be re-cycled by the OS and two children at different times may get the same pid. A better idea is to assign each child its own unique id drawn from a pool of unique ids and given to the child at fork just like its input data.

      Let me know if I got something wrong or you have more questions

      bw, bliako