Re^3: Create parallel database handles... (MCE::Loop)

in reply to Re^2: Create parallel database handles... (MCE::Loop)
in thread Create parallel database handles or SQL statements for multi-threaded/process access to Postgres DB using DBI, DBD::Pg and Parallel::ForkManager

Hi again perlygapes,

The MCE::Loop code just abstracts away all your Parallel::ForkManager logic and improves it, just as Parallel::ForkManager abstracts away and improves some of the tedious manual work of using fork() directly. See how the logic is encapsulated in a sub just like in your code, only with less concurrency boilerplate.

"using a separate DB connection instead for each child feels intuitively right"

I agree, the code I shared keeps a connection open for each child, which itself stays alive and handles multiple jobs from the job list as managed by MCE.

Here's a simpler example I've shared recently showing how to parallelize existing code for making a series of HTTP requests. How would you do the same using P::FM?

single process

use strict; use warnings; use 5.010;
use Data::Dumper; use HTTP::Tiny;
use Time::HiRes 'gettimeofday', 'tv_interval';

my $ua   = HTTP::Tiny->new( timeout => 10 );
my @urls = qw< gap.com amazon.com ebay.com lego.com wunderground.com
    imdb.com underarmour.com disney.com espn.com dailymail.com >;

my %report;

foreach( @urls) {
    my $start = [gettimeofday];
    $ua->get('https://' . $_);
    $report{$_} = tv_interval($start, [gettimeofday]) );
});

say Dumper \%report;
[download]

six processes
(workers stay alive, looping through the list, writing to a shared hash)
(one added line, two slightly changed lines)

use strict; use warnings; use 5.010;
use Data::Dumper; use HTTP::Tiny;
use Time::HiRes 'gettimeofday', 'tv_interval';
use MCE; use MCE::Shared;

my $ua   = HTTP::Tiny->new( timeout => 10 );
my @urls = qw< gap.com amazon.com ebay.com lego.com wunderground.com
    imdb.com underarmour.com disney.com espn.com dailymail.com >;

my $report = MCE::Shared->hash;

MCE->new( max_workers => 6 )->foreach( \@urls, sub {
    my $start = [gettimeofday];
    $ua->get('https://' . $_);
    $report->set( $_, tv_interval($start, [gettimeofday]) );
});

say Dumper $report->export;
[download]

Update: fixed error in first demo code, ++choroba

Hope this helps!

The way forward always starts with a minimal test.

Comment on Re^3: Create parallel database handles... (MCE::Loop) Select or Download Code

Replies are listed 'Best First'.
Re^4: Create parallel database handles... (MCE::Loop) by perlygapes (Sexton) on May 08, 2020 at 06:10 UTC
Something I just realised that I neglected to mention in my example was that I need to apply CPU affinity in the script. That is, I need to be able to specify that 'worker 1' MUST use CPU0, 'worker 2' MUST use CPU1, etc. This is because I need to have another parallel code block where each worker launches an external single-threaded executable that will be accessing another DB and writing results to a third DB but these MUST NOT access and write to the same table at the same time. This affinity is in essence to avoid access conflicts/violations. How can this be done in MCE? Thanks again.	[reply]
Re^5: Create parallel database handles... (MCE::Loop) by 1nickt (Canon) on May 08, 2020 at 16:44 UTC
Hi again, Now that's a classic XY problem statement! One usually gets better help by asking about how to achieve the goal, not how to implement the technique one has already decided is the way to achieve it ;-) I can think of no reason why one should ever have to concern oneself with which CPU core was used by a given worker. You should be able to write a program where you don't even have to concern yourself with workers. It sounds like from your problem description that you might need some kind of job queue. You can achieve this in many ways, but if you are already using MCE for parallelization, you can use MCE::Flow and MCE::Queue to handle enqueuing jobs based on the output of the first task handled by multiple workers. Look at the demo shown in the MCE::Flow doc. Hope this helps! The way forward always starts with a minimal test.	[reply]
Re^6: Create parallel database handles... (MCE::Loop) by perlygapes (Sexton) on Aug 03, 2020 at 23:49 UTC
I don't know why I missed your answer, but thanks again very much. Sorry it took me so long to respond. Yes, there is a very specific reason why I want to create CPU affinity: I am using this script to launch multiple instances of an old single threaded application and each instance is going to be working on the same overall dataset, but the dataset is a collection of files, and I do not want any file clobbering. I am specifically and purposefully trying to eliminate any chance of one of these processes interfering (even just reading) the same file another process is currently working on. Sorry if I seem to you to be asking basic questions - I am a plumber by trade...teaching oneself to code is very difficult. I tried amending your example slightly to this - and it does not give the result I expect with regard to the $process values: #!/usr/bin/perl use strict; use warnings; use 5.010; use Data::Dumper; use HTTP::Tiny; use Time::HiRes 'gettimeofday', 'tv_interval'; use MCE; use MCE::Shared; my $ua = HTTP::Tiny->new( timeout => 10 ); my @urls = qw< gap.com amazon.com ebay.com lego.com wunderground.com imdb.com underarmour.com disney.com espn.com dailymail.com >; my $report = MCE::Shared->hash; my $process = MCE::Shared->scalar; $process = 0; MCE->new( max_workers => 6 )->foreach( \@urls, sub { my $start = [gettimeofday]; $process++; say $process."->GETting https://".$_; $ua->get('https://' . $_); $report->set( $_, tv_interval($start, [gettimeofday]) ); }); say Dumper $report->export; [download] The output reads: 1->GETting https://gap.com 1->GETting https://amazon.com 1->GETting https://ebay.com 1->GETting https://lego.com 1->GETting https://wunderground.com 1->GETting https://imdb.com 2->GETting https://underarmour.com 2->GETting https://disney.com 2->GETting https://espn.com 3->GETting https://dailymail.com $VAR1 = bless( { 'disney.com' => '1.15682', 'amazon.com' => '4.607657', 'wunderground.com' => '0.46855', 'dailymail.com' => '2.355818', 'espn.com' => '1.170818', 'gap.com' => '3.819699', 'ebay.com' => '1.479624', 'underarmour.com' => '2.919818', 'imdb.com' => '2.540127', 'lego.com' => '0.919592' }, 'MCE::Shared::Hash' ); [download] whereas I had expected: 1->GETting https://gap.com 2->GETting https://amazon.com 3->GETting https://ebay.com 4->GETting https://lego.com 5->GETting https://wunderground.com 6->GETting https://imdb.com 7->GETting https://underarmour.com 8->GETting https://disney.com 9->GETting https://espn.com 10->GETting https://dailymail.com $VAR1 = bless( { 'disney.com' => '1.15682', 'amazon.com' => '4.607657', 'wunderground.com' => '0.46855', 'dailymail.com' => '2.355818', 'espn.com' => '1.170818', 'gap.com' => '3.819699', 'ebay.com' => '1.479624', 'underarmour.com' => '2.919818', 'imdb.com' => '2.540127', 'lego.com' => '0.919592' }, 'MCE::Shared::Hash' ); [download] Can you explain this? Thanks.	[reply] [d/l] [select]
Re^7: Create parallel database handles... (MCE::Loop) by marioroy (Prior) on Aug 07, 2020 at 17:07 UTC

In Section Seekers of Perl Wisdom