Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution)

The OP mentioned a large number of text files (thousands to millions at a time, up to a couple of MB each). I think that parallelization is better broken down at the file level. Basically, create a list of input files and chunk the list instead. Since the list may range from thousands to millions, go with chunk_size 1 or 2.

Notice that workers are spawned early, before creating a large array. Create the array and pass the array reference to MCE to not make an extra copy. This is how to tackle a big job, keeping overhead low. And then, fasten your seat belt and enjoy parallelization in top or htop.

use strict;
use warnings;
use MCE;
use Time::HiRes 'time';

sub process_file {
    my ($file) = @_;
}

my $mce = MCE->new(
    max_workers => MCE::Util::get_ncpu(),
    chunk_size  => 2,
    user_func   => sub {
        my ($mce, $chunk_ref, $chunk_id) = @_;
        process_file($_) for @{ $chunk_ref };
    }
)->spawn;

my @file_list = (1 .. 1_000_000); # simulate a list of 1 million files

my $start = time;
$mce->process(\@file_list);
printf "%0.3f seconds\n", time - $start;

$mce->shutdown; # reap workers
[download]

Let's find out the IPC overhead. I wonder myself.

chunk_size   1  3.773 seconds    1 million chunks
chunk_size   2  1.930 seconds  500 thousand chunks
chunk_size  10  0.423 seconds  100 thousand chunks
chunk_size  20  0.234 seconds   50 thousand chunks
[download]

It is mind-boggling nonetheless, just a fraction of a second for 50 thousand chunks. Moreover, 2 seconds will not be felt when processing 500 thousand files. Nor, 4 seconds handling 1 million files.

Comment on Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution) Select or Download Code


Problems? Is your data what you think it is?
	PerlMonks