comment on

> Please feel free to respond away with solutions...

I made a parallel demonstration. The results are taken from a 32-core Linux box. On Windows, run Cygwin's Perl for best performance.

Results

# Modification: our (%by_count, %by_word);
$ time perl choroba_mod.pl big1.txt big2.txt big3.txt >oo1.txt
start
get properties: 9 secs
sort + output: 14 secs
total: 23 secs

real  0m 23.083s
user  0m 22.568s
sys   0m  0.491s

# Run parallel on physical cores: max_workers => 4
$ time taskset -c 0-31 perl mce.pl big1.txt big2.txt big3.txt >oo2.txt
start
get properties + pre-sort: 14 secs
final sort + output: 2 secs
total: 16 secs

real  0m 15.434s
user  0m 52.223s
sys   0m  0.824s

# Verify correctness:
$ diff cpp.tmp oo1.txt  # choroba
$ diff cpp.tmp oo2.txt  # MCE  4 workers
[download]

Parallel code

Basically, workers process and gather orderly letters "a" through "z". I was hoping to gather an array of dualvars, but forgotten serialization removes the numeric part.

#!/usr/bin/env perl
# https://www.perlmonks.org/?node_id=11148465

use warnings;
use strict;
use feature qw{ say };
use Scalar::Util qw{ dualvar };
use MCE;

die "Usage: $0 input1 [ input2 ... ]\n" unless @ARGV;

# Ensure given input files are readable.
my @infiles = @ARGV; @ARGV = ();

for (@infiles) {
    die "Cannot open '$_'" unless -r "$_";
}

# MCE gather and parallel routines.
our @DATA;

sub gather_routine {
    my ($data_ref) = @_;
    while (@{ $data_ref }) {
        push @DATA, dualvar(
            shift @{ $data_ref },
            shift @{ $data_ref }
        );
    }
}

sub parallel_routine {
    my ($char, %by_word, @data, @ret) = ($_);

    for my $file (@infiles) {
        open my $fh, '<', $file;
        while (<$fh>) {
            if (substr($_,0,1) eq $char) {
                chomp;
                my ($k, $v) = split /\t/, $_;
                $by_word{$k} += $v;
            }
        }
        close $fh;
    }

    while (my ($k, $v) = each %by_word) {
        push @data, dualvar($v, $k);
    }

    push(@ret, 0+$_, "$_") for sort @data;

    MCE::relay { MCE->gather(\@ret) };
}

# Run parallel using MCE.
warn "start\n";
my $tstart1 = time;

MCE->new(
    input_data  => ['a'..'z'],
    max_workers => 7,
    chunk_size  => 1,
    init_relay  => 1,
    posix_exit  => 1,
    gather      => \&gather_routine,
    user_func   => \&parallel_routine,
    use_threads => 0,
)->run(1);

my $tend1 = time;
warn "get properties + pre-sort: ", $tend1 - $tstart1, " secs\n";

# Output dualvar data, sorted by count.
$| = 0;  # enable output buffering

my $tstart2 = time;
say "$_\t".(0+$_) for sort { $b <=> $a } @DATA;
my $tend2 = time;

warn "final sort + output: ", $tend2 - $tstart2, " secs\n";
warn "total: ", $tend2 - $tstart1, " secs\n";
[download]

In reply to Re: Rosetta Code: Long List is Long -- parallel by marioroy
in thread Rosetta Code: Long List is Long by eyepopslikeamosquito

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Problems? Is your data what you think it is?
	PerlMonks