Re^2: Adding parallel processing to a working Perl script

This is a terrific response. Thank you very much, Preceptor. Your post titled A basic 'worker' threading example is exactly the kind of beginning Perl threads tutorial I was looking for. I'll study it this weekend and then try to apply its lessons to my application.

Redraft your code such that you have a 'worker' subroutine, which handles one thing at a time.

Here's my refactored code. My intention was to make it readily adaptable to threading. The intended 'worker' subroutine is probe_volume(). I've probably missed the mark entirely, but with guidance from you and other kind monks, I'm hoping I can finally write my first truly useful parallel program.

#!perl
#
# CountFilesRecords.pl

use strict;
use warnings;

use Capture::Tiny qw( capture_stdout );
use English qw( -no_match_vars );
use File::Glob qw( bsd_glob );
use Text::CSV_XS;

@ARGV or die "Usage: perl $PROGRAM_NAME <export volume folder> ...\n";

# Expand globs...
local @ARGV = map { $ARG =~ tr{\\}{/}; bsd_glob($ARG) } @ARGV;

local $OUTPUT_RECORD_SEPARATOR = "\n";

local $OUTPUT_AUTOFLUSH = 1;

my @CSV_FIELD_LABELS = qw(
    ExportVolumeFolder
    TotalDATRecords
    TotalTextFiles
    TotalLFPRecords
    TotalImageFiles
);

for my $volume_folder (@ARGV) {
    -d $volume_folder
        or die "Export volume folder $volume_folder doesn't exist\n";
}

my @volume_folders;
my %stuff_by;

VOLUME_FOLDER:
for my $volume_folder (@ARGV) {
    my $volume_name   = (split m{/}, $volume_folder)[-1];
    my $text_folder   = "$volume_folder/TEXT";
    my $images_folder = "$volume_folder/IMAGES";
    my $dat_file      = "$volume_folder/$volume_name.dat";
    my $lfp_file      = "$volume_folder/$volume_name.lfp";

    # Check for completed export volumes, report incomplete ones...
    unless (-d $text_folder && -d $images_folder && -f $dat_file && -f
+ $lfp_file) {
        select STDERR;
        print $volume_folder;
        select STDOUT;

        next VOLUME_FOLDER;
    }

    push @volume_folders, $volume_folder;

    $stuff_by{$volume_folder} = {
        FOLDER_NAME => $volume_folder,

        TEXT_FILES  => {
            COMMAND => qq( find "$text_folder" -type f -name "*.txt" |
+ wc -l ),
            COUNT   => 0,
        },

        IMAGE_FILES => {
            COMMAND => qq( find "$images_folder" -type f ! -name Thumb
+s.db | wc -l ),
            COUNT   => 0,
        },

        DAT_RECORDS => {
            COMMAND => qq( wc -l "$dat_file" ),
            COUNT   => 0,
        },

        LFP_RECORDS => {
            COMMAND => qq( wc -l "$lfp_file" ),
            COUNT   => 0,
        },
    };
}

# Quit if there are no completed export volume folders...
exit 1 unless @volume_folders;

my $csv = Text::CSV_XS->new();

# Print CSV header...
$csv->print(\*STDOUT, \@CSV_FIELD_LABELS);

for my $volume_folder (@volume_folders) {
    # Print CSV record...
    $csv->print(\*STDOUT, probe_volume($stuff_by{$volume_folder}));
}

exit 0;

sub probe_volume {
    my $vol = shift;

    for my $stuff (qw( TEXT_FILES IMAGE_FILES DAT_RECORDS LFP_RECORDS 
+)) {
        (undef, $vol->{$stuff}{COUNT})
            = capture_stdout { count_stuff($vol->{$stuff}{COMMAND}) };
    }

    # The first line of every DAT file is a header
    $vol->{DAT_RECORDS}{COUNT}--;

    return [
        $vol->{FOLDER_NAME},
        $vol->{DAT_RECORDS}{COUNT},
        $vol->{TEXT_FILES}{COUNT},
        $vol->{LFP_RECORDS}{COUNT},
        $vol->{IMAGE_FILES}{COUNT}
    ];
}

sub count_stuff {
    my $command = shift;
    my $output  = qx( $command );
    my ($count) = $output =~ m/(\d+)/;

    return $count;
}
[download]

Comment on Re^2: Adding parallel processing to a working Perl script Select or Download Code

Replies are listed 'Best First'.

Re^3: Adding parallel processing to a working Perl script
by Preceptor (Deacon) on Apr 28, 2014 at 10:37 UTC

I think you may still be trying to pass a bit too much back and forth. Thread::Queue is a lovely way of handling queuing, but it works best with single values. You're passing a hash into probe_volume - which works single threaded, but can get quite complicated if multithreading.

I think you need to step back a little and consider the design - threading increases throughput by parallelism, but as a result means that each of your threads occur asynchronously and non deterministically - you will never know which order your threads will complete tasks in. You therefore can't do something like 'print probe_volume' - you'll have to collate your data and (potentially) reorder it first.

You will also need to think about sharing variables - you pass a hash into probe_volume, and return a list. This will probably cause you pain. Sharing variables between threads is potentially quite complicated and a source of some really annoying bugs. Try to avoid doing it.

I would therefore suggest that what you want is a 'standalone' probe_volume subroutine that takes _just_ a volume name (either passed via sub call, but ideally 'fed' through a Thread::Queue). And outputs (again, returning via sub call, or Thread::Queue) the results, but without using anything from the global namespace. (Read only access to e.g. command definitions would be ok)

[reply]


Perl-Sensitive Sunglasses
	PerlMonks