Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^2: Adding parallel processing to a working Perl script

by Jim (Curate)
on Apr 26, 2014 at 06:15 UTC ( [id://1083900]=note: print w/replies, xml ) Need Help??


in reply to Re: Adding parallel processing to a working Perl script
in thread Adding parallel processing to a working Perl script

This is a terrific response. Thank you very much, Preceptor. Your post titled A basic 'worker' threading example is exactly the kind of beginning Perl threads tutorial I was looking for. I'll study it this weekend and then try to apply its lessons to my application.

Redraft your code such that you have a 'worker' subroutine, which handles one thing at a time.

Here's my refactored code. My intention was to make it readily adaptable to threading. The intended 'worker' subroutine is probe_volume(). I've probably missed the mark entirely, but with guidance from you and other kind monks, I'm hoping I can finally write my first truly useful parallel program.

#!perl # # CountFilesRecords.pl use strict; use warnings; use Capture::Tiny qw( capture_stdout ); use English qw( -no_match_vars ); use File::Glob qw( bsd_glob ); use Text::CSV_XS; @ARGV or die "Usage: perl $PROGRAM_NAME <export volume folder> ...\n"; # Expand globs... local @ARGV = map { $ARG =~ tr{\\}{/}; bsd_glob($ARG) } @ARGV; local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_AUTOFLUSH = 1; my @CSV_FIELD_LABELS = qw( ExportVolumeFolder TotalDATRecords TotalTextFiles TotalLFPRecords TotalImageFiles ); for my $volume_folder (@ARGV) { -d $volume_folder or die "Export volume folder $volume_folder doesn't exist\n"; } my @volume_folders; my %stuff_by; VOLUME_FOLDER: for my $volume_folder (@ARGV) { my $volume_name = (split m{/}, $volume_folder)[-1]; my $text_folder = "$volume_folder/TEXT"; my $images_folder = "$volume_folder/IMAGES"; my $dat_file = "$volume_folder/$volume_name.dat"; my $lfp_file = "$volume_folder/$volume_name.lfp"; # Check for completed export volumes, report incomplete ones... unless (-d $text_folder && -d $images_folder && -f $dat_file && -f + $lfp_file) { select STDERR; print $volume_folder; select STDOUT; next VOLUME_FOLDER; } push @volume_folders, $volume_folder; $stuff_by{$volume_folder} = { FOLDER_NAME => $volume_folder, TEXT_FILES => { COMMAND => qq( find "$text_folder" -type f -name "*.txt" | + wc -l ), COUNT => 0, }, IMAGE_FILES => { COMMAND => qq( find "$images_folder" -type f ! -name Thumb +s.db | wc -l ), COUNT => 0, }, DAT_RECORDS => { COMMAND => qq( wc -l "$dat_file" ), COUNT => 0, }, LFP_RECORDS => { COMMAND => qq( wc -l "$lfp_file" ), COUNT => 0, }, }; } # Quit if there are no completed export volume folders... exit 1 unless @volume_folders; my $csv = Text::CSV_XS->new(); # Print CSV header... $csv->print(\*STDOUT, \@CSV_FIELD_LABELS); for my $volume_folder (@volume_folders) { # Print CSV record... $csv->print(\*STDOUT, probe_volume($stuff_by{$volume_folder})); } exit 0; sub probe_volume { my $vol = shift; for my $stuff (qw( TEXT_FILES IMAGE_FILES DAT_RECORDS LFP_RECORDS +)) { (undef, $vol->{$stuff}{COUNT}) = capture_stdout { count_stuff($vol->{$stuff}{COMMAND}) }; } # The first line of every DAT file is a header $vol->{DAT_RECORDS}{COUNT}--; return [ $vol->{FOLDER_NAME}, $vol->{DAT_RECORDS}{COUNT}, $vol->{TEXT_FILES}{COUNT}, $vol->{LFP_RECORDS}{COUNT}, $vol->{IMAGE_FILES}{COUNT} ]; } sub count_stuff { my $command = shift; my $output = qx( $command ); my ($count) = $output =~ m/(\d+)/; return $count; }

Replies are listed 'Best First'.
Re^3: Adding parallel processing to a working Perl script
by Preceptor (Deacon) on Apr 28, 2014 at 10:37 UTC

    I think you may still be trying to pass a bit too much back and forth. Thread::Queue is a lovely way of handling queuing, but it works best with single values. You're passing a hash into probe_volume - which works single threaded, but can get quite complicated if multithreading.

    I think you need to step back a little and consider the design - threading increases throughput by parallelism, but as a result means that each of your threads occur asynchronously and non deterministically - you will never know which order your threads will complete tasks in. You therefore can't do something like 'print probe_volume' - you'll have to collate your data and (potentially) reorder it first.

    You will also need to think about sharing variables - you pass a hash into probe_volume, and return a list. This will probably cause you pain. Sharing variables between threads is potentially quite complicated and a source of some really annoying bugs. Try to avoid doing it.

    I would therefore suggest that what you want is a 'standalone' probe_volume subroutine that takes _just_ a volume name (either passed via sub call, but ideally 'fed' through a Thread::Queue). And outputs (again, returning via sub call, or Thread::Queue) the results, but without using anything from the global namespace. (Read only access to e.g. command definitions would be ok)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1083900]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-04-19 14:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found