Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Parallel::ForkManager for any array

by MissPerl (Sexton)
on Oct 16, 2018 at 13:09 UTC ( [id://1224107]=perlquestion: print w/replies, xml ) Need Help??

MissPerl has asked for the wisdom of the Perl Monks concerning the following question:

I wrote this simple script after some reading online.

To achieve fastest execution time, I want to have the number of jobs == arraysize. (any array in script for looping repeating process) I guess the other words, is instead of running stuff serially (one after the other), i have them in parallel to minimize time taken.

So in this case I wanna delete a directory, delete the directory1 in (/user/home/directory1/), but i am clueless why it does not work, all the files and subdirectories still there. Does the script automatically assign each files/subdirectories to each children?

Let me know if I have overlooked stuff. I am ready to learn . Thank you . :)

#! /usr/bin/perl use Parallel::ForkManager; $directory1 = "/user/home/directory1/"; my $pm = Parallel::ForkManager->new(10); my @files = glob("$directory1/*"); my $number = scalar(@files); for (my $i = 0; $i < $number; $i++) { $pm->start and next; system ("rm", "-fr", $files[i]); exit(0); $pm->finish; } $pm-> wait_all_children;

Replies are listed 'Best First'.
Re: Parallel::ForkManager for any array
by toolic (Bishop) on Oct 16, 2018 at 13:30 UTC
      Oh yes, what's wrong with me forgetting those strict and warnings. thank you !

      Oh it is the $i . great thanks much !! =D

      I could not test it now. but on a side note, in this case, let's say the directories have 8 files and 2 directories, does it mean it assign 8 children for 8 files and 2 children for 2 subdirectories?

      What if i have 11 files/subdirectories in the directory with only 10 forks, does the quickest child/process get the 11th ??

Re: Parallel::ForkManager for any array
by 1nickt (Canon) on Oct 16, 2018 at 14:18 UTC

    Hello, ++toolic already pointed out the error in your code. Here are a couple of other observations:

    • You don't need to worry about the size of the array. Perl handles that.
      for my $file (@files) { # Perl will exit the loop when the array has been processed }
    • You don't need to shell out to delete files. See unlink and rmdir and File::Path::remove_tree.
      use File::Path 'remove_tree'; for my $file (@files) { unlink($file) if -f $file; remove_tree($file) if -d $file; }
    • Consider using bsd_glob from File::Glob instead of just glob to get the contents of a directory.

    • Yes, the first worker to finish will get the next job.

    • It's probably not an optimization to parallelize file deletions as the workers can likely only access the filesystem serially anyway (I am not an expert on that).

    • You will often have conditions that must be met before you want to delete a file, e.g. the file is not empty, or is older than a certain date, etc. For finding files conditionally, see Path::Iterator::Rule.

    Hope this helps!


    The way forward always starts with a minimal test.
      Thanks 1nickt for the valuable input !

      removetree definitely new to me!

      Now the code run and delete the main directory containing 8 files and 2 subdirectories,

      the 8 children return fast enough since they are light weight files, but the other two children which were assigned to delete the 2 directories are still running,

      I wonder if it is possible to let the return children to help out or create more children during the process? to achieve shorter deleting process?

        Possibly you could improve performance by getting a list of files to be deleted with their full path name, and then allowing the workers to loop through that list, rather than handing them a top-level "file" that may be a directory. That would even out the workload among workers.

        If this is really a problem that can benefit from parallelization, i.e. there are really a lot of files, and the task is not I/O bound (as I suspect), I would consider also using a technique that employs chunking, so each worker is given a block of files to process before pulling the next one, such as is provided by default by the excellent parallelization engine MCE. The following is untested and lacks error checking, debug output, etc., but should give you some ideas:

        use strict; use warnings; use Path::Iterator::Rule; use MCE; my $rule = Path::Iterator::Rule->new; # Add constraints to the rule here my $root = '/some/path'; my $iter = $rule->iter($root, { depthfirst => 1 }); my @list; while ( my $file = $iter->() ) { push @list, $file; } my $chunk_size = 100; # whatever makes sense for you MCE->new( user_func => \&task, max_workers => 10, chunk_size => $chunk +_size ); MCE->process( \@list ); MCE->shutdown; exit 0; sub task { my $mce = shift; # not used in this case my $file = shift; unlink($file) if -f $file; rmdir($file) if -d $file; } __END__

        Hope this helps!


        The way forward always starts with a minimal test.
Re: Parallel::ForkManager for any array
by ForgotPasswordAgain (Priest) on Oct 16, 2018 at 17:18 UTC
    You might want to check out GNU parallel (which is written in Perl :)
    $ mkdir subdir $ for f in {00..10}; do touch subdir/$f; done $ for f in {a..z}; do mkdir subdir/$f; done $ parallel --verbose rm -r ::: subdir/* rm -r subdir/00 rm -r subdir/01 [elided...] rm -r subdir/z
    --verbose is just to show that it's doing what you think, but isn't necessary. I didn't test whether it's actually faster or not with parallel.
      thanks ForgotPasswordAgain :)

      but i wanna learn parallel::forkmanager at the moment,

      I will surely get back to try out GNU parallel very soon !

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1224107]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-25 22:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found