Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)

in reply to Perl Program to efficiently process 500000 small files in a Directory (AIX)

Historically i wrote a programm, which also read in a large number of files into memory through setting $/ to undef. The given program took 45 minutes at the time to complete. After changing the reading mechanism to File::Slurp the runtime went down to 3 minutes. I did not change anything else. This will of course depend on a number of factors, but maybe you could give it a try. Your example adapted to File::Slurp:

use File::Slurp; # Update: or use File::Slurper which Athanasius menti
+oned.

opendir(DIR, $dir) or die "$!\n";
while ( defined( my $txtFile = readdir DIR ) ) {
    next if( $txtFile !~ /.txt$/ );
    $cnt++;

    my $data = read_file($fh);
    my ($channel) =  $data =~ /A\|CHNL_ID\|(\d+)/i;

    move ($txtFile, "$outDir/$channel") or die $!, $/;
}
closedir(DIR);
[download]

Since your programm is blocking while reading and moving the File, you might also want to parallelize it. E.g. with Parallel::ForkManager or with MCE. Then you can do the reading of the files and the moving in parallel. To some extend you are of course I/O bound, but i think it should still give you some improvement, if implemented correctly.
Update: i whipped up a quick (untested) example for Parallel::Forkmanager:

use strict;
use warnings;
use File::Slurp;
use Parallel::ForkManager;

sub read_next_batch_of_filenames {
        my ($DH, $MAX_FILES) = @_;

        my @files = ();
        while (my $fn = readdir $DH) {
                next if ($fn !~ m/\.txt\z/);
                push @files, $fn;
                last if (scalar(@files) >= $MAX_FILES);
        }

        if (@files) {
                return \@files;
        } else {
                return;
        }
}

sub move_files {
        my ($outDir, $files) = @_;

        foreach my $f (@$files) {
                my $data = read_file($f);
                my ($channel) =  $data =~ /A\|CHNL_ID\|(\d+)/i;

                move ($f, "$outDir/$channel") or die "Failed to move '
+$f' to '$outDir/$channel ($!)\n";
        }

}

sub parallelized_move {
        my $dir    = 'FIXME';
        my $outDir = 'FIXME';

        my $MAX_PROCESSES     =  4;   # tweak this to find the best nu
+mber
        my $FILES_PER_PROCESS = 1000; # process in batches of 1000, to
+ limit forking

        my $pm = Parallel::ForkManager->new($MAX_PROCESSES);

        opendir my $DH, $dir or die "Failed to open '$dir' for reading
+ ($!)\n";

        DATA_LOOP:
        while (my $files = read_next_batch_of_filenames($DH, $FILES_PE
+R_PROCESS)) {

                  # Forks and returns the pid for the child:
                  my $pid = $pm->start and next DATA_LOOP;

                  move_files($outDir, $files);

                  $pm->finish; # Terminates the child process
        }

        closedir $DH or die "Failed to close directory handle for '$di
+r' ($!)\n";
}
[download]

In Section Seekers of Perl Wisdom