Re^3: Perl Program to efficiently process 500000 small files in a Directory (AIX)

in reply to Re^2: Perl Program to efficiently process 500000 small files in a Directory (AIX)
in thread Perl Program to efficiently process 500000 small files in a Directory (AIX)

Perl's built-in rename may be faster than File::Copy's move, but mostly due to the fact that move contains logic to decide whether it can do a rename, or whether it must fall back to a copy and unlink (for example, spanning file systems).

I was curious how quickly I could burn through 5,000 files while reading them line by line, finishing the read when I find the CHNL_ID line, and then renaming them within the same filesystem to a subdir based on the ID found. So I created a script that does just that:

Create a temporary directory (in /tmp on most linux systems).
Create 5,000 files in the temporary directory that contain a repetition of the sample input you provided, expanded out to approximately 350k per file. I injected a single CHNL_ID line at a random position within each of the 5,000 files. The ID I generated was an integer between 0 and 31.
Create 32 subdirectories. This may have to be moved into the processor's loop, to occur on-demand, if you don't know the range that the ID's will fit within.
Start timing the elapsed time.
Open my temp directory to read its contents.
Iteratively read the directory's contents to grab one file at a time.
Decide if the thing we're looking at is the type of file we want to deal with.
Open one file at a time and get an exclusive lock.
Read the file line by line until we find the CHNL_ID line. Grab the ID integer.
Rename the file into the new subdir.
Close the file so that we drop its lock.
Move on to the next file
Capture our total elapsed time and display it.

It was interesting to me that after creating the files (which took some time), I was able to process 5,000 of them in under six seconds. My SSD is pretty fast, so your mileage will certainly vary. But I'm not seeing performance being a big problem, particularly where this only runs on a nightly basis. Here's the code:

#!/usr/bin/env perl

use strict;
use warnings;
use feature qw/say/;
use File::Temp;
use File::Path qw(make_path);
use File::Spec::Functions qw(catdir catfile);
use Time::HiRes qw(tv_interval gettimeofday);
use Fcntl qw(:flock);

my $FILE_COUNT = 5_000;
# SETUP - Create 500k files that contain approximately 350k data with 
+the
# CHNL_ID line randomly distributed in each file.


say "Generating $FILE_COUNT temporary files.";
my @base_content = grep {!m/^\QA|CHNL_ID|\E\d+\n/} <DATA>;
@base_content = (@base_content) x 1024;

my $td = File::Temp->newdir(
    TEMPLATE    => 'pm_tempXXXXX',
    TMPDIR      => 1,
    CLEANUP     => 1,
);

for my $n (0 .. 31) {
    make_path(catdir($td->dirname, sprintf("%02d", $n)));
}

for (1 .. $FILE_COUNT) {
    my $rand_ix = int(rand(scalar(@base_content)));
    my $chnl_id = sprintf "%02d", int(rand(32));
    my @output;
    for my $line_ix (0 .. $#base_content) {
        push @output, "A|CHNL_ID|$chnl_id\n" if $line_ix == $rand_ix;
        push @output, $base_content[$line_ix];
    }

    my $tf = File::Temp->new(
        TEMPLATE => 'pm_XXXXXXXXXXXX',
        SUFFIX   => '.txt',
        DIR      => $td->dirname,
        UNLINK   => 0,
    );
    print $tf @output;
    $tf->flush;
    close $tf;
}

# Sample file processor:
say "Processing of $FILE_COUNT files.";
my $t0 = [gettimeofday];

opendir my $dh, $td->dirname
    or die "Cannot open temporary directory (", $td->dirname, "): $!\n
+";

FILE: while (defined(my $dirent = readdir($dh))) {
    next if     $dirent =~ m/^\.\.?$/;
    next unless $dirent =~ m/\.txt$/;

    my $path = catfile($td->dirname, $dirent);
    next unless -f $path;

    open my $fh, '<', $path or die "Cannot open $path for read: $!";
    flock $fh, LOCK_EX or die "Error obtaining a lock on $path: $!";

    while (defined(my $line = <$fh>)) {
        if ($line =~ m/^\QA|CHNL_ID|\E(\d+)$/m) {
            my $target_dir = catdir($td->dirname, $1);

            make_path($target_dir) unless -d $target_dir;
            my $dest = catfile($target_dir, $dirent);

            rename $path, $dest or die "Could not rename $path into $d
+est: $!";

            close $fh;
            next FILE;
        }
    }
    warn "Did not find CHNL_ID in $path. Skipping.\n";
    close $fh;
}

my $elapsed = tv_interval($t0);
say "Completed processing $FILE_COUNT files in $elapsed seconds.";

__DATA__
A|RCPNT_ID|92299999
A|RCPNT_TYP_CD|QL
A|ALERT_ID|264
A|FROM_ADDR_TX|14084007183
A|RQST_ID|PT201803989898
A|CRTEN_DT|02072018
A|CHNL_ID|17
A|RCPNT_FRST_NM|TESTSMSMIGRATION
A|SBJ_TX|Subject value from CDC
A|CLT_ID|14043
A|ALRT_NM|Order Shipped
A|CNTCT_ADDR|16166354429
A|RCPNT_LAST_NM|MEMBER
A|ORDR_NB|2650249999
A|LOB_CD|PBM
D|QL_922917566|20180313123311|1|TESTSMSMIGRATION MEMBER||
[download]

The output:

Generating 5000 temporary files.
Processing of 5000 files.
Completed processing 5000 files in 5.581455 seconds.
[download]

If you are using a slow spindle drive and a solution similar to this one actually does require way too much time, then you may want to run once per hour instead of nightly. That will require a little more effort to assure that only one process runs at a time, and to assure that you're only dealing with files that the program that creates them is done with, but all of those concerns can be solved with a little thought and code.

If you are dealing with 500,000 files instead of the 5,000 I sampled here, then I would expect that with an equivalent system you should be able to process those 500,000 in about 558 seconds, or 9 minutes, 20 seconds. You mentioned you are processing 80k files per hour, but on my system this script processes up to about 3,000,000 per hour, so about 37x more per hour than you have been experiencing. It's possible some of the improvement comes from not reading each file in its entirety, but given how I'm distributing the trigger line randomly throughout the file, that shouldn't account for more than a halving, on average, of the total run time. Possibly your move was doing a full copy, which would account for a lot more of the time.

I'll suggest that if a method such as this one isn't fast enough, and running it more frequently isn't possible, you're going to have to do some profiling to determine where all the time is being spent.

Dave

In Section Seekers of Perl Wisdom