in reply to Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
in thread Perl Program to efficiently process 500000 small files in a Directory (AIX)

Thanks for the Reply Dave. Can you tell me if Perl's "rename" is more efficient than "move"? As these files are on the same file system.
  • Comment on Re^2: Perl Program to efficiently process 500000 small files in a Directory (AIX)

Replies are listed 'Best First'.
Re^3: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by davido (Cardinal) on Mar 18, 2018 at 06:22 UTC

    Perl's built-in rename may be faster than File::Copy's move, but mostly due to the fact that move contains logic to decide whether it can do a rename, or whether it must fall back to a copy and unlink (for example, spanning file systems).

    I was curious how quickly I could burn through 5,000 files while reading them line by line, finishing the read when I find the CHNL_ID line, and then renaming them within the same filesystem to a subdir based on the ID found. So I created a script that does just that:

    1. Create a temporary directory (in /tmp on most linux systems).
    2. Create 5,000 files in the temporary directory that contain a repetition of the sample input you provided, expanded out to approximately 350k per file. I injected a single CHNL_ID line at a random position within each of the 5,000 files. The ID I generated was an integer between 0 and 31.
    3. Create 32 subdirectories. This may have to be moved into the processor's loop, to occur on-demand, if you don't know the range that the ID's will fit within.
    4. Start timing the elapsed time.
    5. Open my temp directory to read its contents.
    6. Iteratively read the directory's contents to grab one file at a time.
    7. Decide if the thing we're looking at is the type of file we want to deal with.
    8. Open one file at a time and get an exclusive lock.
    9. Read the file line by line until we find the CHNL_ID line. Grab the ID integer.
    10. Rename the file into the new subdir.
    11. Close the file so that we drop its lock.
    12. Move on to the next file
    13. Capture our total elapsed time and display it.

    It was interesting to me that after creating the files (which took some time), I was able to process 5,000 of them in under six seconds. My SSD is pretty fast, so your mileage will certainly vary. But I'm not seeing performance being a big problem, particularly where this only runs on a nightly basis. Here's the code:

    #!/usr/bin/env perl use strict; use warnings; use feature qw/say/; use File::Temp; use File::Path qw(make_path); use File::Spec::Functions qw(catdir catfile); use Time::HiRes qw(tv_interval gettimeofday); use Fcntl qw(:flock); my $FILE_COUNT = 5_000; # SETUP - Create 500k files that contain approximately 350k data with +the # CHNL_ID line randomly distributed in each file. say "Generating $FILE_COUNT temporary files."; my @base_content = grep {!m/^\QA|CHNL_ID|\E\d+\n/} <DATA>; @base_content = (@base_content) x 1024; my $td = File::Temp->newdir( TEMPLATE => 'pm_tempXXXXX', TMPDIR => 1, CLEANUP => 1, ); for my $n (0 .. 31) { make_path(catdir($td->dirname, sprintf("%02d", $n))); } for (1 .. $FILE_COUNT) { my $rand_ix = int(rand(scalar(@base_content))); my $chnl_id = sprintf "%02d", int(rand(32)); my @output; for my $line_ix (0 .. $#base_content) { push @output, "A|CHNL_ID|$chnl_id\n" if $line_ix == $rand_ix; push @output, $base_content[$line_ix]; } my $tf = File::Temp->new( TEMPLATE => 'pm_XXXXXXXXXXXX', SUFFIX => '.txt', DIR => $td->dirname, UNLINK => 0, ); print $tf @output; $tf->flush; close $tf; } # Sample file processor: say "Processing of $FILE_COUNT files."; my $t0 = [gettimeofday]; opendir my $dh, $td->dirname or die "Cannot open temporary directory (", $td->dirname, "): $!\n +"; FILE: while (defined(my $dirent = readdir($dh))) { next if $dirent =~ m/^\.\.?$/; next unless $dirent =~ m/\.txt$/; my $path = catfile($td->dirname, $dirent); next unless -f $path; open my $fh, '<', $path or die "Cannot open $path for read: $!"; flock $fh, LOCK_EX or die "Error obtaining a lock on $path: $!"; while (defined(my $line = <$fh>)) { if ($line =~ m/^\QA|CHNL_ID|\E(\d+)$/m) { my $target_dir = catdir($td->dirname, $1); make_path($target_dir) unless -d $target_dir; my $dest = catfile($target_dir, $dirent); rename $path, $dest or die "Could not rename $path into $d +est: $!"; close $fh; next FILE; } } warn "Did not find CHNL_ID in $path. Skipping.\n"; close $fh; } my $elapsed = tv_interval($t0); say "Completed processing $FILE_COUNT files in $elapsed seconds."; __DATA__ A|RCPNT_ID|92299999 A|RCPNT_TYP_CD|QL A|ALERT_ID|264 A|FROM_ADDR_TX|14084007183 A|RQST_ID|PT201803989898 A|CRTEN_DT|02072018 A|CHNL_ID|17 A|RCPNT_FRST_NM|TESTSMSMIGRATION A|SBJ_TX|Subject value from CDC A|CLT_ID|14043 A|ALRT_NM|Order Shipped A|CNTCT_ADDR|16166354429 A|RCPNT_LAST_NM|MEMBER A|ORDR_NB|2650249999 A|LOB_CD|PBM D|QL_922917566|20180313123311|1|TESTSMSMIGRATION MEMBER||

    The output:

    Generating 5000 temporary files. Processing of 5000 files. Completed processing 5000 files in 5.581455 seconds.

    If you are using a slow spindle drive and a solution similar to this one actually does require way too much time, then you may want to run once per hour instead of nightly. That will require a little more effort to assure that only one process runs at a time, and to assure that you're only dealing with files that the program that creates them is done with, but all of those concerns can be solved with a little thought and code.

    If you are dealing with 500,000 files instead of the 5,000 I sampled here, then I would expect that with an equivalent system you should be able to process those 500,000 in about 558 seconds, or 9 minutes, 20 seconds. You mentioned you are processing 80k files per hour, but on my system this script processes up to about 3,000,000 per hour, so about 37x more per hour than you have been experiencing. It's possible some of the improvement comes from not reading each file in its entirety, but given how I'm distributing the trigger line randomly throughout the file, that shouldn't account for more than a halving, on average, of the total run time. Possibly your move was doing a full copy, which would account for a lot more of the time.

    I'll suggest that if a method such as this one isn't fast enough, and running it more frequently isn't possible, you're going to have to do some profiling to determine where all the time is being spent.


Re^3: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by Marshall (Canon) on Mar 17, 2018 at 19:00 UTC
    move is part of File::Copy and can work across file systems. rename is a built in Perl function and cannot work across file systems. Often more restrictions means faster. This is easy enough to try, that I'd just try it and see the benchmark results. I would imagine that just processing the first 50,000 files would give you enough of a performance idea when comparing various alternatives. The if statement to stop after 50,000 files will make no speed difference.

    Update..The rename should be faster because the actual data bits don't have to be moved - just a directory modification. An actual copy would move the data bits to a new location on the disk - that is way slower. I am also not sure that in this case slurping the file in is best? It sounds like although there are few lines, they are long lines. You could also benchmark letting the filesystem do the line division for you, throw away the first 5 lines and only run the regex on the 6th line. My thinking here is that the line division probably uses the C index function which is faster than the regex engine. Also there apparently is no need to process the rest of the lines. The overall effect might (or not) be a speed increase. I again suggest using say 10% of the data for testing so that you can test 4-5 scenarios within a couple of hours.

Re^3: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by afoken (Canon) on Mar 17, 2018 at 18:35 UTC
    Can you tell me if Perl's "rename" is more efficient than "move"? As these files are on the same file system.

    rename avoids spawning a new process to do the same syscall, so it should be faster and create less load.


    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)