http://qs321.pair.com?node_id=1211102

DenairPete has asked for the wisdom of the Perl Monks concerning the following question:

I am at my wits end with this PERL script. Running on AIX. I am Processing a large directory every night. It accumulates around 1 Million files each night, half of which are ".txt" files that I need to process. Each ".txt" file is pipe delimited and only contains 20 records - Record #6 is the one that contains the info I need in order to determine which directory to move the file to (In this case the file would be moved to "/out/4". Example Record: A|CHNL_ID|4 (Third party software is creating these files and they didn't think of including the channel in the file name). As of now this script is processing at a rate of 80,000 files per hour. Is there any recommendations on how I could speed this up?

opendir(DIR, $dir) or die "$!\n"; while ( defined( my $txtFile = readdir DIR ) ) { next if( $txtFile !~ /.txt$/ ); $cnt++; local $/; open my $fh, '<', $txtFile or die $!, $/; my $data = <$fh>; my ($channel) = $data =~ /A\|CHNL_ID\|(\d+)/i; close($fh); move ($txtFile, "$outDir/$channel") or die $!, $/; } closedir(DIR);
  • Comment on Perl Program to efficiently process 500000 small files in a Directory (AIX)
  • Download Code

Replies are listed 'Best First'.
Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by davido (Cardinal) on Mar 17, 2018 at 16:12 UTC

    If the files are on the same physical device (as BrowserUk asked and you replied yes), then you can eliminate the cost of moving them by assuring you're using a version of 'move' that doesn't do a physical move, just a logical one. If they're on a different device, you're kind of stuck on that point.

    Are the files individually written as atomic chunks throughout the day and never touched again until you process them nightly? If so, consider this: You can process 80k/hour, but you're acquiring only 21k per hour. You have a surplus of 59k, or to put it another way, each hour's worth of files takes you about 16 minutes to process. So, could you run your script as a cron that fires once per hour for 16 minutes? Or once per half-hour for 8 minutes? Or once per quarter-hour for 4 minutes? I also suggest in such cases proactively cause the process to stop its work after 150% of the expected time slot and log if it didn't finish its work. The next cron will pick up where it left off, but you would like to know if you're getting bogged down in the future.

    If you choose to take this approach you will need to deal with file locking to assure you're processing complete files, and also to assure that a given file is only dealt with by a single runtime instance of your processing script. If the writing process doesn't lock you would still want to do so to prevent your own processes from stumbling over each other if one happens to run a little long.

    If the writing process doesn't lock, you could also simply skip any file newer than 5 minutes old just to assure the writing process is done with it (this is making the assumption that the writing process is spitting out a file, closing it, and then leaving it alone from that point forward).


    Dave

      Thanks for the Reply Dave. Can you tell me if Perl's "rename" is more efficient than "move"? As these files are on the same file system.

        Perl's built-in rename may be faster than File::Copy's move, but mostly due to the fact that move contains logic to decide whether it can do a rename, or whether it must fall back to a copy and unlink (for example, spanning file systems).

        I was curious how quickly I could burn through 5,000 files while reading them line by line, finishing the read when I find the CHNL_ID line, and then renaming them within the same filesystem to a subdir based on the ID found. So I created a script that does just that:

        1. Create a temporary directory (in /tmp on most linux systems).
        2. Create 5,000 files in the temporary directory that contain a repetition of the sample input you provided, expanded out to approximately 350k per file. I injected a single CHNL_ID line at a random position within each of the 5,000 files. The ID I generated was an integer between 0 and 31.
        3. Create 32 subdirectories. This may have to be moved into the processor's loop, to occur on-demand, if you don't know the range that the ID's will fit within.
        4. Start timing the elapsed time.
        5. Open my temp directory to read its contents.
        6. Iteratively read the directory's contents to grab one file at a time.
        7. Decide if the thing we're looking at is the type of file we want to deal with.
        8. Open one file at a time and get an exclusive lock.
        9. Read the file line by line until we find the CHNL_ID line. Grab the ID integer.
        10. Rename the file into the new subdir.
        11. Close the file so that we drop its lock.
        12. Move on to the next file
        13. Capture our total elapsed time and display it.

        It was interesting to me that after creating the files (which took some time), I was able to process 5,000 of them in under six seconds. My SSD is pretty fast, so your mileage will certainly vary. But I'm not seeing performance being a big problem, particularly where this only runs on a nightly basis. Here's the code:

        #!/usr/bin/env perl use strict; use warnings; use feature qw/say/; use File::Temp; use File::Path qw(make_path); use File::Spec::Functions qw(catdir catfile); use Time::HiRes qw(tv_interval gettimeofday); use Fcntl qw(:flock); my $FILE_COUNT = 5_000; # SETUP - Create 500k files that contain approximately 350k data with +the # CHNL_ID line randomly distributed in each file. say "Generating $FILE_COUNT temporary files."; my @base_content = grep {!m/^\QA|CHNL_ID|\E\d+\n/} <DATA>; @base_content = (@base_content) x 1024; my $td = File::Temp->newdir( TEMPLATE => 'pm_tempXXXXX', TMPDIR => 1, CLEANUP => 1, ); for my $n (0 .. 31) { make_path(catdir($td->dirname, sprintf("%02d", $n))); } for (1 .. $FILE_COUNT) { my $rand_ix = int(rand(scalar(@base_content))); my $chnl_id = sprintf "%02d", int(rand(32)); my @output; for my $line_ix (0 .. $#base_content) { push @output, "A|CHNL_ID|$chnl_id\n" if $line_ix == $rand_ix; push @output, $base_content[$line_ix]; } my $tf = File::Temp->new( TEMPLATE => 'pm_XXXXXXXXXXXX', SUFFIX => '.txt', DIR => $td->dirname, UNLINK => 0, ); print $tf @output; $tf->flush; close $tf; } # Sample file processor: say "Processing of $FILE_COUNT files."; my $t0 = [gettimeofday]; opendir my $dh, $td->dirname or die "Cannot open temporary directory (", $td->dirname, "): $!\n +"; FILE: while (defined(my $dirent = readdir($dh))) { next if $dirent =~ m/^\.\.?$/; next unless $dirent =~ m/\.txt$/; my $path = catfile($td->dirname, $dirent); next unless -f $path; open my $fh, '<', $path or die "Cannot open $path for read: $!"; flock $fh, LOCK_EX or die "Error obtaining a lock on $path: $!"; while (defined(my $line = <$fh>)) { if ($line =~ m/^\QA|CHNL_ID|\E(\d+)$/m) { my $target_dir = catdir($td->dirname, $1); make_path($target_dir) unless -d $target_dir; my $dest = catfile($target_dir, $dirent); rename $path, $dest or die "Could not rename $path into $d +est: $!"; close $fh; next FILE; } } warn "Did not find CHNL_ID in $path. Skipping.\n"; close $fh; } my $elapsed = tv_interval($t0); say "Completed processing $FILE_COUNT files in $elapsed seconds."; __DATA__ A|RCPNT_ID|92299999 A|RCPNT_TYP_CD|QL A|ALERT_ID|264 A|FROM_ADDR_TX|14084007183 A|RQST_ID|PT201803989898 A|CRTEN_DT|02072018 A|CHNL_ID|17 A|RCPNT_FRST_NM|TESTSMSMIGRATION A|SBJ_TX|Subject value from CDC A|CLT_ID|14043 A|ALRT_NM|Order Shipped A|CNTCT_ADDR|16166354429 A|RCPNT_LAST_NM|MEMBER A|ORDR_NB|2650249999 A|LOB_CD|PBM D|QL_922917566|20180313123311|1|TESTSMSMIGRATION MEMBER||

        The output:

        Generating 5000 temporary files. Processing of 5000 files. Completed processing 5000 files in 5.581455 seconds.

        If you are using a slow spindle drive and a solution similar to this one actually does require way too much time, then you may want to run once per hour instead of nightly. That will require a little more effort to assure that only one process runs at a time, and to assure that you're only dealing with files that the program that creates them is done with, but all of those concerns can be solved with a little thought and code.

        If you are dealing with 500,000 files instead of the 5,000 I sampled here, then I would expect that with an equivalent system you should be able to process those 500,000 in about 558 seconds, or 9 minutes, 20 seconds. You mentioned you are processing 80k files per hour, but on my system this script processes up to about 3,000,000 per hour, so about 37x more per hour than you have been experiencing. It's possible some of the improvement comes from not reading each file in its entirety, but given how I'm distributing the trigger line randomly throughout the file, that shouldn't account for more than a halving, on average, of the total run time. Possibly your move was doing a full copy, which would account for a lot more of the time.

        I'll suggest that if a method such as this one isn't fast enough, and running it more frequently isn't possible, you're going to have to do some profiling to determine where all the time is being spent.


        Dave

        move is part of File::Copy and can work across file systems. rename is a built in Perl function and cannot work across file systems. Often more restrictions means faster. This is easy enough to try, that I'd just try it and see the benchmark results. I would imagine that just processing the first 50,000 files would give you enough of a performance idea when comparing various alternatives. The if statement to stop after 50,000 files will make no speed difference.

        Update..The rename should be faster because the actual data bits don't have to be moved - just a directory modification. An actual copy would move the data bits to a new location on the disk - that is way slower. I am also not sure that in this case slurping the file in is best? It sounds like although there are few lines, they are long lines. You could also benchmark letting the filesystem do the line division for you, throw away the first 5 lines and only run the regex on the 6th line. My thinking here is that the line division probably uses the C index function which is faster than the regex engine. Also there apparently is no need to process the rest of the lines. The overall effect might (or not) be a speed increase. I again suggest using say 10% of the data for testing so that you can test 4-5 scenarios within a couple of hours.

        Can you tell me if Perl's "rename" is more efficient than "move"? As these files are on the same file system.

        rename avoids spawning a new process to do the same syscall, so it should be faster and create less load.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by rminner (Chaplain) on Mar 17, 2018 at 07:11 UTC
    Historically i wrote a programm, which also read in a large number of files into memory through setting $/ to undef. The given program took 45 minutes at the time to complete. After changing the reading mechanism to File::Slurp the runtime went down to 3 minutes. I did not change anything else. This will of course depend on a number of factors, but maybe you could give it a try. Your example adapted to File::Slurp:
    use File::Slurp; # Update: or use File::Slurper which Athanasius menti +oned. opendir(DIR, $dir) or die "$!\n"; while ( defined( my $txtFile = readdir DIR ) ) { next if( $txtFile !~ /.txt$/ ); $cnt++; my $data = read_file($fh); my ($channel) = $data =~ /A\|CHNL_ID\|(\d+)/i; move ($txtFile, "$outDir/$channel") or die $!, $/; } closedir(DIR);
    Since your programm is blocking while reading and moving the File, you might also want to parallelize it. E.g. with Parallel::ForkManager or with MCE. Then you can do the reading of the files and the moving in parallel. To some extend you are of course I/O bound, but i think it should still give you some improvement, if implemented correctly.
    Update: i whipped up a quick (untested) example for Parallel::Forkmanager:
    use strict; use warnings; use File::Slurp; use Parallel::ForkManager; sub read_next_batch_of_filenames { my ($DH, $MAX_FILES) = @_; my @files = (); while (my $fn = readdir $DH) { next if ($fn !~ m/\.txt\z/); push @files, $fn; last if (scalar(@files) >= $MAX_FILES); } if (@files) { return \@files; } else { return; } } sub move_files { my ($outDir, $files) = @_; foreach my $f (@$files) { my $data = read_file($f); my ($channel) = $data =~ /A\|CHNL_ID\|(\d+)/i; move ($f, "$outDir/$channel") or die "Failed to move ' +$f' to '$outDir/$channel ($!)\n"; } } sub parallelized_move { my $dir = 'FIXME'; my $outDir = 'FIXME'; my $MAX_PROCESSES = 4; # tweak this to find the best nu +mber my $FILES_PER_PROCESS = 1000; # process in batches of 1000, to + limit forking my $pm = Parallel::ForkManager->new($MAX_PROCESSES); opendir my $DH, $dir or die "Failed to open '$dir' for reading + ($!)\n"; DATA_LOOP: while (my $files = read_next_batch_of_filenames($DH, $FILES_PE +R_PROCESS)) { # Forks and returns the pid for the child: my $pid = $pm->start and next DATA_LOOP; move_files($outDir, $files); $pm->finish; # Terminates the child process } closedir $DH or die "Failed to close directory handle for '$di +r' ($!)\n"; }
        Which Slurping module is used cannot possibly the bottleneck for ops code
      File slurp cannot make that kind of speed improvement
        Actually it did exactly give this performance gain. I changed the reading mechanism after benchmarking the programm. The benchmarks showed me that more than 90% of the real runtime of my programm was spent on I/O. The files were however larger (a few thousand xml files), and were located on a normal HDD not an SSD. He could simply try and see, whether it changes anything for him.
Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by BrowserUk (Patriarch) on Mar 17, 2018 at 08:32 UTC

    Are the out/n directories on the same physical device as this giant directory?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      Yes
        Yes

        Then multi-(thread/processing) your problem will not help (much). Adding contention between reading and writing will probably slow things down

        And alternative strategy: separate the reading and writing.

        First pass reads the files, extracts the relevant field and construct a hash mapping original path/filename to new path/filename.

        Second pass reads the filenames again using opendir. That (should) give you the filenames in whatever order the filesystem considers its native ordering. That might be alphabetically sorted, or it might be order by creation date. Whatever, it should be the fastest way to access the on-disk directory structure.

        Process the filenames in whatever order the OS/opendir gives you them in. look up the original name in the hash to find the new name, and use rename to move them.

        Rational: separating the reading and writing removes contention at the hardware level; renaming in the same order the OS gives you them, reduces inode/FAT32/HPFS cache misses.

        Moving (renameing) a file does not cause any (file) data to be duplicated; it is simply a change to a field within the filesystem directory structure. Making that change in the same order the filesystem gives you the names, ensures that the modification is made immediately after the inode/... is read, therefore still in cache, saving a re-read/cache miss et al.; and should be the fastest approach. The filesystem LRU cache is optimised for this case.

        HTH.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by LanX (Saint) on Mar 17, 2018 at 14:10 UTC
    Provided there are no unidentified bottlenecks in your code.
    • How busy is your system/ processes/ nice level?
    • Is your filesystem under heavy load?
    • Did you try to run parallel scripts, dedicated to separated file names?

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by bliako (Monsignor) on Mar 17, 2018 at 17:57 UTC

    Would it be faster if you do reading and moving using a shell script?

    Something like:

    for afile in $dir/*; do achannel=$(awk -F'|' '{print $20}' "${afile}") mv "${afile}" "${outdir}/${achannel}" done

      Would it be faster if you do reading and moving using a shell script?

      Something like:

      for afile in $dir/*; do achannel=$(awk -F'|' '{print $20}' "${afile}") mv "${afile}" "${outdir}/${achannel}" done

      Unlikely. Creating an awk and a mv process for each of half a million files sums up to spawning a million processes. I doubt that this idea will be faster than running a single process (perl script), even on an insanely fast machine.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by Anonymous Monk on Mar 17, 2018 at 13:41 UTC
    It would be instructive to comment-out various parts of the program to see exactly which one is slowing you down most – walking the directory, reading the content, or moving the file. It could possibly be that directory manipulation or directory walking is the culprit, such that things might move faster if your program stashed a list of files that need to be moved, then moves them after completing all or part of the walk. It could also be that memory-mapped files could help. You have several I/O operations here any one of which could be the bad(dest) guy.
      Thumbs up, exactly the points I wanted to make.

      Improvement is only possible after identifying the bottlenecks, and who knows how performant single file moves are on AIX ...

      > 80,000 files per hour. 

      Means about 4 per second that's hard to believe on modern hardware.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Wikisyntax for the Monastery

        80,000/hr /3600 = 22.222/s?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        Well I can assure you that you can believe it. Remember there are around 1million files dropped into this directory each day. Of these 1 million around half are the files I'm looking for to move into another directory. I commented out the "Move" out of the above script on this thread and it ran in 1 hour and 20 minutes
      Mike, I'm curious why you post your better, less provocative stuff anonymously. It leads one to surmise that you are intentionally trolling when you post as sundialsvc4.
        Its still the same ol same ol, hes just repeating/restating things already said by others before him
        Quite the opposite. I use AM more and more frequently just to piss seven people off. :-P

        If you liked the previous post, then its reputation would be only -6 instead of -7, and that gets tiresome after a while.
Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by haukex (Archbishop) on Mar 19, 2018 at 06:39 UTC
Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by Anonymous Monk on Mar 17, 2018 at 07:54 UTC
    Which part of your program is the slow part? How long does it take to simply readdir and print the filenames into a file?
      I reran it commenting out the "Move". it took 1 hour and 20 minutes to process 467,000 files.

        How big are the files in bytes ?. Can you post and example of one ?

        poj
Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by Anonymous Monk on Mar 17, 2018 at 19:45 UTC

    What filesystem is being used? JFS? How long does it take to merely ls those 1/2 million directory entries? Have you tried processing the files in particular order, like reverse list order (last entry first), or maybe reverse chronological order (newest entry first)?

Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by Anonymous Monk on Mar 19, 2018 at 18:24 UTC
    "Commenting out the move" does not mean that "move is the culprit." It might well instead mean that the directory-walk ran much faster! Try this: modify the program to walk the directory and to write out the qualifying filenames to a separate temporary file. Next, run (and time) a separate program which reads that temporary file, and which therefore is not performing a directory walk at the same time. Directory walks often rely on caches (for speed) that must be invalidated when the content of the filesystem is changed. If you measured the amount of time between each successive "hit" in your directory walk, in the present program, you just might discover that it's running slower, and slower, and slower . . . all this because you're altering the directory structure at the same time.
      ... also note that dividing this into two parallel processes, e.g. linked by a Unix/Linux pipe, would not be the same as using a temporary disk file as the intermediate buffer and doing the work in two separate stages.