in reply to Re: Perl Program to efficiently process 500000 small files in a Directory (AIX)
in thread Perl Program to efficiently process 500000 small files in a Directory (AIX)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by davido (Cardinal) on Mar 18, 2018 at 06:22 UTC | |
Perl's built-in rename may be faster than File::Copy's move, but mostly due to the fact that move contains logic to decide whether it can do a rename, or whether it must fall back to a copy and unlink (for example, spanning file systems). I was curious how quickly I could burn through 5,000 files while reading them line by line, finishing the read when I find the CHNL_ID line, and then renaming them within the same filesystem to a subdir based on the ID found. So I created a script that does just that: It was interesting to me that after creating the files (which took some time), I was able to process 5,000 of them in under six seconds. My SSD is pretty fast, so your mileage will certainly vary. But I'm not seeing performance being a big problem, particularly where this only runs on a nightly basis. Here's the code:
The output:
If you are using a slow spindle drive and a solution similar to this one actually does require way too much time, then you may want to run once per hour instead of nightly. That will require a little more effort to assure that only one process runs at a time, and to assure that you're only dealing with files that the program that creates them is done with, but all of those concerns can be solved with a little thought and code. If you are dealing with 500,000 files instead of the 5,000 I sampled here, then I would expect that with an equivalent system you should be able to process those 500,000 in about 558 seconds, or 9 minutes, 20 seconds. You mentioned you are processing 80k files per hour, but on my system this script processes up to about 3,000,000 per hour, so about 37x more per hour than you have been experiencing. It's possible some of the improvement comes from not reading each file in its entirety, but given how I'm distributing the trigger line randomly throughout the file, that shouldn't account for more than a halving, on average, of the total run time. Possibly your move was doing a full copy, which would account for a lot more of the time. I'll suggest that if a method such as this one isn't fast enough, and running it more frequently isn't possible, you're going to have to do some profiling to determine where all the time is being spent. Dave | [reply] [d/l] [select] |
Re^3: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by Marshall (Canon) on Mar 17, 2018 at 19:00 UTC | |
Update..The rename should be faster because the actual data bits don't have to be moved - just a directory modification. An actual copy would move the data bits to a new location on the disk - that is way slower. I am also not sure that in this case slurping the file in is best? It sounds like although there are few lines, they are long lines. You could also benchmark letting the filesystem do the line division for you, throw away the first 5 lines and only run the regex on the 6th line. My thinking here is that the line division probably uses the C index function which is faster than the regex engine. Also there apparently is no need to process the rest of the lines. The overall effect might (or not) be a speed increase. I again suggest using say 10% of the data for testing so that you can test 4-5 scenarios within a couple of hours. | [reply] |
Re^3: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by afoken (Chancellor) on Mar 17, 2018 at 18:35 UTC | |
Can you tell me if Perl's "rename" is more efficient than "move"? As these files are on the same file system. rename avoids spawning a new process to do the same syscall, so it should be faster and create less load. Alexander
-- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-) | [reply] |