Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.

Performance Trap - Opening/Closing Files Inside a Loop

by Limbic~Region (Chancellor)
on Dec 09, 2004 at 23:57 UTC ( [id://413719] : perlquestion . print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

There is a long story behind this that involves a Java programmer asking for some help with Perl. I won't get into the particulars other than to say the question asked was:

What's the easiest way to loop through a comma delimited file and append the line minus 1 column into a new file that is the same name as the excluded column?

The file in question was about 20 lines long. I gave my disclaimer about normally using a module to handle CSV, but the following code should work:

#!/usr/bin/perl use strict; use warnings; while ( <DATA> ) { my @field = split /,/; my $file = splice @field, 2, 1; open (OUTPUT, '>>', $file) or die $!; print OUTPUT join ',', @field; } __DATA__ 1,2,foo,3 4,5,bar,6 7,8,foo,9
I asked the Java programmer the next day how it worked and I was informed that it was too slow and that a Java program was being written instead. Scratching my head, I asked if the same file I was shown before was the one actually being used. It wasn't - multiple files millions of lines long each. Opening and closing file(s) that many times is bound to be slow. I offered the following modification* of the code provided the column being excluded was fairly repetetive in the file:
#!/usr/bin/perl use strict; use warnings; my %fh; while ( <DATA> ) { my @field = split /,/; my $file = splice @field, 2, 1; if ( ! $fh{$file} ) { open ($fh{$file}, '>>', $file) or die $!; } print { $fh{$file} } join ',', @field; } __DATA__ 1,2,foo,3 4,5,bar,6 7,8,foo,9
I explained that the reason for the disclaimer was that that the hash only bought performance if a file had more than 1 line getting appended to it. Additionally, if there are too many unique files, memory and/or open file descriptors may cause a problem. I was then told that the Java code was nearly done but thanks anyway. *shrug* - exit stage right.

I think I am missing how Java is going to be that much faster. I assume Java is still going to open and close the file each time through the loop unless there is a similar trick. Given that I don't really know Java I could be out in left field here.

Leaving Java aside, is there more run-time efficient way than my second suggestion in Perl? I haven't given it a lot of thought because the Java developer is just being silly. It is a run one and done script so it would already be finished if the first version (wrapped in a tiny shell script) had been allowed to run. On the other hand, this is the sort of thing that I like to be aware of in the future. (Prior Planning Prevents Poor Performance)**

Cheers - L~R

* The actual code used ARGV
** I learned this in the military, but there were a couple extra explicitive Ps

Replies are listed 'Best First'.
Re: Performance Trap - Opening/Closing Files Inside a Loop
by tachyon (Chancellor) on Dec 10, 2004 at 04:44 UTC

    If you have the memory something this will probably be faster. You can save on the if/else for every line as well as only doing the minimum in the loop (like not splicing and joining when we don't really need to). Even a saving of a few microseconds X millions of lines is substantial. Multiple calls to print are significantly slower than a single call as the OS can buffer/write more efficiently.

    #!/usr/bin/perl my ($field,%fh); while ( <DATA> ) { @field = split /,/; $fh{$field[2]} .= "$field[0],$field[1],$field[3]"; } for my $file( keys %fh ) { open F, ">$file" or die $!; print F $fh{$file}; close F; }



      tachyon, your code is likely to be faster not so much because it shaves away Perl cycles but because it will greatly reduce disk seeks, which are probably dominating L~R's run time. (See my other post in this thread for more on this.)

      L~R: Assuming that you have the RAM, can you compare tachyon's code's run time to the other implementations? My guess is that tachyon's code will fare well. (If you don't have the RAM, just tweak the code so that it will process, say, 100_000 or so lines per pass and clear out %fh between passes. Also, you'll need to open files in append mode.)


        I agree reducing the number of seeks you need is vital. Given an average 3 msec seek time you can only have 333 seeks per second. This is of course glacial. Ignoring buffering the original code effectively needed 2 seeks (or more) per line, the improved version required at least 1 seek. In the example I presented the number of seeks required is a function of the number of files we need to create, not the number of lines in the input file. This will be a significant improvement provided that the number of unique files is less than the number of input lines.



        I had thought about this myself after posting. The reason I didn't give it a lot of initial thought is because the Java developer made it clear that I was not welcome in the sandbox. My guess is that some sort of limited buffer would be best since that's still a whole lot of lines to be keeping all in memory.

        Cheers - L~R

      for my $file (keys %fh) will (or should) be slower then while (my ($file, $data) = each %fh)

        Doesn't placing a variable declaration in the conditional statement for a conditional loop occasionally lead to strange errors?

        print substr("Just another Perl hacker", 0, -2);
        - apotheon
        CopyWrite Chad Perrin

        While you are correct that accessing key value pairs with each is a little faster this is unlikely to influence runtime in any measurable way as the output bottleneck lies with the OS and disk IO.



Re: Performance Trap - Opening/Closing Files Inside a Loop
by sgifford (Prior) on Dec 10, 2004 at 00:14 UTC

    Unless the Java programmer knows something you don't (like there are only 100 possibilities for filenames), I don't think their performance will beat an implementation that caches filehandles, as you described in your second example. I don't even think a C or Assembly program would be much faster; IMHO the time will be completely dominated by I/O and system calls.

    One trick to be able to keep many filehandles open without worrying about consuming more resources than you intend to is the FileCache module.

      like there are only 100 possibilities for filenames

      And what they are so they can be opened before entering the loop. This isn't the case.

      Cheers - L~R

Re: Performance Trap - Opening/Closing Files Inside a Loop
by runrig (Abbot) on Dec 10, 2004 at 01:15 UTC
    Have they mentioned yet how fast the java solution is? And how many LOC? Is it even done yet :) ?? Without knowing what data is typical, there's no way to tell which solution is best. Your first solution is the simplest and safest without knowing what the data is (and like you said, it would've been done by now anyway). The FileCache mentioned earlier or your hash solution might be the best depending on the data. But don't be too disappointed if they don't listen and think perl is "too slow" and don't realize that perl (or some other "scripting language") would've been perfect for this. Also, if they are too perl-inexperienced (and/or java-centric) to realize this, then any perl solution is unmaintainable to them.
Re: Performance Trap - Opening/Closing Files Inside a Loop
by tmoertel (Chaplain) on Dec 10, 2004 at 06:51 UTC
    Limbic~Region axed:
    Leaving Java aside, is there a more run-time efficient way than my second suggestion in Perl?
    Probably. (But in order to answer your question with confidence, I would need to know more about the OS and filesystem that you are using, the input size, and the distribution of files that must be opened for writing. Lacking that information, here is my best guess.)

    Assuming sufficiently small input size, we can load the entire input into RAM and build an optimal write plan before attempting further I/O. The plan's goal would be to minimize disk seek time, which is likely the dominant run-time factor under our control. An optimal strategy would probably be to open one file at a time, write all of its lines, close it, and then move on to the next file. If input size is larger than RAM, the speediest approach would then be to divide the input into RAM-sized partitions and process each according to its own optimal write plan.

    Caching the output filehandles (as in your second implementation) probably will not be competitive. Even if you can hold all of the output files open simultaneously, a write pattern that jumps among files seemingly at random will probably kill you with seek time. Your OS will do its best to combine writes and reduce head movement with elevator (and better) algorithms, but you'll still pay a heavy price. You'll do much better if you can keep the disk head writing instead of seeking.

    If it turns out that the number of distinct files to be created is nearly the same as the number of input lines, no strategy is likely to improve performance significantly over the naive strategy of opening and closing files as you walk line by line through the input.

    One more thing. If the input that Mr. Java tested your program against was millions of lines long, does that imply that your code may have been creating thousands of files? If so, you might want to determine whether the filesystem you were using has a hashed or tree-based directory implementation. If not, your run time may have been dominated by filesystem overhead. Many filesystems (e.g., ext2/3) bog down once you start getting more than a hundred or so entries in a directory.


      ... Many filesystems (e.g., ext2/3) bog down once you start getting more than a hundred or so entries in a directory.

      Actually, I have found VFAT to pale in comparison to ext3 and even ext2. ReiserFS should be even better i've heard. YMMV, of course - RAM/processor(s) etc.


      A well-known approach to this 'many files' problem is to create a n-level directory structure based on filenames. File abc goes into a/b/abc, def goes into d/e/def etc. (for n=2). Filenames are then typically randomly generated - and if they're not you can use some transformation to create input for the directory. Reportedly, ReiserFS does this internally.


        CPAN does something similar. For example, my uploads are placed in Only the few few authors are actually in the authors directory. The rest of us are in the id directory. :-)

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      We found that ext3 with the 2.4.x Linux Kernel was reasonably happy with 10,000 files in a directory but obviously unhappy with 1 million. By reasonably happy I mean other bottlenecks dominated affairs. I would be interested if anyone has done a study on the relation of file numbers per dir vs access time for different file systems that it a little more precise.



        Out of curiousity, did you test with the 'dir_index' feature flag set? It allows the filesystem to use hashed b-trees for lookups in large directories.

        mhoward - at -
      There is something to be said for letting the IO system and OS handle the buffering and writes. All filehandles have a write bufer which only written when it is full. One way to reduce seek times is increase the buffer size for the opened files. The advantage is that Perl decides when the write the buffers, the OS decides when the write it to disk, and both are pretty good at this.

      The big advantage of caching filehandles is that the open files can hold output in the buffers until they are full. If they are continually being closed and reopened, then each line is being written individually.

      What is need is some way to keep a limited number of filehandles opened to keep from hitting the limit. A LRU cache would be perfect. It see a couple of modules that implement this. Or reimplementing it would be pretty easy.

Re: Performance Trap - Opening/Closing Files Inside a Loop
by graff (Chancellor) on Dec 10, 2004 at 03:01 UTC
    is there more run-time efficient way than my second suggestion in Perl?

    Given the original statement of the problem, with or without the (rather disingenuous) extension to the original problem, I would have suggested that it would help matters noticeably if the input were sorted with respect to the column containing the file name.

    The sorting would be really easy to do, either prior to passing the data to perl, or within the perl script (though there might be memory issues doing it in the script, if we're talking about millions of lines instead of dozens). I hope the esteemed java programmer knows about the unix "sort" command (and the fact that it's ported to windows)...

      Presumably, the files need to be appended in the order encountered. Part of the long story unmentioned is a lot of guarded responses to my inquiries for additional information. A cut | sort | uniq might not be a bad idea to pre-process the file to get a list of unique file names though.

      Cheers - L~R

        I wonder if all java programmers are this cagey/evasive about describing their problem sets...

        Even so, now we're just talking about a two-stage sort:

        ## let's suppose the file names are in column 3 of "table.txt": perl -pe 's/^/$.,/' table.txt | sort -t , -k 4,4 -k 1,1n | cut -f2- -d +, |
        where "" is a version of your suggested script that assumes lines are pre-sorted by output file name -- so it really needs only one output file handle open at any one time. By pre-pending the original line numbers before sorting, and using the line numbers as a secondary sort field, the (presumably) intended result is achieved.

        (update: if the original table has file names in column 3, and a perl script prepends a line number to each line, then the primary sort column has to be 4, not 3.)

Re: Performance Trap - Opening/Closing Files Inside a Loop
by kvale (Monsignor) on Dec 10, 2004 at 00:05 UTC
    Assmue that there are a reasonably small number of column names. Then just pre-open all possible files before the main loop and write to the appropriate file handle each time through the loop. Caching is your friend :)


      I am likely being thick, but I don't understand. The value of the column (not a column name) is what is being used as the file name. It is not possible to know in advance the values without going through every line of every file first. Even if you did that, you would still need to store the information in a hash so that you could look up the filehandle corresponding to that value later so I see this as a slower variation on my proposed solution. What am I missing?

      Cheers - L~R

        Ah, sorry I wasn't clear. I assumed that one knew the (small) set of possible column values to be used as filenames. If you do not know this set of values, my method may still be faster, but prescanning the table will add some time to the execution.

        Once you have established a hashmap from column values to filehandles, then you can print to the desired filehandle. I expect a single hash lookup to be much faster than a pair of system calls for opening and closing files; in addition to the OS bookkeeping and disk IO overhead for opening and closing, each file buffer is flushed (and, depending on the OS and filesystem, the disk is written to) for every line written.

        Another completely different method is to append the lines to different strings, one for each column value. Then write them all the strings out to files after the loop.


Re: Performance Trap - Opening/Closing Files Inside a Loop
by edoc (Chaplain) on Dec 10, 2004 at 03:21 UTC

    woah! way to get distracted! I just realised I've spent way too much time messin with this.. back to work..



Re: Performance Trap - Opening/Closing Files Inside a Loop
by TedPride (Priest) on Dec 10, 2004 at 10:03 UTC
    tachyon is correct that printing once to each file is much more efficient then printing hundreds of thousands of times. The question is, do you have enough memory to do this? I'd personally store output in a hash, with a filehandle and output buffer for each file, and just print to the filehandle whenever the buffer reached allowable memory / number of files amount of data. Anything still in the buffers at the end of the run would also be printed to the files.

    Allowing 10 MB of memory and 100 files, for instance, you could print whenever the buffer for an individual file reached 100K or so.

Re: Performance Trap - Opening/Closing Files Inside a Loop
by xorl (Deacon) on Dec 10, 2004 at 15:08 UTC
    Now this is a perfect example of the classic setup. The Java guy did not give you enough information before you began the project. Now the boss is going to be told that perl is old and too slow. You'll be seeing a memo soon saying that all perl programs will need to be converted to Java and that all perl programmers will need to either learn Java or be fired. You better strike now. Have a meeting with the boss. Explain the problem have a working demo of your revised code that shows how much faster it is now that you have enough information. Good Luck.
      I left out the office politics. This isn't about Java versus Perl in the workplace though it very much might be for this one developer. Read my reply here for more information. While it isn't my job to write code (at least not on a day to day basis), I do try and help out the contractors when I can. In this particular case, my help was only wanted if the problem was insignificant.

      Cheers - L~R

        Ah then you might be safe. But I don't think so.

        The more jobs I have, the more I realize with very few exceptions there is always some kind of office politics motivating most people's actions at work (and sometimes even outside of work). This has proven to be the case in the small companies (3-10 employees) to the large companies (4,000-10,000 employees) and all the sizes in between.

        I would still wonder why they'd ask you (the guy with the non-programming title) to solve a "simple" task with an "inferior" language. It could just be the contractor wants to show in his own mind that your company needs him, or it could be something more complex. I'd dig a little deeper and watch out for any rumors.

        Again Good Luck.

        On the bright side, your question has helped me fix a problem I was having. Thanks!

Re: Performance Trap - Opening/Closing Files Inside a Loop
by runrig (Abbot) on Dec 10, 2004 at 18:23 UTC
    (Warning: awkmonk post ahead): I almost forgot. I once had a similar problem, so I wrote an awk solution. Though in my case, I knew there were only about 5 different values for the file name, and there was a bit more error checking than this, but it was basically something like:
    #!/usr/bin/awk -f { print $1, $3>>$2 }
    And just use a2p to get a perl version of this :)
Re: Performance Trap - Opening/Closing Files Inside a Loop
by DrHyde (Prior) on Dec 10, 2004 at 12:06 UTC
    I bet you're going to run out of file handles.

      I bet you're going to run out of file handles.

      I'll bet you I won't :-)

      $ ulimit -HSn 8192

      A file descriptor is really just an entry in a C array which is maintained by the OS. Although by default the size of that array is typically 512, 1024, .... or some other function of base 2 depending on OS etc you can make it virtually anything you want within reason. See this for some more details.



Re: Performance Trap - Opening/Closing Files Inside a Loop
by mattr (Curate) on Dec 11, 2004 at 18:38 UTC
    Cheers for an interesting post. I agree it stinks of personalities but anyway.

    I was agreeing with the idea of doing 10MB segments with an in-memory hash. But what I didn't quite understand is why they are using the filesystem as the database, sure it is possible but hardly seems useful. Add to that the restriction on filenames and even potential security problems if something gets injected there..

    I was thinking one of the tons of db-like options available might be useful, either as the end product or as the intermediate stage. If the java guys can't read a tied hash, mldbm or whatever you could use an sql db, anyway these things ought to be good at dealing with memory and disk write optimization. You can always dump the db to separate files if that's what you want.

    Anyway, the point is that you are intentionally not being told what the project is supposed to do, so watch your back! I would personally ask why on earth they are writing thousands of files to the disk, that is so 70s. Don't the java guys know how to use Oracle or whatever they have in the same room? :) And they waste your time too, talk about inefficient use of resources!

    Anyway it would be really funny if the answer is just to use the sql LOAD DATA INFILE command on a database you already have to solve the problem. You may be interested in the mysqlimport utility which is an interface to that command.