http://qs321.pair.com?node_id=894160

TRoderic has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

Currently working on a document processor that reads most of a directory into memory, parses the files there-hidden, and then dumps the resultant data to several different files on disc ( the reason is obscure and distressing)

I have found however that the log file writes, given they are going to several different files (and at some point soon probably over network shares) are becoming a bottleneck, which is an issue given there are about 800k 1 MB~ files to do on a more than daily basis. (waiting for the DBA to get on with it)

Therefore I would like to store some quantity of the output into memory until a limit is reached, then dump them to disc. viz:

<<loop>> { if ($hashmatch{$scrutiny}){ $$bufmem = $$bufmem . "<<secret parsing output goes here!>>\n"; if (length($$bufmem) > 1024){ print("buffer dump\n"); $outfh->print($$bufmem); $$bufmem = ""; } } } $outfh->print($$bufmem);
I am sure there is a better/more portable way of doing this though, but I can't seem to describe it to the search engine daemons so as to get the right sort of answer. can anyone describe/direct or otherwise enlighten me in this regard?

yrs,

TR

Replies are listed 'Best First'.
Re: temp hold logfiles in memory?
by Eliya (Vicar) on Mar 19, 2011 at 16:14 UTC

    Writing to a file handle is by default buffered anyway (i.e. unless it's unbuffered (autoflushed) or line-buffered (interactive/terminal)) .  As the buffer size is 4k, it doesn't seem to make much sense to do additional buffering yourself at 1k granularity (though maybe that's not the real value you finally intend to use...).

    If you want a larger buffer (the 4k cannot be changed without recompiling perl), maybe you could open the log file handle to a scalar, and periodically check its size...

    open(my $outfh, '>', \$buf)
Re: temp hold logfiles in memory?
by ELISHEVA (Prior) on Mar 19, 2011 at 20:35 UTC

    You mention that your program is 4-5 seconds per minute slower. Depending on the total number of minutes, that may not be a lot of time in the long view. Compare the amount of time you will save over the lifetime of the programs use to the amount of time it will take to find the true source of the slow down and optimize it away. Unless you are running this program over and over, need real-time responsiveness or you have a marketing reason to look fast and slick, you may want to ask whether optimization is even necessary.

    If you still feel you need to optimize, you might be able to confirm your suspicions about the cause by running your script with the -d:DProf option. This will generate a file named tmon.out in the current directory. This file contains raw profiling data. You can analyzed the contents of the file, by typing the command dprofpp, which should be part of your Perl installation. This will give you an idea of how much time is taken by each subroutine. Thus your shell would look something like this:

    $ perl -d:DProf myscript.pl $ dprofpp

    If you are still convinced that buffering is the source of your problem, you may want to consider using the "sys" family of functions. These are low level functions that work directly with the file and will allow you to define your own buffer size and buffering strategy:

    • sysopen to open the file
    • syswrite to dump out your buffer
    • sysread to read
    • sysseek to move the cursor to some place other than the end of the previous read or write.

    If you go this route, do not use the normal Perl file io (open, read, print, seek tell) on the same file handle or your file handle will give you confusing and not terribly helpful results.

    Before you make any changes, you should restructure your program so that the old file buffering code is encapsulated in a subroutine. The new code should also be in a subroutine. That way you can swap out the old and new code at will to compare them. Encapsulating the two approaches (Perl buffering vs. custom buffering) will also let you use a wonderful optimization tool, a core module built into Perl: Benchmark. This module lets you compare two subroutines to see which is faster. If you are going to put in the effort to optimize, it pays to make sure that you are actually making an improvement.

    Finally, you may want to take a look at this Unix Review article, which goes over some of the basic tools for optimizing programs: Speeding up your Perl programs.

      Also, don't forget to have a look at Devel::NYTProf, which generates nice http reports :). Has saved me from a lot of hours of debugging and profiling...

      thank you *so much* for that, it's exactly the sort of information I was looking for.

      the program is internal, but likely to be repackaged and re-purposed a good many times in the near future, hence my attempts to get it as close to right now, before some new purpose highlights an old oversight.

      with the information from that subroutine timer, I would not be surprised if the actual process can be improved in the right directions!

      thanks again.

Re: temp hold logfiles in memory?
by graff (Chancellor) on Mar 19, 2011 at 17:04 UTC
    As indicated in the first reply, it seems likely that your diagnosis of the bottleneck is a bit off the mark. What sort of evidence do you have about where most of the runtime is being spent? Have you profiled the code in any way?
      my diagnosis is perhaps incorrectly based on noticing about 4-5 seconds per minute improvement running the same batch with a 1 megabyte buffer with that loop I posted against the default file handle method.

      I'm generally trying to move most of the operation into memory, as the process itself consists of 300 lines of regex match statements for looping through plain text and/or html files.

      as for profiling the code, i'm still very much of a novice and am only now branching into code optimisation. if you've got some pointers there it would be greatly appreciated :)

        You may want to also benchmark the "300 lines of regex match statements". These statements may be taking more time than the I/O, so don't rule them out. For example, could a simple "index" replace some of the regex tests. Also look at the order of the tests, sometimes just reordering the sequence of tests will improve performance.

        Good Luck

        "Well done is better than well said." - Benjamin Franklin