http://qs321.pair.com?node_id=14908


in reply to read X number of lines?

I don't think there's a way to do exactly what you want and pull out X number of lines in one shot. You're better off doing something like limitting the number of bytes you read in at a time. I threw together a bit of code (posted below) which reads in a certain amount of bytes, splits that into an array, and then processes those lines. Used with a reasonable size of bytes to read in, it seems to consistently be about 1.5 times faster.
#!/usr/bin/perl use Benchmark; use strict; timethese(10, { 'linebyline' => \&linebyline, 'chunk' => \&chunk }); sub linebyline { open(FILE, "file"); while(<FILE>) { } close(FILE); } sub chunk { my($buf, $leftover, @lines); open(FILE, "file"); while(read FILE, $buf, 10240) { $buf = $leftover.$buf; @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { } } close(FILE); } Benchmark: timing 10 iterations of chunk, linebyline... chunk: 60 wallclock secs (55.20 usr + 3.48 sys = 58.68 CPU) linebyline: 95 wallclock secs (91.67 usr + 2.16 sys = 93.83 CPU)
These tests were run on a 25 meg file with roughly 1 million lines in it. This code is not guaranteed to work 100%, but I believe it is correct enough to serve benchmarking purposes well.

Replies are listed 'Best First'.
RE: Re: read X number of lines?
by mikfire (Deacon) on May 26, 2000 at 16:38 UTC
    Just for something kind of different, you can play fun games with the input record seperator and end up with something like
    sub IRS_chunky { my($buf, $leftover, @lines); local $/ = \10240; open(FILE, "totalfoo"); while( $buf = <FILE> ) { $buf = $leftover.$buf; @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { } } close(FILE); }
    which is comparable, on my machine, to chunky.

    My real question, though, is what kind of machines are you running this on? My benchmark results are completely different than yours. Having run this three times on a Sun Ultra 5 on a file that is approx 500,000 lines and 24Mb in size from my local IDE drive, my results were pretty consistently like this

    perl test_read.pl Benchmark: timing 10 iterations of Chunky IRS, chunk, linebyline... Chunky IRS: 41 wallclock secs (31.31 usr + 3.71 sys = 35.02 CPU) chunk: 40 wallclock secs (31.00 usr + 3.81 sys = 34.81 CPU) linebyline: 27 wallclock secs (17.67 usr + 2.47 sys = 20.14 CPU)
    The code I used was identical to the earlier post, with the subroutine I wrote added. Using perl-5.6 generated the same basic results, plus or minus 1 for each stat. I am now somewhat confused. Is this a difference in the way Solaris uses its buffers? What platform/OS were the original tests run on?

    mikfire

      I'm missing something here. What does assigning $/=\10240 mean? Thanks,

      --ZZamboni

        Something new and twisted they added in perl 5.005. To quote perldoc perlvar:
        Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertable to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this: $/ = \32768; # or \"32768", or \$var_containing_32768 open(FILE, $myfile); $_ = <FILE>; will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces.

        mikfire

RE: Re: read X number of lines?
by eduardo (Curate) on May 26, 2000 at 08:32 UTC
    wow... i changed the read size to 8192 (2 4k blocks...) and I got these results on a 16 meg "test" file:
    Benchmark: timing 10 iterations of chunk, linebyline... chunk: 9 wallclock secs ( 7.66 usr + 1.58 sys = 9.24 CPU) linebyline: 34 wallclock secs (33.17 usr + 0.80 sys = 33.97 CPU)
    damn... that makes a BIG difference... screw line by line...
      You can even squeeze a little more out. This is ~5% faster on my system.
      my($buf, $chunk, @lines); while(read FILE, $buf, BUFFER_SIZE) { $chunk .= $buf; @lines = split /\n/, $chunk; $chunk = chomp $buf ? '' : pop @lines; foreach (@lines) {} }
      (BUFFER_SIZE is just a constant I was using)
      Although this is a Perl site, when you start playing with things as a block of data, I start thinking C/C++. Perhaps you should consider that as well... For data organized in \t and \n delimiters you should be able to parse it quickly in C. Whether you can do whatever operations on that data easily in C is another question, but one I think is worth asking yourself if runtime is a serious consideration. I'd also double check for anyplace you might need an eof(). Possible corner cases that might be good tests for the chunk() code is where you have a \n as last char in file, and where you do not.