Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

read X number of lines?

by eduardo (Curate)
on May 26, 2000 at 04:34 UTC ( [id://14900]=perlquestion: print w/replies, xml ) Need Help??

eduardo has asked for the wisdom of the Perl Monks concerning the following question:

let us say that I have a file that is very big... let us say that it is larger than the amount of ram on the box i am working on. i have to process each and every single line, now the typical way to do this is in a
while (<INFILE>) { ... do stuff here ... }
block. now, as we all know (thanks to the von neuman bottleneck) disk IO is going to be the slowest part of this process, and i have to do this to well over 22 million lines, so i can't just slurp it all into one massive mega scalar. so, my question is, in perl is there a way of saying: "read the next 10,000 lines in one shot and put them in this array" i mean, i want to read a large amount of lines, put them into memory, and then do something like: "write the entire contents of this array in one shot" am I dreaming here? is this even possible? help me out here!

Replies are listed 'Best First'.
Re: read X number of lines?
by plaid (Chaplain) on May 26, 2000 at 07:37 UTC
    I don't think there's a way to do exactly what you want and pull out X number of lines in one shot. You're better off doing something like limitting the number of bytes you read in at a time. I threw together a bit of code (posted below) which reads in a certain amount of bytes, splits that into an array, and then processes those lines. Used with a reasonable size of bytes to read in, it seems to consistently be about 1.5 times faster.
    #!/usr/bin/perl use Benchmark; use strict; timethese(10, { 'linebyline' => \&linebyline, 'chunk' => \&chunk }); sub linebyline { open(FILE, "file"); while(<FILE>) { } close(FILE); } sub chunk { my($buf, $leftover, @lines); open(FILE, "file"); while(read FILE, $buf, 10240) { $buf = $leftover.$buf; @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { } } close(FILE); } Benchmark: timing 10 iterations of chunk, linebyline... chunk: 60 wallclock secs (55.20 usr + 3.48 sys = 58.68 CPU) linebyline: 95 wallclock secs (91.67 usr + 2.16 sys = 93.83 CPU)
    These tests were run on a 25 meg file with roughly 1 million lines in it. This code is not guaranteed to work 100%, but I believe it is correct enough to serve benchmarking purposes well.
      Just for something kind of different, you can play fun games with the input record seperator and end up with something like
      sub IRS_chunky { my($buf, $leftover, @lines); local $/ = \10240; open(FILE, "totalfoo"); while( $buf = <FILE> ) { $buf = $leftover.$buf; @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { } } close(FILE); }
      which is comparable, on my machine, to chunky.

      My real question, though, is what kind of machines are you running this on? My benchmark results are completely different than yours. Having run this three times on a Sun Ultra 5 on a file that is approx 500,000 lines and 24Mb in size from my local IDE drive, my results were pretty consistently like this

      perl test_read.pl Benchmark: timing 10 iterations of Chunky IRS, chunk, linebyline... Chunky IRS: 41 wallclock secs (31.31 usr + 3.71 sys = 35.02 CPU) chunk: 40 wallclock secs (31.00 usr + 3.81 sys = 34.81 CPU) linebyline: 27 wallclock secs (17.67 usr + 2.47 sys = 20.14 CPU)
      The code I used was identical to the earlier post, with the subroutine I wrote added. Using perl-5.6 generated the same basic results, plus or minus 1 for each stat. I am now somewhat confused. Is this a difference in the way Solaris uses its buffers? What platform/OS were the original tests run on?

      mikfire

        I'm missing something here. What does assigning $/=\10240 mean? Thanks,

        --ZZamboni

      wow... i changed the read size to 8192 (2 4k blocks...) and I got these results on a 16 meg "test" file:
      Benchmark: timing 10 iterations of chunk, linebyline... chunk: 9 wallclock secs ( 7.66 usr + 1.58 sys = 9.24 CPU) linebyline: 34 wallclock secs (33.17 usr + 0.80 sys = 33.97 CPU)
      damn... that makes a BIG difference... screw line by line...
        You can even squeeze a little more out. This is ~5% faster on my system.
        my($buf, $chunk, @lines); while(read FILE, $buf, BUFFER_SIZE) { $chunk .= $buf; @lines = split /\n/, $chunk; $chunk = chomp $buf ? '' : pop @lines; foreach (@lines) {} }
        (BUFFER_SIZE is just a constant I was using)
        Although this is a Perl site, when you start playing with things as a block of data, I start thinking C/C++. Perhaps you should consider that as well... For data organized in \t and \n delimiters you should be able to parse it quickly in C. Whether you can do whatever operations on that data easily in C is another question, but one I think is worth asking yourself if runtime is a serious consideration. I'd also double check for anyplace you might need an eof(). Possible corner cases that might be good tests for the chunk() code is where you have a \n as last char in file, and where you do not.
RE: read X number of lines?
by lhoward (Vicar) on May 26, 2000 at 18:06 UTC
    There should be a CPAN module to do the "read a block and buffer" method transparently (especially since we have tied filehandles), but can't seem to find one. The I'm gonna poke around CPAN a bit more and see if I can find one. If I can't find one on CPAN and no one here objects I'll probably build a module to do that (giving credit to Perlmonks.org and the monks who contributed to this discussion).
Re: read X number of lines?
by BigJoe (Curate) on May 26, 2000 at 07:11 UTC
    What are you reading in the Windows Kernel? If you have a PERL cookbook page 274. Do a:
    read(HANDLE, $buffer, size(k));
    then use:
    sysseek(HANDLE, $var, length, offset);


    but this is the extent of what I understand about what you are trying to do.
      hm... lemme try to describe it a bit better. I have a series of files, some of them up to 800 megs (and actually i think larger... yep, 1.6 gigs... damn...) and they are all organized in such a way that there is a record every new line. Instead of: read line, process line, write out line, repeat... which is slow (and hard on the IO subsystem) I would like to: read 10000 lines, process 10000 lines, write them out, repeat. my theory being that this would be more efficient IO wise... make better sense?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://14900]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-04-23 16:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found