read X number of lines?

eduardo has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: read X number of lines? by plaid (Chaplain) on May 26, 2000 at 07:37 UTC
I don't think there's a way to do exactly what you want and pull out X number of lines in one shot. You're better off doing something like limitting the number of bytes you read in at a time. I threw together a bit of code (posted below) which reads in a certain amount of bytes, splits that into an array, and then processes those lines. Used with a reasonable size of bytes to read in, it seems to consistently be about 1.5 times faster. #!/usr/bin/perl use Benchmark; use strict; timethese(10, { 'linebyline' => \&linebyline, 'chunk' => \&chunk }); sub linebyline { open(FILE, "file"); while(<FILE>) { } close(FILE); } sub chunk { my($buf, $leftover, @lines); open(FILE, "file"); while(read FILE, $buf, 10240) { $buf = $leftover.$buf; @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { } } close(FILE); } Benchmark: timing 10 iterations of chunk, linebyline... chunk: 60 wallclock secs (55.20 usr + 3.48 sys = 58.68 CPU) linebyline: 95 wallclock secs (91.67 usr + 2.16 sys = 93.83 CPU) [download] These tests were run on a 25 meg file with roughly 1 million lines in it. This code is not guaranteed to work 100%, but I believe it is correct enough to serve benchmarking purposes well.	[reply] [d/l]
RE: Re: read X number of lines? by mikfire (Deacon) on May 26, 2000 at 16:38 UTC
Just for something kind of different, you can play fun games with the input record seperator and end up with something like `sub IRS_chunky { my($buf, $leftover, @lines); local $/ = \10240; open(FILE, "totalfoo"); while( $buf = <FILE> ) { $buf = $leftover.$buf; @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { } } close(FILE); }` [download] which is comparable, on my machine, to chunky. My real question, though, is what kind of machines are you running this on? My benchmark results are completely different than yours. Having run this three times on a Sun Ultra 5 on a file that is approx 500,000 lines and 24Mb in size from my local IDE drive, my results were pretty consistently like this `perl test_read.pl Benchmark: timing 10 iterations of Chunky IRS, chunk, linebyline... Chunky IRS: 41 wallclock secs (31.31 usr + 3.71 sys = 35.02 CPU) chunk: 40 wallclock secs (31.00 usr + 3.81 sys = 34.81 CPU) linebyline: 27 wallclock secs (17.67 usr + 2.47 sys = 20.14 CPU)` [download] The code I used was identical to the earlier post, with the subroutine I wrote added. Using perl-5.6 generated the same basic results, plus or minus 1 for each stat. I am now somewhat confused. Is this a difference in the way Solaris uses its buffers? What platform/OS were the original tests run on? mikfire	[reply] [d/l] [select]
RE: RE: Re: read X number of lines? by ZZamboni (Curate) on May 26, 2000 at 17:17 UTC
I'm missing something here. What does assigning $/=\10240 mean? Thanks, --ZZamboni	[reply]
RE: RE: RE: Re: read X number of lines? by mikfire (Deacon) on May 26, 2000 at 17:37 UTC
RE: Re: read X number of lines? by eduardo (Curate) on May 26, 2000 at 08:32 UTC
wow... i changed the read size to 8192 (2 4k blocks...) and I got these results on a 16 meg "test" file: `Benchmark: timing 10 iterations of chunk, linebyline... chunk: 9 wallclock secs ( 7.66 usr + 1.58 sys = 9.24 CPU) linebyline: 34 wallclock secs (33.17 usr + 0.80 sys = 33.97 CPU)` [download] damn... that makes a BIG difference... screw line by line...	[reply] [d/l]
RE: RE: Re: read X number of lines? by takshaka (Friar) on May 26, 2000 at 10:59 UTC
You can even squeeze a little more out. This is ~5% faster on my system. `my($buf, $chunk, @lines); while(read FILE, $buf, BUFFER_SIZE) { $chunk .= $buf; @lines = split /\n/, $chunk; $chunk = chomp $buf ? '' : pop @lines; foreach (@lines) {} }` [download] (BUFFER_SIZE is just a constant I was using)	[reply] [d/l]
RE: RE: Re: read X number of lines? by Anonymous Monk on May 26, 2000 at 11:33 UTC
Although this is a Perl site, when you start playing with things as a block of data, I start thinking C/C++. Perhaps you should consider that as well... For data organized in \t and \n delimiters you should be able to parse it quickly in C. Whether you can do whatever operations on that data easily in C is another question, but one I think is worth asking yourself if runtime is a serious consideration. I'd also double check for anyplace you might need an eof(). Possible corner cases that might be good tests for the chunk() code is where you have a \n as last char in file, and where you do not.	[reply]
RE: read X number of lines? by lhoward (Vicar) on May 26, 2000 at 18:06 UTC
There should be a CPAN module to do the "read a block and buffer" method transparently (especially since we have tied filehandles), but can't seem to find one. The I'm gonna poke around CPAN a bit more and see if I can find one. If I can't find one on CPAN and no one here objects I'll probably build a module to do that (giving credit to Perlmonks.org and the monks who contributed to this discussion).	[reply]
Re: read X number of lines? by BigJoe (Curate) on May 26, 2000 at 07:11 UTC
What are you reading in the Windows Kernel? If you have a PERL cookbook page 274. Do a: `read(HANDLE, $buffer, size(k));` [download] then use: `sysseek(HANDLE, $var, length, offset);` [download] but this is the extent of what I understand about what you are trying to do.	[reply] [d/l] [select]
RE: Re: read X number of lines? by eduardo (Curate) on May 26, 2000 at 07:18 UTC
hm... lemme try to describe it a bit better. I have a series of files, some of them up to 800 megs (and actually i think larger... yep, 1.6 gigs... damn...) and they are all organized in such a way that there is a record every new line. Instead of: read line, process line, write out line, repeat... which is slow (and hard on the IO subsystem) I would like to: read 10000 lines, process 10000 lines, write them out, repeat. my theory being that this would be more efficient IO wise... make better sense?	[reply]


Perl: the Markov chain saw
	PerlMonks