eduardo has asked for the wisdom of the Perl Monks concerning the following question:
let us say that I have a file that is very big... let us say
that it is larger than the amount of ram on the box i am
working on. i have to process each and every single line,
now the typical way to do this is in a while (<INFILE>) {
... do stuff here ...
}
block. now, as we all know (thanks to the von neuman
bottleneck) disk IO is going to be the slowest part of this
process, and i have to do this to well over 22 million
lines, so i can't just slurp it all into one massive mega
scalar. so, my question is, in perl is there a way of
saying: "read the next 10,000 lines in one shot and put
them in this array" i mean, i want to read a large amount
of lines, put them into memory, and then do something like:
"write the entire contents of this array in one shot" am I
dreaming here? is this even possible? help me out here!
Re: read X number of lines?
by plaid (Chaplain) on May 26, 2000 at 07:37 UTC
|
I don't think there's a way to do exactly what you want and
pull out X number of lines in one shot. You're better off
doing something like limitting the number of bytes you read
in at a time. I threw together a bit of code (posted
below) which reads in a certain amount of bytes, splits
that into an array, and then processes those lines. Used
with a reasonable size of bytes to read in, it seems to
consistently be about 1.5 times faster.
#!/usr/bin/perl
use Benchmark;
use strict;
timethese(10, { 'linebyline' => \&linebyline, 'chunk' => \&chunk });
sub linebyline {
open(FILE, "file");
while(<FILE>) { }
close(FILE);
}
sub chunk {
my($buf, $leftover, @lines);
open(FILE, "file");
while(read FILE, $buf, 10240) {
$buf = $leftover.$buf;
@lines = split(/\n/, $buf);
$leftover = ($buf !~ /\n$/) ? pop @lines : "";
foreach (@lines) { }
}
close(FILE);
}
Benchmark: timing 10 iterations of chunk, linebyline...
chunk: 60 wallclock secs (55.20 usr + 3.48 sys = 58.68 CPU)
linebyline: 95 wallclock secs (91.67 usr + 2.16 sys = 93.83 CPU)
These tests were run on a 25 meg file with roughly 1 million
lines in it. This code is not guaranteed to work 100%, but
I believe it is correct enough to serve benchmarking
purposes well. | [reply] [d/l] |
|
Just for something kind of different, you can play fun games
with the input record seperator and end up with something like
sub IRS_chunky {
my($buf, $leftover, @lines);
local $/ = \10240;
open(FILE, "totalfoo");
while( $buf = <FILE> ) {
$buf = $leftover.$buf;
@lines = split(/\n/, $buf);
$leftover = ($buf !~ /\n$/) ? pop @lines : "";
foreach (@lines) { }
}
close(FILE);
}
which is comparable, on my machine, to chunky.
My real question, though, is what kind of machines are you running
this on? My benchmark results are completely different than
yours. Having run this three times on a Sun Ultra 5 on a file
that is approx 500,000 lines and 24Mb in size from my local
IDE drive, my results were pretty consistently like this
perl test_read.pl
Benchmark: timing 10 iterations of Chunky IRS, chunk, linebyline...
Chunky IRS: 41 wallclock secs (31.31 usr + 3.71 sys = 35.02 CPU)
chunk: 40 wallclock secs (31.00 usr + 3.81 sys = 34.81 CPU)
linebyline: 27 wallclock secs (17.67 usr + 2.47 sys = 20.14 CPU)
The code I used was identical to the earlier post, with the
subroutine I wrote added. Using perl-5.6 generated the same
basic results, plus or minus 1 for each stat. I am now somewhat
confused. Is this a difference in the way Solaris uses its
buffers? What platform/OS were the original tests run on?
mikfire
| [reply] [d/l] [select] |
|
I'm missing something here. What does assigning $/=\10240 mean?
Thanks,
--ZZamboni
| [reply] |
|
|
wow... i changed the read size to 8192 (2 4k blocks...) and
I got these results on a 16 meg "test" file:
Benchmark: timing 10 iterations of chunk, linebyline...
chunk: 9 wallclock secs ( 7.66 usr + 1.58 sys = 9.24 CPU)
linebyline: 34 wallclock secs (33.17 usr + 0.80 sys = 33.97 CPU)
damn... that makes a BIG difference... screw line by line... | [reply] [d/l] |
|
You can even squeeze a little more out. This is ~5% faster
on my system.
my($buf, $chunk, @lines);
while(read FILE, $buf, BUFFER_SIZE) {
$chunk .= $buf;
@lines = split /\n/, $chunk;
$chunk = chomp $buf ? '' : pop @lines;
foreach (@lines) {}
}
(BUFFER_SIZE is just a constant I was using) | [reply] [d/l] |
|
Although this is a Perl site, when you start playing with things as a block of data, I start thinking C/C++.
Perhaps you should consider that as well...
For data organized in \t and \n delimiters you should be able to parse it quickly in C.
Whether you can do whatever operations on that data easily in C is another question, but one I think is worth asking yourself if runtime is a serious consideration.
I'd also double check for anyplace you might need an eof().
Possible corner cases that might be good tests for the chunk() code is where you have a \n as last char in file, and where you do not.
| [reply] |
RE: read X number of lines?
by lhoward (Vicar) on May 26, 2000 at 18:06 UTC
|
There should be a CPAN module to do the
"read a block and buffer" method
transparently (especially since we have tied
filehandles), but can't seem to find one. The I'm gonna
poke around CPAN a bit more and see if I can find one.
If I can't find one on CPAN and no one here objects
I'll probably build a module to do that (giving credit to
Perlmonks.org and the monks who contributed to this
discussion). | [reply] |
Re: read X number of lines?
by BigJoe (Curate) on May 26, 2000 at 07:11 UTC
|
What are you reading in the Windows Kernel? If you have a PERL cookbook page 274. Do a:
read(HANDLE, $buffer, size(k));
then use:sysseek(HANDLE, $var, length, offset);
but this is the extent of what I understand about what you are trying to do. | [reply] [d/l] [select] |
|
hm... lemme try to describe it a bit better. I have a
series of files, some of them up to 800 megs (and actually
i think larger... yep, 1.6 gigs... damn...) and they are
all organized in such a way that there is a record every
new line. Instead of: read line, process line, write out line, repeat...
which is slow (and hard on the IO subsystem) I would like
to: read 10000 lines, process 10000 lines, write them out,
repeat. my theory being that this would be more efficient
IO wise... make better sense?
| [reply] |
|
|