perlmeditation
Tyke
The standard idiom to slurp a file is
<CODE>my $file = do {local $/; <FILE>};</code> (or the 'slurp idiom' in the
rest of this node). An alternative idiom is
<CODE>read FILE, my $file, -s FILE</code> (or the read idiom).
<P>
The thread [50525|Slurp a file] discusses all this fairly extensively,but it
does seem to have left the suggestion that the read idiom is inherently
unportable. After a quick exchange with [tye] on the CB the other day, I was
curious enough to do a little research.
<READMORE>
<P>
Actually lets be honest, another title for this node could have been
<CODE>$ground[eval{die $horse->flog}]</code>.
<TABLE BGCOLOR="WHITE"><TR><TD><FONT COLOR="WHITE">Flogging a dead horse into the ground</font></td></tr></table>
But it would be nice to rehabilitate read!
<H3>Is the read idiom portable or not?</h3>
This technique works for normal text files, and for binary files (with the
proviso that <CODE>binmode</code> is applied to the handle. It will not work
for pipes, consoles or file handles whose size cannot be determined up front.
<P>
In particular it will work for Dos \r\n lines ends as well as for normal Un*x
\n line ends. In the first case <CODE>-s FILE</code> gives the real size of
the file. The length that is read will be shorter as the \r characters are
hidden. This just means that read is asked to do a little more work than is
strictly necessary. Luckily it's clever enough to realize this when it gets to
the end of the file.
<BR>
Any incredulous masochists can look at <A HREF="#SAMPLE_1">Read portability</a>
below.
<P>
However, just because it is <em>portable</em>, doesn't mean that it is
<em>applicable</em> in all situations. In particular this idiom should not
be used when
<UL>
<LI>The size of the file handle has no sense, or cannot be determined when the
file is opened. (Forget <STDIN>!)
<LI>When the file is too large to fit into memory.
<LI>When the file will be processed by chunks that have well identified
delimiters (in which case we can just redefine <CODE>local $/</code> and
process the file by a <CODE>while (<FILE>)</code>.
</ul>
<H3>How do the read and slurp idioms compare</h3>
The two idioms have the same behaviour on normal files. However, the 'slurp'
idiom can also be used effectively on file handles whose length can't be
measured at when they are opened, since it just carries on until it hits the
end of the file. The other applicability constraints (avoid very large or
'chunkable' file) apply to this idiom too.
<P>
The down side is, slurp is much less efficient than read as shown by these
<A HREF="#SAMPLE_2">benchmarks</a>.
<P>
<EM>OK, if you're bored you can leave now 'cos the rest is just the scripts and
results to justify the wild assertions above.</em>
<HR>
<H3><A NAME="SAMPLE_1">Read portability</a></h3>
<H4>Environment</h4>
<UL>
<LI>Win95 (doesn't need passport ;-) ) with ActiveState's perl v5.6.1 (build
626)
<LI>(b) Cygwin with perl V5.6.1 built for Cygwin to simulate a un*x environment
</ul>
In this environment I have three files... one is in DOS format, the second in
UNIX format. Each file just contains the characters A..C on three lines: one
character/line. The third file is a binary executable built under cygwin.
This is the script that I ran
<CODE>
#!/usr/bin/perl
use strict;
use warnings;
my $i = 1;
for my $file (@ARGV) {
open_read($file, $_) for (0,1);
}
sub open_read {
my ($file, $binary) = @_;
my $file_size = -s $file;
open IT, $file or die "Can't open $file: $!";
binmode IT if $binary;
my $handle_size = -s IT;
my $read_len = read IT, my $buffer, -s IT;
close IT;
printf "%d) %5s (%s): File size %5d, Read length %5d,\n", $i++,
$file, $binary ? 'b' : ' ', $file_size, $read_len;
printf " : Handle size %5d, Buffer length %5d,\n",
$handle_size, length $buffer
}
__END__
</code>
And here are the results:
<P>
(a) (WINDOWS ENVIRONMENT)
<CODE>
1) dos1 ( ): File size 9, Read length 6,
: Handle size 9, Buffer length 6,
2) dos1 (b): File size 9, Read length 9,
: Handle size 9, Buffer length 9,
3) unix1 ( ): File size 6, Read length 6,
: Handle size 6, Buffer length 6,
4) unix1 (b): File size 6, Read length 6,
: Handle size 6, Buffer length 6,
5) p.exe ( ): File size 20069, Read length 4716,
: Handle size 20069, Buffer length 4716,
6) p.exe (b): File size 20069, Read length 20069,
: Handle size 20069, Buffer length 20069,
</code>
<P>
<UL>
<LI>1) and 2) show clearly that <CODE>binmode</code> should not be used: 2) is
broken since the loaded buffer includes \r characters.
<LI>3) and 4) show that none of this makes the slightest difference on a unix
text file.
<LI>5) and 6) show that <CODE>binmode</code> must be used for a binary file:
5) is broken since read 'compresses' the buffer in a less than desirable way.
<LI>In each of the normal cases 1), 3) and 6) the buffer is loaded with
exactly the characters that we would expect.
</ul>
<P>
(b) (CYGWIN ENVIRONMENT)
<CODE>
1) dos1 ( ): File size 9, Read length 9,
: Handle size 9, Buffer length 9,
2) dos1 (b): File size 9, Read length 9,
: Handle size 9, Buffer length 9,
3) unix1 ( ): File size 6, Read length 6,
: Handle size 6, Buffer length 6,
4) unix1 (b): File size 6, Read length 6,
: Handle size 6, Buffer length 6,
5) p.exe ( ): File size 20069, Read length 20069,
: Handle size 20069, Buffer length 20069,
6) p.exe (b): File size 20069, Read length 20069,
: Handle size 20069, Buffer length 20069,
</code>
<P>
Just confirms that <CODE>binmode</code> really isn't necessary for Un*x type
systems. But best to use it anyway. Also shows that you can't expect Dos files
to be converted for you automagically... but you knew that anyway.
<H3><A NAME="SAMPLE_2">Benchmarks</a></h3>
Here's the bench mark code, inspired by that found in the [50525|Slurp a file]
node
<CODE>
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw/cmpthese/;
my $time = shift;
for my $file (@ARGV) {
my $binary = -B $file;
printf "Processing a %s file, %d bytes long\n",
$binary ? 'binary' : 'text', -s $file;
cmpthese($time, {
'read' => sub{open_read($file, $binary)},
'sysread' => sub{open_sysread($file, $binary)},
'slurp' => sub{open_slurp($file, $binary)},
});
print "\n";
}
sub open_read {
my ($file, $binary) = @_;
open IT, $file or die "Can't open $file: $!";
binmode IT if $binary;
read IT, my $buffer, -s IT;
close IT;
}
sub open_sysread {
my ($file, $binary) = @_;
open IT, $file or die "Can't open $file: $!";
binmode IT if $binary;
sysread IT, my $buffer, -s IT;
close IT;
}
sub open_slurp {
my ($file, $binary) = @_;
open IT, $file or die "Can't open $file: $!";
binmode IT if $binary;
my $buffer = do {local $/, <IT>};
close IT;
}
__END__
</code>
This was executed in the Cygwin environment.
<CODE>
Processing a text file, 100 bytes long
Benchmark: running read, slurp, sysread, each for at least 20 CPU seconds...
read: 20 wallclock secs (20.07 usr + 0.00 sys = 20.07 CPU) @ 3179.19/s (n=63819)
slurp: 20 wallclock secs (20.00 usr + 0.00 sys = 20.00 CPU) @ 3027.40/s (n=60551)
sysread: 21 wallclock secs (20.88 usr + 0.00 sys = 20.88 CPU) @ 3411.66/s (n=71232)
Rate slurp read sysread
slurp 3027/s -- -5% -11%
read 3179/s 5% -- -7%
sysread 3412/s 13% 7% --
Processing a text file, 10000 bytes long
Benchmark: running read, slurp, sysread, each for at least 20 CPU seconds...
read: 20 wallclock secs (20.05 usr + 0.00 sys = 20.05 CPU) @ 2283.35/s (n=45772)
slurp: 20 wallclock secs (20.11 usr + 0.00 sys = 20.11 CPU) @ 2135.80/s (n=42951)
sysread: 21 wallclock secs (21.59 usr + 0.00 sys = 21.59 CPU) @ 2710.40/s (n=58531)
Rate slurp read sysread
slurp 2136/s -- -6% -21%
read 2283/s 7% -- -16%
sysread 2710/s 27% 19% --
Processing a binary file, 100 bytes long
Benchmark: running read, slurp, sysread, each for at least 20 CPU seconds...
read: 22 wallclock secs (21.59 usr + 0.00 sys = 21.59 CPU) @ 3012.83/s (n=65032)
slurp: 21 wallclock secs (20.89 usr + 0.00 sys = 20.89 CPU) @ 2898.21/s (n=60558)
sysread: 21 wallclock secs (20.15 usr + 0.00 sys = 20.15 CPU) @ 3173.55/s (n=63947)
Rate slurp read sysread
slurp 2898/s -- -4% -9%
read 3013/s 4% -- -5%
sysread 3174/s 10% 5% --
Processing a binary file, 100000 bytes long
Benchmark: running read, slurp, sysread, each for at least 20 CPU seconds...
read: 22 wallclock secs (21.43 usr + 0.00 sys = 21.43 CPU) @ 584.71/s (n=12531)
slurp: 21 wallclock secs (21.04 usr + 0.00 sys = 21.04 CPU) @ 459.77/s (n=9673)
sysread: 21 wallclock secs (21.03 usr + 0.00 sys = 21.03 CPU) @ 710.41/s (n=14937)
Rate slurp read sysread
slurp 460/s -- -21% -35%
read 585/s 27% -- -18%
sysread 710/s 55% 21% --
</code>
<P>
Slurp is consistently outperformed by read and especially sysread. (Actually I had the impression at one point that on Windows the difference is far less clear, especially for small files).
<P>
Disclaimer: these results hold at certain times for certain files under given
weather conditions on this particular computer.
<HR>
[tye] I hope this presents your point of view correctly