http://qs321.pair.com?node_id=122443

The standard idiom to slurp a file is my $file = do {local $/; <FILE>}; (or the 'slurp idiom' in the rest of this node). An alternative idiom is read FILE, my $file, -s FILE (or the read idiom).

The thread Slurp a file discusses all this fairly extensively,but it does seem to have left the suggestion that the read idiom is inherently unportable. After a quick exchange with tye on the CB the other day, I was curious enough to do a little research.

Actually lets be honest, another title for this node could have been $ground[eval{die $horse->flog}].
Flogging a dead horse into the ground
But it would be nice to rehabilitate read!

Is the read idiom portable or not?

This technique works for normal text files, and for binary files (with the proviso that binmode is applied to the handle. It will not work for pipes, consoles or file handles whose size cannot be determined up front.

In particular it will work for Dos \r\n lines ends as well as for normal Un*x \n line ends. In the first case -s FILE gives the real size of the file. The length that is read will be shorter as the \r characters are hidden. This just means that read is asked to do a little more work than is strictly necessary. Luckily it's clever enough to realize this when it gets to the end of the file.
Any incredulous masochists can look at Read portability below.

However, just because it is portable, doesn't mean that it is applicable in all situations. In particular this idiom should not be used when

  • The size of the file handle has no sense, or cannot be determined when the file is opened. (Forget <STDIN>!)
  • When the file is too large to fit into memory.
  • When the file will be processed by chunks that have well identified delimiters (in which case we can just redefine local $/ and process the file by a while (<FILE>).

How do the read and slurp idioms compare

The two idioms have the same behaviour on normal files. However, the 'slurp' idiom can also be used effectively on file handles whose length can't be measured at when they are opened, since it just carries on until it hits the end of the file. The other applicability constraints (avoid very large or 'chunkable' file) apply to this idiom too.

The down side is, slurp is much less efficient than read as shown by these benchmarks.

OK, if you're bored you can leave now 'cos the rest is just the scripts and results to justify the wild assertions above.


Read portability

Environment

  • Win95 (doesn't need passport ;-) ) with ActiveState's perl v5.6.1 (build 626)
  • (b) Cygwin with perl V5.6.1 built for Cygwin to simulate a un*x environment
In this environment I have three files... one is in DOS format, the second in UNIX format. Each file just contains the characters A..C on three lines: one character/line. The third file is a binary executable built under cygwin. This is the script that I ran
#!/usr/bin/perl use strict; use warnings; my $i = 1; for my $file (@ARGV) { open_read($file, $_) for (0,1); } sub open_read { my ($file, $binary) = @_; my $file_size = -s $file; open IT, $file or die "Can't open $file: $!"; binmode IT if $binary; my $handle_size = -s IT; my $read_len = read IT, my $buffer, -s IT; close IT; printf "%d) %5s (%s): File size %5d, Read length %5d,\n", $i+ ++, $file, $binary ? 'b' : ' ', $file_size, $read_len; printf " : Handle size %5d, Buffer length %5d,\n", $handle_size, length $buffer } __END__
And here are the results:

(a) (WINDOWS ENVIRONMENT)

1) dos1 ( ): File size 9, Read length 6, : Handle size 9, Buffer length 6, 2) dos1 (b): File size 9, Read length 9, : Handle size 9, Buffer length 9, 3) unix1 ( ): File size 6, Read length 6, : Handle size 6, Buffer length 6, 4) unix1 (b): File size 6, Read length 6, : Handle size 6, Buffer length 6, 5) p.exe ( ): File size 20069, Read length 4716, : Handle size 20069, Buffer length 4716, 6) p.exe (b): File size 20069, Read length 20069, : Handle size 20069, Buffer length 20069,

  • 1) and 2) show clearly that binmode should not be used: 2) is broken since the loaded buffer includes \r characters.
  • 3) and 4) show that none of this makes the slightest difference on a unix text file.
  • 5) and 6) show that binmode must be used for a binary file: 5) is broken since read 'compresses' the buffer in a less than desirable way.
  • In each of the normal cases 1), 3) and 6) the buffer is loaded with exactly the characters that we would expect.

(b) (CYGWIN ENVIRONMENT)

1) dos1 ( ): File size 9, Read length 9, : Handle size 9, Buffer length 9, 2) dos1 (b): File size 9, Read length 9, : Handle size 9, Buffer length 9, 3) unix1 ( ): File size 6, Read length 6, : Handle size 6, Buffer length 6, 4) unix1 (b): File size 6, Read length 6, : Handle size 6, Buffer length 6, 5) p.exe ( ): File size 20069, Read length 20069, : Handle size 20069, Buffer length 20069, 6) p.exe (b): File size 20069, Read length 20069, : Handle size 20069, Buffer length 20069,

Just confirms that binmode really isn't necessary for Un*x type systems. But best to use it anyway. Also shows that you can't expect Dos files to be converted for you automagically... but you knew that anyway.

Benchmarks

Here's the bench mark code, inspired by that found in the Slurp a file node
#!/usr/bin/perl use strict; use warnings; use Benchmark qw/cmpthese/; my $time = shift; for my $file (@ARGV) { my $binary = -B $file; printf "Processing a %s file, %d bytes long\n", $binary ? 'binary' : 'text', -s $file; cmpthese($time, { 'read' => sub{open_read($file, $binary)}, 'sysread' => sub{open_sysread($file, $binary)}, 'slurp' => sub{open_slurp($file, $binary)}, }); print "\n"; } sub open_read { my ($file, $binary) = @_; open IT, $file or die "Can't open $file: $!"; binmode IT if $binary; read IT, my $buffer, -s IT; close IT; } sub open_sysread { my ($file, $binary) = @_; open IT, $file or die "Can't open $file: $!"; binmode IT if $binary; sysread IT, my $buffer, -s IT; close IT; } sub open_slurp { my ($file, $binary) = @_; open IT, $file or die "Can't open $file: $!"; binmode IT if $binary; my $buffer = do {local $/, <IT>}; close IT; } __END__
This was executed in the Cygwin environment.
Processing a text file, 100 bytes long Benchmark: running read, slurp, sysread, each for at least 20 CPU seco +nds... read: 20 wallclock secs (20.07 usr + 0.00 sys = 20.07 CPU) @ 31 +79.19/s (n=63819) slurp: 20 wallclock secs (20.00 usr + 0.00 sys = 20.00 CPU) @ 30 +27.40/s (n=60551) sysread: 21 wallclock secs (20.88 usr + 0.00 sys = 20.88 CPU) @ 34 +11.66/s (n=71232) Rate slurp read sysread slurp 3027/s -- -5% -11% read 3179/s 5% -- -7% sysread 3412/s 13% 7% -- Processing a text file, 10000 bytes long Benchmark: running read, slurp, sysread, each for at least 20 CPU seco +nds... read: 20 wallclock secs (20.05 usr + 0.00 sys = 20.05 CPU) @ 22 +83.35/s (n=45772) slurp: 20 wallclock secs (20.11 usr + 0.00 sys = 20.11 CPU) @ 21 +35.80/s (n=42951) sysread: 21 wallclock secs (21.59 usr + 0.00 sys = 21.59 CPU) @ 27 +10.40/s (n=58531) Rate slurp read sysread slurp 2136/s -- -6% -21% read 2283/s 7% -- -16% sysread 2710/s 27% 19% -- Processing a binary file, 100 bytes long Benchmark: running read, slurp, sysread, each for at least 20 CPU seco +nds... read: 22 wallclock secs (21.59 usr + 0.00 sys = 21.59 CPU) @ 30 +12.83/s (n=65032) slurp: 21 wallclock secs (20.89 usr + 0.00 sys = 20.89 CPU) @ 28 +98.21/s (n=60558) sysread: 21 wallclock secs (20.15 usr + 0.00 sys = 20.15 CPU) @ 31 +73.55/s (n=63947) Rate slurp read sysread slurp 2898/s -- -4% -9% read 3013/s 4% -- -5% sysread 3174/s 10% 5% -- Processing a binary file, 100000 bytes long Benchmark: running read, slurp, sysread, each for at least 20 CPU seco +nds... read: 22 wallclock secs (21.43 usr + 0.00 sys = 21.43 CPU) @ 58 +4.71/s (n=12531) slurp: 21 wallclock secs (21.04 usr + 0.00 sys = 21.04 CPU) @ 45 +9.77/s (n=9673) sysread: 21 wallclock secs (21.03 usr + 0.00 sys = 21.03 CPU) @ 71 +0.41/s (n=14937) Rate slurp read sysread slurp 460/s -- -21% -35% read 585/s 27% -- -18% sysread 710/s 55% 21% --

Slurp is consistently outperformed by read and especially sysread. (Actually I had the impression at one point that on Windows the difference is far less clear, especially for small files).

Disclaimer: these results hold at certain times for certain files under given weather conditions on this particular computer.


tye I hope this presents your point of view correctly

Replies are listed 'Best First'.
Re: Slurp or Read
by dws (Chancellor) on Nov 01, 2001 at 03:25 UTC
    A note on performance:

    If you trace the system call activity that happens underneath slurp, at least on Linux and FreeBSD, you'll find that the slurp gets divided into a sequence of read(1) calls. If your intent it raw speed (and the file can fit into memory), you might be better off using sysread(), which will do a single read(1). That removes system call overhead, which on a loaded system will help avoid having other processes get their read(1)/write(1) requests in, which might move the disk head off of the sector that contains your partially-read file. Seeks can be very expensive.

    At this level, though, the game is probabilistic. The file might be fragmented, etc.

Re: Slurp or Read
by perrin (Chancellor) on Nov 01, 2001 at 21:21 UTC
    This is interesting, but honestly, how often will slurping a file be the bottleneck in your application? I'd go for the more standard and easier to read slurp option unless DProf showed me it was a performance problem in a finished app.
Re: Slurp or Read
by Anonymous Monk on Nov 01, 2001 at 23:17 UTC
    Personally I quite like:
    use File::Slurp; $file = read_file($filename);
    but then, I rarely care about how fast things are. -- Gavin