(RhetTbull) Re: Looking for a better way to get the number of lines in a file...

The special perl variable $. tells you the current line of the file you're reading. Hence, at the end of the file, it gives you the total number of lines:

while (<FILE>){};
print "length = $.";
[download]

As an aside, it's typical in perl to reserve all uppercase variable names for file handles and such and not ordinary variables as you do. IMHO it makes your code easier to read.

Update:In light of gbarr's post below I did some benchmarking. gbarr is correct that snarfing the file in chunks is much faster. I tested quite a few large files (>1 million lines) and found that in most cases, reading 100K chunks and counting linefeeds is several times faster than reading line by line. In a few cases the two approaches were close. So, I definitely recommend gbarr's solution. Benchmark code follows:

#!/usr/bin/perl

use warnings;
use strict;

use Benchmark;

timethese(100, {
        'line_by_line' => q{
                open(INFILE,'test.dat') or die "open: $!";
                while(<INFILE>){};
                close(INFILE);
        },
        'chunk' => q{
                $count = 0;
                open(INFILE,'test.dat') or die "open: $!";
                local $/=\1024000;
                while(<INFILE>) { $count += tr/\n// }
                close(INFILE);
        }
});
[download]

On some sample data, this produces the following:

Benchmark: timing 100 iterations of chunk, line_by_line...
     chunk:  3 wallclock secs ( 1.58 usr +  0.70 sys =  2.28 CPU) @ 43
+.80/s (n=100)
     line_by_line: 12 wallclock secs ( 9.54 usr +  0.72 sys = 10.27 CP
+U) @  9.74/s (n=100)
[download]

Update #2 After further playing around and looking at other nodes on the same topic it seems that sysread is even faster still. Doing new benchmarks utilizing the following code

open INFILE,'test.dat' or die "open: $!";
my $count = 0;
while (sysread(INFILE,$_,102400)) { $count += tr/\n//; }
[download]

yields

/home/rhet/misc> ./testfcount.pl
Benchmark: timing 100 iterations of chunk, line_by_line, sysread...
     chunk: 12 wallclock secs ( 8.00 usr +  3.52 sys = 11.53 CPU) @  8
+.68/s (n=100)
     line_by_line: 154 wallclock secs (149.20 usr +  3.29 sys = 152.50
+ CPU) @  0.66/s (n=100)
     sysread:  6 wallclock secs ( 4.09 usr +  1.62 sys =  5.71 CPU) @ 
+17.52/s (n=100)
[download]

I tried a variety of files and was able to find certain files that made one or the other faster but in general sysread was usually the fastest and quite often a lot faster. On files with really short lines the line_by_line method was VERY slow but on files with much larger lines, the line_by_line method was often faster than the other two. In general though, it looks like sysread is your best bet. You could probably make further optimizations by changing the size of the block you read with sysread but these would likely be dependent on a particular configuration of a particular platform.

Comment on (RhetTbull) Re: Looking for a better way to get the number of lines in a file... Select or Download Code


Come for the quick hacks, stay for the epiphanies.
	PerlMonks