Re: split and sysread()

When you originally asked this question at speed up one-line "sort|uniq -c" perl code you said that you only wanted the 10th field from an unspecified maximum number. In that case, using a regex to isolate that field alone rather than spliting them all out and then discarding all but one was an obvious way to save some cycles. Using the sliding buffer saved some more for an overall speed-up of about x4 in my tests.

You now appear to be wanting fields (0,3,4,9,17,18,31) which means that the benefits of using a regex over split are considerably lessened--though there is still some saving. Using this in conjunction with the sliding buffer--two variations on the theme, with sysread_1 giving consistantly the best results--and a buffer size of 64k seems to achieve the best results on my machine, with the main benefit seemingly coming from bypassing stdio.

The overall saving on my machine comes out at around 50%, whether this will get you close to your target of 2 minutes you have to see once you actually do something meaningful with the fields inside the loop. If not, I think you may need quicker hardware.

The file used in the tests below is (75MB) 500_000 records x 31 pipe-delimited fields of randomly generated data.

C:\test>215578 -BUFN=16 pipes.dat
1 trial of sysread (160.090s total)
1 trial of sysread2 (182.623s total)
1 trial of stdio (324.950s total)

sysread:20000 sysread2:20000 stdio:20000
[download]

Whilst I've tried various buffer sizes, the test are hardly definitive and you may well get better results with a different size (bigger or smaller)on your machine. Good luck.

Code

#! perl -slw
use strict;
use Benchmark::Timer;
use vars qw[$BUFN];

$BUFN ||= 1; # Buffer size in steps of 4096 bytes.

my $t=Benchmark::Timer->new();
my $re_fields = qr[
    ([^|]+)\|             # Capture field 0
    (?:[^|]+\|){2}        # Skip 1 .. 2
    ([^|]+)\|            # Capture 3 .. 4
    ([^|]+)\|
    (?:[^|]+\|){4}        # Skip 5..8
    ([^|]+)\|            # Capture 9
    (?:[^|]+\|){7}        # Skip 10..16
    ([^|]+)\|            # Capture 17..18
    ([^|]+)\|
    (?:[^|]+\|){11}     # Skip 19..30
    ([^|\n]+)            # Capture 31
    \n                    # Discard the newline
]ox;

my $buffer = "";

$t->start('sysread');
open my $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";
my %h1;

while (sysread($in, $buffer, 4096*$BUFN, length $buffer)) {
    while($buffer =~ m[$re_fields]mog ) { # << Regex.
        $h1{$_}++ for ($1,$2,$3,$4,$5,$6,$7);
    }
    $buffer = substr($buffer, 1+rindex($buffer, "\n"));
}
close $in;

$t->stop('sysread');

$t->start('sysread2');
open $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";
my %h2;

while (sysread($in, $buffer, 4096*$BUFN, length $buffer)) {
    my ($p1, $p2) = (0) x 2;
    while ($p2 = 1 + index($buffer, "\n", $p1)) {
        $h2{$_}++ for substr($buffer, $p1, $p2 - $p1) =~ $re_fields; #
+ << Regex.
        $p1 = $p2;
    }
    $buffer = substr($buffer, 1+rindex($buffer, "\n"));
}
close $in;
$t->stop('sysread2');

my %h3;

$t->start('perlio');
open $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";

while(<$in>){
    my @fields = split'[\|\n]';
    $h3{$_}++ for @fields[0,3,4,9,17,18,30];
}
close $in;
$t->stop('stdio');

$t->report;

print 'sysread:', scalar keys %h1, ' sysread2:', scalar keys %h2, ' st
+dio:', scalar keys %h3;

$h1{$_} ne $h3{$_} and print "$h1{$_} ne $h3{$_}" for keys %h3;
$h2{$_} ne $h3{$_} and print "$h2{$_} ne $h3{$_}" for keys %h3;
__END__
C:\test>215578 -BUFN=16 pipes.dat
1 trial of sysread (160.090s total)
1 trial of sysread2 (182.623s total)
1 trial of stdio (324.950s total)

sysread:20000 sysread2:20000 perlio:20000
[download]

Examine what is said, not who speaks.

1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.

Comment on Re: split and sysread() Select or Download Code

Replies are listed 'Best First'.
Re: Re: split and sysread() by perlplexer (Hermit) on Apr 20, 2003 at 14:50 UTC
with the main benefit seemingly coming from bypassing stdio I just want to clarify what that means because to some people this may sound as if Perl IO is slow and it's always better to use sysread(). In the examples that BrowserUK++ provided, the main benefit comes from the fact that in case of sysread(), the code is looking at the data that is being read only once (plus a little overhead for looking for that last "\n"). In case of normal Perl IO, i.e, `<FH>`, every character is looked at twice -- first by Perl to figure out where each line ends, then by the code itself to split everything into separate fields. That's why you're seeing a ~50% increase in performance. You can also confirm this by checking the user and system times for normal Perl IO and sysread(). You'll see that system time is pretty much the same in both cases but user time will vary. --perlplexer	[reply] [d/l]
Re: Re: Re: split and sysread() by dws (Chancellor) on Apr 20, 2003 at 16:11 UTC
In case of normal Perl IO, i.e, <FH>, every character is looked at twice -- first by Perl to figure out where each line ends, then by the code itself to split everything into separate fields. That's why you're seeing a ~50% increase in performance. I disagree. The "looking at each character" is relatively cheap. Making a new string out of each line, however, is more expensive. The sysread() approach places less load on Perl's memory management.	[reply]
Re: Re: Re: Re: split and sysread() by perlplexer (Hermit) on Apr 20, 2003 at 17:49 UTC
CPU cycles necessary for memory allocation certainly count but they constitute a fairly small percentage of CPU cycles necessary for comparing each and every character in a 75MB file -- roughly 78 million comparisons. That is considerable even if you do them directly in assembly language. I obviously don't know but I'm pretty certain that in Perl every comparison involves at least one function call and those are much more expensive that simple cmp's. Another way to show that it's not just the memory allocation that matters is to plot a graph showing execution speed vs the size of the read buffer. You'll notice that after a certain point the graph will flatten out and you won't see much improvement no matter how much memory you allocate. --perlplexer	[reply]


"be consistent"
	PerlMonks