Re: Re3: Regexes on Streams

You don't see it as buggy that the way that $/ matches a record depends on the layout of the data and your buffer size in some rather complex way, with the only guarantee being that if you suck in the whole stream then it will work as expected? It isn't enough to say, "The program does what it is coded to." If you are going to promise to allow people to use regexes to match input, then make some attempt to do it consistently, and if you can't then make some attempt to at least fail in an easily explainable way. For instance I can be OK with, "I can produce strange results if $/ is supposed to match more than one block." If you can't deliver it, then don't even seem to promise delivering it.

If you want to attempt to deliver the promise, then one idea is the approach that I suggested at Re: Regexes on Streams, which the code above implements a variation on. If the programmer wants to give a potentially infinite match, you do your best to satisfy them. Perhaps it is qr/.*/ and life sucks. Perhaps it is /[\r\n](?:\s*[\r\n]|)/ (ie match end of line and any following blank lines, with either Unix or DOS line endings) and even though it is potentially infinite, with real data it is also pretty sensible and shouldn't be broken too easily. (Note: the $/ example that Dominus used in his chapter was potentially infinite...)

A sample program to play with is the following. Save it and feed it different buffer sizes. The end of record expression is greedy, of size 12. Yet from sizes of 1-17 only twice does it produce the result which Dominus' original description would lead you to expect. And in a longer data example, it would continue to mess up, and there is no simple way to say that it shouldn't.

#! /usr/bin/perl -w
use strict;

sub blocks {
  my $fh = shift;
  my $blocksize = shift || 8192;
  sub {
    return unless read $fh, my($block), $blocksize;
    return $block;
  }
}

sub records {
  my $blocks = shift;
  my $terminator = @_ ? shift : quotemeta($/);
  my @records;
  my ($buf, $finished) = ("");
  sub {
    while (@records == 0 && ! $finished) {
      if (defined(my $block = $blocks->())) {
        $buf .= $block;
        my @newrecs = split /($terminator)/, $buf;
        while (@newrecs > 2) {
          push @records, shift(@newrecs).shift(@newrecs);
        }
        $buf = join "", @newrecs;
      } else {
        @records = $buf;
        $finished = 1;
      }
    }
    return shift(@records);
  }
}

my $iter_block = blocks(\*DATA, shift || 10);
my $iter_record = records($iter_block, "(?:foo\n)+");
while (my $record = $iter_record->()) {
  print "GOT A RECORD:\n$record\n";
}

__DATA__
hello
foo
foo
foo
world
foo
foo
foo
!
[download]

And about using capturing parens into a terminator pattern, there is no real reason for wanting to do so, but it is incredibly easy to do by accident when not thinking about it. If you don't know a fair amount about regexes, you could take a good while to figure out why things look weird and what to do to fix it. (Confession: When writing the above sample code I originally had capturing parens and was somewhat surprised at the results. This is how I became aware of that issue...)

Comment on Re: Re3: Regexes on Streams - Revisited! Select or Download Code


Perl Monk, Perl Meditation
	PerlMonks

Re: Re3: Regexes on Streams - Revisited!