comment on

A recent discussion with Limbic~Region in the CB highlighted the possibility for a neat enhancement for Perl regexes. I hope this can make it into Perl 6 regexes, but I'll mostly talk in terms of Perl 5 regexes as I figure nearly everyone can understand that.

It would be nice to be able to efficiently use regexes on streams. You can do it now (though not that efficiently) so long as you set yourself a maximum match size. But that isn't always easy to figure out. This is probably part of why $/ is a fixed string and not a regex.

[ For the rest of the discussion, note that when I say "string", I am thinking of this: ( $match )= $string =~ /($pattern)/; So "string" is the string that the regex is being applied to and "match" is the substring of the string that the regex matches. ]

For an example of matching against a stream where a maximum match size can be determined, see Re: Shell Script Woes (tye's try). Note how I compute the longest possible match size and ensure that the string being given to the regex is always at least that long (until I get near the end of the stream).

My idea for an enhancement would be a new option that would tell the regex engine that reaching the end of the string would cause the regex to note the current pos() and then fail. I'll do this with /z just to make japhy happy (well, it also makes a bit of sense).

So the regex would fail, returning control to your code. Your code could then decide how much more data from the stream to append to the string and perhaps trim data from the front of the string that we know won't be part of any future match.

So you could then regex against a stream like so:

package Stream::Match;

sub new {
    my( $class, $stream, $pattern )= @_;
    return bless {
        str => $stream,     # Stream to read from
        pat => $pattern,    # Regex to find matches
        buf => "",          # Stream data that might match later
        blk => 8*1024,      # How many bytes to read each time
    }, $class;
}

sub nextMatch {
    my( $self )= @_;
    my $svBuf= \$self->{buf};
    pos( $$svBuf )= 0;      # Just in case
    my $res= 1;
    while(  1  ) {
        if(  $res  &&  length($$svBuf) < $self->{blk}  ) {
            # Append more data from stream to $self->{buf}:
            $res= sysread( $self->{str}, $$svBuf,
                $self->{blk}, length($$svBuf) );
            if(  ! defined $res  ) {
                die "Error reading from stream ($self->{str}): $!";
            }
        }
        if(  "" eq $$svBuf  ) {
            return;     # End of file
        }
        # Note that 0 == pos( $$svBuf )
        my $match=  $res
            ?  $$svBuf =~ /$self->{pat}/zg
            :  $$svBuf =~ /$self->{pat}/g;
        if(  $match  ) {        # We got a match!
            # Copy the match (like $&):
            my $line= substr( $$svBuf, $-[0], $+[0]-$-[0] );
            # Leave only $' in $self->{buf}:
            substr( $$svBuf, 0, $+[0], "" );
            return $line;
        }
        my $pos= pos( $$svBuf );
        if(  ! defined $pos  ) {    # Didn't hit end-of-string:
            return;     # This regex will NEVER match.
        }
        # Remove $self->{buf} chars that will never start a match:
        substr( $$svBuf, 0, $pos, "" );
    }
}

package main;

# Sample use that emulates the Unix "strings" command:
binmode( STDIN );
my $o= Stream::Match->new( \*STDIN, qr/[\s -~]+(?=\0)/ );
my $string;
while(  $string= $o->nextMatch()  ) {
    print "$string\n";
}
[download]

This allows things to be efficient (the regex engine can start matching at the position where it left off last time) and doesn't require some arbitrary limit on maximum match size to be specified.

Note how //z would mostly be useful when doing //zg in a scalar context so you could recover pos(). Also, if you do //zg and not //zgc, then pos() would be undefined if the regex fails without hitting the end of the string so you could give up on streams that are never going to match no matter how much data you collect (though using such a regex on a stream would be strange).

Note that the regex engine does redo some of its work: the work from the last time it incremented pos(). We could avoid this by either saving the entire state of the regex engine (so difficult that it probably requires continuations) or by letting the regex engine call user code to "grow" the string being matched against.

If we go the latter route, I'd still like to be able to tell the regex engine to fail as implementing features only via callbacks tends to be rather limiting for the coder. For example, in my example above, only providing the callback solution could require the buffer to grow extremely large in the case of successive matches being very far apart even when the matches themselves are quite small.

- tye

Updated as described in Re^4: Applying regexes to streams: Perl enhancement idea (bug+fix). Original code inside CODE tags in HTML comments, so the "d/l code" link will fetch both versions of the code.

In reply to Applying regexes to streams: Perl enhancement idea by tye

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks