Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Applying regexes to streams: Perl enhancement idea

by tye (Sage)
on Jan 07, 2003 at 21:22 UTC ( [id://225083]=perlmeditation: print w/replies, xml ) Need Help??

A recent discussion with Limbic~Region in the CB highlighted the possibility for a neat enhancement for Perl regexes. I hope this can make it into Perl 6 regexes, but I'll mostly talk in terms of Perl 5 regexes as I figure nearly everyone can understand that.

It would be nice to be able to efficiently use regexes on streams. You can do it now (though not that efficiently) so long as you set yourself a maximum match size. But that isn't always easy to figure out. This is probably part of why $/ is a fixed string and not a regex.

[ For the rest of the discussion, note that when I say "string", I am thinking of this:     ( $match )= $string =~ /($pattern)/; So "string" is the string that the regex is being applied to and "match" is the substring of the string that the regex matches. ]

For an example of matching against a stream where a maximum match size can be determined, see Re: Shell Script Woes (tye's try). Note how I compute the longest possible match size and ensure that the string being given to the regex is always at least that long (until I get near the end of the stream).

My idea for an enhancement would be a new option that would tell the regex engine that reaching the end of the string would cause the regex to note the current pos() and then fail. I'll do this with /z just to make japhy happy (well, it also makes a bit of sense).

So the regex would fail, returning control to your code. Your code could then decide how much more data from the stream to append to the string and perhaps trim data from the front of the string that we know won't be part of any future match.

So you could then regex against a stream like so:

package Stream::Match; sub new { my( $class, $stream, $pattern )= @_; return bless { str => $stream, # Stream to read from pat => $pattern, # Regex to find matches buf => "", # Stream data that might match later blk => 8*1024, # How many bytes to read each time }, $class; } sub nextMatch { my( $self )= @_; my $svBuf= \$self->{buf}; pos( $$svBuf )= 0; # Just in case my $res= 1; while( 1 ) { if( $res && length($$svBuf) < $self->{blk} ) { # Append more data from stream to $self->{buf}: $res= sysread( $self->{str}, $$svBuf, $self->{blk}, length($$svBuf) ); if( ! defined $res ) { die "Error reading from stream ($self->{str}): $!"; } } if( "" eq $$svBuf ) { return; # End of file } # Note that 0 == pos( $$svBuf ) my $match= $res ? $$svBuf =~ /$self->{pat}/zg : $$svBuf =~ /$self->{pat}/g; if( $match ) { # We got a match! # Copy the match (like $&): my $line= substr( $$svBuf, $-[0], $+[0]-$-[0] ); # Leave only $' in $self->{buf}: substr( $$svBuf, 0, $+[0], "" ); return $line; } my $pos= pos( $$svBuf ); if( ! defined $pos ) { # Didn't hit end-of-string: return; # This regex will NEVER match. } # Remove $self->{buf} chars that will never start a match: substr( $$svBuf, 0, $pos, "" ); } } package main; # Sample use that emulates the Unix "strings" command: binmode( STDIN ); my $o= Stream::Match->new( \*STDIN, qr/[\s -~]+(?=\0)/ ); my $string; while( $string= $o->nextMatch() ) { print "$string\n"; }

This allows things to be efficient (the regex engine can start matching at the position where it left off last time) and doesn't require some arbitrary limit on maximum match size to be specified.

Note how //z would mostly be useful when doing //zg in a scalar context so you could recover pos(). Also, if you do //zg and not //zgc, then pos() would be undefined if the regex fails without hitting the end of the string so you could give up on streams that are never going to match no matter how much data you collect (though using such a regex on a stream would be strange).

Note that the regex engine does redo some of its work: the work from the last time it incremented pos(). We could avoid this by either saving the entire state of the regex engine (so difficult that it probably requires continuations) or by letting the regex engine call user code to "grow" the string being matched against.

If we go the latter route, I'd still like to be able to tell the regex engine to fail as implementing features only via callbacks tends to be rather limiting for the coder. For example, in my example above, only providing the callback solution could require the buffer to grow extremely large in the case of successive matches being very far apart even when the matches themselves are quite small.

                - tye

Updated as described in Re^4: Applying regexes to streams: Perl enhancement idea (bug+fix). Original code inside CODE tags in HTML comments, so the "d/l code" link will fetch both versions of the code.

Replies are listed 'Best First'.
Re: Applying regexes to streams: Perl enhancement idea
by Juerd (Abbot) on Jan 07, 2003 at 22:28 UTC

    I hope this can make it into Perl 6 regexes

    exegesis 5:

    The inability to do pattern matches immediately on an input stream is one of Perl 5's few weaknesses when it comes to text processing. Sure, we can read line-by-line and apply pattern matching to each line, but trying to match a construct that may be laid out across an unknown number of lines is just painful.

    Not in Perl 6 though. In Perl 6, we can bind an input stream to a scalar variable (i.e. like a Perl 5 tied variable) and then just match on the characters in that stream as if they were already in memory

    (...)

    The important point is that, after the match, only those characters that the pattern actually matched will have been removed from the input stream.

    It seems that your wish will come true.

    # Perl 6 my $variable is from($stream);

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

      The important point is that, after the match, only those characters that the pattern actually matched will have been removed from the input stream.

      In the more general case, what is removed from the input stream is the characters that actually matched and any characters prior to them.

      But note that this still leaves the final problem I outlined. It is extremely memory-inefficient when matching short strings that occur very infrequently in very long streams.

      Well, the description is vague enough that perhaps there are also already plans to support such cases more efficiently, though I'm skeptical of that.

      Also, by the time I finished writing this up I had decided that patching this into Perl 5 might be quite easy. So I still have a couple of wishes left.

      Thanks for the pointer.

                      - tye
Re: Applying regexes to streams: Perl enhancement idea
by theorbtwo (Prior) on Jan 07, 2003 at 23:11 UTC

    I suspect you could do that with a negitive look-ahead assertertion on $ at the end of your regex -- IE assert that the end of the match isn't the end of the string. I don't really understand regexes, though, and that wouldn't set pos(), because the regex would fail. You could combine that technique with a (?{code}) block that sets a variable to pos(), though, no?


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      No. That would prevent the regex from succeeding at end-of-string. What I want is to prevent the regex from backtracking due to end-of-string. This can happen at any point in the regex so there is no one place in the pattern that you can put something to cause it to happen. It would be like putting a special token at the end of the string such that every part of the regex treats that token specially.

      It could/should actually do even more than that. Even "mel" =~ /l+/z should fail because it terminated the search due to the end-of-string and the next bytes on my stream might well be "low" and so I'd want that regex to match both "l"s.

                      - tye
        How would I go about telling /z to wrap it up and accept the end of string as end of match? There are really two things you are asking of the engine: to continue where it left off last time, and to fail without forgetting where it's at when it hits the end of string. You need a way to be able to ask for the first without the latter. Otherwise, as a silly example (but let's pretend it isn't), /.+/z would always fail, even at the end of my input stream where I'd want it to successfully match at end of string.

        Makeshifts last the longest.

      The regex engine does not allow (?{code}) blocks to alter pos(). I wanted to do that once and dug into the source. Just prior to executing the code it saves a copy of pos() and restores it immediately afterward. tye answered the other concern but didn't address this issue.


      Fun Fun Fun in the Fluffy Chair

Re: Applying regexes to streams: Perl enhancement idea
by Anonymous Monk on Jan 13, 2003 at 17:15 UTC

    Two root replies to this excellent, provacative thread and 10+ to many other recent contentless ones. Too bad discussion groups end up more often or not like this.

    Anyways, it's a great idea tye, go for it :).

Re: Applying regexes to streams: Perl enhancement idea
by tilly (Archbishop) on Feb 17, 2004 at 05:35 UTC
    I'm sorry that I missed this post the first time around. (My excuse is that I wasn't here...)

    For anyone else who runs across this post, the idea actually came up on the p5p list a couple of years ago. Ilya suggested that it would not happen any time soon, but suggested getting the same effect with a convoluted internal code expression. I reposted a variation of Ilya's suggestion at Re: Regexes on Streams, and tsee did the hard work of implementing it at Regexes on Streams - Revisited!.

    Yes, it is horrible, ugly, and fragile. But it is here, it works (for most cases), and it is on CPAN as File::Stream.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://225083]
Approved by krujos
Front-paged by mojotoad
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-03-28 11:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found