Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

There's been a meditation called Regexes on Streams recently that deals with the evil I've been doing in the File::Stream module and I received much valuable feedback. (Much thanks to those who offered their advice.) I suggest you have a look at the above node first because this is what has happened to the module since.

The find() method is used internally by readline(). This is where the weirdness happens.

Starting out with getting the function arguments (@terms is the set of strings/regexes/objects to incorporate into out regular expression). "use re 'eval'" is needed for the ${} regex construct which we'll be using to do action-at-a-distance when reaching the end of the current string buffer. $End_Of_String is a global that will be incremented on encountering the end of the buffer in the regex. Lexicals did not work here due to some ${} weirdness.

sub find { my $self = shift; my @terms = @_; use re 'eval'; $End_Of_String = 0;
Transforming the input strings/regexes/objects into compiled regexes first (the second map). Then, every regex is deparsed using YAPE::Regex and reconstructed as a string with '(?:\z(?{$End_Of_String++})(?!)|)' after every token. The result is then compiled.
my @regex_tokens = map { my $yp = YAPE::Regex->new($_); my $str = ''; my $token; while ($token = $yp->next()) { $str .= $token->string() . '(?:\z(?{$End_Of_String++})(?!)|)'; } qr/$str/; } map { if ( not ref($_) ) { qr/\Q$_\E/; } elsif ( ref($_) eq 'Regexp' ) { $_; } else { my $string = "$_"; qr/\Q$string\E/; } } @terms;
Some more on that weird piece of regular expression:
'(?:\z(?{$End_Of_String++})(?!)|)'
We match for \z, the end of the string. If that isn't currently the case, the | way at the end of the regex comes in and matches the empty string. Voila - effectively a no-op unless at the end of the string. If the \z matches, the code in (?{}) is executed (that is, $End_Of_String incremented).

Now we construct one final regex with capturing parens around each of the bunch of we munged above. We compile it.

my $re = '(' . join( ')|(', @regex_tokens ) . ')'; my $compiled = qr/$re/s;
We match against the buffer. If either the $End_Of_String var was set via the regex or we didn't match the string at all, the global's reset and we append more data to the buffer. Repeat until match.
Once we have a match, we determine which capturing group matched and remove everything up to after the match from the buffer. Then, the string pre-match and the match itself are returned from find().
while (1) { my @matches = $self->{buffer} =~ $compiled; if ($End_Of_String or not @matches) { $End_Of_String = 0; return undef unless $self->fill_buffer(); next; } else { my $index = undef; for ( 0 .. $#matches ) { $index = $_, last if defined $matches[$_]; } die if not defined $index; # sanity check my $match = $matches[$index]; $self->{buffer} =~ s/^(.*?)\Q$match\E//s or die; return ( $1, $match ); } } }

Can you spot any bugs? (I found one while writing the above.)

Steffen


In reply to Regexes on Streams - Revisited! by tsee

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2024-04-25 22:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found