perlmeditation
japhy
<i>as modified from my Perl5-Porters post</i>
<hr>
If we're able to look at the specific nodes of a regular
expression, then we should be able to create a regex that
works from the BACK of a string to the FRONT. What do I
mean?
<code>
($last_numbers) = $string =~ /(\d+)(?!\D*\d)/;
</code>
That regex finds the last occurrence of digits in $string.
Now, it's not all that bad, but it could be made a great
deal simpler by:
<ol>
<li> reversing the input string
<li> reversing the nodes of the regex (optimize it, too)
<li> reversing the match
</ol>
"How in the world is that simpler?" you ask. "Just watch!"
I reply.
<code>
$last_numbers = scalar reverse( # reverse the match
reverse($string) =~ # reverse the input string
/(\d+)/ # reverse the regex
);
</code>
Now, the regex <tt>(\d+)(?!\D*\d)</tt> can also be written
as <tt>(\d+)(?=\D*$)</tt>. If digits aren't followed by
another digit, then they are followed by only non-digits,
if anything. In the process of reversal, the look-ahead
can be eliminated, since the regex would be <tt>\D*(\d+)</tt>
which has a redundant leading "node", <tt>\D*</tt>. The
regex's meaning is not changed by its removal.
<br><br>
We already have <tt>re.pm</tt> which can pick a regex apart
during compilation and during the actual matching stage.
This has the ability to pick apart individual nodes of a
regular expression.
<blockquote>
Note: the term "node" refers to a combination of regex atoms
and quantifiers which can match a given substring backwards
or forwards. In the regex <tt>\d+[abc]def(foo|bar)</tt>,
the nodes are:
<ul>
<li> <tt>\d+</tt>
<li> <tt>[abc]</tt>
<li> <tt>d</tt>
<li> <tt>e</tt>
<li> <tt>f</tt>
<li> <tt>f</tt>
<li> <tt>o</tt> <a href="#note">*</a>
<li> <tt>o</tt> <a href="#note">*</a>
<li> <tt>b</tt>
<li> <tt>a</tt>
<li> <tt>r</tt>
</ul>
<a name="note">*</a>
<i>technically, the two <tt>o</tt> nodes can be combined
here</i>
<br><br>
This regex, reversed, would be <tt>(rab|oof)fed[abc]\d+</tt>
</blockquote>
For the mean time, there is no way to automatically reverse
a regex. You have to do it by hand. But the benefits of it
are amazing. Maybe they need to be seen to be believed. If
so, here is an example.
<br><br>
You have a simple programming language, called KC ("kinda
crummy"). It has a VERY simple grammar:
<ul>
<li> comments are matched by <tt><[^>]*></tt>
<li> strings are matched by <tt>"[^"\\]*(?:\\.[^"\\]*)*"</tt>
<li> everything else is matched by <tt>[^<>"]+</tt>
</ul>
The string definition is a double-quoted strings with escapes
allowed in it, like "foo \"bar\" \\ blat". All programs
written in KC can be matched by the following regex:
<code>
$CODE =~ m{
\A
(?:
[^<>"]+ # non-string, non-comment
|
"[^"\\]*(?:\\.[^"\\]*)*" # string
|
<[^>]*> # comment
)*
\z
}x;
</code>
We can write a regex to match an entire file (shown above).
Oh, before I go any further, here's the sample program:
<code>
<this is a sample program> int x = 10;
<what a silly grammar> str y = "cool \" beans";
if (len(y) GREATER_THAN x) { <can't use gt and lt symbols... heehee>
<empty comment coming up !
print y, " is longer than ", x, " characters";
<>
<and more stuff to come soon>
chop(y,x);
print "I sliced 'y' down to ", x, " characters for you";
}
</code>
Let's say we wanted to extract the LAST comment from this
program as well as be sure it was coded properly. Here's
my regex. Note: I used the "cut" regex operator
<tt>(?>...)</tt> which matches without allowing a backtrack,
to speed up the non-string, non-comment part.
<code>
($last_comment) = $CODE =~ m{
\A
(?: <[^>]*> | "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )*
(<[^>]*>)
(?: "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )*
\z
}x;
</code>
I'll also implement a reversed version. The reversed terms
for the grammar are:
<ul>
<li> comments are matched by <tt>>[^>]*<</tt>
<li> strings are matched by <tt>"(?:[^"\\]*.\\)*[^"\\]*"</tt>
<li> everything else is matched by <tt>[^<>"]+</tt>
</ul>
<code>
$last = scalar reverse(reverse($CODE) =~ m{
\A
(?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) )*
(>[^>]*<)
(?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) | >[^>]*< )*
\z
}x);
</code>
If you look long enough (it shouldn't take too long) you'll
see that EACH node of the terms has been reversed properly.
Now, I ran a benchmark (available on my web site -- follow
the links for "Sex, Eger!") and it showed that the reverse
method was actually faster than the forward version, EVEN
with having to reverse the input string AND the value of
<tt>$1</tt>. When I implemented a cached version, where the
reverse of <tt>$CODE</tt> was already stored, it was even
faster (obviously). Here are the wonderful results:
<br><br>
<table border=1 cellpadding=2 cellspacing=2>
<tr>
<td colspan=3 align="center"><b>Running for at least 15 seconds</b></td>
</tr>
<tr>
<td align="center"><b>Method</b></td>
<td align="center"><b>Hz</b></td>
<td align="center"><b># iters</b></td>
</tr>
<tr><td>forwards</td><td>1072.42/s</td><td>16097</td></tr>
<tr><td>backwards</td><td>1448.33/s</td><td>21725</td></tr>
<tr><td>back+cache</td><td>1613.53/s</td><td>24574</td></tr>
</table>
<br><br>
This is, to say the least, very good. The backwards version
runs %33 faster, and the cached version runs %50 faster!
<br><br>
If we already know the KC code is well-formed (that is, it
will match the entire-code regex), then we can eliminate a
lot of the code in the regexes. First, to match the last
comment, forwards:
<code>
($last_comment) = $CODE =~ m{
(<[^>]*>)
(?: "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )*
\z
}x;
</code>
And now backwards:
<code>
$last = scalar reverse(reverse($CODE) =~ m{
\A
(?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) )*
(>[^>]*<)
}x);
</code>
The benchmark gave me the following data:
<br><br>
<table border=1 cellpadding=2 cellspacing=2>
<tr>
<td colspan=3 align="center"><b>Running for at least 15 seconds</b></td>
</tr>
<tr>
<td align="center"><b>Method</b></td>
<td align="center"><b>Hz</b></td>
<td align="center"><b># iters</b></td>
</tr>
<tr><td>forwards</td><td>1438.02/s</td><td>21786</td></tr>
<tr><td>backwards</td><td>2888.13/s</td><td>43322</td></tr>
<tr><td>back+cache</td><td>3481.73/s</td><td>52226</td></tr>
</table>
<br><br>
Another tremendous difference. The backwards version is
%100 faster, and the cached version is %150 faster!
<br><br>
What does this mean? Perhaps you should look into rethinking
how you match your strings. If you're looking for the last
of something, or for something towards the end of a string,
and the string is potentially large, thinking about regex
reversal. It could save you a lot of time.
<br><br>
This text will be updated and modified at my web site. I'll
put headings in, and all that fancy stuff. <tt>:)</tt>
<br><br>
<tt>$_="goto+F.print+chop;\n=yhpaj";F1:eval</tt><br>
Jeff "<tt>japhy</tt>" Pinyan<br>
<a href="mailto:japhy@pobox.com"><tt>japhy@pobox.com</tt></a><br>
<a href="http://www.pobox.com/~japhy/"><tt>http://www.pobox.com/~japhy/</tt></a>