comment on

First of all you are right, the regex /vr(o*)*m/ makes a naive NFA go gaga into an infinite loop. In fact I believe that Perl used to break on that exact case.

Contrary to a naive reading, this is not the same as the match vr(o*)m, in fact it gives the same result as /vro*()m/ (that is $1 winds up matching an empty string before the m).

The solution currently used is to keep track of when you start following the same logic during a zero-length match and refuse to go down the same path a second time. So before the m it will try to match another o, cannot, then tries to go around the loop and match o*, succeeds, then tries to go around the loop, detects recursion and fails back to trying to match the m, and succeeds. (Thereby finding your empty match.) A naive optimization will get this wrong.

My proposed optimization has to be done carefully, but will get it right. What I am suggesting is that you first figure out all of the possible transitions depending on what the next character is, in the order an NFA will visit them, remove the duplicates, and then toss that into a psuedo-state. In figuring out the possible transitions you need to get the zero-length recursion right. However if you do that correctly, then at no point does my optimization change what match the RE engine will find. It just allows you to "parallelize" the reasoning process and thereby eliminate a possible spot to backtrack to.

Now for those who are horribly confused, grab Mastering Regular Expressions. In a nutshell the basic problem is this. We have a regular expression. We have a string. A DFA engine (which stands for "discrete finite automata") essentially walks the string once, keeping track on each character of what all of the possible places it could be in the RE match. This gives good guaranteed performance, but it makes it hard to figure out what captured parentheses should be, or to implement backreferences (which need to keep track of more information than just where you are in the string and the RE). An NFA engine just does a recursive brute-force search of possible ways to try to match the string to the RE. This allows for lots of features (for instance the difference between .* and .*? is in whether you try to match another . first or second) but you can get into cases where there is an exponential explosion in how much work needs to be done to recognize that a string does not match. Perl, and most other scripting languages, use NFA engines. A few use DFAs.

In reply to Re (tilly) 5: Research ideas by tilly
in thread Research ideas by Wodin

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks