http://qs321.pair.com?node_id=56230


in reply to Flex, converting regexes, and other Interesting Stuff.

Here be dragons. :-)

The fundamental problem here lies in the differing NFA/DFA semantics. No, there is no easy way to convert one to the other.

As Mastering Regular Expressions explains so well, NFAs are a brute-force recursive search. If there is a combinatorial explosion of possible ways to start trying to match pattern to text (think about /^(\s*yada\s*)+$/ for a second) you get a situation where an NFA will hit exponential failure modes. (No, Perl didn't fail to see that there was no match, you just didn't let it work for enough years.) But since an NFA is a brute force search, just adjusting the search pattern lets you specify which alternatives come first, and whether this wildcard pattern is greedy while that one is not. Plus things like backreferences are easy to implement.

By contrast DFAs walk the string once, keeping track of all of the places they might be in the pattern at once. This is guaranteed to finish quickly, but good luck if you want to keep track of things like backreferences. (Well actually there are quadratic algorithms that will capture backreferences, but you cannot use them within the pattern.) By an implementation quirk, a DFA will always find the longest overall match. (They can also be designed the find the shortest, the Unix grep utility does this for speed.)

So NFAs and DFAs find different matches by default. NFAs offer more features and finer control. DFAs better guaranteed performance. If you understand all of that then you should be able to follow this explanation of the dangers that lie in wait for someone attempting a native conversion of flex to using Perl's RE engine.

Now in addition to the other ideas you have had (most of which will be workable, but will involve some learning) if you don't mind trying something which is possibly insane you could investigate whether Inline makes it feasible to use the Flex specification directly in Perl...

UPDATE
danger pointed out that I had said that a DFA was a brute force search when I clearly meant that an NFA was. Fixed.

  • Comment on Re (tilly) 1: Flex, converting regexes, and other Interesting Stuff.