Re (tilly) 1: Donuts and DFA's

Sort of, only not so complicated. :-)

An NFA is already compiled internally into a series of states with transitions between them. What I am proposing to do is take that representation and rewrite part of it using new states that are ordered subsets of the original.

There are two tricks here.

The more obvious one is summed up by the phrase, "Put off figuring out why you have done this work as long as possible." That is, if you are going to have to backtrack and then go forward with identical logic, why not make it so that you go forward following common logic as long as possible? This is next character discrimination, for instance an RE to match all of the states would be faster because if you saw an A first you would have already figured out that you just need to look at states whose name starts with A.

The more subtle and important one is that if you are aggressive about it, you can eliminate a lot of repeated following of the same reasoning. Basically if ever you arrive at the same place in the string, with the same state information relevant to your match, at the same position in your RE, then the outcome is going to be the same. With a traditional NFA you might arrive at a state, go forward, fail, backtrack. Then you follow a different line of reasoning forward, reach the same place, and of course you will do the same work to figure out that you fail again. This is all repeated useless work, you already did this stuff.

Alternately if it succeeds the first time, it will not backtrack and again it will be irrelevant that there is another way to the same state.

Either way, the *second* way to the same combination is irrelevant. So in the process of optimizing if you see that you can reach the same state in multiple lines of reasoning can safely drop the second way of reaching it. By explicitly parallelizing the reasoning process that the NFA will use you can cause it to realize that 2 chains of reasoning will come back to the same state, and so the ordered sets of states can omit duplicates. And that is why you can avoid the combinatorial backtracking disaster!

OTOH there is no requirement to optimize the whole RE in this way. In fact when you are capturing a backreference used elsewhere in the RE, then state information relevant to the global match is being captured and you explicitly cannot apply this optimization. You might still optimize bits and pieces, but you won't be able to find a simple and efficient way to match crazy things like /(.*).*\1/...

Comment on Re (tilly) 1: Donuts and DFA's Download Code

Replies are listed 'Best First'.
The Return of the Donuts by gryng (Hermit) on Feb 21, 2001 at 00:28 UTC
Ok tilly, I think I understand now. Basically, one simple way to do this optimization would be to keep a list of all states you've visited so far that have failed. Not the exact state that failed, but the furthest in the past state that would definitively lead to that failed state. Then, before proceeding to any further state, you check to make sure your current state isn't in the list of failed states. You could also prune states that are too old to match anymore, and also optimize the search of this failed state list heavily based on your current state. The other thing you could do to make it more efficient is instead of going to the furthest past state, you could allow branching and store multiple fail-states for each failure that occurs. State machines are odd beasts :) . Ciao, Gryn	[reply]
Re (tilly) 1: The Return of the Donuts by tilly (Archbishop) on Feb 21, 2001 at 00:45 UTC
Yup. In fact Perl 5.6 does a version of this. However doing it with a run-time cache has several serious disadvantages. First of all it is much slower because you are always having to check, add information. With 5.6 the RE is labelled simpler or complex, and profiled at run-time, only having caching turned on if it seems slow. Secondly there are a variety of situations in which you have to throw away the cache. At least one of the RE bugs in 5.6.0 is that it was not smart enough in doing this and got it wrong in a few cases. It appears that with my approach it takes less work to figure out when you can and cannot optimize. Thirdly when you do the work at compilation, you get multiple speed wins for the price of one. For instance this one also covers trieing the RE, which is something that run-time caching does not do. And so on.	[reply]


Don't ask to ask, just ask
	PerlMonks