|Perl: the Markov chain saw|
Why machine-generated solutions will never cease to amaze meby grinder (Bishop)
|on Nov 10, 2004 at 11:24 UTC||Need Help??|
Consider the set of strings construction, contorsion, destruction, transaction. Using the module Regexp::Optimizer, you can generate the pattern (?:con(?:struct|tors)ion|(?:destru|transa)ction) as a result. This is an excellent pattern, since no backtracking is required (notwithstanding the inspections required to examine different a|b|c alternations). As soon as this pattern fails to track the target string being matched it can fail quickly. This is a very desirable property.
A contrario, the simple-minded pattern construction|contorsion|destruction|transaction will backtrack on a target like conduct, because it will have to try both "con-" alternations. (And apologies to RE engine internals specialists, I know this explanation skips over a few details).
All this is well and good, but consider the set of four words again. They all end in "-ion", and it would be nice to be able to factor that out. Because when you have lots and lots of strings (read: thousands) with common tails, the savings are considerable. My algorithm produces (?:con(?:struct|tors)|(?:destru|transa)ct)ion. The tradeoff is that at the cost of making the engine jump through a few more hoops, the pattern is shorter.
In putting the finishing touches on the module, and making sure my test suite covers all the edge cases, I prepared a particular test to cover what I expected the module should produce. When I ran the test, the module produced something else.
Consider the set of strings sad, salad, spread. Quick, think of a pattern that will match these strings, factoring out the trailing "-ad". My test case solution was that it should generate s(?:al|pre)?ad. When I ran the test, it failed. Instead, it produced s(?:(?:al)?|pre)ad. That looks so bizarre I had to paste in manually into a one-liner to prove to myself that it worked.
And the crazy thing is that it does work. What is more, when running with use re 'debug', to my untutored eye, it appears to do so more efficiently.
The (al|pre)? pattern produces the following debug output:
Whereas the ((al)?|pre) pattern produces:
And try as I might, I can't find a string that behaves differently between the two. I've tried all the possibilities I can think of, like sadad salalad saladad slad sal, but they behave identically. I still have a nagging doubt that there may be a difference; if anyone can point it out I be most appreciative.
Before I embarked on this journey into regular expressions, I would not have guessed that foo|(bar)?|rat is logically equivalent to (foo|bar|rat)?. It took a machine-generated solution to the problem for me to realise. You learn something every day.
- another intruder with the mooring of the heart of the Perl