http://qs321.pair.com?node_id=822898


in reply to Reverse regexp a regexp?

Is there any "reverse regex" module, code or information available which explains how to get a regexp back to a fully conditioned string in Perl? Like the regexp /^http:\/\/(?:www.)?youtube.com\/user\/(.*)\/$// could get translated back to an original string like http://www.youtube.com/user/username ?
This is possible by converting a regex into some sort of finite automaton and doing a traversal on it. At one point I had been working on a suite of modules that would let one manipulate regular expressions & automata in this way, with these kinds of things as a goal (tinkering with perl regexes). Unfortunately, the development has ground to a halt, and we never got quite that far. I know that other programming languages have more developed tools for this kind of stuff, specifically JFLAP for java. It might be worth checking out.

The regex you give is anchored with ^ and $, so that makes the problem easier -- this is the standard interpretation of "classical" (not Perl) regular expressions, so the techniques from automata theory are more readily applicable. And if all the regexes you care about are as simple as the one you gave, it should be easy to roll your own solution. You'll have to parse the regex, but once you have the regex structure in a recursive data structure, it should be straight-forward. For example, you can replace any (...)? or (....)* with nothing. You can replace a "." with, say, character "x". You can replace any ( ... | .... ) by recursing on just one of the alternations, etc etc..

To stay with regexes; is there way to create a regexp of different inputs when they look similar ? Like a data set of 10 url's, create a most compatible regexp from this? This way regexp could be dynamic usable with a set of data; not bothering the user at all to search for the best options...
This problem is a bit underspecified. For example, if you give me any sample of, say, 10 URLs, the regex /./ would match them all, but I doubt that's what you have in mind. I think it may be a challenge to be more precise about what you mean by "most compatible regex". And I think any non-trivial way of defining this problem will result in a very computationally difficult task.

Here is one way of formalizing the problem that seems natural, but is actually an exceedingly difficult computational problem: given a sample of some URLs that the regex should match, and some others that the regex should not match, find the smallest regex that is consistent. Even getting a good heuristic is NP-complete (minimum consistent regular expression problem).

blokhead