Syntactic Confectionery Delight | |
PerlMonks |
Re: [RFC] Building Regex Alternations Dynamicallyby AnomalousMonk (Archbishop) |
on Jan 19, 2017 at 19:08 UTC ( [id://1179929]=note: print w/replies, xml ) | Need Help?? |
Let's say you have a list of strings, like ("abc", "def", "ghi") ... I think it's very important to highlight the fact that these strings are assumed to be literal strings or their equivalent, and cannot contain regex "operators" per se. Perhaps "Let's say you have a list of literal strings or their moral equivalent, like ..." You make this point further on in point 3 in the first group of bullet points, but I think it should be emphasized early and often. sort { length $b <=> length $a } # 2. sort { length $b <=> length $a Your discussion of sorting by length in an alternation of this kind is critical. I feel the desired end can be achieved in a more uniform way, although it's a bit more tricky to explain.
Say we have the set qw(ab abc abcd wx wxy wxyz) of literal strings we wish to build into an alternation. To build a "longest match" ordered alternation, it's essential to have abcd appear before abc and abc before ab in the alternation as you've explained, but it's irrelevant where wx wxy wxyz appear relative to the former subgroup. The reason is that nothing in the literal wx wxy wxyz subgroup can possibly match anything that the literal ab abc abcd subgroup matches and vice versa. So a standard Does it ever matter that members of the wx wxy wxyz subgroup fall before, after, or within the ab abc abcd subgroup? Never. The regex (?^msx:abcd|wxyz|wxy|abc|ab|wx) is functionally equivalent (in terms of longest matching) to (?^msx:wxyz|wxy|wx|abcd|abc|ab) In addition, there's the whole $b/$a versus $a/$b issue in controlling descending versus ascending sorting. This is very easy to screw up and can be very hard to see to fix; reverse sort eliminates this ambiguity.
A situation in which a performance difference might emerge is if it were known that, e.g., ab was overwhelmingly more likely to occur than wx and so should be tested first (along with all its cohort) to avoid needlessly burning computrons in a heavy-duty matching situation: I tend to use a reverse sort step in all cases when I'm building this type of alternation, even when all substrings to be matched are known to be mutually exclusive or when the alternation will be anchored in such a way that alternation match length is not a consideration. (Of course, if the programer wants a shortest match alternation, that's an entirely different question, and also a point upon which you've touched.) Give a man a fish: <%-{-{-{-<
In Section
Meditations
|
|