Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: [RFC] Building Regex Alternations Dynamically

by AnomalousMonk (Archbishop)
on Jan 19, 2017 at 19:08 UTC ( [id://1179929]=note: print w/replies, xml ) Need Help??


in reply to Building Regex Alternations Dynamically

Let's say you have a list of strings, like ("abc", "def", "ghi") ...

I think it's very important to highlight the fact that these strings are assumed to be literal strings or their equivalent, and cannot contain regex "operators" per se. Perhaps "Let's say you have a list of literal strings or their moral equivalent, like ..." You make this point further on in point 3 in the first group of bullet points, but I think it should be emphasized early and often.

sort { length $b <=> length $a }  # 2.
sort { length $b <=> length $a
       or $a cmp $b }             # 3.

Your discussion of sorting by length in an alternation of this kind is critical. I feel the desired end can be achieved in a more uniform way, although it's a bit more tricky to explain.

Say we have the set  qw(ab abc abcd wx wxy wxyz) of literal strings we wish to build into an alternation. To build a "longest match" ordered alternation, it's essential to have  abcd appear before  abc and  abc before  ab in the alternation as you've explained, but it's irrelevant where  wx wxy wxyz appear relative to the former subgroup. The reason is that nothing in the literal  wx wxy wxyz subgroup can possibly match anything that the literal  ab abc abcd subgroup matches and vice versa. So a standard
    reverse sort
expression serves in all cases.

c:\@Work\Perl\monks>perl -wMstrict -le "my ($alt) = map qr{$_}xms, join '|', map quotemeta, reverse sort qw(wx ab wxy abc wxyz abcd) ; print $alt; " (?^msx:wxyz|wxy|wx|abcd|abc|ab)
Does it ever matter that members of the  wx wxy wxyz subgroup fall before, after, or within the  ab abc abcd subgroup? Never. The regex
    (?^msx:abcd|wxyz|wxy|abc|ab|wx)
is functionally equivalent (in terms of longest matching) to
    (?^msx:wxyz|wxy|wx|abcd|abc|ab)

In addition, there's the whole  $b/$a versus  $a/$b issue in controlling descending versus ascending sorting. This is very easy to screw up and can be very hard to see to fix;  reverse sort eliminates this ambiguity.

A situation in which a performance difference might emerge is if it were known that, e.g.,  ab was overwhelmingly more likely to occur than  wx and so should be tested first (along with all its cohort) to avoid needlessly burning computrons in a heavy-duty matching situation:
    (?^msx:abcd|abc|ab|wxyz|wxy|wx)
But this is a situation that requires carefully hand-crafting a regex, and may be outside the scope of your article.

I tend to use a  reverse sort step in all cases when I'm building this type of alternation, even when all substrings to be matched are known to be mutually exclusive or when the alternation will be anchored in such a way that alternation match length is not a consideration. (Of course, if the programer wants a shortest match alternation, that's an entirely different question, and also a point upon which you've touched.)


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: [RFC] Building Regex Alternations Dynamically
by haukex (Archbishop) on May 14, 2017 at 11:56 UTC
    So a standard reverse sort expression serves in all cases.

    As I /msg'ed you at the time you posted this, I had a doubt about this - and now I've finally found the time to confirm that I wasn't being entirely paranoid ;-) Under locale, it's possible for longer strings to be sorted before shorter ones. Here are the instructions to reproduce (on a Debian-based system):

    $ mkdir -v /tmp/localetest; cd /tmp/localetest # copy "en_US" and the files it refers to via its "copy" statements $ cp -v /usr/share/i18n/locales/{en_US,en_GB,i18n,iso14651_t1,iso14651 +_t1_common} . $ mv -v en_US en_TESTING

    Now edit the file en_TESTING and insert the following into the LC_COLLATE section after the copy "iso14651_t1" statement:

    collating-element <bc> from "<U0062><U0063>" script <TESTING> order_start <TESTING>;forward;forward;forward;forward,position <bc> "<a><c>";"<BAS><BAS>";"<MIN><MIN>";IGNORE order_end

    Now compile and test the locale:

    $ localedef -i en_TESTING -f UTF-8 -c ./en_TESTING.UTF-8 $ LOCPATH=/tmp/localetest LC_ALL=en_TESTING.UTF-8 perl -Mlocale -le 'p +rint for sort qw/ab a b bc/' a ab bc b $ LOCPATH=/tmp/localetest LC_ALL=en_TESTING.UTF-8 perl -wMstrict -lMlo +cale print "1 $_" for "abcabbc"=~/${\join "|", reverse sort qw(b bc) }/g; print "2 $_" for "abcabbc"=~/${\join "|", sort {length$b<=>length$a} q +w(b bc) }/g; __END__ 1 b 1 b 1 b 2 bc 2 b 2 bc

    Now of course the real locale definitions are long and I don't plan on reading all of them, and so it's entirely possible that currently no locales exist that define this kind of sorting order. But I'll just be a bit paranoid and stick with sorting by length explicitly :-)

    Update: I did incorporate your suggestion about highlighting the fact that strings are matched literally, thank you!

      In all my CS/IS career, I've been able get away with taking no more than a couple of quick steps into locale-land and then turning and scurrying back to safety. I will have to do a lot more reading and experimenting to be able to respond meaningfully to what you've posted, but it looks... interesting.


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1179929]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-29 01:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found