Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Let's say you have a list of strings, like ("abc", "def", "ghi") ...

I think it's very important to highlight the fact that these strings are assumed to be literal strings or their equivalent, and cannot contain regex "operators" per se. Perhaps "Let's say you have a list of literal strings or their moral equivalent, like ..." You make this point further on in point 3 in the first group of bullet points, but I think it should be emphasized early and often.

sort { length $b <=> length $a }  # 2.
sort { length $b <=> length $a
       or $a cmp $b }             # 3.

Your discussion of sorting by length in an alternation of this kind is critical. I feel the desired end can be achieved in a more uniform way, although it's a bit more tricky to explain.

Say we have the set  qw(ab abc abcd wx wxy wxyz) of literal strings we wish to build into an alternation. To build a "longest match" ordered alternation, it's essential to have  abcd appear before  abc and  abc before  ab in the alternation as you've explained, but it's irrelevant where  wx wxy wxyz appear relative to the former subgroup. The reason is that nothing in the literal  wx wxy wxyz subgroup can possibly match anything that the literal  ab abc abcd subgroup matches and vice versa. So a standard
    reverse sort
expression serves in all cases.

c:\@Work\Perl\monks>perl -wMstrict -le "my ($alt) = map qr{$_}xms, join '|', map quotemeta, reverse sort qw(wx ab wxy abc wxyz abcd) ; print $alt; " (?^msx:wxyz|wxy|wx|abcd|abc|ab)
Does it ever matter that members of the  wx wxy wxyz subgroup fall before, after, or within the  ab abc abcd subgroup? Never. The regex
    (?^msx:abcd|wxyz|wxy|abc|ab|wx)
is functionally equivalent (in terms of longest matching) to
    (?^msx:wxyz|wxy|wx|abcd|abc|ab)

In addition, there's the whole  $b/$a versus  $a/$b issue in controlling descending versus ascending sorting. This is very easy to screw up and can be very hard to see to fix;  reverse sort eliminates this ambiguity.

A situation in which a performance difference might emerge is if it were known that, e.g.,  ab was overwhelmingly more likely to occur than  wx and so should be tested first (along with all its cohort) to avoid needlessly burning computrons in a heavy-duty matching situation:
    (?^msx:abcd|abc|ab|wxyz|wxy|wx)
But this is a situation that requires carefully hand-crafting a regex, and may be outside the scope of your article.

I tend to use a  reverse sort step in all cases when I'm building this type of alternation, even when all substrings to be matched are known to be mutually exclusive or when the alternation will be anchored in such a way that alternation match length is not a consideration. (Of course, if the programer wants a shortest match alternation, that's an entirely different question, and also a point upon which you've touched.)


Give a man a fish:  <%-{-{-{-<


In reply to Re: [RFC] Building Regex Alternations Dynamically by AnomalousMonk
in thread Building Regex Alternations Dynamically by haukex

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-04-25 08:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found