Let's say you have a list of strings, like ("abc", "def", "ghi") ...

I think it's very important to highlight the fact that these strings are assumed to be literal strings or their equivalent, and cannot contain regex "operators" per se. Perhaps "Let's say you have a list of literal strings or their moral equivalent, like ..." You make this point further on in point 3 in the first group of bullet points, but I think it should be emphasized early and often.

sort { length $b <=> length $a }  # 2.
sort { length $b <=> length $a
       or $a cmp $b }             # 3.

Your discussion of sorting by length in an alternation of this kind is critical. I feel the desired end can be achieved in a more uniform way, although it's a bit more tricky to explain.

Say we have the set  qw(ab abc abcd wx wxy wxyz) of literal strings we wish to build into an alternation. To build a "longest match" ordered alternation, it's essential to have  abcd appear before  abc and  abc before  ab in the alternation as you've explained, but it's irrelevant where  wx wxy wxyz appear relative to the former subgroup. The reason is that nothing in the literal  wx wxy wxyz subgroup can possibly match anything that the literal  ab abc abcd subgroup matches and vice versa. So a standard
    reverse sort
expression serves in all cases.

c:\@Work\Perl\monks>perl -wMstrict -le "my ($alt) = map qr{$_}xms, join '|', map quotemeta, reverse sort qw(wx ab wxy abc wxyz abcd) ; print $alt; " (?^msx:wxyz|wxy|wx|abcd|abc|ab)
Does it ever matter that members of the  wx wxy wxyz subgroup fall before, after, or within the  ab abc abcd subgroup? Never. The regex
    (?^msx:abcd|wxyz|wxy|abc|ab|wx)
is functionally equivalent (in terms of longest matching) to
    (?^msx:wxyz|wxy|wx|abcd|abc|ab)

In addition, there's the whole  $b/$a versus  $a/$b issue in controlling descending versus ascending sorting. This is very easy to screw up and can be very hard to see to fix;  reverse sort eliminates this ambiguity.

A situation in which a performance difference might emerge is if it were known that, e.g.,  ab was overwhelmingly more likely to occur than  wx and so should be tested first (along with all its cohort) to avoid needlessly burning computrons in a heavy-duty matching situation:
    (?^msx:abcd|abc|ab|wxyz|wxy|wx)
But this is a situation that requires carefully hand-crafting a regex, and may be outside the scope of your article.

I tend to use a  reverse sort step in all cases when I'm building this type of alternation, even when all substrings to be matched are known to be mutually exclusive or when the alternation will be anchored in such a way that alternation match length is not a consideration. (Of course, if the programer wants a shortest match alternation, that's an entirely different question, and also a point upon which you've touched.)


Give a man a fish:  <%-{-{-{-<


In reply to Re: [RFC] Building Regex Alternations Dynamically by AnomalousMonk
in thread Building Regex Alternations Dynamically by haukex

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.