Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: [RFC] Building Regex Alternations Dynamically

by kcott (Archbishop)
on Jan 19, 2017 at 10:11 UTC ( [id://1179901]=note: print w/replies, xml ) Need Help??


in reply to Building Regex Alternations Dynamically

G'day Hauke,

++ This looks like a very good start; seems reasonably complete; and covers most of the points I might have made. I have comments on two areas, as follows.

With any sort of tutorial, those reading it — to learn about the subject, rather than for reviewing, proof-reading, etc. — probably start with limited knowledge. Accordingly, any terms used should be unambiguous; unfortunately, you've used $regex to mean two different things:

$regex = join ... $regex = qr/...

I'm familiar with both the subject matter and the technique, so this posed no problem for me; however, for someone learning this, it may do. While it's reasonably obvious in the short code example, half a page later, in the middle of descriptive text, the appearance of $regex might not be as obvious to the student as it is to you or I. Consider renaming those; purely as a suggestion:

$regex_base_str = join ... $regex_compiled = qr/...

In points (4) & (5), in the first list, you show grouping. To resolve the same issue in both, you use explicit capture grouping in (4), and implicit non-capture grouping in (5).

Regex pieces used for alternation often occur as part of a larger regex; in fact, I suspect that's the more usual case. This may be as simple as the anchor assertions you show in (4), or could be a lot more complex. I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it) as part of the normal technique. To demonstrate:

# Simple case: OK - matches "a" or "b" $ perl -E 'my $re = "a|b"; $re = qr{$re}; say $re' (?^u:a|b) # Complex case: NOT OK - matches "Xa" or "bY" $ perl -E 'my $re = "a|b"; $re = qr{X${re}Y}; say $re' (?^u:Xa|bY) # Complex case: OK - matches "a" or "b" [fixed with "(?:...)"] $ perl -E 'my $re = "(?:a|b)"; $re = qr{X${re}Y}; say $re' (?^u:X(?:a|b)Y)

— Ken

Replies are listed 'Best First'.
Re^2: [RFC] Building Regex Alternations Dynamically (updated x2)
by haukex (Archbishop) on Jan 19, 2017 at 15:18 UTC

    Hi Ken,

    Thank you very much for your thoughtful reply!

    With any sort of tutorial, those reading it — to learn about the subject, rather than for reviewing, proof-reading, etc. — probably start with limited knowledge. Accordingly, any terms used should be unambiguous; unfortunately, you've used $regex to mean two different things

    Excellent point. I've renamed the variables to disambiguate (I kept the names shorter though), and I've added a link to perlretut.

    In points (4) & (5), in the first list, you show grouping. To resolve the same issue in both, you use explicit capture grouping in (4), and implicit non-capture grouping in (5).

    Another excellent point, I've switched to using the non-capturing groups in all the examples.

    I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it) as part of the normal technique.

    That is a very good point, but I haven't made the change yet because I need to think on it a bit more. On the one hand, I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be (qr/(?:a|b)/ eq "(?^:(?:a|b))" and on older Perls qr/(?:a|b)/ eq "(?-xism:(?:a|b))"), and also it makes the code to generate the string less elegant (my $regex_str = '(?:'.join('|', map ... ).')'). But those are just stylistic concerns and you're right that it would eliminate the pitfall that I have to discuss at length in points 4 and 5.

    The other potential solution, which I'm currently leaning towards, is to skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... I like this latter approach better because it's more robust (no string for the user to potentially misuse), but it does add one more "trick" that has to be explained to the beginner. Currently I feel that the advantages of that outweigh the disadvantages...

    Update 2, 2017-05-14: This node used to contain a draft, which I've now incorporated into the root node. I wanted to preserve the original text here:

    Thanks,
    -- Hauke D

      (The following is perhaps a bit OT to the main thread, or else already touched upon in an update. Oh, well...)

      ... I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be ... those are just stylistic concerns ...

      Please see Re: Recognizing 3 and 4 digit number and thereunder for a long discussion between myself and kcott on these "stylistic concerns." Personally, I still don't see the need for the extra explicit  (?:...) wrap step. The implicit wrap becomes explicit quick enough if you print the stringized Regexp object, and this feature of a Regexp object should be deeply understood from the moment one begins to use them.

      ... skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... ... it's more robust ...

      To continue the previous point, I feel it's important to get a regex into a Regexp form as quickly as possible: no dilly-dallying. Once objectified, it can be used atomically when composing more complex regex expressions:

      my $rx = qr{ ... }xms; ... $string =~ m{ $rx* $rx+? $rx{3,4} }xms and do_something();
      (Of course, this compositional capability is also addressed by the  (DEFINE) predicate of the  (?(condition)yes-pattern) conditional expression of Perl 5.10+.)

      The only situation which I'm aware of in which this compositional atomicity breaks down is for something like

      my ($n, $m) = (3, 4); ... $string =~ m{ $rx{$n,$m} }xms and do_something();
      where  $rx{$n} $rx{$n,} $rx{$n,$m} are all taken as hash element accesses. This can be fixed simply by an explicit layer of non-capturing group wrapping (entirely necessary here!):
          (?:$rx){$n} (?:$rx){$n,} (?:$rx){$n,$m}


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1179901]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2024-04-25 08:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found