in reply to Re: Recognizing 3 and 4 digit number
in thread Recognizing 3 and 4 digit number

G'day AnomalousMonk,

[Your Update just appeared as I hit [reply]. I think your post is fine where it is: htmanning gets a notification of your response with an alternative point of view and you had sent me a /msg anyway, so I was aware of it (thanks for that).]

"I tend to differ with kcott in the area of regex best practice."

While we certainly differ in some areas, I don't think the gulf is as wide as you suggest. I had originally intended to mention PBP in my post: I had a very long (over an hour) interruption in the middle of typing it and, when I finally returned to it, forgot to include the PBP part. My response below covers the points I wanted to make.

I was very impressed with PBP when I first read it over a decade ago — in fact, I read it cover-to-cover twice — and started using most (if not all) of its recommendations in my code. I suspect that, 10 years ago, our views on "regex best practice" may have been perfectly aligned. I still use much of PBP; although, these days, it's just become part of my standard practices and I don't really think of it in terms of following those specific recommendations. One area that I have departed from is adding /msx to the end of every regex.

"kcott implies that one should avoid using the  /x /m /s modifiers where they are not necessary. I think they are (almost) always necessary: ..."

I wasn't trying to imply anything as strong as "should avoid"; rather, my comments were intended to convey something closer to "could avoid".

Many organisations have Perl coding standards based on PBP. These are often quite inflexible: "You must write your matches like this: m{...}msx!". On the odd occasion that I've been faced with this, especially for short-term contracts, I just take the pragmatic approach and do it. Unfortunately, many of the programmers have no idea why they're doing this: I consider this to be a real problem. So, use all of those modifiers if your pay packet relies on it, but understand what they do and which are really required for the code being written.

I think we're pretty much on the same page with /x, so I'll say no more about that.

We definitely seem to be at odds with /m and /s. Perhaps it's a function of the type of data we normally process but I rarely need those: sometimes I need one of them; I need both far less often. There's not a lot more I can say about that: "(almost) always necessary" is not my experience.

Using the qr{(?mods:...)} form over the qr{...}mods form is something of a personal preference. I've only been using it for a year or two. The latter form makes the modifiers global: you can't get finer control such as qr{(?mo:...)(?ds:...)} or qr{(?mo:...(?ds:...)...)}. Having said that, my requirements for such fine control are exceptionally limited. I really have no strong feelings regarding which form people choose to use. I don't think your arguments against using qr{(?mods:...)} because of potential typos are particularly compelling: I'm far more likely to not release the Shift key quickly enough and terminate a statement with a colon (and that can be a much harder bug to track down).

Whether or not it's a good idea to include captures in qr// is a matter of context: hardly something to be "assiduously" avoided. Where it's used like I did (s/$re/.../), there's no problem. The issue with the OP code was capturing the entire match (s/($re)/.../) when only part of the match was wanted in $1.

— Ken

Replies are listed 'Best First'.
Re^3: Recognizing 3 and 4 digit number
by AnomalousMonk (Archbishop) on Jan 03, 2017 at 19:33 UTC
    We definitely seem to be at odds with /m and /s. ... I rarely need those: sometimes I need one of them; I need both far less often.

    My motive for always using the  /ms modifier cluster (in addition to /x, of course) is to foster clarity, and clarity is always a necessity :) Clarity is improved because the  . ^ $ operators have unvarying behaviors. Sometimes one is forced to be devious and must sacrifice clarity of expression, but that's what comments are for!

    ... the qr{(?mods:...)} form over the qr{...}mods form ... The latter form makes the modifiers global: you can't get finer control such as qr{(?mo:...)(?ds:...)} or qr{(?mo:...(?ds:...)...)}.

    The docs say this finer control is possible:  (?mo-ds) and  (?mo-ds:pattern) are rigorously scoped:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A ((?-s: .+)) .+ ((?-s: .+)) .+ ((?-s: .+)) \z }xms; ;; print qq{B: match, \$1 '$1' @ $-[1]} if $s =~ m{ \A ((?-s: .+)) \z }xms; ;; print qq{C: match, \$1 '$1' @ $-[1]} if $s =~ m{ \A ((?-s: (?s: .+))) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11 C: match, $1 'aa bb cc' @ 0
    (Tricky to put together a meaningful example for this!)

    That said, I would never write regex A as above, but rather as:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A ([^\n]+) .+ ([^\n]+) .+ ([^\n]+) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11
    Don't mess with dot (or  ^ $ either): much less potential for brain-hurt.

    Update: Another version of regex A:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A (?-s) (.+) (?s) .+ (?-s) (.+) (?s) .+ (?-s) (.+) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11
    In the context of global dot-matches-newline behavior, successive  (?-s) and  (?s) turn newline matching off and on, respectively. Again, I wouldn't actually write a regex this way unless my feet were being held to the fire.


    Give a man a fish:  <%-{-{-{-<

      "... clarity is always a necessity ..."

      You'll get no argument from me on that one.

      "My motive for always using the  /ms modifier cluster (in addition to /x, of course) is to foster clarity, ..."

      I think we both agree about /x: no need to discuss that any further. However, I still disagree about /ms. Compare these three lines from your code (above):

      \A ((?-s: .+)) .+ ((?-s: .+)) .+ ((?-s: .+)) \z \A ((?-s: .+)) \z \A ((?-s: (?s: .+))) \z

      with the eqivalent lines from my code (below):

      ^ ( $re_nonl ) $re_all ( $re_nonl ) $re_all ( $re_nonl ) $ ^ ( $re_nonl ) $ ^ ( $re_all ) $

      Your regexes all use \ms and then need (?s and (?-s in various places. My regexes don't use \ms or (?ms at all; $re_nonl and $re_all are tiny regexes, $re_nonl doesn't use \ms or (?ms at all, $re_all only uses (?s.

      "The docs say this finer control is possible: ..."

      I wondered if you thought I was suggesting that level of control was not possible. If so, my apologies: that wasn't my intent. Perhaps I should have compared

      qr{(?mo:...)(?ds:...)} qr{(?mo:...(?ds:...)...)}

      with

      qr{(?mo-ds:...)(?ds-mo:...)}mods qr{(?mo-ds:...(?ds-mo:...)...)}mods

      In the code below, I've used the regexes described above ($reA, $reB & $reC). Those were written on the assumption that all were needed in the same script. I've also added $reAiso, $reBiso & $reCiso to show how I might have written these in isolation: none use the 'm' modifier; two use the 's' modifier. Finally, I added $reD as an example of when I might use both the 'm' and 's' modifiers. Throughout, I've used the input and output formats that you used in your code.

      #!/usr/bin/env perl -l use strict; use warnings; use Test::More tests => 7; my $expA = "A: match, \$1 'aa ' @ 0 \$2 ' ' @ 9 \$3 'c' @ 11"; my $expB = ''; my $expC = "C: match, \$1 'aa \n bb \n cc' @ 0"; my $expD = "D: match, \$1 'aa ' @ 0 \$2 ' cc' @ 9"; my $fmtA = "A: match, \$1 '%s' @ %d \$2 '%s' @ %d \$3 '%s' @ %d"; my $fmtB = "B: match, \$1 '%s' @ %d"; my $fmtC = "C: match, \$1 '%s' @ %d"; my $fmtD = "D: match, \$1 '%s' @ %d \$2 '%s' @ %d"; my $s = "aa \n bb \n cc"; my $re_all = qr{(?sx: .+ )}; my $re_nonl = qr{(?x: [^\n]+ )}; my $reA = qr{(?x: ^ ( $re_nonl ) $re_all ( $re_nonl ) $re_all ( $re_no +nl ) $ )}; my $reB = qr{(?x: ^ ( $re_nonl ) $ )}; my $reC = qr{(?x: ^ ( $re_all ) $ )}; my $reAiso = qr{(?sx: ^ ([^\n]+) .+ ([^\n]+) .+ ([^\n]+) $ )}; my $reBiso = qr{(?x: ^ ( .+ ) $ )}; my $reCiso = qr{(?sx: ^ ( .+ ) $ )}; my $reD = qr{(?msx: \A ( .+? ) $ .+ ^ ( .+ ) \z )}; my ($gotA, $gotB, $gotC, $gotAiso, $gotBiso, $gotCiso, $gotD) = ('') x + 7; $gotA = sprintf $fmtA, $1, $-[1], $2, $-[2], $3, $-[3] if $s =~ $reA; $gotB = sprintf $fmtB, $1, $-[1] if $s =~ $reB; $gotC = sprintf $fmtC, $1, $-[1] if $s =~ $reC; $gotAiso = sprintf $fmtA, $1, $-[1], $2, $-[2], $3, $-[3] if $s =~ $re +Aiso; $gotBiso = sprintf $fmtB, $1, $-[1] if $s =~ $reBiso; $gotCiso = sprintf $fmtC, $1, $-[1] if $s =~ $reCiso; $gotD = sprintf $fmtD, $1, $-[1], $2, $-[2] if $s =~ $reD; is($gotA, $expA, 'testA'); is($gotB, $expB, 'testB'); is($gotC, $expC, 'testC'); is($gotAiso, $expA, 'testAiso'); is($gotBiso, $expB, 'testBiso'); is($gotCiso, $expC, 'testCiso'); is($gotD, $expD, 'testD');

      All passed:

      1..7 ok 1 - testA ok 2 - testB ok 3 - testC ok 4 - testAiso ok 5 - testBiso ok 6 - testCiso ok 7 - testD

      — Ken

        Sorry to be so long getting back to you. Events intervene...

        I think we both have strong personal styles and we're each unlikely to persuade the other to change any time soon, so this will likely be my last word in this thread.

        However, I want to take one more opportunity to state my position clearly. The following rationale is, of course, taken largely (if not entirely!) from TheDamian's regex PBPs.

        qr{(?x: ^ ( $re_nonl ) $re_all ( $re_nonl ) $re_all ( $re_nonl ) $ )};

        My understanding of your practice is that you might or might not include an m or an s modifier in the opening  (?x: modifier group depending on whether or not  ^ $ or  . were used in the expression, and on what behavior you wanted these operators to exhibit.

        When I look at the quoted expression, the first thing I ask is "Ok, what do  ^ and  $ do? How do they behave?" Now I have to go modifier hunting. In this case, there is no  /m modifier in sight, so  ^ $ have their default behaviors.

        If I want to change this expression so as to add a  ^ or  $ operator somewhere, I have to repeat the hunt and decide if the operator behavior selected by the existing (or not) m modifier is compatible with the behavior I want. If I'm tempted to add or delete an m modifier, I must look around for other  ^ $ operators already present so that I can be sure the new behavior selected is correct and compatible with pre-existing usages. Room here for bugs to creep in.

        But why should I have to ask these questions? Why should these operators have multiple behaviors? If  \A \Z exactly duplicate the default  ^ $ behaviors (with  \z thrown in to extend this functionality a bit), why not just nail down  ^ $ to their enhanced  /m behaviors? No further thought needed.

        But what if I don't use any  ^ $ in a given regex? Why should I bother with a useless  /m modifier? If m is always present in a standard /xms tail, no harm is done if no  ^ $ is used in the regex, and if one of these operators is ever added to a regex in which it was not present before, the further step of worrying about whether (and where) to add or not to add the corresponding modifier is totally eliminated.

        A similar argument applies to the dot operator: If  [^\n] exactly duplicates the default match behavior of dot, why not set the /s-modified "dot matches all" behavior in cement (especially since the latter behavior is the one most commonly needed in regexes)? Again, the need for thought and the opportunity for confusion are reduced: If no dot appears in a regex, no harm is done; if one must be added later, it's a one-step process.

        The end result is my (near) universal use of the  /xms tail in any  qr// m// s/// that I write. As I've said, there are exceptions due to the exigencies of the moment (usually my own want of ingenuity) or to the intricacies of the application, but they're few and far between.

        (The  '/flags' mode added to re in Perl version 5.14 seems very convenient for enforcing universal use of an  /xms tail with all regex operators, but I've never used it except for a bit of experimentation. I avoid it because it adds yet another versional boundary to worry about transgressing. Especially with postings to PerlMonks, I move back and forth between pre- and post-5.10 Perl versions so often that I have enough of a headache just with these extensions — but they're too enticing to ignore.)


        Give a man a fish:  <%-{-{-{-<