Re: Recognizing 3 and 4 digit number

Update: On second thought, this post is really more like a reply to kcott's Re: Recognizing 3 and 4 digit number and probably should have been posted as such originally. Oh, well...

htmanning: My remarks are further to the careful and detailed remarks of kcott here and, I hope, are in the same spirit.

I certainly agree with the recommendation (and its rationale) of doing development and posing questions to your fellow monks in a Test::More framework.

I tend to differ with kcott in the area of regex best practice. All the following are certainly personal best practices in this area, and are based largely on the regex Perl Best Practices (PBP)s of TheDamian.

kcott implies that one should avoid using the /x /m /s modifiers where they are not necessary. I think they are (almost) always necessary: They clarify intent and make it easier to think about what a regex, that most slippery and counterintuitive of things, is doing. When dealing with regexes, the less you have to think about the better. The result is that almost without exception, every qr// m// s/// operator I write ends up with an /xms tail.

The /x modifier allows comments, saviours of sanity, in regexes. kcott suggests the embedded
qr{(?x: pattern with whitespace )}
usage where comments are needed. This is undesirable IMHO for two reasons: two opportunities for inadvertent literal spaces before and after the (?x: ... ) expression, giving you, e.g.,
qr{ (?x: pattern with whitespace ) }
and potential brain-hurt. The alternate form
qr{(?x) pattern with whitespace }
is better, but still leaves room for a leading literal space to creep in:
qr{ (?x) pattern with whitespace }
Oops. Just write qr{ ... }xms and be done with it.

What if you want literal space characters in your regex when using the /x modifier? I prefer the [ ] usage over the \ usage (which is hard to see and has to be explained: that's ~~a backslash before a~~ | an escaped literal space). A string containing literal spaces can be represented as
qr{ \Qstring with some literal spaces\E }xms

The justification for always using the /m /s modifiers is a bit different: They reduce the "degrees of freedom" of regex behavior.

What does . (dot) match? "Dot matches everything except a newline except where modified by the /s modifier, in which case it matches everything." That's too much to think about. "Dot matches all" is a lot simpler, and that's what you get with the /s modifier, even if you never use a dot operator. What if you actually want to match "everything but a newline"? Use [^\n] in that case; it does the job and perfectly conveys your intention. I have sometimes seen (?-s:.) and (?s:.) used to invoke the different behaviors of dot. Don't. It's just more potential brain-hurt.

Similarly, the behaviors of the ^ $ operators are ~~constrained~~ | expanded by the /m modifier. What if you want only their commonly used end-of-string behaviors? The \A \z \Z operators were invented for this purpose.

With regard to the use of capture groups in qr// operators: This is something else I try assiduously to avoid.

Say you have two Regexp objects $rx $ry with an embedded capture group in each. They might be used in a substitution:
$string =~ s{ foo $rx bar $ry baz }{$1$2}xmsg;
If you change the pattern match to
$string =~ s{ foo $ry bar $rx baz }{$1$2}xmsg;
do you also have to change the order of the capture variables $1 $2 in the replacement string? The problem, of course, is that capture variables correspond in an absolute way to the order of capture groups in the s/// match. The question is highlighted more sharply if the captures appear explicitly in the s/// match:
$string =~ s{ foo ($rx) bar ($ry) baz }{$1$2}xmsg;
to
$string =~ s{ foo ($ry) bar ($rx) baz }{$1$2}xmsg; # switch $1 $2 also?
The \gn relative back-reference extension of Perl release 5.10 eases the problem of capture group numbering somewhat, but capture group variables are still staunchly absolutist! (The (?|alternation|pattern) construct of 5.10 also eases the capture group numbering problem a bit.)

Give a man a fish: <%-{-{-{-<

Comment on Re: Recognizing 3 and 4 digit number Select or Download Code

Replies are listed 'Best First'.
Re^2: Recognizing 3 and 4 digit number by kcott (Archbishop) on Jan 03, 2017 at 03:49 UTC
G'day AnomalousMonk, [Your Update just appeared as I hit [reply]. I think your post is fine where it is: htmanning gets a notification of your response with an alternative point of view and you had sent me a `/msg` anyway, so I was aware of it (thanks for that).] "I tend to differ with kcott in the area of regex best practice." While we certainly differ in some areas, I don't think the gulf is as wide as you suggest. I had originally intended to mention PBP in my post: I had a very long (over an hour) interruption in the middle of typing it and, when I finally returned to it, forgot to include the PBP part. My response below covers the points I wanted to make. I was very impressed with PBP when I first read it over a decade ago — in fact, I read it cover-to-cover twice — and started using most (if not all) of its recommendations in my code. I suspect that, 10 years ago, our views on "regex best practice" may have been perfectly aligned. I still use much of PBP; although, these days, it's just become part of my standard practices and I don't really think of it in terms of following those specific recommendations. One area that I have departed from is adding `/msx` to the end of every regex. "kcott implies that one should avoid using the `/x /m /s` modifiers where they are not necessary. I think they are (almost) always necessary: ..." I wasn't trying to imply anything as strong as "should avoid"; rather, my comments were intended to convey something closer to "could avoid". Many organisations have Perl coding standards based on PBP. These are often quite inflexible: "You must write your matches like this: `m{...}msx`!". On the odd occasion that I've been faced with this, especially for short-term contracts, I just take the pragmatic approach and do it. Unfortunately, many of the programmers have no idea why they're doing this: I consider this to be a real problem. So, use all of those modifiers if your pay packet relies on it, but understand what they do and which are really required for the code being written. I think we're pretty much on the same page with `/x`, so I'll say no more about that. We definitely seem to be at odds with `/m` and `/s`. Perhaps it's a function of the type of data we normally process but I rarely need those: sometimes I need one of them; I need both far less often. There's not a lot more I can say about that: "(almost) always necessary" is not my experience. Using the `qr{(?mods:...)}` form over the `qr{...}mods` form is something of a personal preference. I've only been using it for a year or two. The latter form makes the modifiers global: you can't get finer control such as `qr{(?mo:...)(?ds:...)}` or `qr{(?mo:...(?ds:...)...)}`. Having said that, my requirements for such fine control are exceptionally limited. I really have no strong feelings regarding which form people choose to use. I don't think your arguments against using `qr{(?mods:...)}` because of potential typos are particularly compelling: I'm far more likely to not release the Shift key quickly enough and terminate a statement with a colon (and that can be a much harder bug to track down). Whether or not it's a good idea to include captures in `qr//` is a matter of context: hardly something to be "assiduously" avoided. Where it's used like I did (`s/$re/.../`), there's no problem. The issue with the OP code was capturing the entire match (`s/($re)/.../`) when only part of the match was wanted in `$1`. — Ken	[reply] [d/l] [select]
Re^3: Recognizing 3 and 4 digit number by AnomalousMonk (Archbishop) on Jan 03, 2017 at 19:33 UTC
We definitely seem to be at odds with `/m` and `/s`. ... I rarely need those: sometimes I need one of them; I need both far less often. My motive for always using the `/ms` modifier cluster (in addition to `/x`, of course) is to foster clarity, and clarity is always a necessity :) Clarity is improved because the `. ^ $` operators have unvarying behaviors. Sometimes one is forced to be devious and must sacrifice clarity of expression, but that's what comments are for! ... the `qr{(?mods:...)}` form over the `qr{...}mods` form ... The latter form makes the modifiers global: you can't get finer control such as `qr{(?mo:...)(?ds:...)}` or `qr{(?mo:...(?ds:...)...)}`. The docs say this finer control is possible: `(?mo-ds)` and `(?mo-ds:pattern)` are rigorously scoped: `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A ((?-s: .+)) .+ ((?-s: .+)) .+ ((?-s: .+)) \z }xms; ;; print qq{B: match, \$1 '$1' @ $-[1]} if $s =~ m{ \A ((?-s: .+)) \z }xms; ;; print qq{C: match, \$1 '$1' @ $-[1]} if $s =~ m{ \A ((?-s: (?s: .+))) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11 C: match, $1 'aa bb cc' @ 0` [download] (Tricky to put together a meaningful example for this!) That said, I would never write regex A as above, but rather as: `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A ([^\n]+) .+ ([^\n]+) .+ ([^\n]+) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11` [download] Don't mess with dot (or `^ $` either): much less potential for brain-hurt. Update: Another version of regex A: `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A (?-s) (.+) (?s) .+ (?-s) (.+) (?s) .+ (?-s) (.+) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11` [download] In the context of global dot-matches-newline behavior, successive `(?-s)` and `(?s)` turn newline matching off and on, respectively. Again, I wouldn't actually write a regex this way unless my feet were being held to the fire. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: Recognizing 3 and 4 digit number by kcott (Archbishop) on Jan 04, 2017 at 00:36 UTC
"... clarity is always a necessity ..." You'll get no argument from me on that one. "My motive for always using the `/ms` modifier cluster (in addition to `/x`, of course) is to foster clarity, ..." I think we both agree about `/x`: no need to discuss that any further. However, I still disagree about `/ms`. Compare these three lines from your code (above): `\A ((?-s: .+)) .+ ((?-s: .+)) .+ ((?-s: .+)) \z \A ((?-s: .+)) \z \A ((?-s: (?s: .+))) \z` [download] with the eqivalent lines from my code (below): `^ ( $re_nonl ) $re_all ( $re_nonl ) $re_all ( $re_nonl ) $ ^ ( $re_nonl ) $ ^ ( $re_all ) $` [download] Your regexes all use `\ms` and then need `(?s` and `(?-s` in various places. My regexes don't use `\ms` or `(?ms` at all; `$re_nonl` and `$re_all` are tiny regexes, `$re_nonl` doesn't use `\ms` or `(?ms` at all, `$re_all` only uses `(?s`. "The docs say this finer control is possible: ..." I wondered if you thought I was suggesting that level of control was not possible. If so, my apologies: that wasn't my intent. Perhaps I should have compared `qr{(?mo:...)(?ds:...)} qr{(?mo:...(?ds:...)...)}` [download] with `qr{(?mo-ds:...)(?ds-mo:...)}mods qr{(?mo-ds:...(?ds-mo:...)...)}mods` [download] In the code below, I've used the regexes described above (`$reA`, `$reB` & `$reC`). Those were written on the assumption that all were needed in the same script. I've also added `$reAiso`, `$reBiso` & `$reCiso` to show how I might have written these in isolation: none use the '`m`' modifier; two use the '`s`' modifier. Finally, I added `$reD` as an example of when I might use both the '`m`' and '`s`' modifiers. Throughout, I've used the input and output formats that you used in your code. #!/usr/bin/env perl -l use strict; use warnings; use Test::More tests => 7; my $expA = "A: match, \$1 'aa ' @ 0 \$2 ' ' @ 9 \$3 'c' @ 11"; my $expB = ''; my $expC = "C: match, \$1 'aa \n bb \n cc' @ 0"; my $expD = "D: match, \$1 'aa ' @ 0 \$2 ' cc' @ 9"; my $fmtA = "A: match, \$1 '%s' @ %d \$2 '%s' @ %d \$3 '%s' @ %d"; my $fmtB = "B: match, \$1 '%s' @ %d"; my $fmtC = "C: match, \$1 '%s' @ %d"; my $fmtD = "D: match, \$1 '%s' @ %d \$2 '%s' @ %d"; my $s = "aa \n bb \n cc"; my $re_all = qr{(?sx: .+ )}; my $re_nonl = qr{(?x: [^\n]+ )}; my $reA = qr{(?x: ^ ( $re_nonl ) $re_all ( $re_nonl ) $re_all ( $re_no +nl ) $ )}; my $reB = qr{(?x: ^ ( $re_nonl ) $ )}; my $reC = qr{(?x: ^ ( $re_all ) $ )}; my $reAiso = qr{(?sx: ^ ([^\n]+) .+ ([^\n]+) .+ ([^\n]+) $ )}; my $reBiso = qr{(?x: ^ ( .+ ) $ )}; my $reCiso = qr{(?sx: ^ ( .+ ) $ )}; my $reD = qr{(?msx: \A ( .+? ) $ .+ ^ ( .+ ) \z )}; my ($gotA, $gotB, $gotC, $gotAiso, $gotBiso, $gotCiso, $gotD) = ('') x + 7; $gotA = sprintf $fmtA, $1, $-[1], $2, $-[2], $3, $-[3] if $s =~ $reA; $gotB = sprintf $fmtB, $1, $-[1] if $s =~ $reB; $gotC = sprintf $fmtC, $1, $-[1] if $s =~ $reC; $gotAiso = sprintf $fmtA, $1, $-[1], $2, $-[2], $3, $-[3] if $s =~ $re +Aiso; $gotBiso = sprintf $fmtB, $1, $-[1] if $s =~ $reBiso; $gotCiso = sprintf $fmtC, $1, $-[1] if $s =~ $reCiso; $gotD = sprintf $fmtD, $1, $-[1], $2, $-[2] if $s =~ $reD; is($gotA, $expA, 'testA'); is($gotB, $expB, 'testB'); is($gotC, $expC, 'testC'); is($gotAiso, $expA, 'testAiso'); is($gotBiso, $expB, 'testBiso'); is($gotCiso, $expC, 'testCiso'); is($gotD, $expD, 'testD'); [download] All passed: `1..7 ok 1 - testA ok 2 - testB ok 3 - testC ok 4 - testAiso ok 5 - testBiso ok 6 - testCiso ok 7 - testD` [download] — Ken	[reply] [d/l] [select]
Re^5: Recognizing 3 and 4 digit number by AnomalousMonk (Archbishop) on Jan 08, 2017 at 23:16 UTC
Re^6: Recognizing 3 and 4 digit number by kcott (Archbishop) on Jan 09, 2017 at 01:32 UTC
Some notes below your chosen depth have not been shown here


Clear questions and runnable code get the best and fastest answer
	PerlMonks