in reply to Why multiline regex doesn't work?

In your second regex, you achieve no match because the regex expression  [.\n] does not mean what (I think) you think it means. There is also another problem with a predefined special variable  $[ that is being interpolated instead of the first part of the  $[.\n] regex expression you intended.

c:\@Work\Perl\monks>perl -le "use warnings; use strict; ;; my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq}ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/[.\n]*?^$m *: (.*)$[.\n]*/$1/rm ; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[aaa : AAA bbb : BBB ccc : CCC ]]
The  '.' (period) character is not special, i.e., not a metacharacter, in a  [] regex character class; it just matches a literal period, and there are no such characters in your  $s test string.

I'm not sure what the  [.\n] expression was intended to represent (maybe  [^\n] "anything but a newline"?), so I can't comment further until you can provide greater clarity. Note, however, that disambiguating the  $ metacharacter at least produces a different output, i.e., a match and substitution, even though the output is still not what you expect:

c:\@Work\Perl\monks>perl -le "use warnings; use strict; ;; my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq}ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/[.\n]*?^$m *: (.*)$(?:[.\n]*)/$1/rm ; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[aaa : AAABBBccc : CCC ]]
(There is no warning because  $[ has a default initialized value.)

Update: Note that the ambiguity of  $[.\n] (regex) and the  $[ predefined special variable (see perlvar) is yet another argument in favor of the  /x embedded whitespace regex modifier (other than simply being able to see the darn regex). Consider:

c:\@Work\Perl\monks>perl -le "use warnings; use strict; ;; my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/ [.\n]*? ^ $m [ ]* : [ ] (.*) $ [.\n]* /$1/xrm ; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[aaa : AAABBBccc : CCC ]]
Still not what you expected, but one less pitfall to negotiate. (The  [ ] expression is what I like to use to represent a space, where  \s represents any whitespace character, a larger set.)

Further Update: The interpolation of  $[ can be clearly seen here:

c:\@Work\Perl\monks>perl -wMstrict -e "my $rx = qr{$[.\n]*}m; print $ +rx;" (?^m:0.\n]*)
The default value of  $[ is 0;


Give a man a fish:  <%-(-(-(-<

Replies are listed 'Best First'.
Re^2: Why multiline regex doesn't work?
by nbd (Novice) on Jun 09, 2015 at 04:29 UTC

    Thanks for the detailed explanation. That was exactly what I was asking about: exact parts of both regexes which work incorrectly.

    .\n was intended to match all characters, including newline character ( since with //m modifier '.' doesn't match newline ). But I see, that within square brackets the dot must be escaped. So, if all characters are expressed as \s\S, the regex now works:
    my $d = $s =~ s/[\s\S]*^$m *: (.*)$(?:[\s\S]*)/$1/rm;
    Thanks!
      ... with //m modifier '.' doesn't match newline ...

      Just to be clear: With or without the  //m regex modifier, the default behavior of the  . (dot) metacharacter is to match everything except a newline. It is only the  //s "dot matches all" modifier that causes dot to match absolutely everything.


      Give a man a fish:  <%-(-(-(-<

      my $d = $s =~ s/[\s\S]*^$m *: (.*)$(?:[\s\S]*)/$1/rm;

      The expression  [\s\S] to express "match any character" cries out for comment. I assume it is used to avoid the  . (dot) metacharacter when promoted by  //s to "dot matches all" status.

      This rubs me the wrong way. If dot (with //s) matches all, why not just use it that way? (All code examples that follow enable warnings and strictures. Also note that the  //r substitution modifier is only avaliable with Perl versions 5.14+.)

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/.*^$m *: (.*)$(?:.*)/$1/rsm; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[BBB ccc : CCC ]]
      This is arguably clearer, with only the tiny problem that it doesn't work! Why not?

      Consider the  (.*) capture group. With dot matching anything, it greedily grabs everything to the end of the string. To achieve an overall match, the regex still has to match  $ at the end of the string, which is easy, and  (?:.*) "zero or more of anything" after the end of the string, also easy. So capture group 1 and  $1 now contain everything to the end of the string, which is substituted back into the string.

      But the intent of  (.*) was only to capture everything up to the  $ anchor before the first embedded newline (due to //m). How to restrain dot?

      One way would be to use a  *? "lazy" modifier for the normally greedy  * match quantifier: dot will then match as little as necessary to get to the first  $ anchor.

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/.*^$m *: (.*?)$(?:.*)/$1/rsm; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[BBB]]
      Now we're getting somewhere!

      But one could argue that the intent of "anything except a newline" is more clearly expressed by  [^\n] and "capture as much as possible to the first newline" is better as  ([^\n]*) (remember that the code must be maintained, one must assume forever, so clear intent is important).

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/.*^$m *: ([^\n]*)$(?:.*)/$1/rsm; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[BBB]]
      (In this version, the  $ anchor is redundant, but does no harm and arguably serves to further clarify intent.)

      Lastly, an example in my own preferred style, taken from TheDamian's PBP:

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = qr{ bbb }xms; ;; my $t = $s =~ s{ .* ^ $m [ ]* : [ ] ([^\n]*) $ .* }{$1}xmsr; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[BBB]]
      The  $m is no longer defined as a raw string, but with  qr// as a regex in its own right. This allows it to be used "atomically" within another regex, as it is in the substitution: expressions like  $m+ or  $m{4} work as expected. The  $ is still redundant, but still arguably clarifies intent. The same could be said about the preceding  ^ in the regex, but I would argue that anchoring the  $m atom in some way is potentially important, so just leave it be.

      And that's the first several inches of the whole nine regex yards. HTH


      Give a man a fish:  <%-(-(-(-<