nbd has asked for the wisdom of the Perl Monks concerning the following question:

Greetings! I read the manuals about using multiline regexes in Perl 5, but still cannot figure out why the following ones don't work as intended (they intended to produce identical match, but I just study how //m and //sm modifiers work):
#!/usr/bin/perl use v5.20; my $s = <<'ENDSTR'; aaa : AAA bbb : BBB ccc : CCC ENDSTR my $m = 'bbb'; my $a = $s =~ s/.*^$m *: (.*?)$.*/$1/rsm; my $b = $s =~ s/[.\n]*?^$m *: (.*)$[.\n]*/$1/rm; print "a: $a\n"; print "b: $b\n";
The intended output of the program is
a: BBB b: BBB
But these regexes produce:
a: BBB ccc : CCC b: aaa : AAA bbb : BBB ccc : CCC
What parts of both these regexes miss or prevent them to do the intended matches?

Replies are listed 'Best First'.
Re: Why multiline regex doesn't work?
by AnomalousMonk (Archbishop) on Jun 09, 2015 at 01:43 UTC

    In your second regex, you achieve no match because the regex expression  [.\n] does not mean what (I think) you think it means. There is also another problem with a predefined special variable  $[ that is being interpolated instead of the first part of the  $[.\n] regex expression you intended.

    c:\@Work\Perl\monks>perl -le "use warnings; use strict; ;; my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq}ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/[.\n]*?^$m *: (.*)$[.\n]*/$1/rm ; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[aaa : AAA bbb : BBB ccc : CCC ]]
    The  '.' (period) character is not special, i.e., not a metacharacter, in a  [] regex character class; it just matches a literal period, and there are no such characters in your  $s test string.

    I'm not sure what the  [.\n] expression was intended to represent (maybe  [^\n] "anything but a newline"?), so I can't comment further until you can provide greater clarity. Note, however, that disambiguating the  $ metacharacter at least produces a different output, i.e., a match and substitution, even though the output is still not what you expect:

    c:\@Work\Perl\monks>perl -le "use warnings; use strict; ;; my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq}ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/[.\n]*?^$m *: (.*)$(?:[.\n]*)/$1/rm ; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[aaa : AAABBBccc : CCC ]]
    (There is no warning because  $[ has a default initialized value.)

    Update: Note that the ambiguity of  $[.\n] (regex) and the  $[ predefined special variable (see perlvar) is yet another argument in favor of the  /x embedded whitespace regex modifier (other than simply being able to see the darn regex). Consider:

    c:\@Work\Perl\monks>perl -le "use warnings; use strict; ;; my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/ [.\n]*? ^ $m [ ]* : [ ] (.*) $ [.\n]* /$1/xrm ; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[aaa : AAABBBccc : CCC ]]
    Still not what you expected, but one less pitfall to negotiate. (The  [ ] expression is what I like to use to represent a space, where  \s represents any whitespace character, a larger set.)

    Further Update: The interpolation of  $[ can be clearly seen here:

    c:\@Work\Perl\monks>perl -wMstrict -e "my $rx = qr{$[.\n]*}m; print $ +rx;" (?^m:0.\n]*)
    The default value of  $[ is 0;


    Give a man a fish:  <%-(-(-(-<

      Thanks for the detailed explanation. That was exactly what I was asking about: exact parts of both regexes which work incorrectly.

      .\n was intended to match all characters, including newline character ( since with //m modifier '.' doesn't match newline ). But I see, that within square brackets the dot must be escaped. So, if all characters are expressed as \s\S, the regex now works:
      my $d = $s =~ s/[\s\S]*^$m *: (.*)$(?:[\s\S]*)/$1/rm;
      Thanks!
        ... with //m modifier '.' doesn't match newline ...

        Just to be clear: With or without the  //m regex modifier, the default behavior of the  . (dot) metacharacter is to match everything except a newline. It is only the  //s "dot matches all" modifier that causes dot to match absolutely everything.


        Give a man a fish:  <%-(-(-(-<

        my $d = $s =~ s/[\s\S]*^$m *: (.*)$(?:[\s\S]*)/$1/rm;

        The expression  [\s\S] to express "match any character" cries out for comment. I assume it is used to avoid the  . (dot) metacharacter when promoted by  //s to "dot matches all" status.

        This rubs me the wrong way. If dot (with //s) matches all, why not just use it that way? (All code examples that follow enable warnings and strictures. Also note that the  //r substitution modifier is only avaliable with Perl versions 5.14+.)

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/.*^$m *: (.*)$(?:.*)/$1/rsm; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[BBB ccc : CCC ]]
        This is arguably clearer, with only the tiny problem that it doesn't work! Why not?

        Consider the  (.*) capture group. With dot matching anything, it greedily grabs everything to the end of the string. To achieve an overall match, the regex still has to match  $ at the end of the string, which is easy, and  (?:.*) "zero or more of anything" after the end of the string, also easy. So capture group 1 and  $1 now contain everything to the end of the string, which is substituted back into the string.

        But the intent of  (.*) was only to capture everything up to the  $ anchor before the first embedded newline (due to //m). How to restrain dot?

        One way would be to use a  *? "lazy" modifier for the normally greedy  * match quantifier: dot will then match as little as necessary to get to the first  $ anchor.

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/.*^$m *: (.*?)$(?:.*)/$1/rsm; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[BBB]]
        Now we're getting somewhere!

        But one could argue that the intent of "anything except a newline" is more clearly expressed by  [^\n] and "capture as much as possible to the first newline" is better as  ([^\n]*) (remember that the code must be maintained, one must assume forever, so clear intent is important).

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/.*^$m *: ([^\n]*)$(?:.*)/$1/rsm; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[BBB]]
        (In this version, the  $ anchor is redundant, but does no harm and arguably serves to further clarify intent.)

        Lastly, an example in my own preferred style, taken from TheDamian's PBP:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq{ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = qr{ bbb }xms; ;; my $t = $s =~ s{ .* ^ $m [ ]* : [ ] ([^\n]*) $ .* }{$1}xmsr; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[BBB]]
        The  $m is no longer defined as a raw string, but with  qr// as a regex in its own right. This allows it to be used "atomically" within another regex, as it is in the substitution: expressions like  $m+ or  $m{4} work as expected. The  $ is still redundant, but still arguably clarifies intent. The same could be said about the preceding  ^ in the regex, but I would argue that anchoring the  $m atom in some way is potentially important, so just leave it be.

        And that's the first several inches of the whole nine regex yards. HTH


        Give a man a fish:  <%-(-(-(-<

Re: Why multiline regex doesn't work?
by FreeBeerReekingMonk (Deacon) on Jun 09, 2015 at 00:07 UTC
    #!/usr/bin/perl use strict; use warnings; use v5.20; my $s = <<'ENDSTR'; aaa : AAA bbb : BBB ccc : CCC ENDSTR my $m = 'bbb'; my $a = $1 if $s =~ s/^$m *: (.*?)$/$1/rsm; my $b = $1 if $s =~ s/^$m *: (.*)$/$1/rm; print "a: $a\n"; print "b: $b\n";

    You see, $ only works at the end or just before a | all the m modifier does is match at the beginning/end of every line instead of the absolute beginning and absolute end. else $ gets confused with a variable, imagine: ~/(.)$./ ~/(.)\$./

      I was guided by this part of perldoc:

      - m modifier (//m): Treat string as a set of multiple lines. '.' matches any character except "\n" . ^ and $ are able to match at the start or end of any line within the string.

      - both s and m modifiers (//sm): Treat string as a single long line, but detect multiple lines. '.' matches any character, even "\n" . ^ and $ , however, are able to match at the start or end of any line within the string.

      Does the correction in the code you made mean that Perl processes the multiline string line by line and not as a single string?

      UPDATE: I see that Perl process the string as a whole. The code from the first sight just looked as an awk line by line pattern matching. Thanks.

        You should also enable warnings (and strictures; see strict), especially if you are a Perl novice. Consider your first regex with warnings enabled:

        c:\@Work\Perl\monks>perl -le "use warnings; use strict; ;; my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq}ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/.*^$m *: (.*?)$.*/$1/rsm ; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] Use of uninitialized value $. in regexp compilation at -e line 1. [[BBB ccc : CCC ]]
        The Use of uninitialized value $. in regexp compilation... message gives you a clue about what is happening.

        If the  $ is unambiguously a regex metacharacter:

        c:\@Work\Perl\monks>perl -le "use warnings; use strict; ;; my $s = qq{aaa : AAA\n} . qq{bbb : BBB\n} . qq}ccc : CCC\n} ; print qq{[[$s]]}; ;; my $m = 'bbb'; ;; my $t = $s =~ s/.*^$m *: (.*?)$(?:.*)/$1/rsm ; ;; print qq{[[$t]]}; " [[aaa : AAA bbb : BBB ccc : CCC ]] [[BBB]]
        You have your intended output for this regex.


        Give a man a fish:  <%-(-(-(-<

        You really should try to work with simpler examples before you make things complicated:

        use strict; use warnings; use Data::Dumper; my ($str,@match); $str = " foo bar baz "; @match = $str =~ /(foo.*bar)/; # nope! print Dumper \@match; @match = $str =~ /(foo.*bar)/m; # nope! print Dumper \@match; @match = $str =~ /(foo.*bar)/s; # this one! print Dumper \@match; $str = " foo bar foo baz "; @match = $str =~ /^(foo bar)/; # nope! print Dumper \@match; @match = $str =~ /^(foo bar)/s; # nope! print Dumper \@match; @match = $str =~ /^(foo bar)/m; # this one! print Dumper \@match;

        The first set of matches illustrates a case when the 's' modifier gets the match and the second set of matches illustrates a case when the 'm' modifier gets the match. Hope this helps!

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        

        See perlvar#$.
        rxrx and http://perldoc.perl.org/re.html#%27debug%27-mode and other regex tools
        The "anchor" misnomer in regexes (string location assertion)
        Why \n matches but not $^?
        Disabling regexp optimizations?

        matches after newline (or beginning of string). $ matches before newline (or end of string)

        $ perl -MData::Dump -Mre=debug -le " dd( $_=qq{a\n\nb} ); s{^$}{boop}m +; dd( $_ ); " Compiling REx "^$" Final program: 1: MBOL (2) 2: MEOL (3) 3: END (0) anchored ""$ at 0 anchored(MBOL) minlen 0 "a\n\nb" Matching REx "^$" against "a%n%nb" 0 <> <a%n%nb> | 1:MBOL(2) 0 <> <a%n%nb> | 2:MEOL(3) failed... 2 <a%n> <%nb> | 1:MBOL(2) 2 <a%n> <%nb> | 2:MEOL(3) 2 <a%n> <%nb> | 3:END(0) Match successful! "a\nboop\nb" Freeing REx: "^$"

        Trying to match newline after end of line won't work, $\n won't work

        $ perl -MData::Dump -Mre=debug -le " dd( $_=qq{a\n\nb} ); s{^$\n}{boop +}m; dd( $_ ); " "a\n\nb" Compiling REx "^%nn" Final program: 1: MBOL (2) 2: EXACT <\nn> (4) 4: END (0) anchored "%nn" at 0 (checking anchored) anchored(MBOL) minlen 2 Guessing start of match in sv for REx "^%nn" against "a%n%nb" Did not find anchored substr "%nn"... Match rejected by optimizer "a\n\nb" Freeing REx: "^%nn"

        But matching an OPTIONAl newline works

        $ perl -MData::Dump -Mre=debug -le " dd( $_=qq{a\n\nb} ); s{^$\n?}{boo +p}ms; dd( $_ ); " "a\n\nb" Compiling REx "^%nn?" Final program: 1: MBOL (2) 2: EXACT <\n> (4) 4: CURLY {0,1} (8) 6: EXACT <n> (0) 8: END (0) anchored "%n" at 0 (checking anchored) anchored(MBOL) minlen 1 Guessing start of match in sv for REx "^%nn?" against "a%n%nb" Found anchored substr "%n" at offset 1... Found /^/m, restarting lookup for check-string at offset 2... Found anchored substr "%n" at offset 2... Position at offset 2 does not contradict /^/m... Guessed: match at offset 2 Matching REx "^%nn?" against "%nb" 2 <a%n> <%nb> | 1:MBOL(2) 2 <a%n> <%nb> | 2:EXACT <\n>(4) 3 <a%n%n> <b> | 4:CURLY {0,1}(8) EXACT <n> can match 0 times out of 1 +... 3 <a%n%n> <b> | 8: END(0) Match successful! "a\nboopb" Freeing REx: "^%nn?"

Re: Why multiline regex doesn't work?
by jeffa (Bishop) on Jun 08, 2015 at 23:31 UTC

    I really don't understand what you are trying to match ... perhaps you simplified your example too much? At any rate, why not use split?

    use strict; use warnings; use Data::Dumper; my $s = <<'ENDSTR'; aaa : AAA bbb : BBB ccc : CCC ENDSTR my %hash = map { split( /\s+:\s+/, $_, 2) } split ( /\n/, $s ); print Dumper \%hash; __END__ $VAR1 = { 'bbb' => 'BBB', 'ccc' => 'CCC', 'aaa' => 'AAA' };

    UPDATE:
    But simple splitting DOES work here.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      I try to learn how to use ^ and $ anchors within a string for cases where simple splitting doesn't work. And cannot see what's wrong with this simple example.
Re: Why multiline regex doesn't work?
by nbd (Novice) on Jun 09, 2015 at 01:56 UTC

    Thanks to all for very helpful comments.

    AnomalousMonk - thanks for very good note about turning on warnings and ambiguous '$.' - this cleared things out.