nbd has asked for the wisdom of the Perl Monks concerning the following question:
The intended output of the program is#!/usr/bin/perl use v5.20; my $s = <<'ENDSTR'; aaa : AAA bbb : BBB ccc : CCC ENDSTR my $m = 'bbb'; my $a = $s =~ s/.*^$m *: (.*?)$.*/$1/rsm; my $b = $s =~ s/[.\n]*?^$m *: (.*)$[.\n]*/$1/rm; print "a: $a\n"; print "b: $b\n";
But these regexes produce:a: BBB b: BBB
What parts of both these regexes miss or prevent them to do the intended matches?a: BBB ccc : CCC b: aaa : AAA bbb : BBB ccc : CCC
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Why multiline regex doesn't work?
by AnomalousMonk (Archbishop) on Jun 09, 2015 at 01:43 UTC | |
In your second regex, you achieve no match because the regex expression [.\n] does not mean what (I think) you think it means. There is also another problem with a predefined special variable $[ that is being interpolated instead of the first part of the $[.\n] regex expression you intended. The '.' (period) character is not special, i.e., not a metacharacter, in a [] regex character class; it just matches a literal period, and there are no such characters in your $s test string. I'm not sure what the [.\n] expression was intended to represent (maybe [^\n] "anything but a newline"?), so I can't comment further until you can provide greater clarity. Note, however, that disambiguating the $ metacharacter at least produces a different output, i.e., a match and substitution, even though the output is still not what you expect: (There is no warning because $[ has a default initialized value.) Update: Note that the ambiguity of $[.\n] (regex) and the $[ predefined special variable (see perlvar) is yet another argument in favor of the /x embedded whitespace regex modifier (other than simply being able to see the darn regex). Consider: Still not what you expected, but one less pitfall to negotiate. (The [ ] expression is what I like to use to represent a space, where \s represents any whitespace character, a larger set.) Further Update: The interpolation of $[ can be clearly seen here: The default value of $[ is 0; Give a man a fish: <%-(-(-(-< | [reply] [d/l] [select] |
by nbd (Novice) on Jun 09, 2015 at 04:29 UTC | |
Thanks for the detailed explanation. That was exactly what I was asking about: exact parts of both regexes which work incorrectly. .\n was intended to match all characters, including newline character ( since with //m modifier '.' doesn't match newline ). But I see, that within square brackets the dot must be escaped. So, if all characters are expressed as \s\S, the regex now works:Thanks! | [reply] [d/l] |
by AnomalousMonk (Archbishop) on Jun 09, 2015 at 15:00 UTC | |
... with //m modifier '.' doesn't match newline ... Just to be clear: With or without the //m regex modifier, the default behavior of the . (dot) metacharacter is to match everything except a newline. It is only the //s "dot matches all" modifier that causes dot to match absolutely everything. Give a man a fish: <%-(-(-(-< | [reply] [d/l] [select] |
by AnomalousMonk (Archbishop) on Jun 09, 2015 at 17:14 UTC | |
my $d = $s =~ s/[\s\S]*^$m *: (.*)$(?:[\s\S]*)/$1/rm; The expression [\s\S] to express "match any character" cries out for comment. I assume it is used to avoid the . (dot) metacharacter when promoted by //s to "dot matches all" status. This rubs me the wrong way. If dot (with //s) matches all, why not just use it that way? (All code examples that follow enable warnings and strictures. Also note that the //r substitution modifier is only avaliable with Perl versions 5.14+.) This is arguably clearer, with only the tiny problem that it doesn't work! Why not? Consider the (.*) capture group. With dot matching anything, it greedily grabs everything to the end of the string. To achieve an overall match, the regex still has to match $ at the end of the string, which is easy, and (?:.*) "zero or more of anything" after the end of the string, also easy. So capture group 1 and $1 now contain everything to the end of the string, which is substituted back into the string. But the intent of (.*) was only to capture everything up to the $ anchor before the first embedded newline (due to //m). How to restrain dot? One way would be to use a *? "lazy" modifier for the normally greedy * match quantifier: dot will then match as little as necessary to get to the first $ anchor. Now we're getting somewhere! But one could argue that the intent of "anything except a newline" is more clearly expressed by [^\n] and "capture as much as possible to the first newline" is better as ([^\n]*) (remember that the code must be maintained, one must assume forever, so clear intent is important). (In this version, the $ anchor is redundant, but does no harm and arguably serves to further clarify intent.) Lastly, an example in my own preferred style, taken from TheDamian's PBP: The $m is no longer defined as a raw string, but with qr// as a regex in its own right. This allows it to be used "atomically" within another regex, as it is in the substitution: expressions like $m+ or $m{4} work as expected. The $ is still redundant, but still arguably clarifies intent. The same could be said about the preceding ^ in the regex, but I would argue that anchoring the $m atom in some way is potentially important, so just leave it be. And that's the first several inches of the whole nine regex yards. HTH Give a man a fish: <%-(-(-(-< | [reply] [d/l] [select] |
by Anonymous Monk on Jun 09, 2015 at 07:46 UTC | |
But I see, that within square brackets the dot must be escaped. Yes, you've got the idea its just the lingo you need help with now :) escaping means prefixing it with a backslash -- ie turn "." into "\." but that isn't required, as inside a character class [.] the dot is not a metacharacter, it is a literal character (?s:.) means any character (including \n) alias [\w\W] alias [\s\S] alias [\d\D] alias \p{All} | [reply] [d/l] |
|
Re: Why multiline regex doesn't work?
by FreeBeerReekingMonk (Deacon) on Jun 09, 2015 at 00:07 UTC | |
You see, $ only works at the end or just before a | all the m modifier does is match at the beginning/end of every line instead of the absolute beginning and absolute end. else $ gets confused with a variable, imagine: ~/(.)$./ ~/(.)\$./ | [reply] [d/l] |
by nbd (Novice) on Jun 09, 2015 at 00:19 UTC | |
- m modifier (//m): Treat string as a set of multiple lines. '.' matches any character except "\n" . ^ and $ are able to match at the start or end of any line within the string. - both s and m modifiers (//sm): Treat string as a single long line, but detect multiple lines. '.' matches any character, even "\n" . ^ and $ , however, are able to match at the start or end of any line within the string. Does the correction in the code you made mean that Perl processes the multiline string line by line and not as a single string?UPDATE: I see that Perl process the string as a whole. The code from the first sight just looked as an awk line by line pattern matching. Thanks. | [reply] |
by AnomalousMonk (Archbishop) on Jun 09, 2015 at 01:14 UTC | |
The Use of uninitialized value $. in regexp compilation... message gives you a clue about what is happening. If the $ is unambiguously a regex metacharacter: You have your intended output for this regex. Give a man a fish: <%-(-(-(-< | [reply] [d/l] [select] |
by jeffa (Bishop) on Jun 09, 2015 at 00:40 UTC | |
You really should try to work with simpler examples before you make things complicated:
The first set of matches illustrates a case when the 's' modifier gets the match and the second set of matches illustrates a case when the 'm' modifier gets the match. Hope this helps! jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat) | [reply] [d/l] [select] |
by Anonymous Monk on Jun 09, 2015 at 01:11 UTC | |
See perlvar#$.
matches after newline (or beginning of string). $ matches before newline (or end of string)
Trying to match newline after end of line won't work, $\n won't work
But matching an OPTIONAl newline works
| [reply] [d/l] [select] |
|
Re: Why multiline regex doesn't work?
by jeffa (Bishop) on Jun 08, 2015 at 23:31 UTC | |
I really don't understand what you are trying to match ... perhaps you simplified your example too much? At any rate, why not use split?
UPDATE: jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat) | [reply] [d/l] |
by nbd (Novice) on Jun 08, 2015 at 23:57 UTC | |
| [reply] |
|
Re: Why multiline regex doesn't work?
by nbd (Novice) on Jun 09, 2015 at 01:56 UTC | |
Thanks to all for very helpful comments. AnomalousMonk - thanks for very good note about turning on warnings and ambiguous '$.' - this cleared things out. | [reply] |