I was testing some regular expressions, when I came across some amusing behavior of a regex when compiled with the qr// operator.
my $rx = 'abc'; my $qr = qr/$rx/; if ('ABC' =~ /$qr/i) { print "ABC matches /abc/i\n" } else { print "ABC does not match /abc/i\n" }
If I use the /i modifier, the regex is supposed to match in a case insentitive mode, i.e. "ABC" =~ /abc/i returns a match. However, if I compile the pattern with qr//, the result is different.
This is intriguing, and I have eventually found out why it happens, but before telling you, I would like to show some more examples and let you meditate on what may be happening behind the scenes.
This script shows some variations on the same tune. First a pattern that is applied in case insensitive mode won't match when we would expect it to. Then a pattern in dot-matches-all mode does not match a newline character.
However, when I use a literal pattern instead of a pre-compiled one, it matches.
#!/usr/bin/perl -w use strict; my @patterns = ('abc', 'xyz'); my %regexes = map { $_, qr/$_/} @patterns; my @strings = ('the alphabet starts with ABC', 'the alphabet ends with XYZ' ); for my $str(@strings) { for ( keys %regexes ) { print qq("$str" =~ /$_/i => ); if ($str =~ /$regexes{$_}/i) { print "(qr) match\n"; } else { print "(qr) no match\n" } } } for my $str(@strings) { for ( @patterns ) { print qq("$str" =~ /$_/i => ); if ($str =~ /$_/i) { print "(pattern) match\n"; } else { print "(pattern) no match\n" } } } my $string = <<END; This text spawns across multiple lines END my $pattern = 'multiple .+ lines'; my $regex = qr/$pattern/x; if ($string =~ /$regex/s) { print "dot-matches-all (qr) matches\n"; } else { print "dot-matches-all (qr) does not match\n"; } if ($string =~ /$pattern/xs) { print "dot-matches-all (literal) matches\n"; } else { print "dot-matches-all (literal) does not match\n"; } __END__ "the alphabet starts with ABC" =~ /abc/i => (qr) no match "the alphabet starts with ABC" =~ /xyz/i => (qr) no match "the alphabet ends with XYZ" =~ /abc/i => (qr) no match "the alphabet ends with XYZ" =~ /xyz/i => (qr) no match "the alphabet starts with ABC" =~ /abc/i => (pattern) match "the alphabet starts with ABC" =~ /xyz/i => (pattern) no match "the alphabet ends with XYZ" =~ /abc/i => (pattern) no match "the alphabet ends with XYZ" =~ /xyz/i => (pattern) match dot-matches-all (qr) does not match dot-matches-all (literal) matches
It is puzzling, isnt'it?
OK. Enough suspense. Let's solve the mistery.
The reason for this behavior is that qr// will compile the pattern with the modifiers we specify at its end. For example, qr/perl/i will happily match "Perl", "perl", and "PERL." The interesting thing that is silently happening, though, is that qr// is setting the /x, /m and /s modifiers as well. If we mention them explicitly, they are operational, if we don't, they are set as non operational. Let's ask Perl itself to unveil the truth.
$ perl -e 'for (qw( i x s m )) {print eval "qr/perl/$_", "\n"}' (?i-xsm:perl) (?x-ism:perl) (?s-xim:perl) (?m-xis:perl)
As you can see, each pattern is compiled as if we had inserted a (?y-z:) block inside a regular expression. For those who don't recall it, such block allows the insertion of a sub-expression with modifiers that only apply within the block's boundaries. Thus, we can insert a case sensitive sub expression within a case insensitive regex. Each modifier following the question mark is set. The ones prepended by a minus sign are unset.
Looking at the outcome of the latest example, we can see that for each modifier that we set explicitly, qr// will implicitly unset the others.
Coming back to our main example, the values in %regexes are (?-xism:abc) and (?-xism:xyz). Keeping in mind the above explanation for sub-expressions, it is clear that this pre-compilation with qr// can't match those patterns. The same is true for the "dot-matches-all" modifier. A pattern compiled with qr//x will end up with (?x-ism:pattern) and even though it is later embedded in a regex with the /s modifier, its matching benefits can't kick in.
perlre and perlop are vague about this issue. The only place I've found it mentioned and explained in plain English is Mastering Regular Expressions, 2nd Ed.
Update
Changed title upon Aristotle's suggestion. (Was qr// hidden risks)
_ _ _ _ (_|| | |(_|>< _|
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: qr// hidden risks
by Aristotle (Chancellor) on Aug 10, 2003 at 16:58 UTC | |
|
Re: qr// hidden risks
by perrin (Chancellor) on Aug 10, 2003 at 16:22 UTC | |
by simonm (Vicar) on Aug 10, 2003 at 18:13 UTC | |
by gmax (Abbot) on Aug 10, 2003 at 16:48 UTC | |
by hossman (Prior) on Aug 10, 2003 at 17:10 UTC | |
|
Re: Risks in the oblivious use of qr//
by chunlou (Curate) on Aug 10, 2003 at 18:40 UTC | |
|
Re: Risks in the oblivious use of qr// (warn)
by tye (Sage) on Aug 11, 2003 at 17:52 UTC | |
|
Re: Risks in the oblivious use of qr//
by TomDLux (Vicar) on Aug 11, 2003 at 01:15 UTC | |
by diotalevi (Canon) on Aug 11, 2003 at 02:58 UTC | |
by waswas-fng (Curate) on Aug 11, 2003 at 04:41 UTC | |
by diotalevi (Canon) on Aug 11, 2003 at 11:55 UTC | |
by demerphq (Chancellor) on Aug 11, 2003 at 16:14 UTC | |
| |
by waswas-fng (Curate) on Aug 11, 2003 at 13:00 UTC | |
| |
|
Re: Risks in the oblivious use of qr//
by halley (Prior) on Aug 11, 2003 at 13:16 UTC | |
by Aristotle (Chancellor) on Aug 11, 2003 at 21:49 UTC | |
by diotalevi (Canon) on Aug 11, 2003 at 22:14 UTC |