Re: Empty pattern in regex
by hv (Prior) on Oct 19, 2023 at 00:40 UTC
|
Why is "f" printed?
I would have expected the question 'why are "f" and "g" printed'. Do you agree that printing "g" is also surprising, for the same reason? (If not, I may be misunderstanding random parts of your post.)
Why is there the empty regex (see (*))?
I don't know, seems very odd to me. I suggest reporting it as a possible bug.
It seems possible that since the last successfully matched regexp was /d/, and the last attempted match against that regexp was a fail, it may have somehow marked it as no longer successfully matched; but that doesn't explain the change of behaviour when you add the empty continue block.
I suspect rather that it is a scoping bug: I'm not sure if the docs make this clear, but it is intended to use the last successfully matched regexp visible to the current scope. Thus:
% perl -wle '"a" =~ /a/; { "b" =~ /b/ } "ab" =~ // and print $&'
a
%
FWIW p5p mostly regards the empty regexp behaviour as a misfeature reluctantly spared the axe only because of the constraints of backward compatibility - it is very rare to see anyone actually trying to make use of it. But since we have it, it certainly ought to work as advertised. | [reply] [d/l] [select] |
|
|
> but it is intended to use the last successfully matched regexp
Whatever its intent, it is one confusing puppy. //; always matches, always returns TRUE, but it never changes $&. $& is always whatever the previous regex set it to, whether it matched or not, effectively a NOP. Consider this:
$_ = 'Hello Perl';
say '$_ = \'Hello Perl\';';
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# unsuccessful match
/Python/;
print "No match, \/Python\/\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# successful match
if (//) {
print "No nothing, if (\/\/) {\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
} else {
print "\/\/ unsuccessfull match";
}
# successful match, no captures
/Perl/;
print "Match \/Perl\/, No captures\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# successful match, empty pattern
if (//) {
print "No nothing, if (\/\/) {\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
} else {
print "\/\/ unsuccessfull match";
}
# successful match, no captures
/Perl/;
print "Match \/Perl\/, No captures\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# successful match, no pattern, empty parens
if (/()/) {
#//;
print "No nothing, if \(\/\(\)\/\) {\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
} else {
print "\/\/ unsuccessfull match";
}
Which results in:
$_ = 'Hello Perl';
$1:
$2:
$3:
$&:
No match, /Python/
$1:
$2:
$3:
$&:
No nothing, if (//) {
$1:
$2:
$3:
$&:
Match /Perl/, No captures
$1:
$2:
$3:
$&: Perl
No nothing, if (//) {
$1:
$2:
$3:
$&: Perl
Match /Perl/, No captures
$1:
$2:
$3:
$&: Perl
No nothing, if (/()/) {
$1:
$2:
$3:
$&:
What is or was the purpose of this construction? How would one use it? | [reply] [d/l] |
|
|
> What is or was the purpose of this construction? How would one use it?
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
$_ = 'abacad';
say "/a(.)/";
if (/a(.)/g) {
say "\$1: $1";
say "\$&: $&";
} else {
say 'No match';
}
for my $try (1 .. 3) {
say "//";
if (//g) {
say "\$1: $1";
say "\$&: $&";
} else {
say 'No match';
}
}
Output:
/a(.)/
$1: b
$&: ab
//
$1: c
$&: ac
//
$1: d
$&: ad
//
No match
Update:
If I remember correctly, this was the original reason the feature was introduced:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $x = 'found 11';
my $y = 'found 12';
if ($x =~ /found (\d+)/ && $y =~ //) { # No need to repeat the long r
+egex! Yay!
say "Found $1.";
}
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
|
|
|
// always matches, always returns TRUE, but it never changes $&.
That is not correct, in either aspect:
% perl -wle '"a" =~ /a/; "b" =~ // or print "did not match, did not re
+turn TRUE"'
did not match, did not return TRUE
% perl -wle '"a" =~ /.*/; q{$& changed} =~ // and print $&'
$& changed
%
| [reply] [d/l] [select] |
|
|
Re: Empty pattern in regex
by jo37 (Curate) on Oct 19, 2023 at 07:07 UTC
|
Maybe it's a bug, maybe it's an obscure feature.
Anyway, this is a fragile construct for border-checking of the flip-flop operator as it will be broken by a regex match within the if-block.
The required information for such a check is provided by the flip-flop operator itself: it returns the current "loop number", with "E0" appended to the final loop call.
Here is a more robust version:
perl -le 'print for a .. h' | perl -nle 'if (my $ff = (/d/ .. /h/)) {
+next unless $ff =~ /(?:^1|E0)$/; print }'
Greetings, -jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
| [reply] [d/l] |
Re: Empty pattern in regex [updated]
by jo37 (Curate) on Oct 25, 2023 at 20:11 UTC
|
I think it's a bug. It has nothing to do with the flip-flop operator and it seems to be caused by jumping out of a block.
Consider this example that emulates a flip flop and uses goto instead of next.
#!/usr/bin/perl
use v5.24;
use warnings;
my ($first, $last);
while (<DATA>) {
chomp;
$first ||= /d/;
undef($first) if $last ||= /h/;
if ($first || $last) {
undef $last;
#goto ewhile unless //;
goto eif unless //;
say;
eif:
}
ewhile:
}
__DATA__
c
d
e
f
g
h
i
goto eif:
d
h
goto ewhile:
d
f
g
h
Jumping to the end of the current block produces the expected result, while jumping to the end of the while loop reproduces choroba's strange results.
The jump out of the block seems to clear the "last successful match" causing // to be taken as an always matching empty pattern.
However, I'd prefer to check the flip-flop's return value as this works in all circumstances, even for if(foo($_) .. bar($_)) {...}.
Update: 26.10.2023
Here is a much simpler example demonstrating the behaviour without any flip-flop behaviour.
A jump out of a block transforms the empty pattern // from the last successful matching pattern to a true empty pattern.
#!/usr/bin/perl
use v5.24;
use warnings;
for my $label ('inner', 'outer') {
say "goto $label";
for ('c' .. 'g') {
say "loop: $_";
say "/d/ matched" if /d/;
{
goto $label unless //;
say "// matched";
inner:
}
outer:
}
say '';
}
__DATA__
goto inner
loop: c
// matched
loop: d
/d/ matched
// matched
loop: e
loop: f
loop: g
goto outer
loop: c
// matched
loop: d
/d/ matched
// matched
loop: e
loop: f
// matched
loop: g
// matched
Greetings, -jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
| [reply] [d/l] [select] |
|
|
Hello jo37,
Do you still think it's a bug even though we seem to be able to get what we want with some Perl trickery? You duplicated the issue choroba raised but we did get the answer he expected. // has been around at least since 5.6 as it is discussed in PP 3rd, as well as 4th (5.14) without, unfortunately, examples. The one in perlop is clear but is not suggestive, to me, of a real use case. I'm still looking for a definitive use case, or at least realistic, if not definitive. I've come up with two. The first might be used by a grammarian or linguist researching comparative languages. The second extracts the string between html tags, although I show how to do this with a much simpler plain-old regex. Me thinks it's a stretch to use // when there are other ways to do a thing, but TMTOWTDI. One of my examples parses a string while the other uses an array. if (/this/../that/) {... almost demands an array. I would really like to hear a war story or two how // was used to solve some really gnarly problem. Here be my two examples:
#!/usr/bin/env -S perl -w
##!/usr/bin/env -S perl -wd
use v5.30.0;
use strict;
use List::AllUtils qw( reduce );
my ($slurpee, $length, $sum);
{
local $/;
($slurpee) = <DATA>;
}
$length = length $slurpee;
my @regexes = (
[ qr/[A-Z]/, "uppercase characte
+rs", 0 ],
[ qr/[a-z]/, "lowercase characte
+rs", 0 ],
[ qr/\d/, "digits",
+ 0 ],
[ qr/\s/, "whitespace charact
+ers", 0 ],
#
# Note: $ must be \$, and - must be first to avoid range interpretat
+ion.
#
[ qr/[-~`!@#\$%^&*()_+={}\[\]|\\:;"'<>,.?\/]/, "punctuation charac
+ters", 0 ],
);
#for my $c (split //, $slurpee) { print $c; }
for my $case (@regexes) {
say "seeding // with: $case->[0]";
"Aa5: " =~ $case->[0]; # seed the // iteration
say "matched: '$&'" if $&;
for (split //, $slurpee) {
// and $case->[2]++;
}
}
for my $case (@regexes) { printf("%4d %s\n", $case->[2], $case->[1]);
+}
$sum = reduce { $a + $b } (map $_->[2], @regexes);
printf(" sum and length: %3d and %3d\n", $sum, $length);
say "\nNow extract the string between HTML tags with //...";
my $str = "Before tag<i>between tags</i>after tag";
say "\n$str";
$str =~ s{ (?: (?<= \w) (?= <) | (?<= >) (?= \w) ) }{ }xg; # insert
+ whitespace
say $str;
my @tokens = split / /, $str;
say "Tokens...\n";
for (@tokens) { say };
my $between;
for (@tokens) {
if (/<\w>/../<\/\w>/) {
$between .= "$_ " unless // and $&;
}
}
chop $between if $between;
say "'$between'";
$str = "\n'Before tag<i>between tags</i>after tag'";
say $str;
say "Parse it again with...";
my $regex = qr/ (<\w+>) (.*) (<\/\w+>) /x;
say $regex;
$str =~ $regex;
say "\$1: '$1'";
say "\$2: '$2'";
say "\$3: '$3'";
exit(0);
__END__
Last night I dreamt I went to Manderley again. This will come as a sur
+prise to
Daphne since she did not write these lines. Here is a line containing
+ stuff
,?- ! : that should/must be deleted/// ; : ! before using it as a o
+ne-time-pad.
A one-time-pad should contain only characters, no punctuation, no par
+entheticals like (this is bogus) or [(this is bogus, too)], or {also
+this}; no contractions, such as
I'll or it's or digits such as 0, 123, -75 or 8 P.M., and no numbers,
+such as $1,234.69. If
you want to use numbers in your message, spell them out; one-hundred d
+ollars and sixty-nine cents, or theeepm. These non-alpha characters
+in the one-time-pad will be discarded, but they must be entered eactl
+y as represented in the book used as the pad. Let the encoding progr
+am decide what to use and what to skip.
Some of the text is from "Rebecca", an out-of copyright but not out-of
+-print fictional
work that can be freely downloaded as an eBook from Project Gutenberg.
+ I use it as the
raw source for one-time pads in a cryptologic research study; i.e., ex
+tract potential
pad bits from somewhere in the text, randomly chosen with seek from EO
+F. Munge the
characters, encrypt the message and delete the characters used for the
+ pad. Since both
encoder and decoder use the same seek expression, both pads are guaran
+teed to be
identical, and since the characters used to create the pad are deleted
+, never to be seen
again, the pad is guaranteed to be used exactly once. Does not scale f
+or large
organizations but works flawlessly for a small group of conspirators.
O U T P U T
seeding // with: (?^u:A-Z)
matched: 'A'
seeding // with: (?^u:a-z)
matched: 'a'
seeding // with: (?^u:\d)
matched: '5'
seeding // with: (?^u:\s)
matched: ' '
seeding // with: (?^u:[-~`!@#\$%^&*()_+={}\\|\\:;"'<>,.?/])
matched: ':'
26 uppercase characters
1168 lowercase characters
13 digits
283 whitespace characters
80 punctuation characters
sum and length: 1570 and 1570
Now extract the string between HTML tags with //...
Before tag<i>between tags</i>after tag
Before tag <i> between tags </i> after tag
Tokens...
Before
tag
<i>
between
tags
</i>
after
tag
'between tags'
'Before tag<i>between tags</i>after tag'
Parse it again with...
(?^ux: (<\w+>) (.*) (</\w+>) )
$1: '<i>'
$2: 'between tags'
$3: '</i>'
| [reply] [d/l] |
|
|
#!/usr/bin/perl
use v5.24;
use warnings;
for my $label ('inner', 'outer') {
say "goto $label";
for ('c' .. 'g') {
say "loop: $_";
say "/d/ matched" if /d/;
say "\$&: '$&'" if defined $&;
{
goto $label unless //;
say "// matched";
inner:
}
outer:
}
say '';
}
__DATA__
goto inner
loop: c
// matched
loop: d
/d/ matched
$&: 'd'
// matched
loop: e
$&: 'd'
loop: f
$&: 'd'
loop: g
$&: 'd'
goto outer
loop: c
// matched
loop: d
/d/ matched
$&: 'd'
// matched
loop: e
$&: 'd'
loop: f
// matched
loop: g
// matched
Greetings, -jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
| [reply] [d/l] [select] |
Re: Empty pattern in regex
by jo37 (Curate) on Oct 30, 2023 at 20:08 UTC
|
Filed a bugreport.
Greetings, -jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
| [reply] |
Re: Empty pattern in regex
by perlboy_emeritus (Scribe) on Oct 19, 2023 at 00:15 UTC
|
perl -le 'print for a .. z' | perl -nle 'if (/d/ .. /h/) { next unles
+s "m//"; print }'
and got:
% perl -le 'print for a .. z' | perl -nle 'if (/d/ .. /h/) { next unless "m//"; print }'
d
e
f
g
h
And then I pedantically did:
for my $c ( 'a'..'z') {
next unless ($c =~ /[d-h]/); say $& if $&;
}
and got:
d
e
f
g
h
I guess I don't understand. Isn't 'd e f g h' what is expected? I've never really trusted one-liners. Brian Foy wrote an interesting piece on SO, to wit:
https://stackoverflow.com/questions/22652393/regex-1-variable-reset
except his example using //; did not work for me. He expected all vars to be cleared but when I ran his code: # The regex capture variables are only reset on the next successful ma
+tch.
# This way, Perl saves a lot of time by not affecting variables when m
+atches
# fail. As such, only use those variables with a guard, to wit:
# if ( /abc/ ) { # this tests for /abc/ success and now it's OK t
+o use $&
# ...
# }
# Here's an extended demonstration, with a special surprise at the end
+:
say "First long example...\n";
$_ = 'Hello Perl';
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# successful match
/(P)(erl)/;
print "First match\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# unsuccessful match
/(P)(ython)/;
print "Failed capture\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# successful match again
/(Pe)(r)(l)/;
print "Three captures\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# successful match, fewer captures
/(Perl)/;
print "One capture\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# successful match, no captures
/Perl/;
print "No captures\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
# successful match, no pattern, special case
//;
print "No nothing\n";
print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";
I got:
$1:
$2:
$3:
$&:
First match
$1: P
$2: erl
$3:
$&: Perl
Failed capture
$1: P
$2: erl
$3:
$&: Perl
Three captures
$1: Pe
$2: r
$3: l
$&: Perl
One capture
$1: Perl
$2:
$3:
$&: Perl
No captures
$1:
$2:
$3:
$&: Perl
No nothing
$1:
$2:
$3:
$&: Perl
As you can see in 'No nothing' $& was not cleared for me as it was for him, as he reported in that piece. I don't trust using $n, $`, $& or $' unless I explicitly test for TRUE after the regex executes. Am I being overly paranoid? | [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
|
|
The empty pattern "//"
If the *PATTERN* evaluates to the empty string, the last
*successfully* matched regular expression is used instead. In
this case, only the "g" and "c" flags on the empty pattern are
honored; the other flags are taken from the original pattern. If
no match has previously succeeded, this will (silently) act
instead as a genuine empty pattern (which will always match).
Note that it's possible to confuse Perl into thinking "//" (the
empty regex) is really "//" (the defined-or operator). Perl is
usually pretty good about this, but some pathological cases
might trigger this, such as "$x///" (is that "($x) / (//)" or
"$x // /"?) and "print $fh //" ("print $fh(//" or
"print($fh //"?). In all of these examples, Perl will assume you
meant defined-or. If you meant the empty regex, just use
parentheses or spaces to disambiguate, or even prefix the empty
regex with an "m" (so "//" becomes "m//").
| [reply] |
|
|
|
|
|
Re: Empty pattern in regex
by perlboy_emeritus (Scribe) on Oct 23, 2023 at 17:33 UTC
|
Hello choroba,
I don't like to give up without exhausting all avenues of research, and for me Perl is enjoyment and therapy (needed in this world we live in, and this age). This issue may now have dropped off the radars of the other participants, but not mine. jo37 came up with:
perl -le 'print for a .. h' | perl -nle 'if (my $ff = (/d/ .. /h/)) {
+next unless $ff =~ /(?:^1|E0)$/; print }'
d
h
which troubles me because of that alternation and the absence of //, which I think is/was your point. Mine are, granted, after debugging with strategic print statements:
perl -le 'print for a .. z' | perl -nle 'if (/d/ .. /h/) { next unles
+s // and $_ eq $&; print; }'
d
h
or
perl -le 'print for a .. z' | perl -nle 'if (/d/ .. /h/) { next unles
+s $_ eq $& and //; print; }'
d
h
and is a short-circuit operator so it works either way. Does this do what you expected it to do?
Regards, Will | [reply] [d/l] [select] |
|
|
It was not my intention to cause any troubles. $ff =~ /(?:^1|E0)$/ can be rewritten as $ff == 1 || $ff =~ /E0$/.
When the second operand of the flip-flop operator becomes true, the return value gets an E0 appended.
This does not change the value in numeric context as it is just one of its floating point representations. In string context it is distinguishable from all the other values, though.
HTH
Greetings, -jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
| [reply] [d/l] [select] |
|
|
The $_ eq $& is an interesting idea. Note that it only works because of -l, otherwise \n would have been included in $_ but not $&.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: Empty pattern in regex
by Anonymous Monk on Oct 19, 2023 at 09:17 UTC
|
| [reply] |