Empty pattern in regex

choroba has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Empty pattern in regex by hv (Prior) on Oct 19, 2023 at 00:40 UTC
Why is "f" printed? I would have expected the question 'why are "f" and "g" printed'. Do you agree that printing "g" is also surprising, for the same reason? (If not, I may be misunderstanding random parts of your post.) Why is there the empty regex (see `()`)?* I don't know, seems very odd to me. I suggest reporting it as a possible bug. It seems possible that since the last successfully matched regexp was `/d/`, and the last attempted match against that regexp was a fail, it may have somehow marked it as no longer successfully matched; but that doesn't explain the change of behaviour when you add the empty continue block. I suspect rather that it is a scoping bug: I'm not sure if the docs make this clear, but it is intended to use the last successfully matched regexp visible to the current scope. Thus: `% perl -wle '"a" =~ /a/; { "b" =~ /b/ } "ab" =~ // and print $&' a %` [download] FWIW p5p mostly regards the empty regexp behaviour as a misfeature reluctantly spared the axe only because of the constraints of backward compatibility - it is very rare to see anyone actually trying to make use of it. But since we have it, it certainly ought to work as advertised.	[reply] [d/l] [select]
Re^2: Empty pattern in regex by perlboy_emeritus (Scribe) on Oct 19, 2023 at 18:45 UTC
> but it is intended to use the last successfully matched regexp Whatever its intent, it is one confusing puppy. //; always matches, always returns TRUE, but it never changes $&. $& is always whatever the previous regex set it to, whether it matched or not, effectively a NOP. Consider this: $_ = 'Hello Perl'; say '$_ = \'Hello Perl\';'; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # unsuccessful match /Python/; print "No match, \/Python\/\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match if (//) { print "No nothing, if (\/\/) {\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; } else { print "\/\/ unsuccessfull match"; } # successful match, no captures /Perl/; print "Match \/Perl\/, No captures\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, empty pattern if (//) { print "No nothing, if (\/\/) {\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; } else { print "\/\/ unsuccessfull match"; } # successful match, no captures /Perl/; print "Match \/Perl\/, No captures\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, no pattern, empty parens if (/()/) { #//; print "No nothing, if $\/\($\/\) {\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; } else { print "\/\/ unsuccessfull match"; } [download] Which results in: $_ = 'Hello Perl'; $1: $2: $3: $&: No match, /Python/ $1: $2: $3: $&: No nothing, if (//) { $1: $2: $3: $&: Match /Perl/, No captures $1: $2: $3: $&: Perl No nothing, if (//) { $1: $2: $3: $&: Perl Match /Perl/, No captures $1: $2: $3: $&: Perl No nothing, if (/()/) { $1: $2: $3: $&: What is or was the purpose of this construction? How would one use it?	[reply] [d/l]
Re^3: Empty pattern in regex by choroba (Cardinal) on Oct 19, 2023 at 19:02 UTC
> What is or was the purpose of this construction? How would one use it? `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; $_ = 'abacad'; say "/a(.)/"; if (/a(.)/g) { say "\$1: $1"; say "\$&: $&"; } else { say 'No match'; } for my $try (1 .. 3) { say "//"; if (//g) { say "\$1: $1"; say "\$&: $&"; } else { say 'No match'; } }` [download] Output: `/a(.)/ $1: b $&: ab // $1: c $&: ac // $1: d $&: ad // No match` [download] Update: If I remember correctly, this was the original reason the feature was introduced: `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $x = 'found 11'; my $y = 'found 12'; if ($x =~ /found (\d+)/ && $y =~ //) { # No need to repeat the long r +egex! Yay! say "Found $1."; }` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^4: Empty pattern in regex by perlboy_emeritus (Scribe) on Oct 20, 2023 at 20:08 UTC
Re^3: Empty pattern in regex by hv (Prior) on Oct 19, 2023 at 19:56 UTC
`//` always matches, always returns TRUE, but it never changes $&. That is not correct, in either aspect: `% perl -wle '"a" =~ /a/; "b" =~ // or print "did not match, did not re +turn TRUE"' did not match, did not return TRUE % perl -wle '"a" =~ /.*/; q{$& changed} =~ // and print $&' $& changed %` [download]	[reply] [d/l] [select]
Re^4: Empty pattern in regex by perlboy_emeritus (Scribe) on Oct 19, 2023 at 20:44 UTC
Re: Empty pattern in regex by jo37 (Curate) on Oct 19, 2023 at 07:07 UTC
Maybe it's a bug, maybe it's an obscure feature. Anyway, this is a fragile construct for border-checking of the flip-flop operator as it will be broken by a regex match within the if-block. The required information for such a check is provided by the flip-flop operator itself: it returns the current "loop number", with "E0" appended to the final loop call. Here is a more robust version: `perl -le 'print for a .. h' \| perl -nle 'if (my $ff = (/d/ .. /h/)) { +next unless $ff =~ /(?:^1\|E0)$/; print }'` [download] Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l]
Re: Empty pattern in regex [updated] by jo37 (Curate) on Oct 25, 2023 at 20:11 UTC
I think it's a bug. It has nothing to do with the flip-flop operator and it seems to be caused by jumping out of a block. Consider this example that emulates a flip flop and uses `goto` instead of `next`. `#!/usr/bin/perl use v5.24; use warnings; my ($first, $last); while (<DATA>) { chomp; $first \|\|= /d/; undef($first) if $last \|\|= /h/; if ($first \|\| $last) { undef $last; #goto ewhile unless //; goto eif unless //; say; eif: } ewhile: } __DATA__ c d e f g h i` [download] `goto eif: d h` [download] `goto ewhile: d f g h` [download] Jumping to the end of the current block produces the expected result, while jumping to the end of the while loop reproduces choroba's strange results. The jump out of the block seems to clear the "last successful match" causing `//` to be taken as an always matching empty pattern. However, I'd prefer to check the flip-flop's return value as this works in all circumstances, even for `if(foo($_) .. bar($_)) {...}`. Update: 26.10.2023 Here is a much simpler example demonstrating the behaviour without any flip-flop behaviour. A jump out of a block transforms the empty pattern `//` from the last successful matching pattern to a true empty pattern. `#!/usr/bin/perl use v5.24; use warnings; for my $label ('inner', 'outer') { say "goto $label"; for ('c' .. 'g') { say "loop: $_"; say "/d/ matched" if /d/; { goto $label unless //; say "// matched"; inner: } outer: } say ''; } __DATA__ goto inner loop: c // matched loop: d /d/ matched // matched loop: e loop: f loop: g goto outer loop: c // matched loop: d /d/ matched // matched loop: e loop: f // matched loop: g // matched` [download] Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]
Re^2: Empty pattern in regex [updated] by perlboy_emeritus (Scribe) on Oct 27, 2023 at 21:54 UTC
Hello jo37, Do you still think it's a bug even though we seem to be able to get what we want with some Perl trickery? You duplicated the issue choroba raised but we did get the answer he expected. // has been around at least since 5.6 as it is discussed in PP 3rd, as well as 4th (5.14) without, unfortunately, examples. The one in perlop is clear but is not suggestive, to me, of a real use case. I'm still looking for a definitive use case, or at least realistic, if not definitive. I've come up with two. The first might be used by a grammarian or linguist researching comparative languages. The second extracts the string between html tags, although I show how to do this with a much simpler plain-old regex. Me thinks it's a stretch to use // when there are other ways to do a thing, but TMTOWTDI. One of my examples parses a string while the other uses an array. if (/this/../that/) {... almost demands an array. I would really like to hear a war story or two how // was used to solve some really gnarly problem. Here be my two examples: #!/usr/bin/env -S perl -w ##!/usr/bin/env -S perl -wd use v5.30.0; use strict; use List::AllUtils qw( reduce ); my ($slurpee, $length, $sum); { local $/; ($slurpee) = <DATA>; } $length = length $slurpee; my @regexes = ( [ qr/[A-Z]/, "uppercase characte +rs", 0 ], [ qr/[a-z]/, "lowercase characte +rs", 0 ], [ qr/\d/, "digits", + 0 ], [ qr/\s/, "whitespace charact +ers", 0 ], # # Note: $ must be \$, and - must be first to avoid range interpretat +ion. # [ qr/[-~`!@#\$%^&()_+={}\[\]\|\\:;"'<>,.?\/]/, "punctuation charac +ters", 0 ], ); #for my $c (split //, $slurpee) { print $c; } for my $case (@regexes) { say "seeding // with: $case->[0]"; "Aa5: " =~ $case->[0]; # seed the // iteration say "matched: '$&'" if $&; for (split //, $slurpee) { // and $case->[2]++; } } for my $case (@regexes) { printf("%4d %s\n", $case->[2], $case->[1]); +} $sum = reduce { $a + $b } (map $_->[2], @regexes); printf(" sum and length: %3d and %3d\n", $sum, $length); say "\nNow extract the string between HTML tags with //..."; my $str = "Before tag<i>between tags</i>after tag"; say "\n$str"; $str =~ s{ (?: (?<= \w) (?= <) \| (?<= >) (?= \w) ) }{ }xg; # insert + whitespace say $str; my @tokens = split / /, $str; say "Tokens...\n"; for (@tokens) { say }; my $between; for (@tokens) { if (/<\w>/../<\/\w>/) { $between .= "$_ " unless // and $&; } } chop $between if $between; say "'$between'"; $str = "\n'Before tag<i>between tags</i>after tag'"; say $str; say "Parse it again with..."; my $regex = qr/ (<\w+>) (.) (<\/\w+>) /x; say $regex; $str =~ $regex; say "\$1: '$1'"; say "\$2: '$2'"; say "\$3: '$3'"; exit(0); __END__ Last night I dreamt I went to Manderley again. This will come as a sur +prise to Daphne since she did not write these lines. Here is a line containing + stuff ,?- ! : that should/must be deleted/// ; : ! before using it as a o +ne-time-pad. A one-time-pad should contain only characters, no punctuation, no par +entheticals like (this is bogus) or [(this is bogus, too)], or {also +this}; no contractions, such as I'll or it's or digits such as 0, 123, -75 or 8 P.M., and no numbers, +such as $1,234.69. If you want to use numbers in your message, spell them out; one-hundred d +ollars and sixty-nine cents, or theeepm. These non-alpha characters +in the one-time-pad will be discarded, but they must be entered eactl +y as represented in the book used as the pad. Let the encoding progr +am decide what to use and what to skip. Some of the text is from "Rebecca", an out-of copyright but not out-of +-print fictional work that can be freely downloaded as an eBook from Project Gutenberg. + I use it as the raw source for one-time pads in a cryptologic research study; i.e., ex +tract potential pad bits from somewhere in the text, randomly chosen with seek from EO +F. Munge the characters, encrypt the message and delete the characters used for the + pad. Since both encoder and decoder use the same seek expression, both pads are guaran +teed to be identical, and since the characters used to create the pad are deleted +, never to be seen again, the pad is guaranteed to be used exactly once. Does not scale f +or large organizations but works flawlessly for a small group of conspirators. [download] O U T P U T seeding // with: (?^u:A-Z) matched: 'A' seeding // with: (?^u:a-z) matched: 'a' seeding // with: (?^u:\d) matched: '5' seeding // with: (?^u:\s) matched: ' ' seeding // with: (?^u:[-~`!@#\$%^&()_+={}\\\|\\:;"'<>,.?/]) matched: ':' 26 uppercase characters 1168 lowercase characters 13 digits 283 whitespace characters 80 punctuation characters sum and length: 1570 and 1570 Now extract the string between HTML tags with //... Before tag<i>between tags</i>after tag Before tag <i> between tags </i> after tag Tokens... Before tag <i> between tags </i> after tag 'between tags' 'Before tag<i>between tags</i>after tag' Parse it again with... (?^ux: (<\w+>) (.) (</\w+>) ) $1: '<i>' $2: 'between tags' $3: '</i>'	[reply] [d/l]
Re^3: Empty pattern in regex [updated] by jo37 (Curate) on Oct 28, 2023 at 10:40 UTC
Hello perlboy_emeritus, to be more explicit in this issue, I do not only think it's a bug, I am absolutely convinced it is. Some remarks: Having a workaround for a bug does in no way mean it is not a bug. Using `$&` in this scenario is dangerous, as it is affected by the very same bug. See extended example below. I cannot find anything in your code that would trigger the bug. This is fine and TIMTOWTDI perlop is very precise in The empty pattern "`//`: If the PATTERN evaluates to the empty string, the last successfully matched regular expression is used instead. (...) If no match has previously succeeded, this will (silently) act instead as a genuine empty pattern (which will always match). (...) As you can see from my example, `//` does not behave as described if there was a successful match and there happens a jump out of an inner block where `//` was applied. This clears `$&` and resets `//` to the genuine empty pattern. #!/usr/bin/perl use v5.24; use warnings; for my $label ('inner', 'outer') { say "goto $label"; for ('c' .. 'g') { say "loop: $_"; say "/d/ matched" if /d/; say "\$&: '$&'" if defined $&; { goto $label unless //; say "// matched"; inner: } outer: } say ''; } __DATA__ goto inner loop: c // matched loop: d /d/ matched $&: 'd' // matched loop: e $&: 'd' loop: f $&: 'd' loop: g $&: 'd' goto outer loop: c // matched loop: d /d/ matched $&: 'd' // matched loop: e $&: 'd' loop: f // matched loop: g // matched [download] Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]
Re: Empty pattern in regex by jo37 (Curate) on Oct 30, 2023 at 20:08 UTC
Filed a bugreport. Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply]
Re: Empty pattern in regex by perlboy_emeritus (Scribe) on Oct 19, 2023 at 00:15 UTC
Per that perlop discussion I wrapped // in quotes with 'm' and tried: `perl -le 'print for a .. z' \| perl -nle 'if (/d/ .. /h/) { next unles +s "m//"; print }'` [download] and got: % perl -le 'print for a .. z' \| perl -nle 'if (/d/ .. /h/) { next unless "m//"; print }' d e f g h And then I pedantically did: `for my $c ( 'a'..'z') { next unless ($c =~ /[d-h]/); say $& if $&; }` [download] and got: d e f g h I guess I don't understand. Isn't 'd e f g h' what is expected? I've never really trusted one-liners. Brian Foy wrote an interesting piece on SO, to wit: https://stackoverflow.com/questions/22652393/regex-1-variable-reset except his example using //; did not work for me. He expected all vars to be cleared but when I ran his code: # The regex capture variables are only reset on the next successful ma +tch. # This way, Perl saves a lot of time by not affecting variables when m +atches # fail. As such, only use those variables with a guard, to wit: # if ( /abc/ ) { # this tests for /abc/ success and now it's OK t +o use $& # ... # } # Here's an extended demonstration, with a special surprise at the end +: say "First long example...\n"; $_ = 'Hello Perl'; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match /(P)(erl)/; print "First match\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # unsuccessful match /(P)(ython)/; print "Failed capture\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match again /(Pe)(r)(l)/; print "Three captures\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, fewer captures /(Perl)/; print "One capture\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, no captures /Perl/; print "No captures\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, no pattern, special case //; print "No nothing\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; [download] I got: $1: $2: $3: $&: First match $1: P $2: erl $3: $&: Perl Failed capture $1: P $2: erl $3: $&: Perl Three captures $1: Pe $2: r $3: l $&: Perl One capture $1: Perl $2: $3: $&: Perl No captures $1: $2: $3: $&: Perl No nothing $1: $2: $3: $&: Perl As you can see in 'No nothing' $& was not cleared for me as it was for him, as he reported in that piece. I don't trust using $n, $`, $& or $' unless I explicitly test for TRUE after the regex executes. Am I being overly paranoid?	[reply] [d/l] [select]
Re^2: Empty pattern in regex by choroba (Cardinal) on Oct 19, 2023 at 12:46 UTC
> Per that perlop discussion I wrapped // in quotes with `m` and tried: Which discussion? `unless "m//"` is the same as `unless "1"`, it's just a string. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^3: Empty pattern in regex by perlboy_emeritus (Scribe) on Oct 19, 2023 at 13:00 UTC
I tried "m//" as reported on your 'next' expression and got 'd e f g h', as expected. From my perlop on 5.36. The empty pattern "//" If the PATTERN evaluates to the empty string, the last successfully matched regular expression is used instead. In this case, only the "g" and "c" flags on the empty pattern are honored; the other flags are taken from the original pattern. If no match has previously succeeded, this will (silently) act instead as a genuine empty pattern (which will always match). Note that it's possible to confuse Perl into thinking "//" (the empty regex) is really "//" (the defined-or operator). Perl is usually pretty good about this, but some pathological cases might trigger this, such as "$x///" (is that "($x) / (//)" or "$x // /"?) and "print $fh //" ("print $fh(//" or "print($fh //"?). In all of these examples, Perl will assume you meant defined-or. If you meant the empty regex, just use parentheses or spaces to disambiguate, or even prefix the empty regex with an "m" (so "//" becomes "m//").	[reply]
Re^4: Empty pattern in regex by choroba (Cardinal) on Oct 19, 2023 at 13:23 UTC
Re^5: Empty pattern in regex by perlboy_emeritus (Scribe) on Oct 19, 2023 at 14:37 UTC
Some notes below your chosen depth have not been shown here
Re: Empty pattern in regex by perlboy_emeritus (Scribe) on Oct 23, 2023 at 17:33 UTC
Hello choroba, I don't like to give up without exhausting all avenues of research, and for me Perl is enjoyment and therapy (needed in this world we live in, and this age). This issue may now have dropped off the radars of the other participants, but not mine. jo37 came up with: `perl -le 'print for a .. h' \| perl -nle 'if (my $ff = (/d/ .. /h/)) { +next unless $ff =~ /(?:^1\|E0)$/; print }' d h` [download] which troubles me because of that alternation and the absence of //, which I think is/was your point. Mine are, granted, after debugging with strategic print statements: `perl -le 'print for a .. z' \| perl -nle 'if (/d/ .. /h/) { next unles +s // and $_ eq $&; print; }' d h` [download] or `perl -le 'print for a .. z' \| perl -nle 'if (/d/ .. /h/) { next unles +s $_ eq $& and //; print; }' d h` [download] and is a short-circuit operator so it works either way. Does this do what you expected it to do? Regards, Will	[reply] [d/l] [select]
Re^2: Empty pattern in regex by jo37 (Curate) on Oct 23, 2023 at 18:04 UTC
It was not my intention to cause any troubles. `$ff =~ /(?:^1\|E0)$/` can be rewritten as `$ff == 1 \|\| $ff =~ /E0$/`. When the second operand of the flip-flop operator becomes `true`, the return value gets an `E0` appended. This does not change the value in numeric context as it is just one of its floating point representations. In string context it is distinguishable from all the other values, though. HTH Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]
Re^2: Empty pattern in regex by choroba (Cardinal) on Oct 23, 2023 at 17:52 UTC
The `$_ eq $&` is an interesting idea. Note that it only works because of `-l`, otherwise `\n` would have been included in `$_` but not `$&`. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re: Empty pattern in regex by Anonymous Monk on Oct 19, 2023 at 09:17 UTC
It behaves correctly itd a warning not use it ;)	[reply]