Re^2: Empty pattern in regex [updated]

Hello jo37,

Do you still think it's a bug even though we seem to be able to get what we want with some Perl trickery? You duplicated the issue choroba raised but we did get the answer he expected. // has been around at least since 5.6 as it is discussed in PP 3rd, as well as 4th (5.14) without, unfortunately, examples. The one in perlop is clear but is not suggestive, to me, of a real use case. I'm still looking for a definitive use case, or at least realistic, if not definitive. I've come up with two. The first might be used by a grammarian or linguist researching comparative languages. The second extracts the string between html tags, although I show how to do this with a much simpler plain-old regex. Me thinks it's a stretch to use // when there are other ways to do a thing, but TMTOWTDI. One of my examples parses a string while the other uses an array. if (/this/../that/) {... almost demands an array. I would really like to hear a war story or two how // was used to solve some really gnarly problem. Here be my two examples:

#!/usr/bin/env -S perl -w
##!/usr/bin/env -S perl -wd

use v5.30.0;
use strict;
use List::AllUtils qw( reduce );

my ($slurpee, $length, $sum);
{
    local $/;
    ($slurpee) = <DATA>;
}
$length = length $slurpee;

my @regexes = (
    [ qr/[A-Z]/,                                   "uppercase characte
+rs",   0 ],
    [ qr/[a-z]/,                                   "lowercase characte
+rs",   0 ],
    [ qr/\d/,                                      "digits",          
+       0 ],
    [ qr/\s/,                                      "whitespace charact
+ers",  0 ],
#
#   Note: $ must be \$, and - must be first to avoid range interpretat
+ion.
#
    [ qr/[-~`!@#\$%^&*()_+={}\[\]|\\:;"'<>,.?\/]/, "punctuation charac
+ters", 0 ],
);

#for my $c (split //, $slurpee) { print $c; }

for my $case (@regexes) {
    say "seeding // with: $case->[0]";
    "Aa5: " =~ $case->[0];       # seed the // iteration
    say "matched: '$&'" if $&;
    for (split //, $slurpee) {
        // and $case->[2]++;
    }
}    
for my $case (@regexes) { printf("%4d %s\n", $case->[2], $case->[1]); 
+}

$sum = reduce { $a + $b } (map $_->[2], @regexes);
printf(" sum and length: %3d and %3d\n", $sum, $length);

say "\nNow extract the string between HTML tags with //...";
my $str = "Before tag<i>between tags</i>after tag";
say "\n$str";
$str =~ s{ (?: (?<= \w) (?= <) | (?<= >) (?= \w) ) }{ }xg;    # insert
+ whitespace
say $str;
my @tokens = split / /, $str;
say "Tokens...\n";
for (@tokens) { say };

my $between;
for (@tokens) {
    if (/<\w>/../<\/\w>/) {
        $between .= "$_ " unless // and $&;
    }
}
chop $between if $between;
say "'$between'";

$str = "\n'Before tag<i>between tags</i>after tag'";
say $str;
say "Parse it again with...";
my $regex = qr/ (<\w+>) (.*) (<\/\w+>) /x;
say $regex;
$str =~ $regex;
say "\$1: '$1'";
say "\$2: '$2'";
say "\$3: '$3'";

exit(0);
__END__
Last night I dreamt I went to Manderley again. This will come as a sur
+prise to
Daphne since she did not write these lines.  Here is a line containing
+ stuff
   ,?- ! : that should/must be deleted/// ; : ! before using it as a o
+ne-time-pad.
A one-time-pad should contain only characters, no  punctuation, no par
+entheticals like (this is bogus) or [(this is bogus, too)], or {also 
+this}; no contractions, such as
I'll or it's or digits such as 0, 123, -75 or 8 P.M., and no numbers, 
+such as $1,234.69.  If
you want to use numbers in your message, spell them out; one-hundred d
+ollars and sixty-nine cents, or theeepm.  These non-alpha characters 
+in the one-time-pad will be discarded, but they must be entered eactl
+y as represented in the book used as the pad.  Let the encoding progr
+am decide what to use and what to skip.

Some of the text is from "Rebecca", an out-of copyright but not out-of
+-print fictional
work that can be freely downloaded as an eBook from Project Gutenberg.
+ I use it as the
raw source for one-time pads in a cryptologic research study; i.e., ex
+tract potential
pad bits from somewhere in the text, randomly chosen with seek from EO
+F. Munge the
characters, encrypt the message and delete the characters used for the
+ pad. Since both
encoder and decoder use the same seek expression, both pads are guaran
+teed to be
identical, and since the characters used to create the pad are deleted
+, never to be seen
again, the pad is guaranteed to be used exactly once. Does not scale f
+or large
organizations but works flawlessly for a small group of conspirators.
[download]

O U T P U T

  seeding // with: (?^u:A-Z)
  matched: 'A'
  seeding // with: (?^u:a-z)
  matched: 'a'
  seeding // with: (?^u:\d)
  matched: '5'
  seeding // with: (?^u:\s)
  matched: ' '
  seeding // with: (?^u:[-~`!@#\$%^&*()_+={}\\|\\:;"'<>,.?/])
  matched: ':'
    26 uppercase characters
  1168 lowercase characters
    13 digits
   283 whitespace characters
    80 punctuation characters
   sum and length: 1570 and 1570

  Now extract the string between HTML tags with //...

  Before tag<i>between tags</i>after tag
  Before tag <i> between tags </i> after tag

  Tokens...

  Before
  tag
  <i>
  between
  tags
  </i>
  after
  tag
  'between tags'

  'Before tag<i>between tags</i>after tag'
  Parse it again with...
  (?^ux: (<\w+>) (.*) (</\w+>) )
  $1: '<i>'
  $2: 'between tags'
  $3: '</i>'

Comment on Re^2: Empty pattern in regex [updated] Download Code

Replies are listed 'Best First'.
Re^3: Empty pattern in regex [updated] by jo37 (Curate) on Oct 28, 2023 at 10:40 UTC
Hello perlboy_emeritus, to be more explicit in this issue, I do not only think it's a bug, I am absolutely convinced it is. Some remarks: Having a workaround for a bug does in no way mean it is not a bug. Using `$&` in this scenario is dangerous, as it is affected by the very same bug. See extended example below. I cannot find anything in your code that would trigger the bug. This is fine and TIMTOWTDI perlop is very precise in The empty pattern "`//`: If the PATTERN evaluates to the empty string, the last successfully matched regular expression is used instead. (...) If no match has previously succeeded, this will (silently) act instead as a genuine empty pattern (which will always match). (...) As you can see from my example, `//` does not behave as described if there was a successful match and there happens a jump out of an inner block where `//` was applied. This clears `$&` and resets `//` to the genuine empty pattern. #!/usr/bin/perl use v5.24; use warnings; for my $label ('inner', 'outer') { say "goto $label"; for ('c' .. 'g') { say "loop: $_"; say "/d/ matched" if /d/; say "\$&: '$&'" if defined $&; { goto $label unless //; say "// matched"; inner: } outer: } say ''; } __DATA__ goto inner loop: c // matched loop: d /d/ matched $&: 'd' // matched loop: e $&: 'd' loop: f $&: 'd' loop: g $&: 'd' goto outer loop: c // matched loop: d /d/ matched $&: 'd' // matched loop: e $&: 'd' loop: f // matched loop: g // matched [download] Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Empty pattern in regex [updated]
by jo37 (Curate) on Oct 28, 2023 at 10:40 UTC

Hello perlboy_emeritus,

to be more explicit in this issue, I do not only think it's a bug, I am absolutely convinced it is. Some remarks:

Having a workaround for a bug does in no way mean it is not a bug.
Using $& in this scenario is dangerous, as it is affected by the very same bug. See extended example below.
I cannot find anything in your code that would trigger the bug. This is fine and TIMTOWTDI
perlop is very precise in The empty pattern "//:
If the *PATTERN* evaluates to the empty string, the last *successfully* matched regular expression is used instead. (...) If no match has previously succeeded, this will (silently) act instead as a genuine empty pattern (which will always match). (...)
As you can see from my example, // does not behave as described if there was a successful match and there happens a jump out of an inner block where // was applied. This clears $& and resets // to the genuine empty pattern.

#!/usr/bin/perl

use v5.24;
use warnings;

for my $label ('inner', 'outer') {
    say "goto $label";
    for ('c' .. 'g') {
        say "loop: $_";
        say "/d/ matched" if /d/;
        say "\$&: '$&'" if defined $&;
        {
            goto $label unless //;
            say "// matched";
            inner:
        }
        outer:
    }
    say '';
}
__DATA__
goto inner
loop: c
// matched
loop: d
/d/ matched
$&: 'd'
// matched
loop: e
$&: 'd'
loop: f
$&: 'd'
loop: g
$&: 'd'

goto outer
loop: c
// matched
loop: d
/d/ matched
$&: 'd'
// matched
loop: e
$&: 'd'
loop: f
// matched
loop: g
// matched
[download]

Greetings,
-jo

$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

[reply]
[d/l]
[select]