in reply to Re: Empty pattern in regex [updated]
in thread Empty pattern in regex
Hello jo37,
Do you still think it's a bug even though we seem to be able to get what we want with some Perl trickery? You duplicated the issue choroba raised but we did get the answer he expected. // has been around at least since 5.6 as it is discussed in PP 3rd, as well as 4th (5.14) without, unfortunately, examples. The one in perlop is clear but is not suggestive, to me, of a real use case. I'm still looking for a definitive use case, or at least realistic, if not definitive. I've come up with two. The first might be used by a grammarian or linguist researching comparative languages. The second extracts the string between html tags, although I show how to do this with a much simpler plain-old regex. Me thinks it's a stretch to use // when there are other ways to do a thing, but TMTOWTDI. One of my examples parses a string while the other uses an array. if (/this/../that/) {... almost demands an array. I would really like to hear a war story or two how // was used to solve some really gnarly problem. Here be my two examples:
#!/usr/bin/env -S perl -w ##!/usr/bin/env -S perl -wd use v5.30.0; use strict; use List::AllUtils qw( reduce ); my ($slurpee, $length, $sum); { local $/; ($slurpee) = <DATA>; } $length = length $slurpee; my @regexes = ( [ qr/[A-Z]/, "uppercase characte +rs", 0 ], [ qr/[a-z]/, "lowercase characte +rs", 0 ], [ qr/\d/, "digits", + 0 ], [ qr/\s/, "whitespace charact +ers", 0 ], # # Note: $ must be \$, and - must be first to avoid range interpretat +ion. # [ qr/[-~`!@#\$%^&*()_+={}\[\]|\\:;"'<>,.?\/]/, "punctuation charac +ters", 0 ], ); #for my $c (split //, $slurpee) { print $c; } for my $case (@regexes) { say "seeding // with: $case->[0]"; "Aa5: " =~ $case->[0]; # seed the // iteration say "matched: '$&'" if $&; for (split //, $slurpee) { // and $case->[2]++; } } for my $case (@regexes) { printf("%4d %s\n", $case->[2], $case->[1]); +} $sum = reduce { $a + $b } (map $_->[2], @regexes); printf(" sum and length: %3d and %3d\n", $sum, $length); say "\nNow extract the string between HTML tags with //..."; my $str = "Before tag<i>between tags</i>after tag"; say "\n$str"; $str =~ s{ (?: (?<= \w) (?= <) | (?<= >) (?= \w) ) }{ }xg; # insert + whitespace say $str; my @tokens = split / /, $str; say "Tokens...\n"; for (@tokens) { say }; my $between; for (@tokens) { if (/<\w>/../<\/\w>/) { $between .= "$_ " unless // and $&; } } chop $between if $between; say "'$between'"; $str = "\n'Before tag<i>between tags</i>after tag'"; say $str; say "Parse it again with..."; my $regex = qr/ (<\w+>) (.*) (<\/\w+>) /x; say $regex; $str =~ $regex; say "\$1: '$1'"; say "\$2: '$2'"; say "\$3: '$3'"; exit(0); __END__ Last night I dreamt I went to Manderley again. This will come as a sur +prise to Daphne since she did not write these lines. Here is a line containing + stuff ,?- ! : that should/must be deleted/// ; : ! before using it as a o +ne-time-pad. A one-time-pad should contain only characters, no punctuation, no par +entheticals like (this is bogus) or [(this is bogus, too)], or {also +this}; no contractions, such as I'll or it's or digits such as 0, 123, -75 or 8 P.M., and no numbers, +such as $1,234.69. If you want to use numbers in your message, spell them out; one-hundred d +ollars and sixty-nine cents, or theeepm. These non-alpha characters +in the one-time-pad will be discarded, but they must be entered eactl +y as represented in the book used as the pad. Let the encoding progr +am decide what to use and what to skip. Some of the text is from "Rebecca", an out-of copyright but not out-of +-print fictional work that can be freely downloaded as an eBook from Project Gutenberg. + I use it as the raw source for one-time pads in a cryptologic research study; i.e., ex +tract potential pad bits from somewhere in the text, randomly chosen with seek from EO +F. Munge the characters, encrypt the message and delete the characters used for the + pad. Since both encoder and decoder use the same seek expression, both pads are guaran +teed to be identical, and since the characters used to create the pad are deleted +, never to be seen again, the pad is guaranteed to be used exactly once. Does not scale f +or large organizations but works flawlessly for a small group of conspirators.
O U T P U T
seeding // with: (?^u:A-Z) matched: 'A' seeding // with: (?^u:a-z) matched: 'a' seeding // with: (?^u:\d) matched: '5' seeding // with: (?^u:\s) matched: ' ' seeding // with: (?^u:[-~`!@#\$%^&*()_+={}\\|\\:;"'<>,.?/]) matched: ':' 26 uppercase characters 1168 lowercase characters 13 digits 283 whitespace characters 80 punctuation characters sum and length: 1570 and 1570 Now extract the string between HTML tags with //... Before tag<i>between tags</i>after tag Before tag <i> between tags </i> after tag Tokens... Before tag <i> between tags </i> after tag 'between tags' 'Before tag<i>between tags</i>after tag' Parse it again with... (?^ux: (<\w+>) (.*) (</\w+>) ) $1: '<i>' $2: 'between tags' $3: '</i>'
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: Empty pattern in regex [updated]
by jo37 (Curate) on Oct 28, 2023 at 10:40 UTC |