in reply to Re: Regex delimiter
in thread Regex delimiter

Separate from the question of handling UTF-8 source code, here are some comments on the regexes.

... use "§" as delimiter ... because I use REGEX on written text and I've found, that nearly any character including brackets will be able to be included in the text ...

But the  s/// m// delimiter will not clash with any character in the "bound" text variable nor in an interpolated  qr// regex object or plain string:

c:\@Work\Perl\monks>perl -wMstrict -le "my $text = 'foo/bar/baz/boff zip/zit/zot/zap'; print qq{'$text'}; ;; my $regex_object = qr{ /bar/baz/ }xms; my $plain_string = '/zit/zot/'; ;; $text =~ s/ $regex_object | $plain_string /OTHER/xmsg; print qq{'$text'}; " 'foo/bar/baz/boff zip/zit/zot/zap' 'fooOTHERboff zipOTHERzap'
(However, note that interpolation of plain strings is problematic if they may contain regex metacharacters; for this, see quotemeta and the  \Q...\E interpolation modifiers.)

The use of  () {} [] <> as balanced regex delimiters is useful because balanced delimiters | delimiter characters within the regex pattern are handled properly (within reason; character classes present exceptions, but | unescaped delimiter characters within the regex pattern must always be strictly balanced, so  [{}] would have worked in the example below):

c:\@Work\Perl\monks>perl -wMstrict -le "my $text = 'foo {bar} baz { whiz } boff'; print qq{A: '$text'}; ;; $text =~ s{ { \s* \w+ \s* } }{OTHER}xmsg; print qq{ '$text'}; ;; $text = 'abc {tuvw} de { xyz } fghi'; print qq{B: '$text'}; ;; $text =~ s{ [\}\{] \s* \w+ \s* [\}\{] }{OTHER}xmsg; print qq{ '$text'}; " A: 'foo {bar} baz { whiz } boff' 'foo OTHER baz OTHER boff' B: 'abc {tuvw} de { xyz } fghi' 'abc OTHER de OTHER fghi'

do { $foundstring =~ s§(<a |\[)([^<>\"]*)(<span class=\"foundterm\">)~ +~([^~]+)~~(</span>)§$1$2$4§igs; } while $foundstring =~ m§(<a |\[) +([^<>\"]*)(<span class=\"foundterm\">)~~([^~]+)~~(</span>)§is;

Doing a substitution that is dependent on a separate, identical  m// match in this way is redundant because the  s/// replacement will only occur if its own match is successful, and the  /g modifier will cause all matches to be replaced:

c:\@Work\Perl\monks>perl -wMstrict -le "my $text = '123 abc 456 de 789 fghi 321'; print qq{A: '$text'}; ;; do { printf 'running s/// -> '; $text =~ s{ [a-z]+ }{OTHER}xmsg; print qq{'$text'}; } while $text =~ m{ [a-z]+ }xms; print qq{done: '$text'}; ;; $text = '123 rs 456 tuvw 789 xyz 321'; print qq{B: '$text'}; ;; $text =~ s{ [a-z]+ }{OTHER}xmsg; print qq{ '$text'}; " A: '123 abc 456 de 789 fghi 321' running s/// -> '123 OTHER 456 OTHER 789 OTHER 321' done: '123 OTHER 456 OTHER 789 OTHER 321' B: '123 rs 456 tuvw 789 xyz 321' '123 OTHER 456 OTHER 789 OTHER 321'
In case A, the while-loop and substitution only run once because the  /g modifier of the  s/// causes anything that could match to be replaced. In case B, the same result is achieved with no separate  m// match.

Update: One sometimes sees something like
    my $match = qr{ ... }xms;
    $string =~ s{ $match }{replace}xms if $string =~ m{ $match }xms;
as a variation on this theme. Again, the substitution will only occur if the  $match pattern matches, so the separate  m// on the same pattern is redundant.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^3: Regex delimiter
by toohoo (Beadle) on Jun 14, 2019 at 06:57 UTC

    Hello AnomalousMonk,

    great answer, helpful hints. That is helping others in best manner. Thanks very much for it.

    I had some ideas why I took this redundancy. But it might become too long here to discuss. One was about uncertainty on nested matches and the wish of doing it one after the other. About the redundancy I was aware.

    So thanks again and have a nice day