Separate from the question of handling UTF-8 source code, here are some comments on the regexes.

... use "§" as delimiter ... because I use REGEX on written text and I've found, that nearly any character including brackets will be able to be included in the text ...

But the  s/// m// delimiter will not clash with any character in the "bound" text variable nor in an interpolated  qr// regex object or plain string:

c:\@Work\Perl\monks>perl -wMstrict -le "my $text = 'foo/bar/baz/boff zip/zit/zot/zap'; print qq{'$text'}; ;; my $regex_object = qr{ /bar/baz/ }xms; my $plain_string = '/zit/zot/'; ;; $text =~ s/ $regex_object | $plain_string /OTHER/xmsg; print qq{'$text'}; " 'foo/bar/baz/boff zip/zit/zot/zap' 'fooOTHERboff zipOTHERzap'
(However, note that interpolation of plain strings is problematic if they may contain regex metacharacters; for this, see quotemeta and the  \Q...\E interpolation modifiers.)

The use of  () {} [] <> as balanced regex delimiters is useful because balanced delimiters | delimiter characters within the regex pattern are handled properly (within reason; character classes present exceptions, but | unescaped delimiter characters within the regex pattern must always be strictly balanced, so  [{}] would have worked in the example below):

c:\@Work\Perl\monks>perl -wMstrict -le "my $text = 'foo {bar} baz { whiz } boff'; print qq{A: '$text'}; ;; $text =~ s{ { \s* \w+ \s* } }{OTHER}xmsg; print qq{ '$text'}; ;; $text = 'abc {tuvw} de { xyz } fghi'; print qq{B: '$text'}; ;; $text =~ s{ [\}\{] \s* \w+ \s* [\}\{] }{OTHER}xmsg; print qq{ '$text'}; " A: 'foo {bar} baz { whiz } boff' 'foo OTHER baz OTHER boff' B: 'abc {tuvw} de { xyz } fghi' 'abc OTHER de OTHER fghi'

do { $foundstring =~ s§(<a |\[)([^<>\"]*)(<span class=\"foundterm\">)~ +~([^~]+)~~(</span>)§$1$2$4§igs; } while $foundstring =~ m§(<a |\[) +([^<>\"]*)(<span class=\"foundterm\">)~~([^~]+)~~(</span>)§is;

Doing a substitution that is dependent on a separate, identical  m// match in this way is redundant because the  s/// replacement will only occur if its own match is successful, and the  /g modifier will cause all matches to be replaced:

c:\@Work\Perl\monks>perl -wMstrict -le "my $text = '123 abc 456 de 789 fghi 321'; print qq{A: '$text'}; ;; do { printf 'running s/// -> '; $text =~ s{ [a-z]+ }{OTHER}xmsg; print qq{'$text'}; } while $text =~ m{ [a-z]+ }xms; print qq{done: '$text'}; ;; $text = '123 rs 456 tuvw 789 xyz 321'; print qq{B: '$text'}; ;; $text =~ s{ [a-z]+ }{OTHER}xmsg; print qq{ '$text'}; " A: '123 abc 456 de 789 fghi 321' running s/// -> '123 OTHER 456 OTHER 789 OTHER 321' done: '123 OTHER 456 OTHER 789 OTHER 321' B: '123 rs 456 tuvw 789 xyz 321' '123 OTHER 456 OTHER 789 OTHER 321'
In case A, the while-loop and substitution only run once because the  /g modifier of the  s/// causes anything that could match to be replaced. In case B, the same result is achieved with no separate  m// match.

Update: One sometimes sees something like
    my $match = qr{ ... }xms;
    $string =~ s{ $match }{replace}xms if $string =~ m{ $match }xms;
as a variation on this theme. Again, the substitution will only occur if the  $match pattern matches, so the separate  m// on the same pattern is redundant.


Give a man a fish:  <%-{-{-{-<


In reply to Re^2: Regex delimiter by AnomalousMonk
in thread Regex delimiter by Outaspace

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.