Regex delimiter

Outaspace has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex delimiter by davido (Cardinal) on Jun 26, 2007 at 19:50 UTC
For one thing, when you use non-standard delimiters, you need to preface them with "`m`". In other words, you can use `/regex/` or `m/regex/`, or `m#regex#`, but you cannot use `#regex#` without the leading `m`. Also, in perlop, you can read the following explanation as to what characters are permissible: If "/" is the delimiter then the initial m is optional. With the m you can use any pair of non-alphanumeric, non-whitespace characters as delimiters. This is particularly useful for matching path names that contain "/", to avoid LTS (leaning toothpick syndrome). If "?" is the delimiter, then the match-only-once rule of ?PATTERN? applies. If "'" is the delimiter, no interpolation is performed on the PATTERN. So to reiterate: In your example you're using 'a' as the delimiter. This is not allowed according to perlop. The documentation explains that the delimiter must be non-alphanumeric, and non-whitespace. Dave	[reply] [d/l] [select]
Re^2: Regex delimiter by jwkrahn (Abbot) on Jun 26, 2007 at 20:17 UTC
The documentation is wrong in this case, you can use alphanumeric delimiters: `$ perl -le'$_ = q[bcdefg]; print $1 if m x..(..)..x;' de $ perl -le'$_ = q[bcdefg]; print $1 if m 7..(..)..7;' de` [download]	[reply] [d/l]
Re^3: Regex delimiter by davido (Cardinal) on Jun 27, 2007 at 04:25 UTC
The documentation and reality are at odds with each other, it seems. However, I think there's probably some wisdom in sticking with documented behavior for code that needs to be robust, until the behavior that defies documentation is reconciled. You never know which way the pendulum will swing when they sort it out. Dave	[reply]
Re: Regex delimiter by akho (Hermit) on Jun 26, 2007 at 19:49 UTC
You need to use the `m<delim>regexp<delim>` form (note the m). Braces are paired, everything else — repeated.	[reply] [d/l]
Re: Regex delimiter by PreferredUserName (Pilgrim) on Jun 26, 2007 at 21:23 UTC
`perl -wle 'for (0 .. 255) { $delim = chr; next if $delim =~ /\w/; eval "m$delim...$delim"; print "Bad: $_ ($delim)" if $@ }' Bad: 9 ( ) # tab Bad: 10 ( # newline ) Bad: 12 ( # formfeed ) )ad: 13 ( # carriage return Bad: 32 ( ) # substitute Bad: 40 (() # OK paired Bad: 46 (.) Bad: 59 (;) Bad: 60 (<) # OK paired Bad: 91 ([) # OK paired Bad: 123 ({) # OK paired Bad: 137 (_) # warns only, but it's wrong` [download]	[reply] [d/l]
Re^2: Regex delimiter by lidden (Curate) on Jun 26, 2007 at 21:55 UTC
Changing your code to: `perl -wle 'for (0 .. 255) { $delim = chr; next if $delim =~ /\w/; eval "m${delim}foo$delim;"; print "Bad: $_ ($delim)" if $@ }'` [download] Makes '.' and ';' pass too. I also have no problem with '_'.	[reply] [d/l]
Re: Regex delimiter by RMGir (Prior) on Jun 26, 2007 at 22:36 UTC
All of this is very educational. I'd never have thought m a...a was legal. But don't expect anyone reading your code to be very happy if you use a as a regex delimiter. I suggest sticking to the usual suspects, like //, m!!, m{}, m(), m@@, and m<>. Those will delimit most of the things you need to match without excessive escaping, and still be fairly readable... Mike	[reply]
Re: Regex delimiter by toohoo (Beadle) on Jun 13, 2019 at 07:01 UTC
Hello monks, I have the same problem. Mine is a little bit different but in the case the same if I think of: "what delimiter for REGEX I can use?" My problem has the following extension: I tried to use "§" as delimiter. This I did because I use REGEX on written text and I've found, that nearly any character including brackets will be able to be included in the text I want to process. Now for accident one of my tools has changed the encoding of a script to UTF-8 on upload to github, which was not originally. In the beginning I had Windows 1252 but this did work also under Unix/Linux. But now Perl recognizes a dangerous character before the "§": `Unrecognized character \xA7;` The code of the REGEX is as follows: `do { $foundstring =~ s§(<a \|\[)([^<>\"])(<span class=\"foundterm\">)~~([^~]+)~~(</span>)§$1$2$4§igs; } while $foundstring =~ m§(<a \|\[)([^<>\"])(<span class=\"foundterm\">)~~([^~]+)~~(</span>)§is;` has someone an idea or a hint which character I can use and which is not needed to escape in the text? Thanks in advance and regards Extension: I have done a workaround. To be able to use curly brackets as REGEX delimiters I've replaced curly brackets in the text before the operation and set it back afterwards. ## hide out the curly brackets $foundstring =~ s\|\{\|#lcb#\|igs; $foundstring =~ s\|\}\|#rcb#\|igs; do { $foundstring =~ s{(<a \|\[)(^<>\")()~~(^~+)~~()}{$1$2$4}igs; } while $foundstring =~ m{(<a \|\[)(^<>\")()~~(^~+)~~()}is; ## bring the curly brackets back $foundstring =~ s\|#lcb#\|\{\|igs; $foundstring =~ s\|#rcb#\|\}\|igs; This means in the end it does not matter if someone saves the perl by accident in UTF-8, it will work nonetheless.	[reply] [d/l] [select]
Re^2: Regex delimiter by hippo (Archbishop) on Jun 13, 2019 at 08:27 UTC
You just need to ensure that the character encoding is correctly treated in your source. eg: $ perl -Mutf8 -e 'print "Yes\n" if "a" =~ m§a§;' Yes $ See the docs for the utf8 pragma for more info on that. If your source is in some other encoding, now might be an excellent time to switch.	[reply]
Re^3: Regex delimiter by toohoo (Beadle) on Jun 13, 2019 at 09:37 UTC
Hello hippo, It's not what I expected, but thanks for the hint. For some circumstances this would take me little bit more effort. But may be I will find some character I can use in the short way. regards and have a nice day	[reply]
Re^2: Regex delimiter by AnomalousMonk (Archbishop) on Jun 13, 2019 at 17:03 UTC
Separate from the question of handling UTF-8 source code, here are some comments on the regexes. ... use "§" as delimiter ... because I use REGEX on written text and I've found, that nearly any character including brackets will be able to be included in the text ... But the `s/// m//` delimiter will not clash with any character in the "bound" text variable nor in an interpolated `qr//` regex object or plain string: `c:\@Work\Perl\monks>perl -wMstrict -le "my $text = 'foo/bar/baz/boff zip/zit/zot/zap'; print qq{'$text'}; ;; my $regex_object = qr{ /bar/baz/ }xms; my $plain_string = '/zit/zot/'; ;; $text =~ s/ $regex_object \| $plain_string /OTHER/xmsg; print qq{'$text'}; " 'foo/bar/baz/boff zip/zit/zot/zap' 'fooOTHERboff zipOTHERzap'` [download] (However, note that interpolation of plain strings is problematic if they may contain regex metacharacters; for this, see quotemeta and the `\Q...\E` interpolation modifiers.) The use of `() {} [] <>` as balanced regex delimiters is useful because balanced ~~delimiters~~ \| delimiter characters within the regex pattern are handled properly (within reason; ~~character classes present exceptions, but~~ \| unescaped delimiter characters within the regex pattern must always be strictly balanced, so `[{}]` would have worked in the example below): `c:\@Work\Perl\monks>perl -wMstrict -le "my $text = 'foo {bar} baz { whiz } boff'; print qq{A: '$text'}; ;; $text =~ s{ { \s* \w+ \s* } }{OTHER}xmsg; print qq{ '$text'}; ;; $text = 'abc {tuvw} de { xyz } fghi'; print qq{B: '$text'}; ;; $text =~ s{ [\}\{] \s* \w+ \s* [\}\{] }{OTHER}xmsg; print qq{ '$text'}; " A: 'foo {bar} baz { whiz } boff' 'foo OTHER baz OTHER boff' B: 'abc {tuvw} de { xyz } fghi' 'abc OTHER de OTHER fghi'` [download] `do { $foundstring =~ s§(<a \|\[)([^<>\"])(<span class=\"foundterm\">)~ +~([^~]+)~~(</span>)§$1$2$4§igs; } while $foundstring =~ m§(<a \|\[) +([^<>\"])(<span class=\"foundterm\">)~~([^~]+)~~(</span>)§is;` [download] Doing a substitution that is dependent on a separate, identical `m//` match in this way is redundant because the `s///` replacement will only occur if its own match is successful, and the `/g` modifier will cause all matches to be replaced: c:\@Work\Perl\monks>perl -wMstrict -le "my $text = '123 abc 456 de 789 fghi 321'; print qq{A: '$text'}; ;; do { printf 'running s/// -> '; $text =~ s{ [a-z]+ }{OTHER}xmsg; print qq{'$text'}; } while $text =~ m{ [a-z]+ }xms; print qq{done: '$text'}; ;; $text = '123 rs 456 tuvw 789 xyz 321'; print qq{B: '$text'}; ;; $text =~ s{ [a-z]+ }{OTHER}xmsg; print qq{ '$text'}; " A: '123 abc 456 de 789 fghi 321' running s/// -> '123 OTHER 456 OTHER 789 OTHER 321' done: '123 OTHER 456 OTHER 789 OTHER 321' B: '123 rs 456 tuvw 789 xyz 321' '123 OTHER 456 OTHER 789 OTHER 321' [download] In case A, the `while`-loop and substitution only run once because the `/g` modifier of the `s///` causes anything that could match to be replaced. In case B, the same result is achieved with no separate `m//` match. Update: One sometimes sees something like `my $match = qr{ ... }xms;` `$string =~ s{ $match }{replace}xms if $string =~ m{ $match }xms;` as a variation on this theme. Again, the substitution will only occur if the `$match` pattern matches, so the separate `m//` on the same pattern is redundant. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Regex delimiter by toohoo (Beadle) on Jun 14, 2019 at 06:57 UTC
Hello AnomalousMonk, great answer, helpful hints. That is helping others in best manner. Thanks very much for it. I had some ideas why I took this redundancy. But it might become too long here to discuss. One was about uncertainty on nested matches and the wish of doing it one after the other. About the redundancy I was aware. So thanks again and have a nice day	[reply]