pool has asked for the wisdom of the Perl Monks concerning the following question:

I want to know whether a numeric \x escape can be used to represent a character with a codepoint from U+0080 to U+00ff within a regular-expression character class.

In the code below, the pattern is intended to match any character in the 2-character range from U+007f to U+0080. The string doesn't contain either of those characters, so I expect the match to fail, but instead it succeeds.

The output is: "Your string is matched by /[\7f-\x80]/"

#!/usr/bin/perl -w use warnings 'FATAL', 'all'; # Make every warning fatal. use strict; # Require strict checking of variable references, etc. use utf8; # Treat this script as encoded with UTF-8. my $_ = 'abcdefg'; # Identify a string. print 'Your string is ', (/[\7f-\x80]/ ? '' : 'NOT '), 'matched by /[\7f-\x80]/', "\n"; # Report the result.

Original content restored by GrandFather

Replies are listed 'Best First'.
Re: \x in RE character class
by ikegami (Patriarch) on Dec 10, 2010 at 02:21 UTC

    I want to know whether a numeric \x escape can be used to represent a character with a codepoint from U+0080 to U+00ff within a regular-expression character class.

    Yes, you can. \x80, \x{0080}, \0200, \N{U+0080}, etc all work. You could easily have ascertained that yourself.

    $ perl -E'say "\x80" =~ /^\x80\z/ ?"match":"no match"' match

    Elsewhere, you've asked the same question but under use encoding 'UTF-8'. The UTF-8 encoding of U+0080 is C2 80, so...

    ...well, I can't find a way of matching U+0080 with \x or \0, but \N works.

    $ perl -E' use encoding "UTF-8"; say "\xC2\x80" =~ /^\N{U+0080}\z/ ?"match":"no match" ' match

    Of course, placing the literal character in the regex pattern works too. You can insert it into the source code, or interpolate it in.

    $ perl -E' use encoding "UTF-8"; $x = "\xC2\x80"; say $x =~ /^\Q$x\E\z/ ?"match":"no match" ' match

    This is a known limitation because the documentation shows a workaround for similar operator tr///. tr/// needs a workaround because it doesn't allow interpolation.

Re: \x in RE character class
by roboticus (Chancellor) on Dec 10, 2010 at 00:13 UTC

    pool:

    Update: Never mind ... a range of 0x7f..0x80 isn't likely what you wanted.

    For want of an 'x', a character range was pooched. Perhaps you meant [\x7f-\x80]?

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks for noticing my typo. Sorry for the inconvenience.