Ovid has asked for the wisdom of the Perl Monks concerning the following question:

According to perlre:

The following equivalences to Unicode \p{} constructs and equivalent backslash character classes (if available), will hold:

[snip]

digit IsDigit \d

However, according to Larry, they are not equivalent. Is the Perl documentation incorrect? What does this mean? Should constructs like \s also be coverted to [[:space:]] or am I totally misunderstanding this?

Cheers,
Ovid

New address of my CGI Course.

  • Comment on POSIX character classes in regular expressions

Replies are listed 'Best First'.
Re: POSIX character classes in regular expressions
by ikegami (Patriarch) on Jan 31, 2005 at 19:26 UTC
    In (a parent to) the post you linked, Larry said \d and [0-9] weren't equivalent. He didn't say anything about \d and the character classes you mentioned being not equivalent.

      Thanks. I guess in reading that a bit closer, I see what you mean.

      Cheers,
      Ovid

      New address of my CGI Course.

Re: POSIX character classes in regular expressions
by ambrus (Abbot) on Jan 31, 2005 at 20:58 UTC

    I don't quite understand the distinction either, but locales come in here too.

    I've done this quick check:

    for l in C hu_HU de_DE; do echo $l:; LANG=$l perl -we 'use locale; for + $s ("\x{e1}", "\x{151}", "\x{a3}") { for $r (qr/\w/, qr/[[:alnum:]]/ +, qr/\pL/) { printf qq["\\x{%x}" %s /%s/\n], ord($s), ($s =~ qr/$r/) ? "=~" : "!~", $r; } }'; done
    output is
    C: "\x{e1}" !~ /(?-xism:\w)/ "\x{e1}" !~ /(?-xism:[[:alnum:]])/ "\x{e1}" =~ /(?-xism:\pL)/ "\x{151}" =~ /(?-xism:\w)/ # why? "\x{151}" =~ /(?-xism:[[:alnum:]])/ # why? "\x{151}" =~ /(?-xism:\pL)/ "\x{a3}" !~ /(?-xism:\w)/ "\x{a3}" !~ /(?-xism:[[:alnum:]])/ "\x{a3}" !~ /(?-xism:\pL)/ hu_HU: "\x{e1}" =~ /(?-xism:\w)/ "\x{e1}" =~ /(?-xism:[[:alnum:]])/ "\x{e1}" =~ /(?-xism:\pL)/ "\x{151}" =~ /(?-xism:\w)/ "\x{151}" =~ /(?-xism:[[:alnum:]])/ "\x{151}" =~ /(?-xism:\pL)/ "\x{a3}" =~ /(?-xism:\w)/ "\x{a3}" =~ /(?-xism:[[:alnum:]])/ "\x{a3}" !~ /(?-xism:\pL)/ de_DE: "\x{e1}" =~ /(?-xism:\w)/ "\x{e1}" =~ /(?-xism:[[:alnum:]])/ "\x{e1}" =~ /(?-xism:\pL)/ "\x{151}" =~ /(?-xism:\w)/ "\x{151}" =~ /(?-xism:[[:alnum:]])/ "\x{151}" =~ /(?-xism:\pL)/ "\x{a3}" !~ /(?-xism:\w)/ "\x{a3}" !~ /(?-xism:[[:alnum:]])/ "\x{a3}" !~ /(?-xism:\pL)/

    With no locale;, you get the same output as with with the C locale.