graff has asked for the wisdom of the Perl Monks concerning the following question:

I'm wondering if this is a bug in 5.8's handling of regex character classes, or whether it's just a case of the "perlre" man page being a bit off... (this applies to 5.8.0 and 5.8.1 equally)

Reading perlre, I would expect the following two regexes to match the same set of characters (in the ASCII range, at least), because they are said to be "equivalent":

/[[:punct:]]/ /\p{IsPunct}/
But when I tried the following little test snippet, I got a bit of a surprise:
for $x ( 0x20 .. 0x7e ) { $_ = chr( $x ); $res = ( /[[:punct:]]/ ) ? "matches :punct:" : "is not a :punct:" +; $res .= ( /\p{IsPunct}/ ) ? " matches {IsPunct}" : " fails on {Is +Punct}"; printf( " 0x%x (%3d.) %s %s\n", $x, $x, $_, $res ) if ( $res =~ /m +atches/ ); }
Actually, when I look at the output, there seems to be some rhyme and reason to the discrepancies, so it looks like a "feature" (not a bug) to have the two different notions of "punctuation" (and the docs should be updated accordingly): I wasn't sure which perl mailing list(s) I would post this to, so I decided to check here first.

Replies are listed 'Best First'.
Re: [[:punct:]] vs. {IsPunct} in 5.8
by particle (Vicar) on Nov 02, 2003 at 14:26 UTC

    for some background, perlre (5.008) states:

    The following equivalences to Unicode \p{} constructs and equivale +nt backslash character classes (if available), will hold: [:...:] \p{...} backslash alpha IsAlpha alnum IsAlnum ascii IsASCII blank IsSpace cntrl IsCntrl digit IsDigit \d graph IsGraph lower IsLower print IsPrint punct IsPunct space IsSpace IsSpacePerl \s upper IsUpper word IsWord xdigit IsXDigit <em>For example "[:lower:]" and "\p{IsLower}" are equivalent.</em>

    if your results match mine,

    #!/usr/bin/perl use strict; use warnings; $|++; my %classes= qw/ alpha IsAlpha alnum IsAlnum ascii IsASCII blank IsBlank cntrl IsCntrl digit IsDigit graph IsGraph lower IsLower print IsPrint punct IsPunct space IsSpace upper IsUpper word IsWord xdigit IsXDigit /; for( keys %classes ) { my( $r_posix, $r_unicode )= ( qr/[[:$_:]]/, qr/\p{$classes{$_}}/ ); print "testing $r_posix and $r_unicode$/"; for my $x (0x00..0x7e) { local $_= chr $x; printf "0x%x (%3d.) differ$/" => $x, $x if /$r_posix/ xor /$r_unicode/; } } __END__ testing (?-xism:[[:digit:]]) and (?-xism:\p{IsDigit}) testing (?-xism:[[:upper:]]) and (?-xism:\p{IsUpper}) testing (?-xism:[[:xdigit:]]) and (?-xism:\p{IsXDigit}) testing (?-xism:[[:cntrl:]]) and (?-xism:\p{IsCntrl}) testing (?-xism:[[:alnum:]]) and (?-xism:\p{IsAlnum}) testing (?-xism:[[:space:]]) and (?-xism:\p{IsSpace}) testing (?-xism:[[:print:]]) and (?-xism:\p{IsPrint}) testing (?-xism:[[:ascii:]]) and (?-xism:\p{IsASCII}) testing (?-xism:[[:word:]]) and (?-xism:\p{IsWord}) testing (?-xism:[[:alpha:]]) and (?-xism:\p{IsAlpha}) testing (?-xism:[[:punct:]]) and (?-xism:\p{IsPunct}) 0x24 ( 36.) differ 0x2b ( 43.) differ 0x3c ( 60.) differ 0x3d ( 61.) differ 0x3e ( 62.) differ 0x5e ( 94.) differ 0x60 ( 96.) differ 0x7c (124.) differ 0x7e (126.) differ testing (?-xism:[[:lower:]]) and (?-xism:\p{IsLower}) testing (?-xism:[[:blank:]]) and (?-xism:\p{IsBlank}) testing (?-xism:[[:graph:]]) and (?-xism:\p{IsGraph})

    then i'd list this as a bug, and contact p5p. it seems only [[:punct:]] and \p{IsPunct} differ. this is not expected behavior.

    ~Particle *accelerates*

      It's a bug alright. A documentation bug...

      I checked the Unicode properties, and these are the results:

      CodepointCharClass
      0024$Currency Symbol
      002B+Math Symbol
      003C<Math Symbol
      003D=Math Symbol
      003E>Math Symbol
      005E^Modifier Symbol
      0060`Modifier Symbol
      007C|Math Symbol
      007E~Math Symbol

      So those are not "punctuation" according to the Unicode standard... Time for a PunctPerl class, to keep company to SpacePerl?

      -- 
              dakkar - Mobilis in mobile
      

      Most of my code is tested...

      Perl is strongly typed, it just has very few types (Dan)

      Thanks for such a nicely crafted verification. (I wanted to check the other POSIX vs. Unicode classes as well, so you saved me some trouble -- and shown a neat approach!)

      I have posted the observation to both perl5-porters and perl-unicode mail lists.

Re: :punct: vs. {IsPunct} in 5.8
by liz (Monsignor) on Nov 02, 2003 at 09:51 UTC
    I think your assessment is correct, but I don't have that much experience with Unicode regexes.

    One good place to ask this would be the perl-unicode@perl.org mailing list.

    Liz