in reply to nbsp in space character class

There's \p{Zs} that matches a normal space and an nbsp and a few other oddities:
$ perl -le'use charnames (); use warnings; no warnings "utf8"; chr($_) +=~/[\p{Zs}]/ && printf "%.4x: %s\n", $_, charnames::viacode($_) for 0 +..65500' 0020: SPACE 00a0: NO-BREAK SPACE 1680: OGHAM SPACE MARK 180e: MONGOLIAN VOWEL SEPARATOR 2000: EN QUAD 2001: EM QUAD 2002: EN SPACE 2003: EM SPACE 2004: THREE-PER-EM SPACE 2005: FOUR-PER-EM SPACE 2006: SIX-PER-EM SPACE 2007: FIGURE SPACE 2008: PUNCTUATION SPACE 2009: THIN SPACE 200a: HAIR SPACE 202f: NARROW NO-BREAK SPACE 205f: MEDIUM MATHEMATICAL SPACE 3000: IDEOGRAPHIC SPACE
(\s also matches it, but only if perl has the string marked as utf8). But \p{Zs} expects an actual NO-BREAK SPACE character, not the HTML entity for one. If you want to include entities that represent space characters, you'd probably be best off using HTML::Entities to decode them first.

Replies are listed 'Best First'.
Re^2: nbsp in space character class
by ikegami (Patriarch) on Jul 14, 2008 at 05:57 UTC

    \s also matches it, but only if perl has the string marked as utf8

    As shown below.

    use HTML::Entities qw( decode_entities ); my $ch = decode_entities(' '); print("Unicode Semantics\n"); print("-----------------\n"); utf8::upgrade($ch); if ($ch =~ /\s/) { print("Match\n"); } else { print("No Match\n"); } print("\n"); print("Byte Semantics\n"); print("--------------\n"); utf8::downgrade($ch); if ($ch =~ /\s/) { print("Match\n"); } else { print("No Match\n"); }
    Unicode Semantics ----------------- Match Byte Semantics -------------- No Match