hsfrey has asked for the wisdom of the Perl Monks concerning the following question:

In regex, it appears that the white space character class, \s, doesn't recognize the html non-breaking space character,   Of course, I could define a character string to use (\s| )+, but is there a pre-defined perl character class which includes it along with the other space characters?

Replies are listed 'Best First'.
Re: nbsp in space character class
by ysth (Canon) on Jul 14, 2008 at 00:17 UTC
    There's \p{Zs} that matches a normal space and an nbsp and a few other oddities:
    $ perl -le'use charnames (); use warnings; no warnings "utf8"; chr($_) +=~/[\p{Zs}]/ && printf "%.4x: %s\n", $_, charnames::viacode($_) for 0 +..65500' 0020: SPACE 00a0: NO-BREAK SPACE 1680: OGHAM SPACE MARK 180e: MONGOLIAN VOWEL SEPARATOR 2000: EN QUAD 2001: EM QUAD 2002: EN SPACE 2003: EM SPACE 2004: THREE-PER-EM SPACE 2005: FOUR-PER-EM SPACE 2006: SIX-PER-EM SPACE 2007: FIGURE SPACE 2008: PUNCTUATION SPACE 2009: THIN SPACE 200a: HAIR SPACE 202f: NARROW NO-BREAK SPACE 205f: MEDIUM MATHEMATICAL SPACE 3000: IDEOGRAPHIC SPACE
    (\s also matches it, but only if perl has the string marked as utf8). But \p{Zs} expects an actual NO-BREAK SPACE character, not the HTML entity for one. If you want to include entities that represent space characters, you'd probably be best off using HTML::Entities to decode them first.

      \s also matches it, but only if perl has the string marked as utf8

      As shown below.

      use HTML::Entities qw( decode_entities ); my $ch = decode_entities(' '); print("Unicode Semantics\n"); print("-----------------\n"); utf8::upgrade($ch); if ($ch =~ /\s/) { print("Match\n"); } else { print("No Match\n"); } print("\n"); print("Byte Semantics\n"); print("--------------\n"); utf8::downgrade($ch); if ($ch =~ /\s/) { print("Match\n"); } else { print("No Match\n"); }
      Unicode Semantics ----------------- Match Byte Semantics -------------- No Match
Re: nbsp in space character class
by pc88mxer (Vicar) on Jul 14, 2008 at 00:16 UTC
    I have found that   is often represented by the character "\xa0".

    So, you could use m/(\s|\xa0)/ or m/[\s\xa0]/.

Re: nbsp in space character class
by swampyankee (Parson) on Jul 14, 2008 at 02:48 UTC

    Have you checked the various HTML modules on CPAN? I, for one, would think that modifying the regex engine to recognize HTML entities, such as   as a member of the \s character-class is far too specific to HTML-processing for a general purpose language, such as Perl.


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc