nbsp in space character class

hsfrey has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: nbsp in space character class by ysth (Canon) on Jul 14, 2008 at 00:17 UTC
There's \p{Zs} that matches a normal space and an nbsp and a few other oddities: $ perl -le'use charnames (); use warnings; no warnings "utf8"; chr($_) +=~/[\p{Zs}]/ && printf "%.4x: %s\n", $_, charnames::viacode($_) for 0 +..65500' 0020: SPACE 00a0: NO-BREAK SPACE 1680: OGHAM SPACE MARK 180e: MONGOLIAN VOWEL SEPARATOR 2000: EN QUAD 2001: EM QUAD 2002: EN SPACE 2003: EM SPACE 2004: THREE-PER-EM SPACE 2005: FOUR-PER-EM SPACE 2006: SIX-PER-EM SPACE 2007: FIGURE SPACE 2008: PUNCTUATION SPACE 2009: THIN SPACE 200a: HAIR SPACE 202f: NARROW NO-BREAK SPACE 205f: MEDIUM MATHEMATICAL SPACE 3000: IDEOGRAPHIC SPACE [download] (\s also matches it, but only if perl has the string marked as utf8). But \p{Zs} expects an actual NO-BREAK SPACE character, not the HTML entity for one. If you want to include entities that represent space characters, you'd probably be best off using HTML::Entities to decode them first. -- Online Fortune Cookie Search	[reply] [d/l]
Re^2: nbsp in space character class by ikegami (Patriarch) on Jul 14, 2008 at 05:57 UTC
\s also matches it, but only if perl has the string marked as utf8 As shown below. `use HTML::Entities qw( decode_entities ); my $ch = decode_entities(' '); print("Unicode Semantics\n"); print("-----------------\n"); utf8::upgrade($ch); if ($ch =~ /\s/) { print("Match\n"); } else { print("No Match\n"); } print("\n"); print("Byte Semantics\n"); print("--------------\n"); utf8::downgrade($ch); if ($ch =~ /\s/) { print("Match\n"); } else { print("No Match\n"); }` [download] `Unicode Semantics ----------------- Match Byte Semantics -------------- No Match` [download]	[reply] [d/l] [select]
Re: nbsp in space character class by pc88mxer (Vicar) on Jul 14, 2008 at 00:16 UTC
I have found that ` ` is often represented by the character `"\xa0"`. So, you could use `m/(\s\|\xa0)/` or `m/[\s\xa0]/`.	[reply] [d/l] [select]
Re: nbsp in space character class by swampyankee (Parson) on Jul 14, 2008 at 02:48 UTC
Have you checked the various HTML modules on CPAN? I, for one, would think that modifying the regex engine to recognize HTML entities, such as   as a member of the \s character-class is far too specific to HTML-processing for a general purpose language, such as Perl. Information about American English usage here and here. Floating point issues? Please read this before posting. — emc	[reply]