Re: nbsp in space character class

There's \p{Zs} that matches a normal space and an nbsp and a few other oddities:

$ perl -le'use charnames (); use warnings; no warnings "utf8"; chr($_)
+=~/[\p{Zs}]/ && printf "%.4x: %s\n", $_, charnames::viacode($_) for 0
+..65500'
0020: SPACE
00a0: NO-BREAK SPACE
1680: OGHAM SPACE MARK
180e: MONGOLIAN VOWEL SEPARATOR
2000: EN QUAD
2001: EM QUAD
2002: EN SPACE
2003: EM SPACE
2004: THREE-PER-EM SPACE
2005: FOUR-PER-EM SPACE
2006: SIX-PER-EM SPACE
2007: FIGURE SPACE
2008: PUNCTUATION SPACE
2009: THIN SPACE
200a: HAIR SPACE
202f: NARROW NO-BREAK SPACE
205f: MEDIUM MATHEMATICAL SPACE
3000: IDEOGRAPHIC SPACE
[download]

(\s also matches it, but only if perl has the string marked as utf8). But \p{Zs} expects an actual NO-BREAK SPACE character, not the HTML entity for one. If you want to include entities that represent space characters, you'd probably be best off using HTML::Entities to decode them first.

--
Online Fortune Cookie Search

Comment on Re: nbsp in space character class Download Code

Replies are listed 'Best First'.
Re^2: nbsp in space character class by ikegami (Patriarch) on Jul 14, 2008 at 05:57 UTC
\s also matches it, but only if perl has the string marked as utf8 As shown below. `use HTML::Entities qw( decode_entities ); my $ch = decode_entities(' '); print("Unicode Semantics\n"); print("-----------------\n"); utf8::upgrade($ch); if ($ch =~ /\s/) { print("Match\n"); } else { print("No Match\n"); } print("\n"); print("Byte Semantics\n"); print("--------------\n"); utf8::downgrade($ch); if ($ch =~ /\s/) { print("Match\n"); } else { print("No Match\n"); }` [download] `Unicode Semantics ----------------- Match Byte Semantics -------------- No Match` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: nbsp in space character class
by ikegami (Patriarch) on Jul 14, 2008 at 05:57 UTC

\s also matches it, but only if perl has the string marked as utf8

As shown below.

use HTML::Entities qw( decode_entities );

my $ch = decode_entities('&nbsp;');

print("Unicode Semantics\n");
print("-----------------\n");
utf8::upgrade($ch);
if ($ch =~ /\s/) {
   print("Match\n");
} else {
   print("No Match\n");
}

print("\n");

print("Byte Semantics\n");
print("--------------\n");
utf8::downgrade($ch);
if ($ch =~ /\s/) {
   print("Match\n");
} else {
   print("No Match\n");
}
[download]

Unicode Semantics
-----------------
Match

Byte Semantics
--------------
No Match
[download]

[reply]
[d/l]
[select]