Re^3: Something strange in the world or Regexes

It's the UTF-8 encoding (0xC2 0xA0) of the non-breaking space (which is not included in the "whitespace" set of chars¹ — thus your regex didn't match).

___

¹ update: at least not the iso-latin-1 encoding of the character, i.e. 0xA0 (for backwards compatibility, Perl assumes iso-latin-1 by default):

print "\xa0" =~ /\s/ ? "space" : "no space";   # no space
[download]

But see below. Apparently, the 0xc2 part ("Ā") somehow got lost in your case... — simply (incorrectly) treating the UTF-8 sequence as iso-latin-1 should have left you with two characters.

Comment on Re^3: Something strange in the world or Regexes Select or Download Code

Replies are listed 'Best First'.
Re^4: Something strange in the world or Regexes by JavaFan (Canon) on Sep 30, 2009 at 11:38 UTC
If one has 5.10 or later, one can use `/\h/` which will match a non-breaking space, regardless whether the string is encoded in UTF8 or not.	[reply] [d/l]
Re^5: Something strange in the world or Regexes by ikegami (Patriarch) on Sep 30, 2009 at 18:48 UTC
What you said is very misleading. `$ perl -le' $_ = "\xc2\xa0"; print /^\h$/ ? "h" : "not h"; ' not h` [download] Of course, you are referring to the internal encoding. `$ perl -le' $_ = "\xA0"; utf8::downgrade $_; print /^\s$/ ? "s" : "not s"; print /^\h$/ ? "h" : "not h"; utf8::upgrade $_; print /^\s$/ ? "s" : "not s"; print /^\h$/ ? "h" : "not h"; ' not s h s h` [download] Unfortunately, that's irrelevant in the OP's case since he needs to decode his UTF-8 first, and will make the internal encoding UTF-8.	[reply] [d/l] [select]
Re^5: Something strange in the world or Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 11:47 UTC
Thanx for this nugget. Leading to the confusing situation that \s (all whitespace) matches less than `[\h\v]` (both horizontal and vertical WS) :).	[reply] [d/l]
Re^6: Something strange in the world or Regexes by JavaFan (Canon) on Sep 30, 2009 at 14:08 UTC
Yes, but at least this way \s is somewhat "fixed" without breaking code. "NEXT LINE" ("\x85") is matched by \s only in UTF-8 matching, but always by \v. And perhaps more importantly, a vertical tab (aka LINE TABULATION or "\x0b") is never matched by \s, but always by \v.	[reply]