in reply to Re^2: Something strange in the world or Regexes
in thread Something strange in the world or Regexes

It's the UTF-8 encoding (0xC2 0xA0) of the non-breaking space (which is not included in the "whitespace" set of chars1 — thus your regex didn't match).

___

1 update: at least not the iso-latin-1 encoding of the character, i.e. 0xA0  (for backwards compatibility, Perl assumes iso-latin-1 by default):

print "\xa0" =~ /\s/ ? "space" : "no space"; # no space

But see below.  Apparently, the 0xc2 part ("Â") somehow got lost in your case... — simply (incorrectly) treating the UTF-8 sequence as iso-latin-1 should have left you with two characters.

Replies are listed 'Best First'.
Re^4: Something strange in the world or Regexes
by JavaFan (Canon) on Sep 30, 2009 at 11:38 UTC
    If one has 5.10 or later, one can use /\h/ which will match a non-breaking space, regardless whether the string is encoded in UTF8 or not.
      What you said is very misleading.
      $ perl -le' $_ = "\xc2\xa0"; print /^\h$/ ? "h" : "not h"; ' not h

      Of course, you are referring to the internal encoding.

      $ perl -le' $_ = "\xA0"; utf8::downgrade $_; print /^\s$/ ? "s" : "not s"; print /^\h$/ ? "h" : "not h"; utf8::upgrade $_; print /^\s$/ ? "s" : "not s"; print /^\h$/ ? "h" : "not h"; ' not s h s h

      Unfortunately, that's irrelevant in the OP's case since he needs to decode his UTF-8 first, and will make the internal encoding UTF-8.

      Thanx for this nugget.

      Leading to the confusing situation that \s (all whitespace) matches less than [\h\v] (both horizontal and vertical WS) :).

        Yes, but at least this way \s is somewhat "fixed" without breaking code. "NEXT LINE" ("\x85") is matched by \s only in UTF-8 matching, but always by \v. And perhaps more importantly, a vertical tab (aka LINE TABULATION or "\x0b") is never matched by \s, but always by \v.