in reply to Re^3: Something strange in the world or Regexes
in thread Something strange in the world or Regexes

If one has 5.10 or later, one can use /\h/ which will match a non-breaking space, regardless whether the string is encoded in UTF8 or not.

Replies are listed 'Best First'.
Re^5: Something strange in the world or Regexes
by ikegami (Patriarch) on Sep 30, 2009 at 18:48 UTC
    What you said is very misleading.
    $ perl -le' $_ = "\xc2\xa0"; print /^\h$/ ? "h" : "not h"; ' not h

    Of course, you are referring to the internal encoding.

    $ perl -le' $_ = "\xA0"; utf8::downgrade $_; print /^\s$/ ? "s" : "not s"; print /^\h$/ ? "h" : "not h"; utf8::upgrade $_; print /^\s$/ ? "s" : "not s"; print /^\h$/ ? "h" : "not h"; ' not s h s h

    Unfortunately, that's irrelevant in the OP's case since he needs to decode his UTF-8 first, and will make the internal encoding UTF-8.

Re^5: Something strange in the world or Regexes
by jakobi (Pilgrim) on Sep 30, 2009 at 11:47 UTC

    Thanx for this nugget.

    Leading to the confusing situation that \s (all whitespace) matches less than [\h\v] (both horizontal and vertical WS) :).

      Yes, but at least this way \s is somewhat "fixed" without breaking code. "NEXT LINE" ("\x85") is matched by \s only in UTF-8 matching, but always by \v. And perhaps more importantly, a vertical tab (aka LINE TABULATION or "\x0b") is never matched by \s, but always by \v.