Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow monks! Is it expected behaviour that
$ascii="\xa0\xa0"; # correction thanks to idsfa $unicode="\x{2022}\xa0"; print "ascii: ", $ascii=~ /\s/ ? "yes":"no","\n"; print "unicode: ",$unicode=~/\s/ ? "yes":"no","\n";
prints
ascii: no unicode: yes
I would expect it to be either both times yes or both times no.

$\=~s;s*.*;q^|D9JYJ^^qq^\//\\\///^;ex;print

Replies are listed 'Best First'.
Re: inconsistency in whitespace handling
by idsfa (Vicar) on May 12, 2005 at 14:17 UTC

    Just naming the variable "unicode" doesn't make it unicode. What you are actually creating is:

    $notunicode = "\x20" . "22" . "\xa0";

    And \x20 is definitely whitespace. Try:

    $unicode = "\x{2022}\xa0";

    perluniintro may have additional help

    Updated:(updated: etiquette complaint removed by author)


    The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon
Re: inconsistency in whitespace handling
by bart (Canon) on May 12, 2005 at 14:23 UTC
    I'd like to point out that despite the character error in Skeeve's post ("\x2022" is a space (!) followed by "22", oops), his report is for real.
    $latin = "\xa0"; # nbsp $unicode= $latin . pack 'U0'; # convert to UTF-8 print "Latin 1: ", $latin =~ /\s/ ? "yes":"no", "\n"; print "Unicode: ", $unicode =~/\s/ ? "yes":"no", "\n";
    result:
    Latin 1: no
    Unicode: yes
    

      True. That's because unicode changes the definition of whitespace. Until you go Unicode, perl defines whitespace as:

      \s A whitespace character [ \t\n\r\f]

      But once you're in Unicode, it honors the encoding's WhiteSpace flag. (Which is set, in this case.)

      Updated: The same applies to thundergnat's discovery.


      The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon
Re: inconsistency in whitespace handling
by thundergnat (Deacon) on May 12, 2005 at 14:37 UTC

    Interestingly enough, if you use a named parameter class instead of a the \s assertion, it finds it correctly in both strings.

    $ascii = "\xa0\x{a0}"; $unicode = "\x{100}\xa0\x{a0}"; print "Latin-1 \\s : ", $ascii =~ /\s/ ? "yes":"no","\n"; print "Latin-1 \\p{Space}: ", $ascii =~ /\p{Space}/ ? "yes":"no","\n +\n"; print "Unicode \\s: ",$unicode =~ /\s/ ? "yes":"no","\n"; print "Unicode \\p{Space}: ",$unicode =~ /\s/ ? "yes":"no","\n";

    I certainly wouldn't expect non-breaking space to be recognized or not as a space depending on what else was in the string.

    (I even experimented with whether the enclosing brackets were significant... apparantly not.)

      I think this is because underneath the \p{Foo} stuff generates a different regex opcode which calls through the utf routines even if the source string isn't marked as utf.

        Youd be correct, UTF8 in either the text being matched or the pattern causes UTF8 semantics to apply to the whole regex. A good example of oddness this causes is the differing handling of the german sharp S. If you use extended ascii a case insensitive pattern will not match 'ss' if you use utf8 it will. :-)

        ---
        $world=~s/war/peace/g

Re: inconsistency in whitespace handling
by Fletch (Bishop) on May 12, 2005 at 14:13 UTC

    Erm, strictly speaking ASCII is characters 0x0-0x7f. 0xa0 is (I believe) Latin-1 and wouldn't be matched as \s unless it was in a Unicode string.

      The meaning of chr(160) shouldn't change between Latin-1 and UTF-8, despite the different representation as bytes. It is the same character, Latin-1 being a subset of Unicode.

        Right, but my point was that 0xa0 isn't considered a space character for a plain vanilla ASCII scalar without the utf magic enabled (underneath Perl's calling isspace(3), which only considers the characters space, form-feed, newline, carriage return, horizontal tab, and vertical tab to be whitespace).