inconsistency in whitespace handling

Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: inconsistency in whitespace handling by idsfa (Vicar) on May 12, 2005 at 14:17 UTC
Just naming the variable "unicode" doesn't make it unicode. What you are actually creating is: `$notunicode = "\x20" . "22" . "\xa0";` [download] And `\x20` is definitely whitespace. Try: `$unicode = "\x{2022}\xa0";` [download] perluniintro may have additional help Updated:(updated: etiquette complaint removed by author) The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon	[reply] [d/l] [select]
Re: inconsistency in whitespace handling by bart (Canon) on May 12, 2005 at 14:23 UTC
I'd like to point out that despite the character error in Skeeve's post ("\x2022" is a space (!) followed by "22", oops), his report is for real. `$latin = "\xa0"; # nbsp $unicode= $latin . pack 'U0'; # convert to UTF-8 print "Latin 1: ", $latin =~ /\s/ ? "yes":"no", "\n"; print "Unicode: ", $unicode =~/\s/ ? "yes":"no", "\n";` [download] result: Latin 1: no Unicode: yes	[reply] [d/l]
Re^2: inconsistency in whitespace handling by idsfa (Vicar) on May 12, 2005 at 14:40 UTC
True. That's because unicode changes the definition of whitespace. Until you go Unicode, perl defines whitespace as: `\s A whitespace character [ \t\n\r\f]` [download] But once you're in Unicode, it honors the encoding's WhiteSpace flag. (Which is set, in this case.) Updated: The same applies to thundergnat's discovery. The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon	[reply] [d/l]
Re: inconsistency in whitespace handling by thundergnat (Deacon) on May 12, 2005 at 14:37 UTC
Interestingly enough, if you use a named parameter class instead of a the \s assertion, it finds it correctly in both strings. `$ascii = "\xa0\x{a0}"; $unicode = "\x{100}\xa0\x{a0}"; print "Latin-1 \\s : ", $ascii =~ /\s/ ? "yes":"no","\n"; print "Latin-1 \\p{Space}: ", $ascii =~ /\p{Space}/ ? "yes":"no","\n +\n"; print "Unicode \\s: ",$unicode =~ /\s/ ? "yes":"no","\n"; print "Unicode \\p{Space}: ",$unicode =~ /\s/ ? "yes":"no","\n";` [download] I certainly wouldn't expect non-breaking space to be recognized or not as a space depending on what else was in the string. (I even experimented with whether the enclosing brackets were significant... apparantly not.)	[reply] [d/l]
Re^2: inconsistency in whitespace handling by Fletch (Bishop) on May 12, 2005 at 14:45 UTC
I think this is because underneath the `\p{Foo}` stuff generates a different regex opcode which calls through the utf routines even if the source string isn't marked as utf.	[reply] [d/l]
Re^3: inconsistency in whitespace handling by demerphq (Chancellor) on May 13, 2005 at 11:18 UTC
Youd be correct, UTF8 in either the text being matched or the pattern causes UTF8 semantics to apply to the whole regex. A good example of oddness this causes is the differing handling of the german sharp S. If you use extended ascii a case insensitive pattern will not match 'ss' if you use utf8 it will. :-) --- $world=~s/war/peace/g	[reply]
Re: inconsistency in whitespace handling by Fletch (Bishop) on May 12, 2005 at 14:13 UTC
Erm, strictly speaking ASCII is characters 0x0-0x7f. 0xa0 is (I believe) Latin-1 and wouldn't be matched as `\s` unless it was in a Unicode string.	[reply]
Re^2: inconsistency in whitespace handling by bart (Canon) on May 12, 2005 at 14:25 UTC
The meaning of chr(160) shouldn't change between Latin-1 and UTF-8, despite the different representation as bytes. It is the same character, Latin-1 being a subset of Unicode.	[reply]
Re^3: inconsistency in whitespace handling by Fletch (Bishop) on May 12, 2005 at 14:41 UTC
Right, but my point was that 0xa0 isn't considered a space character for a plain vanilla ASCII scalar without the utf magic enabled (underneath Perl's calling `isspace(3)`, which only considers the characters space, form-feed, newline, carriage return, horizontal tab, and vertical tab to be whitespace).	[reply]