Re^3: Something strange in the world or Regexes

Good catch! You caught me giving the brief answer. Let's start demonstrating what you describe:

$ perl -MEncode -le'
    $_ = decode("UTF-8", encode("UTF-8", "\xA0"));
    print /\s/ ? "space" : "no space";

    $_ = decode("iso-latin-1", encode("iso-latin-1", "\xA0"));
    print /\s/ ? "space" : "no space";

    $_ = "\xA0";
    print /\s/ ? "space" : "no space";

    $_ = "\xA0\x{2660}";
    print /^\s/ ? "space" : "no space";
'
space
space
no space
space
[download]

Regex matching follow two sets of rules: "byte semantics" and "unicode semantics".[*1] The set of rules used is determined by the internal encoding of the string used to build the pattern and/or the internal encoding of the string against which the pattern is being matched.[*2]

By default, strings are internally encoded as iso-latin-1 if possible.[*3] On the other hand, the decoding facilities of Encode, utf8 and PerlIO::encoding return strings internally encoded as utf8. This enables unicode semantics on matching.

Under byte semantics, \s matches whitespace in the ASCII range only. Under unicode semantics, \s matches anything Unicode considers whitespace[*4], which include NBSP (U+00A0).

The internal encoding of a string can be manipulated using utf8::ugprade and utf8::downgrade

*1 — This post doesn't discuss the effects of use locale, if any.

*2 — Expect (backwards compatible) changes in this area in 5.12.

*3 — This post doesn't discuss the effects of (broken) use encoding, if any.

*4 — There are bugs in many properties, but I don't think \s has any errors. These are being fixed for 5.12.

Comment on Re^3: Something strange in the world or Regexes Select or Download Code

Replies are listed 'Best First'.
Re^4: LC_*: Something horrible in the world of Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 17:54 UTC
way outside the opener, but related: How will(/are) shell LANG/LC_* variables be handled? I'm 90% convinced that is sanest to ignore those. Esp. this heresy: There's that painful POSIXishly sick but "officially correct" problem of many utf8 locales having suddenly rather strange collating sequences, making a mess of the most trivial shell patterns like e.g. A-Z* in bash for e.g. de_DE.utf8 or en_US.utf8. Seeing a A-Z* glob suddenly match thisshouldnotmatchbutdoesgeethanxposix nearly made me return to bed hoping for the nightmare to stop. It required finding the antidote of LC_COLLATE=C to recover. Now while I place some trust that regexes won't fall victim to that collation malsequencing insanity, what about perl's glob patterns?	[reply]
Re^5: Something horrible in the world of Regexes - attack of the posix zombies by ikegami (Patriarch) on Sep 30, 2009 at 18:14 UTC
How will(/are) shell LANG/LC_ variables be handled? I'm 90% convinced that is sanest to ignore those.* Those are definitely ignored if you don't use locale. That's about all I know. `use open ':std', ':locale';` is useful if you want to use the locale's encoding (and nothing else) for STD*. ref: open	[reply] [d/l]