in reply to Re: Something strange in the world or Regexes
in thread Something strange in the world or Regexes

Interesting observation (to me at least):

The single line of binmode stdin :encoding(X) does extend \s for both X = LATIN1 and UTF8 to include the non-breakable space. Without it, regardless of LANG and LC_* variables set in the shell, you've the old semantics for \s. Not (yet?) mentioned in perldoc -f binmode, but OTOH it mentions a nice way to flush STDIN.

Looks like binmode and PerlIO got way more interesting in the meantime :).

Replies are listed 'Best First'.
Re^3: Something strange in the world or Regexes
by ikegami (Patriarch) on Sep 30, 2009 at 17:30 UTC
    Good catch! You caught me giving the brief answer. Let's start demonstrating what you describe:
    $ perl -MEncode -le' $_ = decode("UTF-8", encode("UTF-8", "\xA0")); print /\s/ ? "space" : "no space"; $_ = decode("iso-latin-1", encode("iso-latin-1", "\xA0")); print /\s/ ? "space" : "no space"; $_ = "\xA0"; print /\s/ ? "space" : "no space"; $_ = "\xA0\x{2660}"; print /^\s/ ? "space" : "no space"; ' space space no space space

    Regex matching follow two sets of rules: "byte semantics" and "unicode semantics".[*1] The set of rules used is determined by the internal encoding of the string used to build the pattern and/or the internal encoding of the string against which the pattern is being matched.[*2]

    By default, strings are internally encoded as iso-latin-1 if possible.[*3] On the other hand, the decoding facilities of Encode, utf8 and PerlIO::encoding return strings internally encoded as utf8. This enables unicode semantics on matching.

    Under byte semantics, \s matches whitespace in the ASCII range only. Under unicode semantics, \s matches anything Unicode considers whitespace[*4], which include NBSP (U+00A0).

    The internal encoding of a string can be manipulated using utf8::ugprade and utf8::downgrade

    *1 — This post doesn't discuss the effects of use locale, if any.

    *2 — Expect (backwards compatible) changes in this area in 5.12.

    *3 — This post doesn't discuss the effects of (broken) use encoding, if any.

    *4 — There are bugs in many properties, but I don't think \s has any errors. These are being fixed for 5.12.

      way outside the opener, but related:

      How will(/are) shell LANG/LC_* variables be handled? I'm 90% convinced that is sanest to ignore those.

      Esp. this heresy:

      There's that painful POSIXishly sick but "officially correct" problem of many utf8 locales having suddenly rather strange collating sequences, making a mess of the most trivial shell patterns like e.g. A-Z* in bash for e.g. de_DE.utf8 or en_US.utf8. Seeing a A-Z* glob suddenly match thisshouldnotmatchbutdoesgeethanxposix nearly made me return to bed hoping for the nightmare to stop. It required finding the antidote of LC_COLLATE=C to recover.

      Now while I place some trust that regexes won't fall victim to that collation malsequencing insanity, what about perl's glob patterns?

        How will(/are) shell LANG/LC_* variables be handled? I'm 90% convinced that is sanest to ignore those.

        Those are definitely ignored if you don't use locale. That's about all I know.

        use open ':std', ':locale'; is useful if you want to use the locale's encoding (and nothing else) for STD*. ref: open