Re^2: Something strange in the world or Regexes

Replies are listed 'Best First'.
Re^3: Something strange in the world or Regexes by ikegami (Patriarch) on Sep 30, 2009 at 17:30 UTC
Good catch! You caught me giving the brief answer. Let's start demonstrating what you describe: `$ perl -MEncode -le' $_ = decode("UTF-8", encode("UTF-8", "\xA0")); print /\s/ ? "space" : "no space"; $_ = decode("iso-latin-1", encode("iso-latin-1", "\xA0")); print /\s/ ? "space" : "no space"; $_ = "\xA0"; print /\s/ ? "space" : "no space"; $_ = "\xA0\x{2660}"; print /^\s/ ? "space" : "no space"; ' space space no space space` [download] Regex matching follow two sets of rules: "byte semantics" and "unicode semantics".[1] The set of rules used is determined by the internal encoding of the string used to build the pattern and/or the internal encoding of the string against which the pattern is being matched.[2] By default, strings are internally encoded as iso-latin-1 if possible.[3] On the other hand, the decoding facilities of Encode, utf8 and PerlIO::encoding return strings internally encoded as utf8. This enables unicode semantics on matching. Under byte semantics, `\s` matches whitespace in the ASCII range only. Under unicode semantics, `\s` matches anything Unicode considers whitespace[4], which include NBSP (U+00A0). The internal encoding of a string can be manipulated using `utf8::ugprade` and `utf8::downgrade` 1 — This post doesn't discuss the effects of `use locale`, if any. 2 — Expect (backwards compatible) changes in this area in 5.12. 3 — This post doesn't discuss the effects of (broken) `use encoding`, if any. 4 — There are bugs in many properties, but I don't think `\s` has any errors. These are being fixed for 5.12.	[reply] [d/l] [select]
Re^4: LC_*: Something horrible in the world of Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 17:54 UTC
way outside the opener, but related: How will(/are) shell LANG/LC_* variables be handled? I'm 90% convinced that is sanest to ignore those. Esp. this heresy: There's that painful POSIXishly sick but "officially correct" problem of many utf8 locales having suddenly rather strange collating sequences, making a mess of the most trivial shell patterns like e.g. A-Z* in bash for e.g. de_DE.utf8 or en_US.utf8. Seeing a A-Z* glob suddenly match thisshouldnotmatchbutdoesgeethanxposix nearly made me return to bed hoping for the nightmare to stop. It required finding the antidote of LC_COLLATE=C to recover. Now while I place some trust that regexes won't fall victim to that collation malsequencing insanity, what about perl's glob patterns?	[reply]
Re^5: Something horrible in the world of Regexes - attack of the posix zombies by ikegami (Patriarch) on Sep 30, 2009 at 18:14 UTC
How will(/are) shell LANG/LC_ variables be handled? I'm 90% convinced that is sanest to ignore those.* Those are definitely ignored if you don't use locale. That's about all I know. `use open ':std', ':locale';` is useful if you want to use the locale's encoding (and nothing else) for STD*. ref: open	[reply] [d/l]