in reply to Re^2: Something strange in the world or Regexes
in thread Something strange in the world or Regexes
$ perl -MEncode -le' $_ = decode("UTF-8", encode("UTF-8", "\xA0")); print /\s/ ? "space" : "no space"; $_ = decode("iso-latin-1", encode("iso-latin-1", "\xA0")); print /\s/ ? "space" : "no space"; $_ = "\xA0"; print /\s/ ? "space" : "no space"; $_ = "\xA0\x{2660}"; print /^\s/ ? "space" : "no space"; ' space space no space space
Regex matching follow two sets of rules: "byte semantics" and "unicode semantics".[*1] The set of rules used is determined by the internal encoding of the string used to build the pattern and/or the internal encoding of the string against which the pattern is being matched.[*2]
By default, strings are internally encoded as iso-latin-1 if possible.[*3] On the other hand, the decoding facilities of Encode, utf8 and PerlIO::encoding return strings internally encoded as utf8. This enables unicode semantics on matching.
Under byte semantics, \s matches whitespace in the ASCII range only. Under unicode semantics, \s matches anything Unicode considers whitespace[*4], which include NBSP (U+00A0).
The internal encoding of a string can be manipulated using utf8::ugprade and utf8::downgrade
*1 — This post doesn't discuss the effects of use locale, if any.
*2 — Expect (backwards compatible) changes in this area in 5.12.
*3 — This post doesn't discuss the effects of (broken) use encoding, if any.
*4 — There are bugs in many properties, but I don't think \s has any errors. These are being fixed for 5.12.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: LC_*: Something horrible in the world of Regexes
by jakobi (Pilgrim) on Sep 30, 2009 at 17:54 UTC | |
by ikegami (Patriarch) on Sep 30, 2009 at 18:14 UTC |