$ perl -MEncode -le' $_ = decode("UTF-8", encode("UTF-8", "\xA0")); print /\s/ ? "space" : "no space"; $_ = decode("iso-latin-1", encode("iso-latin-1", "\xA0")); print /\s/ ? "space" : "no space"; $_ = "\xA0"; print /\s/ ? "space" : "no space"; $_ = "\xA0\x{2660}"; print /^\s/ ? "space" : "no space"; ' space space no space space
Regex matching follow two sets of rules: "byte semantics" and "unicode semantics".[*1] The set of rules used is determined by the internal encoding of the string used to build the pattern and/or the internal encoding of the string against which the pattern is being matched.[*2]
By default, strings are internally encoded as iso-latin-1 if possible.[*3] On the other hand, the decoding facilities of Encode, utf8 and PerlIO::encoding return strings internally encoded as utf8. This enables unicode semantics on matching.
Under byte semantics, \s matches whitespace in the ASCII range only. Under unicode semantics, \s matches anything Unicode considers whitespace[*4], which include NBSP (U+00A0).
The internal encoding of a string can be manipulated using utf8::ugprade and utf8::downgrade
*1 — This post doesn't discuss the effects of use locale, if any.
*2 — Expect (backwards compatible) changes in this area in 5.12.
*3 — This post doesn't discuss the effects of (broken) use encoding, if any.
*4 — There are bugs in many properties, but I don't think \s has any errors. These are being fixed for 5.12.
In reply to Re^3: Something strange in the world or Regexes
by ikegami
in thread Something strange in the world or Regexes
by mrguy123
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |