comment on

Good catch! You caught me giving the brief answer. Let's start demonstrating what you describe:

$ perl -MEncode -le'
    $_ = decode("UTF-8", encode("UTF-8", "\xA0"));
    print /\s/ ? "space" : "no space";

    $_ = decode("iso-latin-1", encode("iso-latin-1", "\xA0"));
    print /\s/ ? "space" : "no space";

    $_ = "\xA0";
    print /\s/ ? "space" : "no space";

    $_ = "\xA0\x{2660}";
    print /^\s/ ? "space" : "no space";
'
space
space
no space
space
[download]

Regex matching follow two sets of rules: "byte semantics" and "unicode semantics".[*1] The set of rules used is determined by the internal encoding of the string used to build the pattern and/or the internal encoding of the string against which the pattern is being matched.[*2]

By default, strings are internally encoded as iso-latin-1 if possible.[*3] On the other hand, the decoding facilities of Encode, utf8 and PerlIO::encoding return strings internally encoded as utf8. This enables unicode semantics on matching.

Under byte semantics, \s matches whitespace in the ASCII range only. Under unicode semantics, \s matches anything Unicode considers whitespace[*4], which include NBSP (U+00A0).

The internal encoding of a string can be manipulated using utf8::ugprade and utf8::downgrade

*1 — This post doesn't discuss the effects of use locale, if any.

*2 — Expect (backwards compatible) changes in this area in 5.12.

*3 — This post doesn't discuss the effects of (broken) use encoding, if any.

*4 — There are bugs in many properties, but I don't think \s has any errors. These are being fixed for 5.12.

In reply to Re^3: Something strange in the world or Regexes by ikegami
in thread Something strange in the world or Regexes by mrguy123

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.