comment on

Well, this is a confusing situation.

By default, chr turns a value from 0..255 into a single-character/single-byte ASCII string and turns a value of 256 or larger into a single-character/multi-byte UTF8 string (according to testing and the source code, not according to the documentation). If you have said 'use bytes', then a value of 256 or larger is instead &ed with 0xff and the result converted into a single-character/single-byte ASCII string.

So chr(192) produces an ASCII string (always). Now, regular expressions don't yet know how to distinguish ASCII strings from UTF8 strings (for each string, Perl keeps track of both the bytes that compose the string and whether it thinks those bytes are in ASCII or in UTF8) in order to treat them differently. Instead, whether to treat strings as ASCII or UTF8 is determined when the regular expression is compiled. So a regular expression is compiled to expect UTF8 if you've said 'use utf8' or if there are Unicode characters in the regular expression (not in the string that is being matched).

So chr(192) creates an ASCII string that is not a valid UTF8 string and 'use utf8' causes the regular expression to be compiled to expect UTF8 strings. Then you give it something that isn't valid as a UTF8 string so it fails.

It might be smart (and is a very simple patch) to change chr so that, if you've said 'use utf8', it converts values in 128..255 into single-character/multi-byte UTF8 strings. It might even be wise to have it convert values in 0..127 into single-character/single-byte UTF8 strings (that requires a better vision of where Unicode support is headed in Perl than I currently have).

It is already planned to have regular expressions be compiled into polymorphic code such that the compiled regex can deal with both ASCII and Unicode strings. When that happens, 'use utf8' should no longer affect regular expressions.

Perl's support for Unicode is still in flux and so there are still some inconsistancies and lots of confusing bits.

Currently, if you want the UTF8 string for the character 192, you'll need to convert chr(192) into a UTF8 string. See the encode modules for some ways to do this. perlunicode has more on Unicode support for the different releases of Perl.

- tye

In reply to (tye)Re: problem with chr function by tye
in thread problem with chr function by John M. Dlugosz

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.