comment on

For instance, STDOUT takes it upon itself to attempt translation of data rather than just printing what I want...
(code snippet)
Is there any reason why STDOUT shouldn't just print whatever gets thrown at it? How about, if I want it to perform automatic translation, I turn that feature on?

Huh? In your snippet, ~~perl~~ STDOUT is "just printing whatever gets thrown at it", without doing any sort of "translation" on it.

The first output is a three-byte sequence that, when viewed on a utf8-aware display, will show a single unicode character. If you want to see that as three separate bytes, print to something that does not do utf8 interpretation -- e.g. (in unix):

perl -e 'print "\xE2\x98\xBA"' | od -txC
[download]

In the second output, you've told perl that your three-byte string should be interpreted by perl internally as utf8 data, and then you print it to a file handle that has not been configured for that encoding, so you get the warning, but that's just a warning, and the output is effectively the same as it was before -- and how you see it will depend on what you use to view it.

(In perl 5.8.0, esp. with Red Hat, perl actually referred to the user's "locale" settings in order to "automagically" do utf8 conversion on output whenever the locale cited utf8; everyone quickly agreed that this was a Big Mistake^TM, and the behavior was corrected in 5.8.1, never to return.)

Another "feature" that's bitten me in the butt is the silent "upgrading" (i.e. corruption) of non-UTF8 scalars when concatenated with UTF8 scalars...

Conceptually, appending a non-UTF8 string to a UTF8 string is a really bad idea, bordering on stupid. Don't do that. (Why would you want to? What would you hope to accomplish as a result?)

Your second snippet shows the "special" (quasi-ambiguous) status of byte values in the \x80-\xFF range in perl 5.8: (<update>:) when used in a "raw" (non-utf8) context, they are treated simply as single byte values without further ado -- e.g. print "\xA0" prints just one byte when STDOUT is in ":raw" (default) mode -- but (</update>) when used in a utf8 context (e.g. appended to a utf8 string or printed to a file handle that is set to utf8 mode), they are automatically "upgraded" to utf8 characters by changing the single byte to its two-byte utf8 equivalent. For people migrating out of iso-8859-1 into unicode (which is quite a few people, even now), this prevents a lot more trouble than it creates. Admittedly, a lot of people who don't yet understand unicode and/or utf8 can and do get into trouble with this.

As for your "preferred API", I don't think I understand what you are trying to demonstrate with the first two "print" statements. As for the third print statement ("$utf8 . $non_utf8"), if the latter scalar contains data that cannot be parsed as utf8, any utf8-aware display will simply put question-marks for the bytes that make no sense. That's what the Unicode Standard says is the appropriate thing to do; Perl will only tell you your non-utf8 data cannot be used directly as utf8 if/when you try to do:

decode( 'utf8', $non_utf8, Encode::FB_CROAK ); # or Encode::FB_WARN
[download]

or you can do the "default" decoding, without the third "check" parameter, and the resulting string will contain one or more \x{FFFD} unicode characters (rendered in three utf8 bytes, of course), which refers to a code point labeled "REPLACEMENT CHARACTER", which will either be ignored or show up as a question-mark, depending on what utf8-aware tool you use to view it.

If you have non-utf8 data and you want to "display" it using a utf8-aware terminal/window, you need to figure out how to make it intelligible, both to the displayer and to the user.

To get rid of the "wide character in print" warnings, do binmode FILEHANDLE, ":utf8" or use the three-arg version of the "open" statement when opening an output file: open FH, ">:utf8", $filename -- check the man page for "open" (perldoc -f open).

In reply to Re: Interventionist Unicode Behaviors by graff
in thread Interventionist Unicode Behaviors by creamygoodness

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.