comment on

And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time

Perl doesn't guess at encodings, so I don't know to what you are referring.

Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.

Same goes for output. For example, file handles expect bytes unless configured otherwise. It's the writer's responsibility to convert the output into bytes or to configure the file handle to do the conversion for it.

but don't forget there's SIX places for UTF8 to get messed up:

At least everywhere data is serialised or deserialised. I think you missed a couple. A more complete list:

Inputs:

Source code: Decoding.
HTML form data(?): Decoding.
Database: Decoding.
Template: Decoding.

Outputs:

Database (queries and parameters(?)): Encoding.
HTML response: Encoding and inclusion of Content-Type header.
HTTP response: Inclusion of Content-Type in header.
Error log: Encoding.

The nice thing is that they are all independent. Fixing a problem with one doesn't require the cooperation of others.

With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database.

No, that wasn't the only way to fix it. Two wrongs made a right, but introduced many other problems. Specifically, it broke length, substr, regular expressions and much more.

$ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l
+ength; say /^¢\z/ ?1:0' 0
1
1

$ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l
+ength; say /^¢\z/ ?1:0' 1
2
0
[download]

but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.

Good. You're checking for undef, which isn't a string. Encoding something that isn't a string is most definitely an error. I don't know why you mention this.

In reply to Re^4: Mugged by UTF8, this CANNOT be right by ikegami
in thread Mugged by UTF8, this CANNOT be right by tosh

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.