And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time
Perl doesn't guess at encodings, so I don't know to what you are referring.
Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.
Same goes for output. For example, file handles expect bytes unless configured otherwise. It's the writer's responsibility to convert the output into bytes or to configure the file handle to do the conversion for it.
but don't forget there's SIX places for UTF8 to get messed up:
At least everywhere data is serialised or deserialised. I think you missed a couple. A more complete list:
Inputs:
Outputs:
The nice thing is that they are all independent. Fixing a problem with one doesn't require the cooperation of others.
With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database.
No, that wasn't the only way to fix it. Two wrongs made a right, but introduced many other problems. Specifically, it broke length, substr, regular expressions and much more.
$ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 0 1 1 $ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 1 2 0
but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.
Good. You're checking for undef, which isn't a string. Encoding something that isn't a string is most definitely an error. I don't know why you mention this.
In reply to Re^4: Mugged by UTF8, this CANNOT be right
by ikegami
in thread Mugged by UTF8, this CANNOT be right
by tosh
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |