in reply to Re^3: Mugged by UTF8, this CANNOT be right
in thread Mugged by UTF8, this CANNOT be right
And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time
Perl doesn't guess at encodings, so I don't know to what you are referring.
Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.
Same goes for output. For example, file handles expect bytes unless configured otherwise. It's the writer's responsibility to convert the output into bytes or to configure the file handle to do the conversion for it.
but don't forget there's SIX places for UTF8 to get messed up:
At least everywhere data is serialised or deserialised. I think you missed a couple. A more complete list:
Inputs:
Outputs:
The nice thing is that they are all independent. Fixing a problem with one doesn't require the cooperation of others.
With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database.
No, that wasn't the only way to fix it. Two wrongs made a right, but introduced many other problems. Specifically, it broke length, substr, regular expressions and much more.
$ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 0 1 1 $ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 1 2 0
but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.
Good. You're checking for undef, which isn't a string. Encoding something that isn't a string is most definitely an error. I don't know why you mention this.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Mugged by UTF8, this CANNOT be right
by tosh (Scribe) on Jan 26, 2011 at 21:26 UTC | |
by ikegami (Patriarch) on Jan 27, 2011 at 00:38 UTC | |
by DrHyde (Prior) on Jan 27, 2011 at 10:55 UTC | |
by ikegami (Patriarch) on Jan 27, 2011 at 16:49 UTC | |
by NodeReader (Initiate) on Jan 27, 2011 at 19:27 UTC | |
| |
|
Re^5: Mugged by UTF8, this CANNOT be right
by Jim (Curate) on Jan 27, 2011 at 00:37 UTC | |
by ikegami (Patriarch) on Jan 27, 2011 at 02:17 UTC | |
by ikegami (Patriarch) on Jan 27, 2011 at 01:35 UTC | |
|
Re^5: Mugged by UTF8, this CANNOT be right
by Jim (Curate) on Jan 27, 2011 at 01:10 UTC | |
by ikegami (Patriarch) on Jan 27, 2011 at 03:12 UTC |