Re: Encoding Hell

You said:

Instead of returning a nice string of readable characters, $out (or $res I'm not sure which) returns a string of octets corresponding to the individual bytes for these multibyte characters... I'd like to know: at what point is perl carrying out this conversion process...

The point is: perl is not doing any conversion -- it is giving you the "raw" binary byte stream from the source, without doing any kind of "interpretation" of it.

Whatever display tool you are using to view the data as it arrives (and just what are you using to view the data?), it's that tool which is applying the "conversion" (the interpretation of the octet stream) that you find so confusing.

The right track, as indicated by rhesa, is to figure out what character encoding is being used for a given chunk of input content, and use Encode so that perl will apply the correct interpretation to the data, and depending on what sort of display tool you use, convert it to the appropriate character set for viewing. Something like this:

use Encode;

...

my $inp_enc = ...;  # whatever it happens to be

my $out_enc = ':utf8';
# or: my $out_enc = 'encoding(big5)';
# (or whatever your display tool expects)

binmode STDOUT, $out_enc;

...

print decode( $inp_enc, $res->content ) if ( $res->is_success );
[download]

(updated to fix a discrepancy in the variable names).

The way that works is: the decode call converts the content to perl-internal utf8 encoding; then, whatever mode was set for STDOUT, the print will automatically do the right thing (or try to) -- converting utf8 to something else if need be -- as the content is written to that file handle.

(Of course, if you want to output a non-unicode encoding because of your display tool, understand that you will get lots of encoding errors, and nothing worth looking at, if you try printing, say, Chinese text when STDOUT is set to, say, cp1251. That's the problem with non-unicode character sets: they tend to be language-specific.)

Comment on Re: Encoding Hell Download Code

Replies are listed 'Best First'.
Re^2: Encoding Hell by kettle (Beadle) on Aug 10, 2006 at 02:11 UTC
"Whatever display tool you are using to view the data as it arrives (and just what are you using to view the data?), it's that tool which is applying the "conversion" (the interpretation of the octet stream) that you find so confusing." This is not precisely true - and I never said I found it confusing... It does matter that whatever one uses to view the data be set to the same encoding that the output has been set to, but this is not the whole story. The byte stream must also be decoded properly, i.e. it must match the encoding at the source - otherwise perl makes assumptions about the input byte stream. After that one can make changes according to one's 'display tool', but leaving a shift-jis encoded byte stream as is, and then expecting the unicode decoding of this stream to work properly is not Ok. It is clear from the code that this is understood but the wording of this post unnecessarily obfuscates the fact that perl has default settings which are not always appropriate. I don't really know why this post turned so negative; but I guess it must be my fault. Anyway the problem as mentioned a ways above, is long solved, so I guess I shan't be harking back again.	[reply]
Re^3: Encoding Hell by graff (Chancellor) on Aug 10, 2006 at 17:44 UTC
The byte stream must also be decoded properly ... That's the point that rhesa and I were making, and which was absent in the OP code. ... otherwise perl makes assumptions about the input byte stream. Well, if you want to put it in those terms, you could say "perl assumes that whatever byte stream comes in, that is what will be printed (unless your script specifically applies some other interpretation or conversion, either using Encode or via a PerlIO encoding layer on the output file handle). leaving a shift-jis encoded byte stream as is, and then expecting the unicode decoding of this stream to work properly is not Ok I'm not sure what you're talking about here. If you know you have shift-jis data, and you want to convert it to unicode, that's definitely okay, so long as you actually apply some process to do that (perl won't do it "implicitly"). (update: I just remembered something: in case you happen to be running Perl 5.8.0 on a Red-Hat 9 system, then there is a good chance that your defaults include a "locale" setting, which, on that combination of Perl/OS versions, caused Perl to make an implicit ("default") attempt to coerce input/output data between unicode and the encoding implied by the locale. This murdered countless applications and was fixed in later versions of Perl. If this is your situation, it's long past time to upgrade.) It is clear from the code that this is understood but the wording of this post unnecessarily obfuscates the fact that perl has default settings which are not always appropriate. Again, this is a bit hard to follow... which code are you referring to here? Which wording is obfuscating? Of course default settings are not always appropriate -- that's why there are alternatives to default settings... I don't really know why this post turned so negative; Me neither. That first reply (and its subthread) really threw me. If anything I said seemed negative, I apologize for that -- I generally try to keep my tone neutral, but of course I don't always succeed. (updated to fix typos)	[reply]