cheerful has asked for the wisdom of the Perl Monks concerning the following question:

I have some image files which have unicode caption. I used ExifTool to extract them. However, I don't know how to output them to HTML file in their native encoding.

If I write them out without encoding (all output to STDOUT which is redirected to a file), the output can be viewed if charset is UTF-8. The font is ugly but correct.

However, if I try any of the following, I get garbage

binmode(STDOUT, ":encoding(euc-cn)") binmode(STDOUT, ":encoding(gb2312)") or convert each individual string $text = encode('euc-cn', $text)
I even tried decode('UTF-8', $text) before but it does not work either. What's the proper way to output in correct encoding/charset? Thanks!

Replies are listed 'Best First'.
Re: How to encode for non-unicode output
by moritz (Cardinal) on Nov 04, 2008 at 19:36 UTC
    I have some image files which have unicode caption.

    Unicode is not a character encoding. If ExifTool doesn't decode the strings for you, you have to do it yourself. And you have to know its encoding first. There's no way around that.

    However, I don't know how to output them to HTML file in their native encoding.

    In which "native encoding"? That of the HTML files? which encoding is that?

    Let me get this straight: When you want to change the encoding of something, Encode (or the IO layers) are they way to go, but you have to know both the source and destination encoding.

    Also make sure to always test with reliable tools and as soon as possible. hexdump in conjunction with an encoding table is reliable. Browsers (that often try to guess an encoding, and sometimes fail) are not.

        decode("Guess", $text) worked. Since I did not specify the suspect, trial-error leads to UTF-8.

        Since the un-encoded output looks fine as UTF-8, the original text is probably UTF-8 or ExifTool decoded it. But somehow perl does not know it when it tries to encode. Does the decode call just tell perl it's UTF-8?

      ExifTool use UTF-8 as default. If I print it out w/o encoding, the text is correct with charset set to UTF-8. So the decoding is done, or the source is UTF-8.
        If the source is UTF-8, most string operations (like encoding into a specified character encoding) behaves very differently in the two cases (decoded or not decoded).

        If it's indeed decoded, encode($destination_encoding, $string) will work (but you still need to know in which encoding you want to store it).