Unicode Korean problem

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Basic problem: perl 5.8 seems to refuse to decode Korean UTF-8 correctly.

I have an e-mail sending program that reads UTF-8 Korean (and Japanese) from a database and then formats it to an e-mail. I already have this routine working well for iso-8859.

I thought all I would have to do is change the MIME tags to utf-8 and have it print out the raw utf-8 characters, but perl 5.8.* is complaining I have a Wide character as part of a function call when I call encode_qp (for converting the subject line to quoted printed format according to RFC2047 standards).. The program then dies. I tried to follow the recommendations of 'man perlunicode' and converted the database strings to utf-8 flagged status using:

$subjecttxt = Encode::decode_utf8($subjecttxt);
$encodedsubject = encode_qp($subjecttxt);
[download]

This resulted in a blank string.. When I changed it to use:
encode("utf8",$subjecttxt,Encode::FB_CROAK)
and it told me it couldn't convert the utf8.. thinking it was invalid.. I verified it was valid and was even able to view it correctly in Linux (with LANG=en_US.utf-8 setting).

I also went to extra step of verifying the first 3 bytes of the subject line was a valid code.. The UTF-8 sequence was "EC A0 9C" which converts to C81C in Unicode, which is a valid codepoint.

I read further into a 'README.perl' in the lib/perl5/5.8.*/unicore area that mentioned downloading a couple of large files (Unihan.txt and NormalizeTesting.txt), which I did, and followed the one step of 'perl mktables -makelist'... This build process seemed to work but it still complains about the invalid translations..

Is there more that I need to do to get a successful utf8 decode?
Is there a workaround way I could pass the raw utf8 directly to encode_qp() function without it complaining?

Thanks much in advance.

Comment on Unicode Korean problem Select or Download Code

Replies are listed 'Best First'.
Re: Unicode Korean problem by Tanalis (Curate) on Jul 28, 2005 at 13:06 UTC
The documentation for MIME::QuotedPrint talks about doing this: Perl v5.6 and better allow extended Unicode characters in strings. Such strings cannot be encoded directly, as the quoted-printable encoding is only defined for single-byte characters. The solution is to use the Encode module to select the byte encoding you want. For example: `use MIME::QuotedPrint qw(encode_qp); use Encode qw(encode); $encoded = encode_qp(encode("UTF-8", "\x{FFFF}\n")); print $encoded;` [download] This seems different to the `encode` statement you've posted above. Did you try the example from the docs? What was the outcome of that? Assuming it works, which it does for me, it would seem straightforward enough to apply that example to your needs. Update: Grammatical changes, and added a line to indicate that the example works for me. -- Foxcub `#include www.liquidfusion.org.uk`	[reply] [d/l] [select]
Re^2: Unicode Korean problem by bivansc (Initiate) on Jul 28, 2005 at 14:07 UTC
That works! Thank you... I had seen stuff implying I needed to use 'decode()', when really it was encode()! After making the switch then it works!	[reply]


more useful options
	PerlMonks