Unicode & charset conversions

roundboy has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a system that processes a bunch of email, saves it all in a DB, and then allows web users to view and reply to the messages (this is a custom call center application). Why this is tricky for me:

The messages we receive are sent in various languages and encodings, both western and east asian.
We have an extensive knowledge base (KB) of canned answers to questions. We don't mind duplicating these in each language we support, but we'd rather not have them in multiple encodings as well. We currently have all the KB entries stored as UTF-8.

The difficulty is that when I compose a message reply, we want to include the original message, and allow insertion of KB entries. Now if I set the encoding of the page (that displays the reply) to UTF-8, we can read the KB entries as they're inserted, but some of the original message is garbled (especially if gb2312 or big5). But, if I instead encode it using the charset of the original message, I can't read the KB entries.

The solution seems to be to convert either the original message or the KB entries used into the other charset. And that's my question: how do I do this? I've poked around CPAN, and the most likely suspect (Unicode::MapUTF8) does not support the charsets I need. I've also tried forking an iconv(1) to do the conversion, but it just aborts when it encounters a byte sequence it doesn't recognize, which is apparently pretty often with our data.

So does anyone have any wisdom or experience with charset conversions that they care to share?

Thanks, and happy holidays!

--roundboy

Comment on Unicode & charset conversions - how? Select or Download Code

Replies are listed 'Best First'.
Re: Unicode & charset conversions - how? by CountZero (Bishop) on Dec 25, 2002 at 11:03 UTC
iconv looks to be your best bet. Perhaps upgrade to a newer version which has better support for the charsets you have to use? CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Unicode & charset conversions - how? by roundboy (Sexton) on Dec 24, 2002 at 20:04 UTC
I should have mentioned...I'm using Perl 5.6.1 on RedHat Linux 7.3, neither of which can be upgraded. I can, however, use any modules and/or free/cheap software that's runnable in that context. Thanks again.	[reply]
Re: Unicode & charset conversions - how? by John M. Dlugosz (Monsignor) on Dec 25, 2002 at 20:14 UTC
Where I work, we needed to convert a long list of character sets all mixed together using ISO2022 escape codes, with a few custom quirks. It was easy to write our own transcoder that dealt with errors properly. The character sets themselves are all available in tables online. Scanned documents are filed as part of the International Registry of Character Sets, and the Unicode databases give ready-made machine readable conversion tables. I know big5 is one of the ones cross-referenced in the Unicode book. So my advice is get the tables for the charsets you need. They must exist somewhere, even if you need two "hops" to do it. —John	[reply]
Re: Re: Unicode & charset conversions - how? by roundboy (Sexton) on Dec 27, 2002 at 01:28 UTC
Thanks for the replies. In fact I looked at my data more carefully, and discovered that out of 600-odd documents, there were only 9 that `iconv` was unhappy with. By random coincidence my initial testing was using 3 of those 9, so the problem looked much more severe than it was! So the solution I arrived at was to simply recreate the 9 docs and eliminate the ostensibly bogus data.	[reply] [d/l]